CN112733672A - Monocular camera-based three-dimensional target detection method and device and computer equipment - Google Patents

Monocular camera-based three-dimensional target detection method and device and computer equipment Download PDF

Info

Publication number
CN112733672A
CN112733672A CN202011631597.1A CN202011631597A CN112733672A CN 112733672 A CN112733672 A CN 112733672A CN 202011631597 A CN202011631597 A CN 202011631597A CN 112733672 A CN112733672 A CN 112733672A
Authority
CN
China
Prior art keywords
target object
image
feature map
detection model
upsampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011631597.1A
Other languages
Chinese (zh)
Inventor
刘明
廖毅雄
马福龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yiqing Innovation Technology Co ltd
Original Assignee
Shenzhen Yiqing Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yiqing Innovation Technology Co ltd filed Critical Shenzhen Yiqing Innovation Technology Co ltd
Priority to CN202011631597.1A priority Critical patent/CN112733672A/en
Publication of CN112733672A publication Critical patent/CN112733672A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The application relates to a monocular camera-based three-dimensional target detection method and device, computer equipment and storage medium. The method comprises the following steps: acquiring an image acquired by a monocular camera under an automatic driving scene; inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer; performing feature enhancement on the fused feature map; regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object. The method can improve the accuracy of three-dimensional target detection.

Description

Monocular camera-based three-dimensional target detection method and device and computer equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a monocular camera-based three-dimensional target detection method, apparatus, and computer device.
Background
With the development of computer technology, automatic driving becomes a research hotspot. In an autonomous driving scenario, it is important to accurately detect surrounding objects. In order to save cost, mass production is mainly performed by acquiring images of peripheral obstacles with a camera and detecting peripheral objects from the captured images.
However, in the conventional method, the features are extracted from the captured image, and the object is detected based on the output feature map, because the field of view of the feature map obtained by directly extracting the features is not high, and the image captured by the camera has distortion in comparison with the real object, the detection is directly performed on the feature map with low field of view, and the peripheral object in the automatic driving scene cannot be accurately detected.
Disclosure of Invention
In view of the above, it is necessary to provide a monocular camera based three-dimensional object detection method, apparatus, computer device and storage medium capable of improving accuracy.
A monocular camera-based three-dimensional target detection method, the method comprising:
acquiring an image acquired by a monocular camera under an automatic driving scene;
inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;
performing feature enhancement on the fused feature map;
regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map;
and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
In one embodiment, the step of inputting the image into a discrete attention residual network in a trained target object detection model and upsampling the image by using deformable convolution further includes:
inputting the image into a distraction residual error network in a trained target object detection model;
in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape;
and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
In one embodiment, the step of upsampling the image by using deformable convolution and fusing each feature map obtained after upsampling with a feature map of a previous lower layer further includes:
performing up-sampling on the image for multiple times by using deformable convolution, and fusing each feature map obtained after each up-sampling with the feature map of the front lower layer;
and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.
In one embodiment, the step of regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map further includes:
adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame;
regressing the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera;
and obtaining a three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
In one embodiment, the method further comprises:
labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images collected by a laser radar;
converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information;
converting the coordinates under the camera coordinate system to a pixel coordinate system through the camera internal parameters of the label information;
and training a target object detection model according to the sample image converted into the pixel coordinate system.
In one embodiment, the method further comprises:
in the process of using the trained target object detection model, quantizing 32-bit or 16-bit data to be calculated in the target object detection model into data in an 8-bit integer form by using an inference optimizer in the target object detection model in a mode of minimizing KL divergence;
and calculating the data quantized into the form of 8-bit integers according to the target object detection model.
A monocular camera-based three-dimensional object detection apparatus, the apparatus comprising:
the image acquisition module is used for acquiring an image acquired by the monocular camera under an automatic driving scene;
the feature extraction module is used for inputting the image into a discrete attention residual error network in a trained target object detection model, utilizing deformable convolution to carry out up-sampling on the image, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;
the enhancement module is used for carrying out feature enhancement on the fused feature map;
a regression module for regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map;
and the detection module is used for adjusting the position of the three-dimensional frame according to the offset and obtaining a target detection result of the target object.
In one embodiment, the feature extraction module is further configured to input the image into a distraction residual error network in a trained target object detection model; in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring an image acquired by a monocular camera under an automatic driving scene;
inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer
Performing feature enhancement on the fused feature map;
regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map;
and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring an image acquired by a monocular camera under an automatic driving scene;
inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;
performing feature enhancement on the fused feature map;
regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map;
and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
According to the monocular camera-based three-dimensional target detection method and device, the computer equipment and the storage medium, the image acquired by the monocular camera under the automatic driving scene is acquired, and compared with a binocular camera, the cost can be saved by half. The acquired image is input into a trained target object detection model, a distraction residual error network is used as a main feature extraction network, and compared with the residual error network, a distraction module is added, so that important features can be concerned more. In the distributed attention residual error network, the image is subjected to up-sampling for multiple times by utilizing deformable convolution, each feature map obtained after the up-sampling for multiple times is fused with the feature map of the front lower layer, the enhancement is further carried out, and the receptive field of the output feature map is relatively high. And regressing the offset of the three-dimensional frame and the central point of the target object for the feature map with higher receptive field, and adjusting the three-dimensional frame according to the offset, so that the inaccuracy of the framed target object caused by the truncation of the three-dimensional frame can be avoided. In conclusion, the accuracy of target object detection is integrally improved from the characteristic extraction by using a residual error network combined with a distraction module to the detection of the target object by using the adjusted three-dimensional frame after enhancing the characteristic diagram obtained by multiple times of upsampling fusion.
Drawings
FIG. 1 is a diagram of an application environment of a monocular camera-based three-dimensional object detection method in one embodiment;
FIG. 2 is a schematic flow chart illustrating a monocular camera-based three-dimensional object detection method according to an embodiment;
FIG. 3 is a diagram illustrating the results of a monocular camera-based three-dimensional object detection method in one embodiment;
FIG. 4 is a block diagram of a monocular camera-based three-dimensional object detection device in one embodiment;
FIG. 5 is a block diagram of a monocular camera-based three-dimensional object detecting device in another embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The monocular camera-based three-dimensional target detection method provided by the application can be applied to the application environment shown in fig. 1. Therein, the monocular camera 102 is connected to the vehicle 104 via a network. The vehicle 104 is provided with an object detection device. The monocular camera 102 detects by passing the captured images in the autonomous driving scene to the target detection device of the vehicle 104.
In one embodiment, as shown in fig. 2, a monocular camera-based three-dimensional object detection method is provided, which is described by taking the application of the method to the object detection device in fig. 1 as an example, and includes the following steps:
step 202, acquiring an image acquired by a monocular camera under an automatic driving scene.
The monocular camera is a camera. It is understood that a binocular camera or a multi-view camera uses two or more cameras. The automatic driving scene is a scene in which the vehicle automatically travels.
In one embodiment, the vehicle in the autonomous driving scenario may or may not have a driver.
In one embodiment, pictures of different automatic driving scenes, including pictures of various automatic driving scenes such as daytime, backlight, night, rainy days and foggy days, can be collected through the monocular camera.
Specifically, in a vehicle automatic driving scene, a monocular camera is arranged on the vehicle, and the monocular camera can shoot the automatic driving scene so as to acquire an image in the automatic driving scene.
And step 204, inputting the image into a discrete attention residual error network in the trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature map obtained after up-sampling with the feature map of the front lower layer.
The target object detection model is a model for detecting a target object in an image. The target object detection model comprises a distraction residual error network, a detection head and a task head. The distraction residual network is a residual network to which a distraction module is added, and is used to extract a feature of a target object of an image, and may also be referred to as a feature extractor. The deformable convolution is a convolution using a convolution kernel that can change the shape. The feature map is a map having a feature of the target object. The target object is an object to be detected.
Specifically, the target detection device puts the image collected by the monocular camera into a trained target object detection model. And the target detection equipment downsamples the image according to a preset step length in a dispersion attention residual error network of the trained target object detection model and outputs a characteristic diagram. And the target detection equipment performs up-sampling on the down-sampled feature map by using deformable convolution in a decentralized attention residual error network, outputs each feature map obtained after up-sampling through different channel numbers and superposes the feature maps into the up-sampled feature map. And the target detection equipment fuses the up-sampled characteristic diagram and the characteristic diagram of the front lower layer in the decentralized attention residual error network.
In one embodiment, the step size of the down-sampling may be 32.
And step 206, performing feature enhancement on the fused feature map.
The detection head is a code block for enhancing the target object.
Specifically, the target detection device convolves the fused feature map through a detection head in the target object detection model to increase the features of the fused feature map.
And step 208, a three-dimensional frame for identifying the target object, the orientation of the target object and the offset of the central point of the target object are regressed based on the enhanced feature map.
The three-dimensional frame is a solid geometric frame containing three-dimensional information.
Specifically, the target detection device regresses the length, width, and height of a three-dimensional frame for identifying the target object in the image, the orientation of the target object, and the offset amount of the center point of the identified target object by a task head in the target object detection model based on the enhanced feature map. The task head is a code block for regressing the offset between the three-dimensional frame and the central point of the target object.
And step 210, adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
The target detection result comprises the type, the orientation, the position and the length, the width and the height of the target object.
Specifically, during model training, a user may manually adjust camera parameters, adjust offsets of camera coordinates relative to world coordinates by adjusting the camera parameters, and store the offsets as parameters of a trained target object detection model. The target detection device adjusts the position of the three-dimensional frame according to the stored offset through a task head in the target object detection model so as to detect the type, the direction, the position, the length, the width and the height of the three-dimensional target object.
According to the three-dimensional target detection method based on the monocular camera, the image acquired by the monocular camera under the automatic driving scene is acquired, and compared with a binocular camera, the cost is saved by half. The acquired image is input into a trained target object detection model, a distraction residual error network is used as a main feature extraction network, and compared with the residual error network, a distraction module is added, so that important features can be concerned more. In the distributed attention residual error network, the image is subjected to up-sampling for multiple times by utilizing deformable convolution, each feature map obtained after the up-sampling for multiple times is fused with the feature map of the front lower layer, the enhancement is further carried out, and the receptive field of the output feature map is relatively high. The offset of the three-dimensional frame and the central point of the target object is regressed from the feature map with higher receptive field, and the three-dimensional frame is adjusted according to the offset, so that the inaccuracy of the framed target object caused by the truncation of the three-dimensional frame can be avoided. In conclusion, the target object is detected by using the adjusted three-dimensional frame after the features are extracted by using the residual error network combined with the distraction module and the feature graph obtained by multiple times of upsampling fusion are enhanced, so that the accuracy of target object detection is integrally improved.
In one embodiment, the step of inputting the image into a discrete attention residual network in a trained target object detection model and upsampling the image by using deformable convolution further comprises: inputting the image into a dispersion attention residual error network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
The convolution kernel is a matrix used for performing convolution with a pixel matrix corresponding to the image. And the receptive field is the visual field perception of the target object identified by the characteristic diagram. For example, if the target object can be easily identified by the feature map, the receptive field is said to be high, and conversely, the receptive field is said to be low.
Specifically, the target detection device performs up-sampling on the feature map of the down-sampled image a plurality of times by using deformable convolution in a discrete attention residual network of a trained target object detection model. In each upsampling, a convolution kernel which is deformed according to the geometric shape of the target object is utilized to be convolved with a pixel matrix corresponding to the characteristic map, so that the characteristic map with the receptive field consistent with the size of the target object is obtained through sampling.
In this embodiment, upsampling is performed by using deformable convolution, so that the adaptability of the target object detection model to the geometric shape change of the target object in the image can be enhanced.
In one embodiment, the step of upsampling the image by using deformable convolution and fusing each feature map obtained after the upsampling with the feature map of the front lower layer further includes: fusing the feature map obtained after each upsampling with the feature map of the front lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.
The convolutional neural network is a network corresponding to a single neuron in the distraction residual error network. It will be appreciated that a neuron corresponds to a layer of the network, and that each layer of the network may be convolved with an image input to that layer. The image input to the layer may be an original image, or may be a feature image output by performing convolution with an upper layer convolution neural network.
Specifically, the target detection device fuses, through the discrete attention residual error network, each feature map obtained by upsampling in each layer of convolutional neural network with a feature map which is obtained by upsampling in a preceding lower layer of convolutional neural network of the convolutional neural network corresponding to each upsampling and is not fused.
In one embodiment, the number of upsampling may be three. The target detection equipment performs primary up-sampling on the feature map output after down-sampling in a dispersion attention residual error network of a trained target object detection model by using deformable convolution, outputs the feature map through 256 channels and superposes the feature map to obtain a primary up-sampled feature map which is not fused. And fusing the feature map which is up-sampled for the first time and is not fused with the feature map which is output after down-sampling to obtain the feature map which is up-sampled for the first time and is fused. And the target detection equipment performs second upsampling on the first upsampled and fused feature map by using deformable convolution through a decentralized attention residual error network, outputs the feature map through 128 channels and superposes the feature map to obtain a second upsampled and unfused feature map. And fusing the feature map which is up-sampled for the second time and is not fused with the feature map which is up-sampled for the first time and is fused to obtain a feature map which is up-sampled for the second time and is fused. And the target detection equipment performs third upsampling on the second upsampled and fused feature map by using a deformable convolution through a scattered attention residual error network, outputs the feature maps through 64 channels and superposes the feature maps to obtain a third upsampled and unfused feature map. And fusing the feature map which is up-sampled and not fused for the third time with the feature map which is up-sampled and fused for the second time to obtain a feature map which is up-sampled and fused for the third time.
In this embodiment, the receptive field of the feature map after downsampling is small, which is not beneficial to target detection, and the feature expression capability can be improved by upsampling the output feature map. The variable convolution can be adaptive to the change of the geometric shape of the target, and the generalization of the convolution neural network is improved.
In one embodiment, the step of regressing the three-dimensional frame for identifying the target object, the orientation of the target object, and the offset of the center point of the target object based on the enhanced feature map further comprises: adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
Wherein, the prior frame is an initial two-dimensional geometric frame. It will be appreciated that the two-dimensional geometric box is only of length and width. It is understood that the prior frame may be multiple frames, and an accurate detection frame is finally obtained through continuous regression.
Specifically, the target detection device adds a plurality of prior frames in the image through a task head in a target object detection model based on a feature map enhanced by the detection head, and obtains an optimal prior frame, namely, the length and the width of a regression three-dimensional frame by continuously comparing a threshold value calculated by the prior frames with a reference threshold value in a trained target object detection model. And returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame by the target detection equipment through a task head in the target object detection model according to the distance from the central point of the target object to the monocular camera, and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
In one embodiment, as shown in FIG. 3, from the image captured by the monocular camera, a monocular camera-based three-dimensional object detection device on the vehicle may identify the target object on the image with a three-dimensional frame.
In this embodiment, based on the enhanced feature image, the offset of the central point of the target object is regressed by the task head in the target object detection model to adjust the three-dimensional frame used for identifying the target object in the image, so as to reduce the fact that the three-dimensional frame exceeds the image range, that is, reduce the situation that the three-dimensional frame is truncated, thereby avoiding the truncation of the identified target object and improving the accuracy of the three-dimensional frame.
In one embodiment, the method further comprises: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.
The sample image is an image obtained by abnormally detecting the sample. The point cloud coordinate is the coordinate marked by a plurality of cloud-shaped points. The camera coordinate system is a coordinate system set at an angle of an image taken by the camera. The camera external parameters comprise rotation parameters of three axes of the coordinate axes of the camera and translation parameters of the three axes. The camera parameters comprise a radial distortion coefficient and a tangential distortion coefficient of the camera. The sample image, converted to a pixel coordinate system, is a pixel matrix.
Specifically, the user annotates a sample image acquired by the monocular camera. A user can input a sample image acquired by the monocular camera to the labeling software on the same labeling software, adjust the point cloud coordinates added to the target object by the labeling software, add the value corresponding to the coordinates, and add label information to the labeled target object. The tag information includes: picture identification, image category, camera internal parameters, two-dimensional frame, whether to cut off, occlusion degree, orientation and three-dimensional dimensionality. The truncation may be represented by "0" for no truncation, and "1" for truncation. Sample images in an autonomous driving scene are acquired by a monocular camera. And adding point cloud coordinates to a sample target object in the sample image and labeling label information by the user. The target detection equipment converts the point cloud coordinates into coordinates under a camera coordinate system through the camera external parameters in the tag information, and then converts the coordinates under the camera coordinate system into a pixel coordinate system through the camera internal parameters in the tag information. And training the target object detection model by the user through the sample image converted into the pixel coordinate system.
In one embodiment, fine tuning the camera's camera external parameters may adjust the three degrees of freedom of the camera coordinates, i.e., pitch, yaw, and roll, respectively, and translate the coordinate axes of the camera coordinates, i.e., the x, y, and z axes. The target detection device can convert the point cloud coordinate into a coordinate under a camera coordinate system through the offset obtained by fine adjustment in the label information. And the target detection equipment converts the coordinates under the camera coordinate system into a pixel coordinate system through the radial distortion coefficient, the tangential distortion coefficient and the pixel proportion of the camera.
In this embodiment, after sample images in different automatic driving scenes are labeled, the sample images are put into a target object model for training, so that the accuracy of target object model detection can be improved.
In one embodiment, the method further comprises: in the process of using the trained target object detection model, 32-bit or 16-bit data to be calculated in the target object detection model is quantized into data in an 8-bit integer form by using an inference optimizer in the target object detection model in a mode of minimizing KL divergence (Kullback-Leibler divergence); and calculating the data quantized into the form of 8-bit integers according to the target object detection model.
Specifically, the target detection device quantizes 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of an 8-bit integer by using an inference optimizer in the target object detection model in a manner of minimizing the KL divergence, and then performs calculation.
In one embodiment, for example, before upsampling, 2-bit or 16-bit data corresponding to the feature map to be subjected to deformable convolution may be quantized into data in the form of 8-bit integer, and then subjected to deformable convolution.
In one embodiment, the data 1/10 of the training data may be taken as a calibration data set, the target object in the image is inferred in the form of 32-bit data in the target object detection model, then a histogram of the activation values of each layer is collected, the saturation quantization distribution of different thresholds is counted, and finally the threshold that minimizes the KL divergence is found.
In one embodiment, the inference optimizer may be "TensorRT," a forward-propagation-only deep learning framework.
In this embodiment, the 32-bit or 16-bit data to be calculated is quantized into data in the form of an 8-bit integer, so that the operation speed of the CPU or the GPU of the target detection device can be increased, and the speed of detecting the target object by the target object detection model is never increased, specifically, the speed can be increased by at least 1.5 times compared with the speed before acceleration.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
In one embodiment, as shown in fig. 4, there is provided a monocular camera-based three-dimensional object detecting device 400, comprising: an image acquisition module 402, a feature extraction module 404, an enhancement module 406, a regression module 408, and a detection module 410, wherein:
an image obtaining module 402, configured to obtain an image in an automatic driving scene captured by a monocular camera.
And the feature extraction module 404 is configured to input the image into a discrete attention residual error network in the trained target object detection model, perform upsampling on the image by using deformable convolution, and fuse each feature map obtained after the upsampling with a feature map of a front lower layer.
And an enhancing module 406, configured to perform feature enhancement on the fused feature map.
A regression module 408 for regressing a three-dimensional frame for identifying the target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map.
The detection module 410 is configured to adjust the position of the three-dimensional frame according to the offset, and obtain a target detection result of the target object.
In one embodiment, the feature extraction module 404 is further configured to input the image into a distraction residual network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
In one embodiment, the feature extraction module 404 is further configured to perform up-sampling on the image for multiple times by using deformable convolution, and fuse each feature map obtained after each up-sampling with a feature map of a previous lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.
In one embodiment, the regression module 408 is further configured to add a prior frame to the image based on the enhanced feature map, and regress the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
In one embodiment, the apparatus further comprises:
the training module 401 is configured to label the sample image with label information according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.
As shown in fig. 5, in one embodiment, the apparatus further comprises: a training module 401 and an acceleration module 412;
the acceleration module 412 is configured to quantize the to-be-calculated 32-bit or 16-bit data in the target object detection model into data in an 8-bit integer form by using a reasoning optimizer in the target object detection model in a manner of minimizing the KL divergence in the process of using the trained target object detection model; and calculating the data quantized into the form of 8-bit integers according to the target object detection model.
For specific limitations of the monocular camera-based three-dimensional object detection device, reference may be made to the above limitations of the monocular camera-based three-dimensional object detection method, which are not described herein again. The modules in the monocular camera-based three-dimensional object detecting device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be an object detection device on a vehicle in an autonomous driving scenario, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with external target detection equipment, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a monocular camera-based three-dimensional object detection method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring an image acquired by a monocular camera under an automatic driving scene; inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer; performing feature enhancement on the fused feature map; regressing a three-dimensional frame for identifying the target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting the image into a dispersion attention residual error network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
In one embodiment, the processor, when executing the computer program, further performs the steps of: performing up-sampling on the image for multiple times by using deformable convolution, and fusing each feature map obtained after the up-sampling for multiple times with the feature map of the front lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.
In one embodiment, the processor, when executing the computer program, further performs the steps of: adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
In one embodiment, the processor, when executing the computer program, further performs the steps of: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.
In one embodiment, the processor, when executing the computer program, further performs the steps of: in the process of using the trained target object detection model, 32-bit or 16-bit data to be calculated in the target object detection model is quantized into data in an 8-bit integer form by using a reasoning optimizer in the target object detection model in a mode of minimizing KL divergence; and calculating the data quantized into the form of 8-bit integers according to the target object detection model.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring an image acquired by a monocular camera under an automatic driving scene; inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer; performing feature enhancement on the fused feature map; regressing a three-dimensional frame for identifying the target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
In one embodiment, the computer program when executed by the processor further performs the steps of: inputting the image into a dispersion attention residual error network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
In one embodiment, the computer program when executed by the processor further performs the steps of: performing up-sampling on the image for multiple times by using deformable convolution, and fusing a feature map obtained after each up-sampling with a feature map of a front lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.
In one embodiment, the computer program when executed by the processor further performs the steps of: adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
In one embodiment, the computer program when executed by the processor further performs the steps of: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.
In one embodiment, the computer program when executed by the processor further performs the steps of: in the process of using the trained target object detection model, 32-bit or 16-bit data to be calculated in the target object detection model is quantized into data in an 8-bit integer form by using a reasoning optimizer in the target object detection model in a mode of minimizing KL divergence; and calculating the data quantized into the form of 8-bit integers according to the target object detection model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A monocular camera-based three-dimensional target detection method is characterized by comprising the following steps:
acquiring an image acquired by a monocular camera under an automatic driving scene;
inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;
performing feature enhancement on the fused feature map;
regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map;
and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.
2. The method of claim 1, wherein the step of inputting the image into a discrete attention residual network in a trained target object detection model and upsampling the image using a deformable convolution further comprises:
inputting the image into a distraction residual error network in a trained target object detection model;
in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape;
and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
3. The method according to claim 1, wherein the step of upsampling the image by using deformable convolution and fusing each feature map obtained after the upsampling with the feature map of the front lower layer further comprises:
performing up-sampling on the image for multiple times by using deformable convolution, and fusing a feature map obtained after each up-sampling with a feature map of a front lower layer;
and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.
4. The method of claim 1, wherein the step of regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map further comprises:
adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame;
regressing the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera;
and obtaining a three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.
5. The method of claim 1, wherein the target object detection model is obtained by a model training step comprising:
labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images collected by a laser radar;
converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information;
converting the coordinates under the camera coordinate system to a pixel coordinate system through the camera internal parameters of the label information;
and training a target object detection model according to the sample image converted into the pixel coordinate system.
6. The method according to any one of claims 1 to 5, further comprising:
in the process of using the trained target object detection model, quantizing 32-bit or 16-bit data to be calculated in the target object detection model into data in an 8-bit integer form by using an inference optimizer in the target object detection model in a mode of minimizing KL divergence;
and calculating the data quantized into the form of 8-bit integers according to the target object detection model.
7. A monocular camera-based three-dimensional object detection apparatus, the apparatus comprising:
the image acquisition module is used for acquiring an image acquired by the monocular camera under an automatic driving scene;
the feature extraction module is used for inputting the image into a discrete attention residual error network in a trained target object detection model, utilizing deformable convolution to carry out up-sampling on the image, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;
the enhancement module is used for carrying out feature enhancement on the fused feature map;
a regression module for regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map;
and the detection module is used for adjusting the position of the three-dimensional frame according to the offset and obtaining a target detection result of the target object.
8. The apparatus of claim 7, wherein the feature extraction module is further configured to input the image into a distracted residual network in a trained target object detection model; in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202011631597.1A 2020-12-31 2020-12-31 Monocular camera-based three-dimensional target detection method and device and computer equipment Pending CN112733672A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011631597.1A CN112733672A (en) 2020-12-31 2020-12-31 Monocular camera-based three-dimensional target detection method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011631597.1A CN112733672A (en) 2020-12-31 2020-12-31 Monocular camera-based three-dimensional target detection method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN112733672A true CN112733672A (en) 2021-04-30

Family

ID=75609926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011631597.1A Pending CN112733672A (en) 2020-12-31 2020-12-31 Monocular camera-based three-dimensional target detection method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN112733672A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837977A (en) * 2021-09-22 2021-12-24 马上消费金融股份有限公司 Object tracking method, multi-target tracking model training method and related equipment
CN114550163A (en) * 2022-02-25 2022-05-27 清华大学 Imaging millimeter wave three-dimensional target detection method based on deformable attention mechanism
WO2023050810A1 (en) * 2021-09-30 2023-04-06 上海商汤智能科技有限公司 Target detection method and apparatus, electronic device, storage medium, and computer program product
CN117336459A (en) * 2023-10-10 2024-01-02 雄安雄创数字技术有限公司 Three-dimensional video fusion method and device, electronic equipment and storage medium
CN117336459B (en) * 2023-10-10 2024-04-30 雄安雄创数字技术有限公司 Three-dimensional video fusion method and device, electronic equipment and storage medium

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427867A (en) * 2019-07-30 2019-11-08 华中科技大学 Human facial expression recognition method and system based on residual error attention mechanism
CN110674866A (en) * 2019-09-23 2020-01-10 兰州理工大学 Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network
CN110717905A (en) * 2019-09-30 2020-01-21 上海联影智能医疗科技有限公司 Brain image detection method, computer device, and storage medium
WO2020042345A1 (en) * 2018-08-28 2020-03-05 初速度(苏州)科技有限公司 Method and system for acquiring line-of-sight direction of human eyes by means of single camera
CN111047516A (en) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN111126385A (en) * 2019-12-13 2020-05-08 哈尔滨工程大学 Deep learning intelligent identification method for deformable living body small target
US20200160559A1 (en) * 2018-11-16 2020-05-21 Uatc, Llc Multi-Task Multi-Sensor Fusion for Three-Dimensional Object Detection
CN111340864A (en) * 2020-02-26 2020-06-26 浙江大华技术股份有限公司 Monocular estimation-based three-dimensional scene fusion method and device
CN111369617A (en) * 2019-12-31 2020-07-03 浙江大学 3D target detection method of monocular view based on convolutional neural network
CN111382677A (en) * 2020-02-25 2020-07-07 华南理工大学 Human behavior identification method and system based on 3D attention residual error model
CN111415342A (en) * 2020-03-18 2020-07-14 北京工业大学 Attention mechanism fused automatic detection method for pulmonary nodule image of three-dimensional convolutional neural network
CN111539484A (en) * 2020-04-29 2020-08-14 北京市商汤科技开发有限公司 Method and device for training neural network
CN111626159A (en) * 2020-05-15 2020-09-04 南京邮电大学 Human body key point detection method based on attention residual error module and branch fusion
CN111695448A (en) * 2020-05-27 2020-09-22 东南大学 Roadside vehicle identification method based on visual sensor
WO2020221990A1 (en) * 2019-04-30 2020-11-05 Huawei Technologies Co., Ltd. Facial localisation in images
CN111932550A (en) * 2020-07-01 2020-11-13 浙江大学 3D ventricle nuclear magnetic resonance video segmentation system based on deep learning
JP2020205048A (en) * 2019-06-18 2020-12-24 富士通株式会社 Object detection method based on deep learning network, apparatus, and electronic device

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020042345A1 (en) * 2018-08-28 2020-03-05 初速度(苏州)科技有限公司 Method and system for acquiring line-of-sight direction of human eyes by means of single camera
US20200160559A1 (en) * 2018-11-16 2020-05-21 Uatc, Llc Multi-Task Multi-Sensor Fusion for Three-Dimensional Object Detection
WO2020221990A1 (en) * 2019-04-30 2020-11-05 Huawei Technologies Co., Ltd. Facial localisation in images
JP2020205048A (en) * 2019-06-18 2020-12-24 富士通株式会社 Object detection method based on deep learning network, apparatus, and electronic device
CN110427867A (en) * 2019-07-30 2019-11-08 华中科技大学 Human facial expression recognition method and system based on residual error attention mechanism
CN110674866A (en) * 2019-09-23 2020-01-10 兰州理工大学 Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network
CN110717905A (en) * 2019-09-30 2020-01-21 上海联影智能医疗科技有限公司 Brain image detection method, computer device, and storage medium
CN111126385A (en) * 2019-12-13 2020-05-08 哈尔滨工程大学 Deep learning intelligent identification method for deformable living body small target
CN111369617A (en) * 2019-12-31 2020-07-03 浙江大学 3D target detection method of monocular view based on convolutional neural network
CN111382677A (en) * 2020-02-25 2020-07-07 华南理工大学 Human behavior identification method and system based on 3D attention residual error model
CN111340864A (en) * 2020-02-26 2020-06-26 浙江大华技术股份有限公司 Monocular estimation-based three-dimensional scene fusion method and device
CN111047516A (en) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN111415342A (en) * 2020-03-18 2020-07-14 北京工业大学 Attention mechanism fused automatic detection method for pulmonary nodule image of three-dimensional convolutional neural network
CN111539484A (en) * 2020-04-29 2020-08-14 北京市商汤科技开发有限公司 Method and device for training neural network
CN111626159A (en) * 2020-05-15 2020-09-04 南京邮电大学 Human body key point detection method based on attention residual error module and branch fusion
CN111695448A (en) * 2020-05-27 2020-09-22 东南大学 Roadside vehicle identification method based on visual sensor
CN111932550A (en) * 2020-07-01 2020-11-13 浙江大学 3D ventricle nuclear magnetic resonance video segmentation system based on deep learning

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
LONG CHEN: "Semantic Segmentation via Structured Refined Prediction and Dual Global Priors", 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM) *
YUNJI ZHAO: "Fault Diagnosis Based on Space Mapping and Deformable Convolution Networks", IEEE ACCESS, vol. 8 *
严娟;方志军;高永彬;: "结合混合域注意力与空洞卷积的3维目标检测", 中国图象图形学报, no. 06 *
严娟;方志军;高永彬;: "结合混合域注意力与空洞卷积的3维目标检测", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) *
侯向丹;赵一浩;刘洪普: "融合残差注意力机制的UNet视盘分割", 中国图象图形学报, no. 09 *
侯向丹;赵一浩;刘洪普;郭鸿湧;于习欣;丁梦园;: "融合残差注意力机制的UNet视盘分割", 中国图象图形学报, no. 09, 16 September 2020 (2020-09-16) *
王文超;: "基于可变形卷积的手绘图像检索", 计算机系统应用, no. 07 *
王文超;: "基于可变形卷积的手绘图像检索", 计算机系统应用, no. 07, 15 July 2020 (2020-07-15) *
苏军雄;见雪婷;刘玮;华俊达;张胜祥;: "基于可变形卷积神经网络的手势识别方法", 计算机与现代化, no. 04 *
苏军雄;见雪婷;刘玮;华俊达;张胜祥;: "基于可变形卷积神经网络的手势识别方法", 计算机与现代化, no. 04, 20 April 2018 (2018-04-20) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837977A (en) * 2021-09-22 2021-12-24 马上消费金融股份有限公司 Object tracking method, multi-target tracking model training method and related equipment
WO2023050810A1 (en) * 2021-09-30 2023-04-06 上海商汤智能科技有限公司 Target detection method and apparatus, electronic device, storage medium, and computer program product
CN114550163A (en) * 2022-02-25 2022-05-27 清华大学 Imaging millimeter wave three-dimensional target detection method based on deformable attention mechanism
CN114550163B (en) * 2022-02-25 2023-02-03 清华大学 Imaging millimeter wave three-dimensional target detection method based on deformable attention mechanism
CN117336459A (en) * 2023-10-10 2024-01-02 雄安雄创数字技术有限公司 Three-dimensional video fusion method and device, electronic equipment and storage medium
CN117336459B (en) * 2023-10-10 2024-04-30 雄安雄创数字技术有限公司 Three-dimensional video fusion method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112733672A (en) Monocular camera-based three-dimensional target detection method and device and computer equipment
CN111797657A (en) Vehicle peripheral obstacle detection method, device, storage medium, and electronic apparatus
CN110889464B (en) Neural network training method for detecting target object, and target object detection method and device
US20160360186A1 (en) Methods and systems for human action recognition using 3d integral imaging
CN110827202A (en) Target detection method, target detection device, computer equipment and storage medium
CN112036455B (en) Image identification method, intelligent terminal and storage medium
CN109741241B (en) Fisheye image processing method, device, equipment and storage medium
CN111222387B (en) System and method for object detection
CN114998856B (en) 3D target detection method, device, equipment and medium for multi-camera image
CN112200056A (en) Face living body detection method and device, electronic equipment and storage medium
WO2023284255A1 (en) Systems and methods for processing images
US11605220B2 (en) Systems and methods for video surveillance
CN114359361A (en) Depth estimation method, depth estimation device, electronic equipment and computer-readable storage medium
CN113496472A (en) Image defogging model construction method, road image defogging device and vehicle
CN112241963A (en) Lane line identification method and system based on vehicle-mounted video and electronic equipment
CN112132753A (en) Infrared image super-resolution method and system for multi-scale structure guide image
CN115115552B (en) Image correction model training method, image correction device and computer equipment
CN110633692A (en) Pedestrian identification method and related device for unmanned aerial vehicle aerial photography
CN116543143A (en) Training method of target detection model, target detection method and device
CN111382654A (en) Image processing method and apparatus, and storage medium
CN112818743B (en) Image recognition method and device, electronic equipment and computer storage medium
CN115222621A (en) Image correction method, electronic device, storage medium, and computer program product
CN114882465A (en) Visual perception method and device, storage medium and electronic equipment
CN113963060A (en) Vehicle information image processing method and device based on artificial intelligence and electronic equipment
CN112634331A (en) Optical flow prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination