WO2023050810A1

WO2023050810A1 - Target detection method and apparatus, electronic device, storage medium, and computer program product

Info

Publication number: WO2023050810A1
Application number: PCT/CN2022/090957
Authority: WO
Inventors: 刘配; 杨国润; 王哲; 石建萍
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-09-30
Filing date: 2022-05-05
Publication date: 2023-04-06
Also published as: CN114119991A

Abstract

The present disclosure provides a target detection method and apparatus, an electronic device, a storage medium, and a computer program product. The method comprises: performing feature extraction on an image to be detected, and obtaining a feature map of said image; on the basis of the feature map, generating a depth map corresponding to a projection region in which a target object in said image is projected to the ground; and, on the basis of the depth map and the feature map, determining three-dimensional detection information of the target object. The projection region in the present disclosure is associated with the target object to a certain degree. In this way, the depth map corresponding to the local ground can targetedly guide the feature map of the target object on the local ground so as to perform three-dimensional detection, thereby improving the accuracy of detection.

Description

A target detection method, device, electronic equipment, storage medium, and computer program product

Cross References to Related Applications

This disclosure is based on a Chinese patent application with the application number 202111164729.9, the filing date is September 30, 2021, and the title of the invention is "a method, device, electronic device and storage medium for target detection", and requires the Chinese patent application Priority, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.

technical field

The present disclosure relates to the technical field of image processing, and in particular to a target detection method, device, electronic equipment, storage medium, and computer program product.

Background technique

Compared with two-dimensional (2D, 2-Dimension) target detection tasks, three-dimensional (3D, 3-Dimension) target detection tasks are more difficult and more complex, and often need to detect the 3D geometric information and Semantic information mainly includes the length, width, height, center point and orientation angle information of the target. Among them, the 3D target detection of monocular images is widely used in various fields (such as the field of unmanned driving) due to its economical and practical characteristics.

However, 3D object detection techniques based on monocular images mainly rely on some external subtasks, which are responsible for performing tasks such as 2D object detection, depth map estimation, etc. Since the sub-tasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection.

Contents of the invention

Embodiments of the present disclosure provide at least one object detection method, device, electronic device, storage medium, and computer program product, so as to improve the accuracy of 3D object detection.

In a first aspect, an embodiment of the present disclosure provides a method for target detection, the method comprising:

performing feature extraction on the image to be detected to obtain a feature map of the image to be detected; based on the feature map, generating a depth map corresponding to a projection area where the target object in the image to be detected is projected onto the ground; based on the depth map and The feature map determines the three-dimensional detection information of the target object.

Using the above target detection method, not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground The depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.

In a possible implementation manner, after obtaining the feature map of the image to be detected, the method further includes:

Detecting the feature map to obtain two-dimensional detection information for the target object;

The determining the three-dimensional detection information of the target object based on the depth map and the feature map includes:

Based on the two-dimensional detection information, determine three-dimensional prior frame information for the target object;

Based on the 3D priori frame information, the depth map and the feature map, determine the 3D detection frame information of the target object.

The 3D detection here can be the detection combined with the information of the 3D priori frame. The 3D priori frame can constrain the starting position of the 3D detection to a certain extent, so as to search for the information of the 3D detection frame near the starting position, thereby further improving the accuracy of the 3D detection. .

In a possible implementation manner, the two-dimensional detection information includes the two-dimensional detection frame information of the target object and the category information of the target object; The 3D prior frame information of the object, including:

Based on the category information of the target object, determine the clustering information of each subcategory included in the category of the target object;

According to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, the three-dimensional prior frame information for the target object is determined.

The three-dimensional prior frame here can be determined in combination with the category information of the target object. The size and position of the corresponding three-dimensional prior frame may be different for different categories of target objects. The use of category information can assist in determining the position of the three-dimensional prior frame with high accuracy.

In a possible implementation manner, the determining the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information of the target object includes:

For each subcategory in the various subcategories, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth corresponding to the subcategory value;

Based on the clustering information of the subcategory and the depth value corresponding to the subcategory, determine a 3D prior frame information for the target object.

In a possible implementation manner, the determining the 3D detection frame information of the target object based on the 3D prior frame information, the depth map, and the feature map includes:

determining a three-dimensional detection frame offset according to the depth map and the feature map;

Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.

Here it can be the prediction for the offset. Combining the offset and the 3D prior frame can get a more accurate 3D detection frame.

In a possible implementation manner, the determining the offset of the 3D detection frame according to the depth map and the feature map includes:

Based on the position range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map matching the position range from the depth map and the feature map respectively;

The three-dimensional detection frame offset is determined based on a depth map and a feature map matched with the position range.

Here, using the position range included in the two-dimensional detection frame information where the target object is located can realize the clipping of the depth map and feature map at the corresponding position, which will make the predicted offset for the target object without other interference The relevant information of the region improves the prediction accuracy.

In a possible implementation manner, there are multiple 3D prior frame information; determining the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset ,include:

Determining the weight corresponding to each of the three-dimensional prior frame information;

The 3D detection frame information of the target object is determined based on each of the 3D prior frame information, the weight corresponding to each of the 3D prior frame information, and the 3D detection frame offset.

In a possible implementation manner, the method also includes:

determining the predicted probability of each subcategory included in the category information of the target object according to the depth map and the feature map;

The determination of the weight corresponding to each of the three-dimensional prior frame information includes:

Based on the prediction probability of each subcategory, the weight of the three-dimensional prior frame information corresponding to each of the subcategories is determined.

Considering that the predicted probabilities for different subcategories are not the same, the greater the probability, the higher the possibility that the target object points to the corresponding subcategory, and then the corresponding three-dimensional prior frame information can be given higher weight. The prediction accuracy of the final 3D detection frame will be further improved.

In a possible implementation manner, the detecting the feature map to obtain the two-dimensional detection information for the target object includes:

determining a two-dimensional detection frame offset according to the feature map;

The two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.

In a possible implementation manner, the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and three-dimensional annotations based on target objects marked in the image samples It is obtained by training the labeled depth map determined by the box information.

In a possible implementation manner, the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the center point of the bottom surface of the annotation frame; the annotation depth map is acquired according to the following steps:

Based on the corresponding relationship between the three-dimensional coordinate system where the three-dimensional label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, the three-dimensional label frame information marked by the target object is projected to the ground, and the projection area and the projection area of the target object on the ground are obtained. The extension area where the projection area is located;

Determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label frame information;

Based on the corresponding relationship between the camera coordinate system and the image coordinate system, each three-dimensional label point on the extended area under the camera coordinate system is projected to a pixel plane under the image coordinate system to obtain the pixel plane in the pixel plane projection point;

The marked depth map is obtained based on the depth value of each three-dimensional marked point on the extended area and the projected point in the pixel plane.

The annotated depth map here can be implemented in combination with ground projection operations and conversion operations between coordinate systems. Through the construction of the extended area, the local ground area including the target object can be completely covered, and the corresponding marked depth map can be obtained by using the 3D projection result of the extended area, which can reflect the extended area including the local ground area. The depth information of the region can specifically assist the three-dimensional detection of the target object on the corresponding local ground region.

In a possible implementation manner, the determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information includes:

Determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;

In the case of determining the position coordinates of the central point of the extended area, the depth value of the central point of the extended area is used as the initial depth value, and the depth of each three-dimensional label point in the extended area is determined at a preset depth interval value.

In the second aspect, the embodiment of the present disclosure also provides a target detection device, the device comprising:

The extraction module is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected; the generation module is configured to generate a projection area corresponding to the projection of the target object in the image to be detected to the ground based on the feature map The depth map; the first detection module is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method for object detection as described in any one of the first aspect and its various implementation manners are executed.

In the fourth aspect, the embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed when the processor runs, as in the first aspect and its various implementation modes The steps of any one of the methods for target detection.

In the fifth aspect, the embodiments of the present disclosure further provide a computer program product, including a computer-readable storage medium storing program codes, and when the instructions included in the program codes are executed by the processor of the computer device, the first aspect can be implemented. The steps of the target detection method described in any one of its various embodiments.

For the effect description of the above-mentioned object detection device, electronic equipment, and computer-readable storage medium, refer to the description of the above-mentioned object detection method.

In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments are specifically cited below, together with the accompanying drawings, and described in detail as follows.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. The accompanying drawings here are incorporated into the specification and constitute a part of the specification. The drawings show the embodiments consistent with the present disclosure, and are used together with the description to explain the technical solutions of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those skilled in the art, they can also make From these drawings other related drawings are obtained.

FIG. 1 shows a flow chart of a method for target detection provided by an embodiment of the present disclosure;

FIG. 2A shows a flowchart of a method for determining two-dimensional detection information of a target object provided by an embodiment of the present disclosure;

FIG. 2B shows a flow chart of a method for determining 3D prior frame information for a target object provided by an embodiment of the present disclosure;

FIG. 2C shows a flow chart of a method for determining a three-dimensional prior frame information for a target object provided by an embodiment of the present disclosure;

FIG. 2D shows a flow chart of a method for determining the three-dimensional detection frame information of a target object provided by an embodiment of the present disclosure;

FIG. 2E shows a flow chart of a method for determining the offset of a three-dimensional detection frame of a target object provided by an embodiment of the present disclosure;

FIG. 2F shows a flowchart of a method for training a depth map generation network provided by an embodiment of the present disclosure;

FIG. 2G shows a flow chart of a method for obtaining a marked depth map provided by an embodiment of the present disclosure;

FIG. 2H shows a flowchart of a method for determining the depth value of each three-dimensional label point on the extended area provided by an embodiment of the present disclosure;

FIG. 2I shows a schematic diagram of the application of a method for generating local ground depth labels provided by an embodiment of the present disclosure;

Fig. 2J shows a schematic diagram of the application of a target detection method provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a target detection device provided by an embodiment of the present disclosure;

Fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only It is a part of the embodiments of the present disclosure, but not all of them. The components of the disclosed embodiments generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present disclosure.

It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

The term "and/or" in this article only describes an association relationship, which means that there can be three kinds of relationships, for example, A and/or B can mean: there is A alone, A and B exist at the same time, and B exists alone. situation. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

After research, it was found that with the successful application of deep learning in the field of target detection, especially in 3D target detection, the detection accuracy has reached a very high level. The commonly used 3D target detection method is based on LiDAR data, but it is difficult to meet large-scale applications and deployments due to the expensive equipment for collecting data. And the 3D target detection method of monocular image can be to adopt the vehicle-mounted camera of the car, economical and available. For a single-view image, the task of monocular 3D detection is to detect the 3D geometric information and semantic information of the target object from the 3D scene, including the length, width, height, center point and orientation angle information of the target object.

In related technologies, the monocular image-based 3D object detection technology relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Since the subtasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection, so it is difficult to be used in practical applications.

The difficulty of current 3D object detection methods lies in the depth prediction of 3D detection frames. The label of 3D target detection only provides the depth information of the center point or corner of the target frame, and it is difficult for the network to learn and cannot generate more and more accurate depth information. This is because the 3D object detection method in the related art guides the learning of the 3D detection frame through its subtasks, such as depth estimation, pseudo point cloud generation and semantic segmentation prediction results, but its subtasks require a large number of accurate depth labels, It is difficult to use in practical applications, and the accuracy of subtasks limits the performance upper limit of 3D target detection, and it is not very reliable in 3D target detection.

Based on the above research, the present disclosure provides a method, device, electronic device, storage medium, and computer program product for object detection, so as to improve the accuracy of 3D object detection.

In order to facilitate the understanding of this embodiment, a method for target detection disclosed in the embodiments of the present disclosure is firstly introduced in detail. The execution subject of the method for target detection provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities. The computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc. In some possible implementation manners, the object detection method may be implemented by a processor invoking computer-readable instructions stored in a memory.

FIG. 1 is a flow chart of a method for target detection provided by an embodiment of the present disclosure, the method is executed by an electronic device, and the method includes steps S101 to S103, wherein:

Step S101, performing feature extraction on the image to be detected to obtain a feature map of the image to be detected;

Step S102, based on the feature map, generate a depth map corresponding to the projection area where the target object in the image to be detected is projected onto the ground;

Step S103, based on the depth map and the feature map, determine the three-dimensional detection information of the target object.

In order to facilitate the understanding of the target detection method provided by the embodiment of the present disclosure, the application scenario of the method will firstly be introduced in detail. The above target detection method can be applied to the field of computer vision, for example, it can be applied to scenarios such as vehicle detection in unmanned driving, and drone detection. Considering the wide application of unmanned driving, the following is an example of vehicle detection.

The 3D object detection technology in the related art relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Due to the separate training of the subtasks, there is a loss of accuracy in itself, resulting in a low final 3D detection accuracy.

Just to solve the above problems, the embodiments of the present disclosure provide a solution for three-dimensional detection in combination with a local depth map and a feature map, and the detection accuracy is high.

The image to be detected in the embodiments of the present disclosure may be an image collected in a target scene, and different application scenes correspond to different collected images. Taking unmanned driving as an example, the image to be detected here can be the image collected by the camera device installed on the driverless car during the driving process of the vehicle, and the image can include all target objects within the shooting field of view of the camera device. The target object here may be a vehicle in front or a pedestrian in front, which is not limited here.

Before performing three-dimensional detection, the embodiments of the present disclosure may use various feature extraction methods to extract feature maps for the image to be detected. For example, the feature map can be extracted from the image to be detected through image processing, and for example, the feature map can be extracted by using a trained feature extraction network.

Considering that the feature extraction network can mine deeper image features, the feature extraction network can be used to extract the feature map in the embodiments of the present disclosure. The feature extraction network here can be a convolutional neural network (Convolutional Neural Networks, CNN). In practical applications, it can be implemented by using a CNN model including convolutional blocks, dense blocks, and transition blocks. The convolutional blocks here can be composed of convolutional layers, batch normalization (Batch normalization) layers, and linear rectification layers (Rectified Linear Unit, ReLU), the dense block can be composed of multiple convolutional blocks and multiple skip connections (Skip connection), and the transitional block is generally composed of convolutional blocks and average pooling layers. The composition of convolutional blocks, dense blocks, and transitional blocks, for example, including several convolutional layers and average pooling layers during application, can be determined in conjunction with the application scenario, and there is no limitation here.

In order to perform three-dimensional detection, the embodiments of the present disclosure can generate a local depth map based on the extracted feature map. The local depth map corresponds to the projection area of the target object in the image to be detected projected to the ground, pointing to the target object Associated local ground depth information. Since there is a binding relationship between the local ground and the target object to a certain extent, the target object can be detected more accurately in combination with the feature map extracted above.

Wherein, the above partial depth map may be determined by using a trained depth map generation network. The depth map generation network trains the corresponding relationship between the image sample and the feature and depth of the corresponding pixel in the marked depth map. In this way, when the extracted feature map is input to the trained depth map generation network, The depth map corresponding to the projection area on the ground pointing to the target object can be output.

In practical applications, the feature map and depth map can be clipped in combination with the region feature clustering method (ROI-align), and the three-dimensional detection of the target object can be realized according to the depth map and feature map corresponding to the target object obtained by clipping.

The 3D detection in the embodiments of the present disclosure may be based on the residual prediction of the 3D priori frame, which is considering that in the residual prediction, the information of the original 3D priori frame can be used to guide the subsequent 3D detection, for example, it can be The original 3D prior frame is the initial position, and the 3D detection frame is searched near the initial position, especially when the accuracy of the 3D prior frame is relatively high, which will significantly improve the detection efficiency compared to direct 3D detection .

The above-mentioned 3D prior frame may be determined based on 2D detection information, so that 3D detection may be realized based on 3D prior frame information, a depth map, and a feature map.

In the embodiment of the present disclosure, the two-dimensional detection information of the target object can be determined according to the steps shown in FIG. 2A, and the steps include:

Step S201, determining the offset of the two-dimensional detection frame according to the feature map;

Step S202, based on the preset 2D prior frame information and the offset of the 2D detection frame, determine the 2D detection information of the target object.

Here, the two-dimensional detection information can be determined based on the calculation result between the offset of the two-dimensional detection frame and the preset two-dimensional prior frame information.

Wherein, the two-dimensional detection information in the embodiment of the present disclosure may be obtained by using the trained first target detection network to perform two-dimensional detection on the feature map. Here, the first target detection network training can be the correspondence between the feature map of the image sample and the two-dimensional label information, or the feature map and offset of the image sample (corresponding to the two-dimensional label frame and the two-dimensional The difference between the prior frames) and the corresponding relationship between the two-dimensional detection information of the target object in the image to be detected can be directly determined by using the previous corresponding relationship, and the offset can be determined first by using the latter corresponding relationship, and then the offset The sum of the quantity and the two-dimensional prior box is used to determine the two-dimensional detection information of the target object.

Regardless of which of the above correspondences is adopted, the determined two-dimensional detection information may include the position information of the two-dimensional detection frame (x _2d , y _2d , w _2d , h _2d ), the center point position information (x _p , y _p ), orientation angle (α _3d ), category information (cls) to which the target object belongs, and other information related to two-dimensional detection may also be included, which is not limited here.

Considering the excellent characteristics of residual prediction, the first target detection network here can be a two-dimensional residual prediction. In practical applications, the first target detection network here can first perform dimensionality reduction through a convolutional layer and a linear rectification layer, and then perform residual prediction of the two-dimensional detection frame through multiple convolutional layers. The prediction accuracy higher.

In the embodiment of the present disclosure, based on the two-dimensional detection information determined by the above-mentioned first target detection network, the three-dimensional prior frame information can be determined, as shown in FIG. 2B, the steps include:

Step S301, based on the category information of the target object, determine the clustering information of each subcategory included in the category to which the target object belongs;

Step S302, according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, determine the three-dimensional prior frame information for the target object.

Here, the three-dimensional prior frame information can be determined by combining the clustering information of each sub-category included in the category to which the target object belongs and the two-dimensional detection frame information where the target object is located. There will be some differences in the 3D detection results corresponding to the categories. For example, for each target belonging to the same category of vehicles, for the subcategory of cars, the size of the 3D detection frame is the same as that of the subcategory of large trucks. There is a big difference in the size of the three-dimensional detection frame. In order to take into account the possibility of each subcategory being three-dimensionally predicted, the subcategory can be divided in advance, and the corresponding three-dimensional prior frame information can be determined based on the clustering information of each divided subcategory.

In the embodiment of the present disclosure, in the case of determining the category information of the target object, the clustering result corresponding to this category information may be determined. Still taking the vehicle as the target object here, vehicle image samples including various subcategories may be collected in advance, and the vehicle image samples are determined to have information such as the length, width, and height of the vehicle. For vehicle image samples, clustering can be performed based on height values, so that vehicle image samples belonging to the same height range can be correspondingly divided into a subcategory, and then the clustering information of this subcategory can be determined. In practical applications, clustering methods such as K-means clustering algorithm (K-means) can be used to realize the above clustering process.

The process of determining the three-dimensional a priori frame information by combining the clustering information and the two-dimensional detection frame information where the target object is located can be implemented according to the steps shown in Figure 2C, and the steps include:

Step S3021, for each subcategory in each subcategory, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth value corresponding to the subcategory ;

Step S3022, based on the clustering information of the sub-category and the depth value corresponding to the sub-category, determine a 3D prior frame information for the target object.

Here, each subcategory can correspond to a 3D priori frame information, and information such as the size of the 3D priori frame can be determined by the clustering information of the corresponding subcategory; the relevant depth information can be determined by the cluster height value and the two It is determined by the width value included in the three-dimensional detection frame information. In practical applications, it can be realized by performing the ratio operation between the cluster height value and the width value first, and then performing the multiplication operation of the focal length of the camera device.

In the case of determining the three-dimensional prior frame information, the embodiment of the present disclosure may combine this information, as well as the depth map and the feature map to determine the three-dimensional detection frame information, as shown in Figure 2D, including the following steps:

Step S1031, determining the offset of the three-dimensional detection frame according to the depth map and the feature map;

Step S1032: Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.

Here, the second target detection network can be used to realize three-dimensional detection, and the offset of the three-dimensional detection frame output by the second target detection network can be obtained, and then the three-dimensional detection frame of the target object can be determined based on the three-dimensional prior frame information and the offset of the three-dimensional detection frame information.

Wherein, the above three-dimensional detection frame information may include shape information (w _3d , h _3d , l _3d ) and depth information (z _3d ) of the detection frame.

It should be noted that, compared with two-dimensional prediction, three-dimensional prediction can determine more dimensional information of the target object. For example, it can also determine the subcategories included in the category information of the target object. One class targets cars or trucks.

Considering that there may be multiple 3D priori frames in the embodiments of the present disclosure, a 3D detection frame offset can be predicted based on each 3D priori frame, and considering that the subcategories corresponding to different 3D priori frames are also different, and the prediction probabilities of different subcategories are also different. Therefore, based on the prediction probabilities of each subcategory, the corresponding weights can be given to the three-dimensional prior frame information corresponding to each subcategory, and then based on each three-dimensional priori The frame information, the weight corresponding to each 3D prior frame information, and the offset of the 3D detection frame determine the 3D detection frame information of the target object.

Here, subcategories with higher predicted probabilities can be given higher weights to highlight the role of the corresponding 3D prior frame in subsequent 3D detection. Similarly, subcategories with lower predicted probabilities can be given smaller weights to weaken Corresponding to the role of the 3D prior frame in the subsequent 3D detection, so that the determined 3D detection frame information is more accurate.

In some embodiments, in order to improve the accuracy of the 3D detection, the depth map and the feature map can be clipped first, and then the 3D detection can be performed, as shown in FIG. 2E , which can be achieved by the following steps:

Step S1031a, based on the location range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map that match the location range from the depth map and the feature map, respectively;

Step S1031b. Determine the offset of the 3D detection frame based on the depth map and feature map matched with the location range.

Here, the depth map and feature map corresponding to the position range can be clipped based on the position range included in the two-dimensional detection frame information, that is, the local depth map and local feature map pointing to the target object can be obtained. The corresponding 3D detection frame offset can be determined based on the local depth map and the local feature map, and the 3D detection frame offset here can also be determined by using the second target detection network.

In the process of predicting the offset of the three-dimensional detection frame, since the local depth map and local feature map adopted can exclude the influence of other irrelevant features, the prediction accuracy is higher.

In order to implement the object detection method provided by the embodiments of the present disclosure, training of the first object detection network and the second object detection network is required. For different target detection networks, corresponding supervisory signals (that is, prior frame information) can be set, and then the corresponding loss function values can be determined. Based on these loss function values, network training can be guided by backpropagation, and there is no limitation here.

Considering that the depth map corresponding to the projection area of the target object on the ground plays a key role in the above-mentioned target detection process, the embodiment of the present disclosure also sets a corresponding supervisory signal for the depth map (that is, marking the depth map), which can be generated through the depth map network to achieve. The training process of the depth map generation network is as shown in Figure 2F, and the training process includes the following steps:

Step S401, acquiring an image sample and an annotated depth map determined based on the three-dimensional annotation frame information of the target object annotated in the image sample;

Step S402, performing feature extraction on the image sample to obtain a feature map of the image sample;

Step S403, input the feature map of the image sample into the depth map generation network to be trained, obtain the depth map output by the depth map generation network, and determine the loss function value based on the similarity between the output depth map and the marked depth map;

Step S404. When the loss function value is greater than the preset threshold, adjust the network parameter value of the depth map generation network, and input the feature map of the image sample into the adjusted depth map generation network until the loss function value is less than or equal to preset threshold.

The image sample obtained here is similar to the acquisition method of the image to be detected. In addition, for the extraction of the feature map of the image sample, please refer to the above extraction process of the feature map of the image to be detected.

The embodiment of the present disclosure can determine the loss function value based on the similarity between the depth map output by the depth map generation network and the marked depth map, and adjust the network parameter values of the depth map generation network according to the loss function value, so that the network input result is consistent with the marked depth map. The results tend to be the same or closer.

Wherein, the marked depth map described in the above step 401 can be obtained according to the steps shown in FIG. 2G:

Step S4011, based on the corresponding relationship between the 3D coordinate system where the 3D label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, project the 3D label frame information marked by the target object to the ground, and obtain the projection area and projection of the target object on the ground The extended area where the area is located;

Step S4012. Determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information;

Step S4013, based on the corresponding relationship between the camera coordinate system and the image coordinate system, project the three-dimensional label points on the extended area under the camera coordinate system to the pixel plane under the image coordinate system to obtain the projection points in the pixel plane;

Step S4014, based on the depth values of the three-dimensional marked points on the extended area and the projected points in the pixel plane, the marked depth map is obtained.

During implementation, as shown in Figure 2H, the depth value of each three-dimensional label point on the extended area is determined based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information, including the following steps:

Step S41, determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;

Step S42, in the case of determining the position coordinates of the central point of the extended area, use the depth value of the central point of the extended area as the initial depth value, and determine each three-dimensional annotation in the extended area at a preset depth interval The depth value of the point.

An embodiment of the present disclosure provides a method for generating local ground depth labels. Here, the depth information of the surrounding ground (corresponding to the extended area) can be obtained by using the position of the center point of the bottom surface of the callout frame (the center point falls on the ground) included in the three-dimensional callout frame information of the target object. During implementation, as shown in FIG. 2I, the three-dimensional labeling frame 21 of the target object 20 and the target object shown in the figure (a), here, the center point of the bottom surface of the three-dimensional labeling frame is at the same height as the surrounding ground, in the figure ( In b), a large number of three-dimensional labeling points 23 can be generated in an extended area 22 around the central point, where the three-dimensional points include the central point, and also include the initial depth value of the central point of the extended area 22, with Each three-dimensional marking point 23 on the extension area determined by the preset depth interval.

Like this, as shown in Fig. (b) among Fig. 2I, utilize projective relationship to be able to project three-dimensional mark point 23 on the pixel plane, obtain the projection point corresponding to three-dimensional mark point on the pixel plane, record the depth value of three-dimensional mark point 23 For the corresponding relationship between the projected points of its projection, the average depth value of at least one corresponding three-dimensional label point can be obtained for each projected projected point, and the marked depth shown in Figure (c) in Figure 2I can be obtained As shown in Fig. 24, the depth information of the surrounding ground (corresponding to the extended area) can be obtained from the marked depth map 24.

Wherein, the above-mentioned projection relationship can be realized by using the following formula (1):

Among them, (x _3d , y _3d , _z _3d ) represent the camera coordinates of the three-dimensional label point, (x _p , y _p ) represent the projection point of the three-dimensional mark point projection, Prect and R _rect represent the rotation Correction matrix and projection matrix.

In this way, inputting the feature map of the image to be detected into the trained depth map generation network can determine the depth map corresponding to the projection area where the target object in the image to be detected is projected to the ground, and then can combine the feature map and the three-dimensional prior frame The information realizes the three-dimensional prediction of the target object.

In order to facilitate a further understanding of the above three-dimensional prediction process, it can be explained in conjunction with FIG. 2J next.

As shown in FIG. 2J , for the image to be detected 31 including the target object of the vehicle 30 , the feature map of the image to be detected can be extracted through the feature extraction network 32 first. Then, on the one hand, two-dimensional detection is performed through the first target detection network 33 to obtain two-dimensional detection information for the target object; Depth Chart 35.

In the embodiments of the present disclosure, based on the above two-dimensional detection information, the three-dimensional prior frame information for the target object may be determined. As shown in FIG. 2J , it is an exemplary display of three 3D a priori frame information determined for the corresponding three subcategories 36 .

Here, before inputting the depth map and feature map into the trained second target detection network 37, the cropping in ROI-align mode can be performed based on the two-dimensional detection information, and then the depth map and feature map obtained by clipping can be input into The second target detection network 37 can obtain the corresponding three-dimensional detection frame offset, such as Δ(w,h,l) _3d , Δz _3d and other information as shown in FIG. 2J .

The 3D detection information can be determined by combining the above 3D detection frame offset and 3D prior frame information. In practical applications, the above three-dimensional detection information can be presented on the image to be detected.

Those skilled in the art can understand that in the above-mentioned method of specific implementation, the writing order of each step does not imply a strict execution order and constitutes any limitation on the implementation process, and the execution order of each step should be based on its function and possible internal Logically OK.

Based on the same inventive concept, the embodiment of the present disclosure also provides a target detection device corresponding to the target detection method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned target detection method of the embodiment of the present disclosure, therefore The implementation of the device can refer to the implementation of the method.

Fig. 3 is a schematic diagram of a target detection device provided by an embodiment of the present disclosure, the device includes: an extraction module 301, a generation module 302, and a first detection module 303; wherein,

The extraction module 301 is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;

The generation module 302 is configured to generate a depth map corresponding to a projection area where the target object in the image to be detected is projected to the ground based on the feature map;

The first detection module 303 is configured to determine three-dimensional detection information of the target object based on the depth map and the feature map.

Using the above target detection device, not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground The depth map corresponding to the ground of the situation can be used as a guide for three-dimensional detection to improve the accuracy of detection.

In a possible implementation manner, the above-mentioned device also includes:

The second detection module 304 is configured to detect the feature map after obtaining the feature map of the image to be detected, and obtain two-dimensional detection information for the target object;

The first detection module 303 is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map according to the following steps:

Based on the two-dimensional detection information, determine the three-dimensional prior frame information for the target object;

Based on the 3D prior frame information, the depth map and the feature map, the 3D detection frame information of the target object is determined.

In a possible implementation manner, the two-dimensional detection information includes the two-dimensional detection frame information where the target object is located and the category information of the target object; the first detection module 303 is configured to determine the target object based on the two-dimensional detection information according to the following steps The three-dimensional prior frame information of :

Based on the category information of the target object, determine the clustering information of each subcategory included in the category to which the target object belongs;

According to the clustering information of each subcategory and the information of the two-dimensional detection frame where the target object is located, the three-dimensional prior frame information for the target object is determined.

In a possible implementation manner, the first detection module 303 is configured to determine the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located according to the following steps:

For each subcategory in each subcategory, determine the depth value corresponding to the subcategory based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located;

Based on the clustering information of the subcategory and the depth value corresponding to the subcategory, a three-dimensional prior frame information for the target object is determined.

In a possible implementation manner, the first detection module 303 is configured to determine the 3D detection frame information of the target object based on the 3D prior frame information, the depth map and the feature map according to the following steps:

Determine the offset of the three-dimensional detection frame according to the depth map and the feature map;

Based on the 3D prior frame information and the offset of the 3D detection frame, the 3D detection frame information of the target object is determined.

In a possible implementation manner, the first detection module 303 is configured to determine the offset of the three-dimensional detection frame according to the depth map and the feature map according to the following steps:

Based on the position range included in the two-dimensional detection frame information where the target object is located, extract the depth map and feature map matching the position range from the depth map and feature map;

Determine the 3D detection box offset based on the depth map and feature map matched to the location range.

In a possible implementation manner, there are multiple 3D prior frame information; the first detection module 303 is configured to determine the 3D detection frame of the target object based on the 3D prior frame information and the offset of the 3D detection frame according to the following steps information:

Determine the weight corresponding to each three-dimensional prior box information;

The three-dimensional detection frame information of the target object is determined based on each three-dimensional prior frame information, the weight corresponding to each three-dimensional prior frame information, and the three-dimensional detection frame offset.

In a possible implementation manner, the first detection module 303 is configured to determine the weight corresponding to each three-dimensional prior frame information according to the following steps:

Determine the predicted probability of each subcategory included in the category information of the target object according to the depth map and the feature map;

Based on the prediction probability of each subcategory, the weight of the three-dimensional prior frame information corresponding to each subcategory is determined.

In a possible implementation manner, the second detection module 304 is configured to detect the feature map according to the following steps to obtain two-dimensional detection information for the target object:

Determine the offset of the two-dimensional detection frame according to the feature map;

In a possible implementation, the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and based on the three-dimensional annotation frame information of the target object marked in the image samples. Figure training obtained.

In a possible implementation manner, the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the central point of the bottom surface of the annotation frame; the generation module 302 is configured to obtain the annotation depth map according to the following steps:

Based on the corresponding relationship between the three-dimensional coordinate system where the three-dimensional label frame is located and the ground coordinate system where the center point of the bottom surface of the label box is located, the information of the three-dimensional label frame marked by the target object is projected to the ground, and the projection area of the target object on the ground and the extension of the projection area are obtained area;

Based on the corresponding relationship between the camera coordinate system and the image coordinate system, each three-dimensional label point on the extended area under the camera coordinate system is projected to the pixel plane under the image coordinate system to obtain the projection point in the pixel plane;

A marked depth map is obtained based on the depth values of each three-dimensional marked point on the extended area and the projected point in the pixel plane.

In a possible implementation manner, the generation module 302 is configured to determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information according to the following steps:

Determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area respectively;

In the case of determining the position coordinates of the central point of the extended area, the depth value of the central point of the extended area is used as the initial depth value, and the depth values of each three-dimensional label point in the extended area are determined at preset depth intervals.

For the description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant description in the above method embodiment, and details will not be described here.

An embodiment of the present disclosure also provides an electronic device, as shown in FIG. 4 , which is a schematic structural diagram of the electronic device provided by the embodiment of the present disclosure, including: a processor 401 , a memory 402 , and a bus 403 . The memory 402 stores machine-readable instructions executable by the processor 401 (for example, the execution instructions corresponding to the extraction module 301, the generation module 302, and the first detection module 303 in the device in FIG. 3 ), and when the electronic device is running, the processing The processor 401 communicates with the memory 402 through the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:

Perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;

Based on the feature map, generate a depth map corresponding to the projection area where the target object in the image to be detected is projected to the ground;

Based on the depth map and the feature map, the 3D detection information of the target object is determined.

Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the method for object detection described in the foregoing method embodiments are executed. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the method for target detection described in the above method embodiment, please refer to the above method Example.

Wherein, the above-mentioned computer program product may be realized by hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

Those skilled in the art can clearly understand that for the convenience and brevity of description, for the working process of the above-described system and device, reference may be made to the corresponding process in the foregoing method embodiments. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions are implemented in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Finally, it should be noted that: the above-mentioned embodiments are only specific implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, rather than limit them, and the protection scope of the present disclosure is not limited thereto, although referring to the aforementioned The embodiments have described the present disclosure in detail, and those skilled in the art should understand that any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present disclosure Changes can be easily imagined, or equivalent replacements can be made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be included in this disclosure. within the scope of protection. Therefore, the protection scope of the present disclosure should be defined by the protection scope of the claims.

Industrial Applicability

In an embodiment of the present disclosure, the method for target detection includes: performing feature extraction on the image to be detected to obtain a feature map of the image to be detected; based on the feature map, generating a projection of the target object in the image to be detected projected to the ground A depth map corresponding to the region; based on the depth map and the feature map, determine the three-dimensional detection information of the target object. Using the above target detection method, not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground The depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.

Claims

A method for target detection, the method comprising:

performing feature extraction on the image to be detected to obtain a feature map of the image to be detected;

Based on the feature map, generate a depth map corresponding to a projection area where the target object in the image to be detected is projected to the ground;

Based on the depth map and the feature map, determine three-dimensional detection information of the target object.
The method according to claim 1, wherein, after obtaining the feature map of the image to be detected, the method further comprises:

Detecting the feature map to obtain two-dimensional detection information for the target object;

The determining the three-dimensional detection information of the target object based on the depth map and the feature map includes:

Based on the two-dimensional detection information, determine three-dimensional prior frame information for the target object;

Based on the 3D priori frame information, the depth map and the feature map, determine the 3D detection frame information of the target object.
The method according to claim 2, wherein the two-dimensional detection information includes the two-dimensional detection frame information of the target object and the category information of the target object; Describe the three-dimensional prior frame information of the target object, including:

Based on the category information of the target object, determine the clustering information of each subcategory included in the category of the target object;

According to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, the three-dimensional prior frame information for the target object is determined.
The method according to claim 3, wherein, according to the clustering information of each subcategory and the two-dimensional detection frame information of the target object, determining the three-dimensional prior frame information for the target object includes:

For each subcategory in the various subcategories, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth corresponding to the subcategory value;

Based on the clustering information of the subcategory and the depth value corresponding to the subcategory, determine a 3D prior frame information for the target object.
The method according to claim 3 or 4, wherein the determining the 3D detection frame information of the target object based on the 3D prior frame information, the depth map, and the feature map includes:

determining a three-dimensional detection frame offset according to the depth map and the feature map;

Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.
The method according to claim 5, wherein said determining the offset of the three-dimensional detection frame according to the depth map and the feature map comprises:

Based on the position range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map matching the position range from the depth map and the feature map respectively;

The three-dimensional detection frame offset is determined based on a depth map and a feature map matched with the position range.
The method according to claim 5 or 6, wherein there are multiple 3D prior frame information; determining the target object based on the 3D prior frame information and the offset of the 3D detection frame 3D detection frame information, including:

Determining the weight corresponding to each of the three-dimensional prior frame information;

The 3D detection frame information of the target object is determined based on each of the 3D prior frame information, the weight corresponding to each of the 3D prior frame information, and the 3D detection frame offset.
The method according to claim 7, wherein the method further comprises:

Determine the predicted probability of each subcategory included in the category information of the target object according to the depth map and the feature map;

The determination of the weight corresponding to each of the three-dimensional prior frame information includes:

Based on the prediction probability of each subcategory, the weight of the three-dimensional prior frame information corresponding to each of the subcategories is determined.
The method according to any one of claims 2 to 8, wherein the detecting the feature map to obtain two-dimensional detection information for the target object comprises:

determining a two-dimensional detection frame offset according to the feature map;

The two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
The method according to any one of claims 1 to 9, wherein the depth map is determined by a trained depth map generation network; the depth map generation network is composed of image samples and labels based on the image samples The 3D annotation frame information of the target object is obtained by training the annotation depth map.
The method according to claim 10, wherein the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the center point of the bottom surface of the annotation frame; and the annotation depth map is obtained according to the following steps:

Based on the corresponding relationship between the three-dimensional coordinate system where the three-dimensional label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, the three-dimensional label frame information marked by the target object is projected to the ground, and the projection area and the projection area of the target object on the ground are obtained. The extension area where the projection area is located;

Determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label frame information;

Based on the corresponding relationship between the camera coordinate system and the image coordinate system, each three-dimensional label point on the extended area under the camera coordinate system is projected to a pixel plane under the image coordinate system to obtain the pixel plane in the pixel plane projection point;

The marked depth map is obtained based on the depth value of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
The method according to claim 11, wherein the determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information includes:

Determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;

In the case of determining the position coordinates of the central point of the extended area, the depth value of the central point of the extended area is used as the initial depth value, and the depth of each three-dimensional label point in the extended area is determined at a preset depth interval value.
A device for target detection, the device comprising:

The extraction module is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;

A generation module configured to generate a depth map corresponding to a projection area of the target object in the image to be detected projected to the ground based on the feature map;

The first detection module is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map.
An electronic device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory through the bus , when the machine-readable instructions are executed by the processor, the steps of the method for object detection according to any one of claims 1 to 12 are executed.
A computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the method for detecting an object according to any one of claims 1 to 12 are executed.
A computer program product, comprising a computer-readable storage medium storing program code, when the instructions included in the program code are executed by the processor of the computer device, the object detection method described in any one of claims 1 to 12 is realized method steps.