CN114119991A

CN114119991A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN114119991A
Application number: CN202111164729.9A
Authority: CN
Inventors: 刘配; 杨国润; 王哲; 石建萍
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-03-01
Also published as: WO2023050810A1

Abstract

The present disclosure provides a method, an apparatus, an electronic device and a storage medium for target detection, wherein the method comprises: extracting the characteristics of the image to be detected to obtain a characteristic diagram of the image to be detected; generating a depth map corresponding to a projection area of a target object projected to the ground in the image to be detected based on the characteristic map; and determining three-dimensional detection information of the target object based on the depth map and the feature map. The projection area in the disclosure is linked with the target object to a certain extent, so that the depth map corresponding to the local ground can pertinently guide the feature map of the target object on the local ground to carry out three-dimensional detection, and the detection precision is improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for target detection, an electronic device, and a storage medium.

Background

Compared with a two-dimensional (2D, 2-Dimension) target detection task, the three-dimensional (3D, 3-Dimension) target detection task has higher difficulty and higher complexity, and often needs to detect 3D geometric information and semantic information of a target from a 3D scene, and mainly comprises information of the length, the width, the height, the central point and the orientation angle of the target. Among them, 3D object detection of monocular images has excellent characteristics of economical applicability, and is widely used in various fields (such as the field of unmanned driving).

However, monocular image-based 3D object detection techniques rely primarily on some external subtasks that are responsible for performing the tasks of 2D object detection, depth map estimation, and so on. Because the subtasks are trained independently, precision loss exists, the performance upper limit of the network model is limited, and the precision requirement of 3D detection cannot be met.

Disclosure of Invention

The embodiment of the disclosure at least provides a method and a device for target detection, an electronic device and a storage medium, so as to improve the precision of 3D target detection.

In a first aspect, an embodiment of the present disclosure provides a method for target detection, where the method includes:

extracting the characteristics of an image to be detected to obtain a characteristic diagram of the image to be detected;

generating a depth map corresponding to a projection area of a target object in the image to be detected projected to the ground based on the feature map;

and determining three-dimensional detection information of the target object based on the depth map and the feature map.

By adopting the target detection method, not only can the features of the image to be detected be extracted, but also the depth map corresponding to the projection area of the target object projected to the ground in the image to be detected can be generated based on the extracted feature map, and then the three-dimensional detection information of the target object can be determined based on the depth map and the feature map. The generated depth map is directed to the target object in the image to be detected, and corresponds to a projection area of the target object projected to the ground, and the projection area is associated with the target object to a certain extent, so that the depth map corresponding to the local ground can be used as a guide when the feature map of the target object on the local ground is used for three-dimensional detection, and the detection precision is improved.

In a possible implementation, after obtaining the feature map of the image to be detected, the method further includes:

performing two-dimensional detection on the characteristic diagram to obtain two-dimensional detection information aiming at the target object;

the determining three-dimensional detection information of the target object based on the depth map and the feature map comprises:

determining three-dimensional prior frame information for the target object based on the two-dimensional detection information;

and determining three-dimensional detection frame information of the target object based on the three-dimensional prior frame information, the depth map and the feature map.

The three-dimensional detection can be detection combined with three-dimensional prior frame information, and the three-dimensional prior frame can restrict the initial position of the three-dimensional detection to a certain extent so as to search the three-dimensional detection frame information near the initial position, thereby further improving the precision of the three-dimensional detection.

In a possible implementation manner, the two-dimensional detection information includes two-dimensional detection frame information where the target object is located and category information where the target object belongs; the determining three-dimensional prior frame information for the target object based on the two-dimensional detection information comprises:

based on the category information of the target object, determining clustering information of each sub-category included in the category of the target object;

and determining three-dimensional prior frame information aiming at the target object according to the clustering information of each sub-category and the two-dimensional detection frame information of the target object.

The three-dimensional prior frame can be determined by combining the information of the category to which the target object belongs. The types of the target objects are different, the sizes, the positions and the like of the corresponding three-dimensional prior frames are possibly different, the positions of the three-dimensional prior frames can be determined in an auxiliary mode by utilizing the type information, and the accuracy is high.

In a possible implementation manner, the determining, according to the clustering information of each sub-category and the two-dimensional detection frame information of the target object, three-dimensional prior frame information for the target object includes:

for each sub-category in the sub-categories, determining a depth value corresponding to the sub-category based on a cluster height value included in cluster information of the sub-category and a width value included in two-dimensional detection frame information of the target object;

determining a three-dimensional prior box information for the target object based on the cluster information of the sub-category and the depth value corresponding to the sub-category.

In a possible embodiment, the determining three-dimensional detection frame information of the target object based on the three-dimensional prior frame information, the depth map, and the feature map includes:

determining the offset of the three-dimensional detection frame according to the depth map and the feature map;

and determining the three-dimensional detection frame information of the target object based on the three-dimensional prior frame information and the three-dimensional detection frame offset.

The offset can be predicted, and a more accurate three-dimensional detection frame can be obtained by combining the offset and the three-dimensional prior frame.

In a possible implementation, the determining a three-dimensional detection frame offset according to the depth map and the feature map includes:

extracting a depth map and a feature map which are matched with the position range from the depth map and the feature map respectively based on the position range included in the two-dimensional detection frame information of the target object;

and determining the three-dimensional detection frame offset based on the depth map and the feature map which are matched with the position range.

Here, the clipping of the depth map and the feature map at corresponding positions can be realized by using the position range included in the two-dimensional detection frame information of the target object, which enables the predicted offset to be specific to the target object and does not include related information of other interference regions, thereby improving the prediction accuracy.

In a possible implementation, the three-dimensional prior frame information is a plurality of frames; the determining three-dimensional detection frame information of the target object based on the three-dimensional prior frame information and the three-dimensional detection frame offset includes:

determining a weight corresponding to each three-dimensional prior frame information;

and determining the three-dimensional detection frame information of the target object based on the three-dimensional prior frame information, the weight corresponding to each piece of the three-dimensional prior frame information and the three-dimensional detection frame offset.

In one possible embodiment, the method further comprises:

determining the prediction probability of each sub-category included in the category information of the target object according to the depth map and the feature map;

the determining the weight corresponding to each three-dimensional prior frame information comprises:

and determining the weight of the three-dimensional prior frame information corresponding to each subcategory based on the prediction probability of each subcategory.

Considering that the prediction probabilities for different sub-categories are different, the probability is higher, which means that the probability that the target object points to the corresponding sub-category is higher, and further higher weight can be given to the corresponding three-dimensional prior frame information, which further improves the prediction accuracy of the finally obtained three-dimensional detection frame.

In a possible implementation manner, the detecting the feature map to obtain two-dimensional detection information for the target object includes:

determining the offset of the two-dimensional detection frame according to the feature map;

and determining the two-dimensional detection information of the target object based on preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.

In one possible embodiment, the depth map is determined by a trained depth map generation network; the depth map generation network is obtained by training an image sample and an annotated depth map determined based on three-dimensional annotated frame information of a target object annotated in the image sample.

In one possible implementation, the three-dimensional labeling box information of the target object includes position coordinates and a depth value of a bottom center point of a labeling box; the marked depth map is obtained according to the following steps:

projecting the information of the three-dimensional marking frame marked by the target object to the ground based on the corresponding relation between the three-dimensional coordinate system of the three-dimensional marking frame and the ground coordinate system of the bottom center point of the marking frame to obtain a projection area of the target object on the ground and an extension area of the projection area;

determining the depth value of each three-dimensional marking point on the extension area based on the position coordinate and the depth value of the bottom center point of the marking frame, which are included in the three-dimensional marking frame information;

based on the corresponding relation between a camera coordinate system and an image coordinate system, projecting each three-dimensional marking point on the extension area under the camera coordinate system to a pixel plane under the image coordinate system to obtain a projection point in the pixel plane;

and obtaining the marked depth map based on the depth value of each three-dimensional marked point on the extension area and the projection point in the pixel plane.

The annotated depth map herein may be implemented in conjunction with ground projection operations and conversion operations between coordinate systems. The construction of the extension area can enable the local ground area including the target object to be completely covered, the corresponding labeled depth map can be obtained by utilizing the three-dimensional projection result of the extension area, the labeled depth map can reflect the depth information of the extension area including the local ground area, and the depth information can pertinently assist the target object on the corresponding local ground area to carry out three-dimensional detection.

In a possible implementation manner, the determining a depth value of each three-dimensional labeling point on the extension area based on the position coordinate of the bottom center point of the labeling frame and the depth value included in the three-dimensional labeling frame information includes:

respectively determining the depth value and the position coordinate of the bottom surface central point of the marking frame as the depth value and the position coordinate of the central point of the extension area;

and under the condition of determining the position coordinates of the central point of the extension area, determining the depth value of each three-dimensional labeling point in the extension area by taking the depth value of the central point of the extension area as an initial depth value and a preset depth interval.

In a second aspect, an embodiment of the present disclosure further provides an apparatus for target detection, where the apparatus includes:

the extraction module is used for extracting the characteristics of the image to be detected to obtain a characteristic diagram of the image to be detected;

the generating module is used for generating a depth map corresponding to a projection area of a target object in the image to be detected projected to the ground based on the feature map;

and the first detection module is used for determining the three-dimensional detection information of the target object based on the depth map and the feature map.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of object detection according to the first aspect and any of its various embodiments.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the method for object detection according to the first aspect and any of its various embodiments.

For the description of the effects of the above target detection apparatus, electronic device, and computer-readable storage medium, reference is made to the description of the above target detection method, which is not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 illustrates a flow chart of a method of target detection provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an application of the method for target detection provided by the embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an apparatus for target detection provided by an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Research shows that with the successful application of deep learning in the field of target detection, particularly in the aspect of 3D target detection, the detection precision reaches a very high level. A common 3D object detection method is based on LiDAR data, but is difficult to meet for large-scale applications and deployments due to the expensive data acquisition equipment. And the 3D target detection mode of the monocular image can be a vehicle-mounted camera of an automobile, so that the monocular image is economical and available. For an image with a single view angle, the task of monocular 3D detection is to detect 3D geometric information and semantic information of a target object from a 3D scene, mainly including length, width, height, center point and orientation angle information of the target object.

In the related art, the 3D target detection technology based on monocular images mainly relies on some external subtasks, which are responsible for performing tasks such as 2D target detection and depth map estimation. Because the subtasks are trained independently, the precision loss exists, the performance upper limit of the network model is limited, the precision requirement of 3D detection cannot be met, and the method is difficult to be used in practical application.

The difficulty of the current 3D object detection method is the depth prediction of the 3D detection frame. The label of 3D target detection only provides depth information of the center point or the angular point of the target frame, so that the network is difficult to learn, and more amount and more accurate depth information cannot be generated. The main reason is that the 3D target detection method in the related art mainly guides the learning of the 3D detection frame through its subtasks, such as depth estimation, pseudo point cloud generation, and prediction results of semantic segmentation, but its subtasks require a large number of accurate depth labels, and are difficult to use in practical applications, and the accuracy of the subtasks limits the performance upper limit of 3D target detection, and is not very reliable in 3D target detection.

Based on the above research, the present disclosure provides a method and an apparatus for target detection, an electronic device, and a storage medium, so as to improve the precision of 3D target detection.

To facilitate understanding of the present embodiment, first, a method for object detection disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the method for object detection provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method of object detection may be implemented by a processor calling computer readable instructions stored in a memory.

Referring to fig. 1, which is a flowchart of a method for object detection provided in the embodiment of the present disclosure, the method includes steps S101 to S103, where:

s101: extracting the characteristics of the image to be detected to obtain a characteristic diagram of the image to be detected;

s102: generating a depth map corresponding to a projection area of a target object projected to the ground in the image to be detected based on the characteristic map;

s103: and determining three-dimensional detection information of the target object based on the depth map and the feature map.

In order to facilitate understanding of the method for detecting the target provided by the embodiments of the present disclosure, an application scenario of the method is first described in detail below. The target detection method can be mainly applied to the field of computer vision, for example, can be applied to scenes such as vehicle detection and unmanned aerial vehicle detection in unmanned driving. In view of the wide application of unmanned driving, vehicle detection will be exemplified in the following.

The 3D object detection technique in the related art mainly relies on some external subtasks, which are responsible for performing tasks such as 2D object detection, depth map estimation, and the like. Due to the fact that the subtasks are trained independently, precision loss exists, and the final 3D detection precision is not high.

In order to solve the above problem, the embodiments of the present disclosure provide a scheme for performing three-dimensional detection by combining a local depth map and a feature map, and the detection precision is high.

The image to be detected in the embodiment of the present disclosure may be an image acquired in a target scene, and the images acquired in different application scenes are different. Taking the unmanned driving as an example, the image to be detected may be an image acquired by a camera device mounted on an unmanned automobile during the vehicle traveling process, the image may include all target objects in a shooting view of the camera device, where the target objects may be vehicles ahead or pedestrians ahead, and the image is not limited specifically herein.

Before three-dimensional detection is performed, the embodiment of the disclosure can perform extraction of a feature map by using various feature extraction methods for an image to be detected. For example, a feature map may be extracted from the image to be detected through image processing, and then, for example, a trained feature extraction network may be used to extract the feature map.

In consideration of the fact that the feature extraction network can mine deeper image features, the feature extraction network can be adopted to extract the feature map in the embodiment of the disclosure. The feature extraction network may be a Convolutional Neural Network (CNN). In a specific application, the CNN model may be implemented by using a convolutional block (convolutional n) model including a convolutional layer, a Batch normalization layer and a Linear rectifying layer (ReLU), a dense block (dense) including a plurality of convolutional blocks and a plurality of Skip connections, and a transition block (transition block) including a convolutional block and an average pooling layer. Specific composition manners of the convolution block, the dense block, and the transition block, for example, specific composition manners including several convolution layers, several average pooling layers, and the like may be determined according to specific application scenarios, and are not limited herein.

For three-dimensional detection, the embodiment of the present disclosure may generate a local depth map based on the extracted feature map, where the local depth map corresponds to a projection area of a target object in an image to be detected, which is projected to the ground, and points to local ground depth information associated with the target object. The local ground has a binding relationship with the target object to a certain extent, so that the target object can be more accurately detected by combining the extracted feature map.

The local depth map may be determined by using a trained depth map generation network. The depth map generation network is trained by the corresponding relation between the features and the depths of corresponding pixel points in the image sample and the labeled depth map, so that the depth map corresponding to the ground projection area of the pointed target object can be output under the condition that the extracted feature map is input into the trained depth map generation network.

In practical application, the feature map and the depth map can be cropped in combination with the ROI-align mode, so as to implement three-dimensional detection on the target object according to the depth map and the feature map which are obtained by cropping and correspond to the target object.

The three-dimensional detection in the embodiment of the present disclosure may be residual prediction based on a three-dimensional prior frame, which mainly considers that in residual prediction, subsequent three-dimensional detection may be guided based on information of an original three-dimensional prior frame, for example, the original three-dimensional prior frame may be used as an initial position, and a search of the three-dimensional detection frame is performed near the initial position, which significantly improves detection efficiency compared with direct three-dimensional detection in a case that accuracy of the three-dimensional prior frame is relatively high.

The three-dimensional prior frame can be determined based on two-dimensional detection information, so that three-dimensional detection can be realized based on the three-dimensional prior frame information, the depth map and the feature map.

The two-dimensional detection information of the target object can be determined according to the following steps:

step one, determining the offset of a two-dimensional detection frame according to a characteristic diagram;

and step two, determining two-dimensional detection information of the target object based on preset two-dimensional prior frame information and two-dimensional detection frame offset.

Here, the two-dimensional detection information may be determined based on an operation result between the two-dimensional detection frame offset and preset two-dimensional prior frame information.

The two-dimensional detection information in the embodiment of the present disclosure may be obtained by performing two-dimensional detection on the feature map by using the trained first target detection network. The first target detection network training may be a corresponding relationship between a feature map of an image sample and two-dimensional labeling information, or a corresponding relationship between a feature map of an image sample and an offset (a difference between a corresponding two-dimensional labeling frame and a two-dimensional prior frame), the former corresponding relationship may be used to directly determine two-dimensional detection information of a target object in an image to be detected, and the latter corresponding relationship may be used to determine the offset first and then sum the offset and the two-dimensional prior frame to determine two-dimensional detection information of the target object.

Regardless of which correspondence is adopted, the determined two-dimensional detection information may include position information (x) of the two-dimensional detection frame_2d，y_2d，w_2d，h_2d) Center point position information (x)_p，y_p) Angle of orientation (α)_3d) The class information (cls) to which the target object belongs may further include other information related to two-dimensional detection, and is not limited herein.

In view of the superior characteristics of residual prediction, the first objective detection network herein may be implemented as two-dimensional residual prediction. In practical application, the first target detection network can perform dimension reduction through a convolution layer and a linear rectification layer, and then perform residual prediction of the two-dimensional detection frame through a plurality of convolution layers respectively, so that the prediction accuracy is high.

In the embodiment of the present disclosure, three-dimensional prior frame information may be determined based on the two-dimensional detection information determined by the first target detection network, which may specifically be implemented by the following steps:

step one, based on the category information of the target object, determining the clustering information of each sub-category included in the category of the target object;

and step two, determining three-dimensional prior frame information aiming at the target object according to the clustering information of each sub-category and the two-dimensional detection frame information of the target object.

Here, the three-dimensional prior frame information may be determined by combining the clustering information of each sub-category included in the category to which the target object belongs and the two-dimensional detection frame information of the target object, which mainly considers that for the target object belonging to one category, there is a certain difference in the three-dimensional detection results corresponding to different sub-categories, for example, for each target belonging to one category, for the sub-category of cars, there is a large difference in the size of the three-dimensional detection frame from the size of the three-dimensional detection frame of the sub-category of trucks. In order to take account of the possibility that each sub-category is predicted three-dimensionally, the sub-categories may be divided in advance, and the determination of the corresponding three-dimensional prior frame information is realized based on the clustering information of each divided sub-category.

In the embodiment of the present disclosure, under the condition that the category information to which the target object belongs is determined, the clustering result corresponding to the category information may be used. Here, still taking a vehicle as an example of a target object, a vehicle image sample including various sub-categories, which determines that there is information such as the length, width, and the like of the vehicle, may be collected in advance. For the vehicle image samples, clustering can be performed based on the height values, so that the vehicle image samples belonging to the same height range can be correspondingly divided into a sub-category, and further clustering information of the sub-category can be determined. In practical application, the clustering process can be realized by adopting clustering methods such as K-means and the like, which are not described herein any more.

The process of determining the three-dimensional prior frame information by combining the clustering information and the two-dimensional detection frame information of the target object can comprise the following steps:

step one, aiming at each sub-category in each sub-category, determining a depth value corresponding to the sub-category based on a cluster height value included by cluster information of the sub-category and a width value included by two-dimensional detection frame information of a target object;

and step two, determining three-dimensional prior frame information aiming at the target object based on the clustering information of the sub-category and the depth value corresponding to the sub-category.

Here, each sub-category may correspond to a piece of three-dimensional prior frame information, and information about the size and the like of the three-dimensional prior frame may be determined by the clustering information of the corresponding sub-category; the depth information may be determined by a cluster height value and a width value included in the two-dimensional detection frame information, and in a specific application, the depth information may be implemented by performing a ratio operation between the cluster height value and the width value, and then performing a multiplication operation of the focal length of the image pickup device.

Under the condition of determining three-dimensional prior frame information, the embodiment of the disclosure may determine three-dimensional detection frame information by combining the information, a depth map and a feature map, and specifically includes the following steps:

step one, determining the offset of a three-dimensional detection frame according to a depth map and a feature map;

and secondly, determining three-dimensional detection frame information of the target object based on the three-dimensional prior frame information and the three-dimensional detection frame offset.

Here, the second target detection network may be used to implement three-dimensional detection, obtain a three-dimensional detection frame offset output by the second target detection network, and then determine the three-dimensional detection frame information of the target object based on the three-dimensional prior frame information and the three-dimensional detection frame offset.

The three-dimensional detection frame information may mainly include shape information (w) of the detection frame_3d，h_3d，l_3d) And depth information (z)_3d)。

It should be noted that, compared to two-dimensional prediction, three-dimensional prediction may determine more dimensional information of the target object, for example, each sub-category included in the category information of the target object may also be determined here, for example, it may be determined whether the target object belonging to the category of the vehicle is a car or a truck.

Considering that there may be a plurality of three-dimensional prior frames in the embodiment of the present disclosure, based on that each three-dimensional prior frame may correspondingly predict a three-dimensional detection frame offset, and considering that sub-categories corresponding to different three-dimensional prior frames are also different, and prediction probabilities of different sub-categories are also different, therefore, here, the three-dimensional prior frame information corresponding to each sub-category may be given a corresponding weight based on the prediction probability of each sub-category, and then the three-dimensional detection frame information of the target object may be determined based on each three-dimensional prior frame information, the weight corresponding to each three-dimensional prior frame information, and the three-dimensional detection frame offset.

Here, the sub-category with the higher prediction probability may be given a higher weight to highlight the role of the corresponding three-dimensional prior frame in the subsequent three-dimensional detection, and similarly, the sub-category with the lower prediction probability may be given a lower weight to weaken the role of the corresponding three-dimensional prior frame in the subsequent three-dimensional detection, so that the determined three-dimensional detection frame information is more accurate.

In order to further improve the precision of three-dimensional detection, the depth map and the feature map may be firstly cut, and then three-dimensional detection may be performed, which may specifically be implemented by the following steps:

the method comprises the steps that firstly, based on a position range included by information of a two-dimensional detection frame where a target object is located, a depth map and a feature map which are matched with the position range are extracted from the depth map and the feature map respectively;

and step two, determining the offset of the three-dimensional detection frame based on the depth map and the feature map which are matched with the position range.

Here, the clipping of the depth map and the feature map corresponding to the position range may be realized based on the position range included in the two-dimensional detection frame information, that is, the local depth map and the local feature map pointing to the target object may be obtained. And determining the corresponding three-dimensional detection frame offset based on the local depth map and the local feature map, wherein the three-dimensional detection frame offset can also be determined by using the second target detection network.

In the process of predicting the offset of the three-dimensional detection frame, the influence of other irrelevant features can be avoided due to the adopted local depth map and the local feature map, so that the prediction accuracy is higher.

In order to implement the method for target detection provided by the embodiment of the present disclosure, training of the first target detection network and the second target detection network is required. Corresponding supervision signals (namely prior frame information) can be set aiming at different target detection networks, so that corresponding loss function values can be determined, network training can be guided by back propagation based on the loss function values, and specific limitation is not required.

In consideration of the key role of the depth map corresponding to the projection area of the target object on the ground in the target detection process, the embodiment of the present disclosure also sets a corresponding supervision signal (i.e., a labeled depth map) for the depth map, and may be specifically implemented by a depth map generation network. The training process of the depth map generation network specifically comprises the following steps:

acquiring an image sample and an annotation depth map determined based on three-dimensional annotation frame information of a target object annotated in the image sample;

secondly, extracting the features of the image sample to obtain a feature map of the image sample;

inputting the feature map of the image sample into a depth map generation network to be trained to obtain a depth map output by the depth map generation network, and determining a loss function value based on the similarity between the output depth map and the labeled depth map;

and step four, under the condition that the loss function value is larger than the preset threshold value, adjusting the network parameter value of the depth map generation network, and inputting the feature map of the image sample into the adjusted depth map generation network until the loss function value is smaller than or equal to the preset threshold value.

The image sample obtained here is similar to the image to be detected in the obtaining manner, and is not described herein again. In addition, the extraction of the feature map of the image sample can refer to the above extraction process of the feature map of the image to be detected, and is not described herein again.

The embodiment of the disclosure can determine the loss function value based on the similarity between the depth map output by the depth map generation network and the labeled depth map, and adjust the network parameter value of the depth map generation network according to the loss function value, so that the network input result and the labeled result tend to be consistent or closer.

The labeled depth map can be obtained according to the following steps:

step one, projecting the information of the three-dimensional marking frame marked by the target object to the ground based on the corresponding relation between the three-dimensional coordinate system of the three-dimensional marking frame and the ground coordinate system of the bottom center point of the marking frame to obtain the projection area of the target object on the ground and the extension area of the projection area;

secondly, determining the depth value of each three-dimensional marking point on the extension area based on the position coordinate and the depth value of the bottom center point of the marking frame, which are included in the three-dimensional marking frame information;

thirdly, projecting each three-dimensional annotation point on the extension area under the camera coordinate system to a pixel plane under the image coordinate system based on the corresponding relation between the camera coordinate system and the image coordinate system to obtain a projection point in the pixel plane;

and fourthly, obtaining a marked depth map based on the depth values of the three-dimensional marked points on the extension area and the projection points in the pixel plane.

In the concrete implementation, the depth value of each three-dimensional labeling point on the extension area is determined based on the position coordinate and the depth value of the bottom center point of the labeling frame, which are included in the three-dimensional labeling frame information, and the method comprises the following steps:

The embodiment of the disclosure provides a method for generating a local ground depth label. Here, the depth information of the surrounding ground (corresponding to the extended area) may be obtained by using the position of the center point of the bottom surface of the labeling frame (the center point falls on the ground) included in the three-dimensional labeling frame information of the target object.

Here, the center point of the bottom surface of the labeling frame is at the same height as the surrounding ground, and a large number of three-dimensional labeling points can be generated in an extension area around the center point, where the three-dimensional labeling points include the center point, and each three-dimensional labeling point on the extension area is determined by using the depth value of the center point of the extension area as the starting depth value and at the preset depth interval.

In this way, the three-dimensional annotation point can be projected onto the pixel plane by using the projection relationship, the corresponding relationship between the depth value of the three-dimensional annotation point and the projected point of the three-dimensional annotation point is recorded, and the average depth value of at least one corresponding three-dimensional annotation point is obtained for each projected point obtained by projection, so that the annotation depth map can be obtained.

The projection relationship can be realized by the following formula:

wherein (x)_3d，y_3d，z_3d) Characterized by the camera coordinates of the three-dimensional annotation points, (x)_p，y_p) Characterised by the projected point, P, of the projection of the three-dimensional annotation point_rectAnd R_rectCharacterized respectively are a rotation correction matrix and a projection matrix.

Therefore, the feature map of the image to be detected is input into the trained depth map generation network, so that the depth map corresponding to the projection area of the target object in the image to be detected, which is projected to the ground, can be determined, and then the three-dimensional prediction of the target object can be realized by combining the feature map and the three-dimensional prior frame information.

To facilitate further understanding of the above three-dimensional prediction process, a detailed description will be given below with reference to fig. 2.

As shown in fig. 2, for an image to be detected including a target object of a vehicle, a feature map of the image to be detected may be extracted through a feature extraction network. And then, on one hand, two-dimensional detection is carried out through the first target detection network, and the depth map generation network is utilized to generate a depth map corresponding to a projection area of the target object projected to the ground in the image to be detected aiming at the two-dimensional detection information of the target object.

In the embodiment of the present disclosure, based on the two-dimensional detection information, three-dimensional prior frame information for the target object may be determined. Fig. 2 shows an exemplary illustration of three-dimensional prior frame information determined by the corresponding three subcategories.

Here, before inputting the depth map and the feature map into the trained second object detection network, clipping in the ROI-align mode may be performed based on the two-dimensional detection information, and then the clipped depth map and feature map may be input into the second object detection network, so as to obtain the corresponding three-dimensional detection frame offset, such as Δ (w, h, l) shown in fig. 2_3d,Δz_3dAnd so on.

And determining the three-dimensional detection information by combining the three-dimensional detection frame offset and the three-dimensional prior frame information. In practical application, the three-dimensional detection information can be presented on an image to be detected.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a target detection apparatus corresponding to the target detection method, and since the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the target detection method in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 3, a schematic diagram of an apparatus for detecting an object according to an embodiment of the present disclosure is shown, where the apparatus includes: an extraction module 301, a generation module 302 and a first detection module 303; wherein the content of the first and second substances,

the extraction module 301 is configured to perform feature extraction on an image to be detected to obtain a feature map of the image to be detected;

a generating module 302, configured to generate a depth map corresponding to a projection area, where a target object in the image to be detected is projected onto the ground, based on the feature map;

and a first detection module 303, configured to determine three-dimensional detection information of the target object based on the depth map and the feature map.

By adopting the target detection device, not only can the features of the image to be detected be extracted, but also a depth map corresponding to a projection area of the target object projected to the ground in the image to be detected can be generated based on the extracted feature map, and then the three-dimensional detection information of the target object can be determined based on the depth map and the feature map. The generated depth map is directed to a target object in the image to be detected, and corresponds to a projection area of the target object projected to the ground, and the projection area is associated with the target object to a certain extent, so that the depth map corresponding to the local ground can be used as a guide when the feature map of the target object on the local ground is used for three-dimensional detection, and the detection precision is improved.

In a possible embodiment, the above apparatus further comprises:

the second detection module 304 is configured to perform two-dimensional detection on the feature map after obtaining the feature map of the image to be detected, so as to obtain two-dimensional detection information for the target object;

a first detection module 303, configured to determine three-dimensional detection information of the target object based on the depth map and the feature map according to the following steps:

In one possible implementation manner, the two-dimensional detection information includes two-dimensional detection frame information where the target object is located and category information where the target object belongs; a first detection module 303, configured to determine three-dimensional prior frame information for the target object based on the two-dimensional detection information according to the following steps:

based on the category information of the target object, determining the clustering information of each sub-category included in the category of the target object;

In a possible implementation manner, the first detection module 303 is configured to determine three-dimensional prior frame information for the target object according to the cluster information of each sub-category and the two-dimensional detection frame information of the target object according to the following steps:

for each sub-category in the sub-categories, determining a depth value corresponding to the sub-category based on a cluster height value included by the cluster information of the sub-category and a width value included by the two-dimensional detection frame information of the target object;

based on the cluster information of the sub-category and the depth value corresponding to the sub-category, a three-dimensional prior box information for the target object is determined.

In a possible implementation, the first detection module 303 is configured to determine three-dimensional detection frame information of the target object based on the three-dimensional prior frame information, the depth map, and the feature map according to the following steps:

In a possible implementation, the first detection module 303 is configured to determine the three-dimensional detection frame offset according to the depth map and the feature map according to the following steps:

based on the position range included by the information of the two-dimensional detection frame where the target object is located, respectively extracting a depth map and a feature map which are matched with the position range from the depth map and the feature map;

In one possible implementation mode, the three-dimensional prior frame information is multiple; a first detection module 303, configured to determine three-dimensional detection frame information of the target object based on the three-dimensional prior frame information and the three-dimensional detection frame offset according to the following steps:

and determining the three-dimensional detection frame information of the target object based on the three-dimensional prior frame information, the weight corresponding to each three-dimensional prior frame information and the three-dimensional detection frame offset.

In a possible implementation, the first detecting module 303 is configured to determine a weight corresponding to each three-dimensional prior frame information according to the following steps:

In a possible implementation manner, the second detection module 304 is configured to detect the feature map according to the following steps to obtain two-dimensional detection information for the target object:

determining the offset of the two-dimensional detection frame according to the characteristic diagram;

and determining two-dimensional detection information of the target object based on preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.

In one possible embodiment, the depth map is determined by a trained depth map generation network; the depth map generation network is trained by the image sample and an annotation depth map determined based on three-dimensional annotation frame information of the target object annotated in the image sample.

In one possible implementation, the three-dimensional labeling box information of the target object comprises position coordinates and a depth value of a bottom surface center point of the labeling box; a generating module 302, configured to obtain the labeled depth map according to the following steps:

projecting the information of the three-dimensional marking frame marked by the target object to the ground based on the corresponding relation between the three-dimensional coordinate system of the three-dimensional marking frame and the ground coordinate system of the bottom center point of the marking frame to obtain a projection area of the target object on the ground and an extension area where the projection area is located;

based on the corresponding relation between the camera coordinate system and the image coordinate system, projecting each three-dimensional annotation point on the extension area under the camera coordinate system to a pixel plane under the image coordinate system to obtain a projection point in the pixel plane;

and obtaining a marked depth map based on the depth values of the three-dimensional marked points on the extension area and the projection points in the pixel plane.

In a possible implementation manner, the generating module 302 is configured to determine a depth value of each three-dimensional labeling point on the extension area based on the position coordinate of the bottom center point of the labeling frame and the depth value included in the three-dimensional labeling frame information according to the following steps:

and under the condition of determining the position coordinates of the central point of the extension area, taking the depth value of the central point of the extension area as an initial depth value, and determining the depth value of each three-dimensional labeling point in the extension area at a preset depth interval.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides an electronic device, as shown in fig. 4, which is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and the electronic device includes: a processor 401, a memory 402, and a bus 403. The memory 402 stores machine-readable instructions executable by the processor 401 (for example, execution instructions corresponding to the extraction module 301, the generation module 302, the first detection module 303, and the like in the apparatus in fig. 3), when the electronic device is operated, the processor 401 and the memory 402 communicate via the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:

extracting the characteristics of the image to be detected to obtain a characteristic diagram of the image to be detected;

generating a depth map corresponding to a projection area of a target object projected to the ground in the image to be detected based on the characteristic map;

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method for object detection described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the method for target detection in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of target detection, the method comprising:

2. The method according to claim 1, wherein after obtaining the feature map of the image to be detected, the method further comprises:

detecting the characteristic diagram to obtain two-dimensional detection information aiming at the target object;

3. The method according to claim 2, wherein the two-dimensional detection information includes two-dimensional detection frame information where the target object is located and category information where the target object belongs; the determining three-dimensional prior frame information for the target object based on the two-dimensional detection information comprises:

4. The method according to claim 3, wherein the determining three-dimensional prior frame information for the target object according to the clustering information of the sub-categories and the two-dimensional detection frame information of the target object comprises:

5. The method according to claim 3 or 4, wherein the determining three-dimensional detection frame information of the target object based on the three-dimensional prior frame information, the depth map and the feature map comprises:

6. The method of claim 5, wherein determining a three-dimensional detection frame offset from the depth map and the feature map comprises:

7. The method according to claim 5 or 6, wherein the three-dimensional prior frame information is plural; the determining three-dimensional detection frame information of the target object based on the three-dimensional prior frame information and the three-dimensional detection frame offset includes:

8. The method of claim 7, further comprising:

9. The method according to any one of claims 2 to 8, wherein the detecting the feature map to obtain two-dimensional detection information for the target object includes:

10. The method of any one of claims 1-9, wherein the depth map is determined by a trained depth map generation network; the depth map generation network is obtained by training an image sample and an annotated depth map determined based on three-dimensional annotated frame information of a target object annotated in the image sample.

11. The method according to claim 10, wherein the three-dimensional labeling box information of the target object includes position coordinates and a depth value of a center point of a bottom surface of a labeling box; the marked depth map is obtained according to the following steps:

12. The method according to claim 11, wherein the determining the depth value of each three-dimensional labeling point on the extension area based on the position coordinate of the bottom center point of the labeling frame and the depth value included in the three-dimensional labeling frame information comprises:

13. An apparatus for object detection, the apparatus comprising:

14. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of object detection according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method of object detection as set forth in any one of the claims 1 to 12.