WO2023050810A1 - Target detection method and apparatus, electronic device, storage medium, and computer program product - Google Patents

Target detection method and apparatus, electronic device, storage medium, and computer program product Download PDF

Info

Publication number
WO2023050810A1
WO2023050810A1 PCT/CN2022/090957 CN2022090957W WO2023050810A1 WO 2023050810 A1 WO2023050810 A1 WO 2023050810A1 CN 2022090957 W CN2022090957 W CN 2022090957W WO 2023050810 A1 WO2023050810 A1 WO 2023050810A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimensional
target object
information
detection
depth
Prior art date
Application number
PCT/CN2022/090957
Other languages
French (fr)
Chinese (zh)
Inventor
刘配
杨国润
王哲
石建萍
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023050810A1 publication Critical patent/WO2023050810A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present disclosure relates to the technical field of image processing, and in particular to a target detection method, device, electronic equipment, storage medium, and computer program product.
  • 3D, 3-Dimension target detection tasks are more difficult and more complex, and often need to detect the 3D geometric information and Semantic information mainly includes the length, width, height, center point and orientation angle information of the target.
  • 3D target detection of monocular images is widely used in various fields (such as the field of unmanned driving) due to its economical and practical characteristics.
  • 3D object detection techniques based on monocular images mainly rely on some external subtasks, which are responsible for performing tasks such as 2D object detection, depth map estimation, etc. Since the sub-tasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection.
  • Embodiments of the present disclosure provide at least one object detection method, device, electronic device, storage medium, and computer program product, so as to improve the accuracy of 3D object detection.
  • an embodiment of the present disclosure provides a method for target detection, the method comprising:
  • the feature map determines the three-dimensional detection information of the target object.
  • a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features
  • the map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground
  • the depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.
  • the method further includes:
  • the determining the three-dimensional detection information of the target object based on the depth map and the feature map includes:
  • the depth map and the feature map determine the 3D detection frame information of the target object.
  • the 3D detection here can be the detection combined with the information of the 3D priori frame.
  • the 3D priori frame can constrain the starting position of the 3D detection to a certain extent, so as to search for the information of the 3D detection frame near the starting position, thereby further improving the accuracy of the 3D detection. .
  • the two-dimensional detection information includes the two-dimensional detection frame information of the target object and the category information of the target object;
  • the 3D prior frame information of the object including:
  • the three-dimensional prior frame information for the target object is determined.
  • the three-dimensional prior frame here can be determined in combination with the category information of the target object.
  • the size and position of the corresponding three-dimensional prior frame may be different for different categories of target objects.
  • the use of category information can assist in determining the position of the three-dimensional prior frame with high accuracy.
  • the determining the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information of the target object includes:
  • the determining the 3D detection frame information of the target object based on the 3D prior frame information, the depth map, and the feature map includes:
  • the determining the offset of the 3D detection frame according to the depth map and the feature map includes:
  • the three-dimensional detection frame offset is determined based on a depth map and a feature map matched with the position range.
  • determining the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset include:
  • the 3D detection frame information of the target object is determined based on each of the 3D prior frame information, the weight corresponding to each of the 3D prior frame information, and the 3D detection frame offset.
  • the method also includes:
  • the determination of the weight corresponding to each of the three-dimensional prior frame information includes:
  • the weight of the three-dimensional prior frame information corresponding to each of the subcategories is determined.
  • the greater the probability the higher the possibility that the target object points to the corresponding subcategory, and then the corresponding three-dimensional prior frame information can be given higher weight.
  • the prediction accuracy of the final 3D detection frame will be further improved.
  • the detecting the feature map to obtain the two-dimensional detection information for the target object includes:
  • the two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
  • the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and three-dimensional annotations based on target objects marked in the image samples It is obtained by training the labeled depth map determined by the box information.
  • the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the center point of the bottom surface of the annotation frame; the annotation depth map is acquired according to the following steps:
  • the three-dimensional label frame information marked by the target object is projected to the ground, and the projection area and the projection area of the target object on the ground are obtained.
  • each three-dimensional label point on the extended area under the camera coordinate system is projected to a pixel plane under the image coordinate system to obtain the pixel plane in the pixel plane projection point;
  • the marked depth map is obtained based on the depth value of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
  • the annotated depth map here can be implemented in combination with ground projection operations and conversion operations between coordinate systems.
  • the local ground area including the target object can be completely covered, and the corresponding marked depth map can be obtained by using the 3D projection result of the extended area, which can reflect the extended area including the local ground area.
  • the depth information of the region can specifically assist the three-dimensional detection of the target object on the corresponding local ground region.
  • the determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information includes:
  • the depth value of the central point of the extended area is used as the initial depth value, and the depth of each three-dimensional label point in the extended area is determined at a preset depth interval value.
  • the embodiment of the present disclosure also provides a target detection device, the device comprising:
  • the extraction module is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected; the generation module is configured to generate a projection area corresponding to the projection of the target object in the image to be detected to the ground based on the feature map The depth map; the first detection module is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map.
  • an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method for object detection as described in any one of the first aspect and its various implementation manners are executed.
  • the embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed when the processor runs, as in the first aspect and its various implementation modes The steps of any one of the methods for target detection.
  • the embodiments of the present disclosure further provide a computer program product, including a computer-readable storage medium storing program codes, and when the instructions included in the program codes are executed by the processor of the computer device, the first aspect can be implemented.
  • a computer program product including a computer-readable storage medium storing program codes, and when the instructions included in the program codes are executed by the processor of the computer device, the first aspect can be implemented.
  • FIG. 1 shows a flow chart of a method for target detection provided by an embodiment of the present disclosure
  • FIG. 2A shows a flowchart of a method for determining two-dimensional detection information of a target object provided by an embodiment of the present disclosure
  • FIG. 2B shows a flow chart of a method for determining 3D prior frame information for a target object provided by an embodiment of the present disclosure
  • FIG. 2C shows a flow chart of a method for determining a three-dimensional prior frame information for a target object provided by an embodiment of the present disclosure
  • FIG. 2D shows a flow chart of a method for determining the three-dimensional detection frame information of a target object provided by an embodiment of the present disclosure
  • FIG. 2E shows a flow chart of a method for determining the offset of a three-dimensional detection frame of a target object provided by an embodiment of the present disclosure
  • FIG. 2F shows a flowchart of a method for training a depth map generation network provided by an embodiment of the present disclosure
  • FIG. 2G shows a flow chart of a method for obtaining a marked depth map provided by an embodiment of the present disclosure
  • FIG. 2H shows a flowchart of a method for determining the depth value of each three-dimensional label point on the extended area provided by an embodiment of the present disclosure
  • FIG. 2I shows a schematic diagram of the application of a method for generating local ground depth labels provided by an embodiment of the present disclosure
  • Fig. 2J shows a schematic diagram of the application of a target detection method provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of a target detection device provided by an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
  • the detection accuracy has reached a very high level.
  • the commonly used 3D target detection method is based on LiDAR data, but it is difficult to meet large-scale applications and deployments due to the expensive equipment for collecting data.
  • the 3D target detection method of monocular image can be to adopt the vehicle-mounted camera of the car, economical and available.
  • the task of monocular 3D detection is to detect the 3D geometric information and semantic information of the target object from the 3D scene, including the length, width, height, center point and orientation angle information of the target object.
  • the monocular image-based 3D object detection technology relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Since the subtasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection, so it is difficult to be used in practical applications.
  • the difficulty of current 3D object detection methods lies in the depth prediction of 3D detection frames.
  • the label of 3D target detection only provides the depth information of the center point or corner of the target frame, and it is difficult for the network to learn and cannot generate more and more accurate depth information.
  • the 3D object detection method in the related art guides the learning of the 3D detection frame through its subtasks, such as depth estimation, pseudo point cloud generation and semantic segmentation prediction results, but its subtasks require a large number of accurate depth labels, It is difficult to use in practical applications, and the accuracy of subtasks limits the performance upper limit of 3D target detection, and it is not very reliable in 3D target detection.
  • the present disclosure provides a method, device, electronic device, storage medium, and computer program product for object detection, so as to improve the accuracy of 3D object detection.
  • the execution subject of the method for target detection provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities.
  • the computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the object detection method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 is a flow chart of a method for target detection provided by an embodiment of the present disclosure, the method is executed by an electronic device, and the method includes steps S101 to S103, wherein:
  • Step S101 performing feature extraction on the image to be detected to obtain a feature map of the image to be detected
  • Step S102 based on the feature map, generate a depth map corresponding to the projection area where the target object in the image to be detected is projected onto the ground;
  • Step S103 based on the depth map and the feature map, determine the three-dimensional detection information of the target object.
  • the above target detection method can be applied to the field of computer vision, for example, it can be applied to scenarios such as vehicle detection in unmanned driving, and drone detection. Considering the wide application of unmanned driving, the following is an example of vehicle detection.
  • the 3D object detection technology in the related art relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Due to the separate training of the subtasks, there is a loss of accuracy in itself, resulting in a low final 3D detection accuracy.
  • the embodiments of the present disclosure provide a solution for three-dimensional detection in combination with a local depth map and a feature map, and the detection accuracy is high.
  • the image to be detected in the embodiments of the present disclosure may be an image collected in a target scene, and different application scenes correspond to different collected images.
  • the image to be detected here can be the image collected by the camera device installed on the driverless car during the driving process of the vehicle, and the image can include all target objects within the shooting field of view of the camera device.
  • the target object here may be a vehicle in front or a pedestrian in front, which is not limited here.
  • the embodiments of the present disclosure may use various feature extraction methods to extract feature maps for the image to be detected.
  • the feature map can be extracted from the image to be detected through image processing, and for example, the feature map can be extracted by using a trained feature extraction network.
  • the feature extraction network can be used to extract the feature map in the embodiments of the present disclosure.
  • the feature extraction network here can be a convolutional neural network (Convolutional Neural Networks, CNN). In practical applications, it can be implemented by using a CNN model including convolutional blocks, dense blocks, and transition blocks.
  • the convolutional blocks here can be composed of convolutional layers, batch normalization (Batch normalization) layers, and linear rectification layers (Rectified Linear Unit, ReLU), the dense block can be composed of multiple convolutional blocks and multiple skip connections (Skip connection), and the transitional block is generally composed of convolutional blocks and average pooling layers.
  • the composition of convolutional blocks, dense blocks, and transitional blocks for example, including several convolutional layers and average pooling layers during application, can be determined in conjunction with the application scenario, and there is no limitation here.
  • the embodiments of the present disclosure can generate a local depth map based on the extracted feature map.
  • the local depth map corresponds to the projection area of the target object in the image to be detected projected to the ground, pointing to the target object Associated local ground depth information. Since there is a binding relationship between the local ground and the target object to a certain extent, the target object can be detected more accurately in combination with the feature map extracted above.
  • the above partial depth map may be determined by using a trained depth map generation network.
  • the depth map generation network trains the corresponding relationship between the image sample and the feature and depth of the corresponding pixel in the marked depth map. In this way, when the extracted feature map is input to the trained depth map generation network, The depth map corresponding to the projection area on the ground pointing to the target object can be output.
  • the feature map and depth map can be clipped in combination with the region feature clustering method (ROI-align), and the three-dimensional detection of the target object can be realized according to the depth map and feature map corresponding to the target object obtained by clipping.
  • ROI-align region feature clustering method
  • the 3D detection in the embodiments of the present disclosure may be based on the residual prediction of the 3D priori frame, which is considering that in the residual prediction, the information of the original 3D priori frame can be used to guide the subsequent 3D detection, for example, it can be
  • the original 3D prior frame is the initial position, and the 3D detection frame is searched near the initial position, especially when the accuracy of the 3D prior frame is relatively high, which will significantly improve the detection efficiency compared to direct 3D detection .
  • the above-mentioned 3D prior frame may be determined based on 2D detection information, so that 3D detection may be realized based on 3D prior frame information, a depth map, and a feature map.
  • the two-dimensional detection information of the target object can be determined according to the steps shown in FIG. 2A, and the steps include:
  • Step S201 determining the offset of the two-dimensional detection frame according to the feature map
  • Step S202 based on the preset 2D prior frame information and the offset of the 2D detection frame, determine the 2D detection information of the target object.
  • the two-dimensional detection information can be determined based on the calculation result between the offset of the two-dimensional detection frame and the preset two-dimensional prior frame information.
  • the two-dimensional detection information in the embodiment of the present disclosure may be obtained by using the trained first target detection network to perform two-dimensional detection on the feature map.
  • the first target detection network training can be the correspondence between the feature map of the image sample and the two-dimensional label information, or the feature map and offset of the image sample (corresponding to the two-dimensional label frame and the two-dimensional The difference between the prior frames) and the corresponding relationship between the two-dimensional detection information of the target object in the image to be detected can be directly determined by using the previous corresponding relationship, and the offset can be determined first by using the latter corresponding relationship, and then the offset The sum of the quantity and the two-dimensional prior box is used to determine the two-dimensional detection information of the target object.
  • the determined two-dimensional detection information may include the position information of the two-dimensional detection frame (x 2d , y 2d , w 2d , h 2d ), the center point position information (x p , y p ), orientation angle ( ⁇ 3d ), category information (cls) to which the target object belongs, and other information related to two-dimensional detection may also be included, which is not limited here.
  • the first target detection network here can be a two-dimensional residual prediction.
  • the first target detection network here can first perform dimensionality reduction through a convolutional layer and a linear rectification layer, and then perform residual prediction of the two-dimensional detection frame through multiple convolutional layers. The prediction accuracy higher.
  • the steps include:
  • Step S301 based on the category information of the target object, determine the clustering information of each subcategory included in the category to which the target object belongs;
  • Step S302 according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, determine the three-dimensional prior frame information for the target object.
  • the three-dimensional prior frame information can be determined by combining the clustering information of each sub-category included in the category to which the target object belongs and the two-dimensional detection frame information where the target object is located.
  • the 3D detection results corresponding to the categories For example, for each target belonging to the same category of vehicles, for the subcategory of cars, the size of the 3D detection frame is the same as that of the subcategory of large trucks. There is a big difference in the size of the three-dimensional detection frame.
  • the subcategory can be divided in advance, and the corresponding three-dimensional prior frame information can be determined based on the clustering information of each divided subcategory.
  • the clustering result corresponding to this category information may be determined.
  • vehicle image samples including various subcategories may be collected in advance, and the vehicle image samples are determined to have information such as the length, width, and height of the vehicle.
  • clustering can be performed based on height values, so that vehicle image samples belonging to the same height range can be correspondingly divided into a subcategory, and then the clustering information of this subcategory can be determined.
  • clustering methods such as K-means clustering algorithm (K-means) can be used to realize the above clustering process.
  • the process of determining the three-dimensional a priori frame information by combining the clustering information and the two-dimensional detection frame information where the target object is located can be implemented according to the steps shown in Figure 2C, and the steps include:
  • Step S3021 for each subcategory in each subcategory, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth value corresponding to the subcategory ;
  • Step S3022 based on the clustering information of the sub-category and the depth value corresponding to the sub-category, determine a 3D prior frame information for the target object.
  • each subcategory can correspond to a 3D priori frame information, and information such as the size of the 3D priori frame can be determined by the clustering information of the corresponding subcategory; the relevant depth information can be determined by the cluster height value and the two It is determined by the width value included in the three-dimensional detection frame information.
  • it can be realized by performing the ratio operation between the cluster height value and the width value first, and then performing the multiplication operation of the focal length of the camera device.
  • the embodiment of the present disclosure may combine this information, as well as the depth map and the feature map to determine the three-dimensional detection frame information, as shown in Figure 2D, including the following steps:
  • Step S1031 determining the offset of the three-dimensional detection frame according to the depth map and the feature map;
  • Step S1032 Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.
  • the second target detection network can be used to realize three-dimensional detection, and the offset of the three-dimensional detection frame output by the second target detection network can be obtained, and then the three-dimensional detection frame of the target object can be determined based on the three-dimensional prior frame information and the offset of the three-dimensional detection frame information.
  • the above three-dimensional detection frame information may include shape information (w 3d , h 3d , l 3d ) and depth information (z 3d ) of the detection frame.
  • three-dimensional prediction can determine more dimensional information of the target object. For example, it can also determine the subcategories included in the category information of the target object.
  • One class targets cars or trucks.
  • a 3D detection frame offset can be predicted based on each 3D priori frame, and considering that the subcategories corresponding to different 3D priori frames are also different, and the prediction probabilities of different subcategories are also different. Therefore, based on the prediction probabilities of each subcategory, the corresponding weights can be given to the three-dimensional prior frame information corresponding to each subcategory, and then based on each three-dimensional priori The frame information, the weight corresponding to each 3D prior frame information, and the offset of the 3D detection frame determine the 3D detection frame information of the target object.
  • subcategories with higher predicted probabilities can be given higher weights to highlight the role of the corresponding 3D prior frame in subsequent 3D detection.
  • subcategories with lower predicted probabilities can be given smaller weights to weaken Corresponding to the role of the 3D prior frame in the subsequent 3D detection, so that the determined 3D detection frame information is more accurate.
  • the depth map and the feature map can be clipped first, and then the 3D detection can be performed, as shown in FIG. 2E , which can be achieved by the following steps:
  • Step S1031a based on the location range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map that match the location range from the depth map and the feature map, respectively;
  • Step S1031b Determine the offset of the 3D detection frame based on the depth map and feature map matched with the location range.
  • the depth map and feature map corresponding to the position range can be clipped based on the position range included in the two-dimensional detection frame information, that is, the local depth map and local feature map pointing to the target object can be obtained.
  • the corresponding 3D detection frame offset can be determined based on the local depth map and the local feature map, and the 3D detection frame offset here can also be determined by using the second target detection network.
  • the prediction accuracy is higher.
  • training of the first object detection network and the second object detection network is required.
  • corresponding supervisory signals that is, prior frame information
  • the corresponding loss function values can be determined. Based on these loss function values, network training can be guided by backpropagation, and there is no limitation here.
  • the embodiment of the present disclosure also sets a corresponding supervisory signal for the depth map (that is, marking the depth map), which can be generated through the depth map network to achieve.
  • the training process of the depth map generation network is as shown in Figure 2F, and the training process includes the following steps:
  • Step S401 acquiring an image sample and an annotated depth map determined based on the three-dimensional annotation frame information of the target object annotated in the image sample;
  • Step S402 performing feature extraction on the image sample to obtain a feature map of the image sample
  • Step S403 input the feature map of the image sample into the depth map generation network to be trained, obtain the depth map output by the depth map generation network, and determine the loss function value based on the similarity between the output depth map and the marked depth map;
  • Step S404 When the loss function value is greater than the preset threshold, adjust the network parameter value of the depth map generation network, and input the feature map of the image sample into the adjusted depth map generation network until the loss function value is less than or equal to preset threshold.
  • the image sample obtained here is similar to the acquisition method of the image to be detected.
  • the extraction of the feature map of the image sample please refer to the above extraction process of the feature map of the image to be detected.
  • the embodiment of the present disclosure can determine the loss function value based on the similarity between the depth map output by the depth map generation network and the marked depth map, and adjust the network parameter values of the depth map generation network according to the loss function value, so that the network input result is consistent with the marked depth map.
  • the results tend to be the same or closer.
  • the marked depth map described in the above step 401 can be obtained according to the steps shown in FIG. 2G:
  • Step S4011 based on the corresponding relationship between the 3D coordinate system where the 3D label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, project the 3D label frame information marked by the target object to the ground, and obtain the projection area and projection of the target object on the ground The extended area where the area is located;
  • Step S4012 Determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information;
  • Step S4013 based on the corresponding relationship between the camera coordinate system and the image coordinate system, project the three-dimensional label points on the extended area under the camera coordinate system to the pixel plane under the image coordinate system to obtain the projection points in the pixel plane;
  • Step S4014 based on the depth values of the three-dimensional marked points on the extended area and the projected points in the pixel plane, the marked depth map is obtained.
  • the depth value of each three-dimensional label point on the extended area is determined based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information, including the following steps:
  • Step S41 determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;
  • Step S42 in the case of determining the position coordinates of the central point of the extended area, use the depth value of the central point of the extended area as the initial depth value, and determine each three-dimensional annotation in the extended area at a preset depth interval The depth value of the point.
  • An embodiment of the present disclosure provides a method for generating local ground depth labels.
  • the depth information of the surrounding ground (corresponding to the extended area) can be obtained by using the position of the center point of the bottom surface of the callout frame (the center point falls on the ground) included in the three-dimensional callout frame information of the target object.
  • the center point of the bottom surface of the three-dimensional labeling frame is at the same height as the surrounding ground, in the figure ( In b), a large number of three-dimensional labeling points 23 can be generated in an extended area 22 around the central point, where the three-dimensional points include the central point, and also include the initial depth value of the central point of the extended area 22, with Each three-dimensional marking point 23 on the extension area determined by the preset depth interval.
  • Fig. (b) among Fig. 2I utilize projective relationship to be able to project three-dimensional mark point 23 on the pixel plane, obtain the projection point corresponding to three-dimensional mark point on the pixel plane, record the depth value of three-dimensional mark point 23
  • the average depth value of at least one corresponding three-dimensional label point can be obtained for each projected projected point, and the marked depth shown in Figure (c) in Figure 2I can be obtained
  • the depth information of the surrounding ground (corresponding to the extended area) can be obtained from the marked depth map 24.
  • (x 3d , y 3d , z 3d ) represent the camera coordinates of the three-dimensional label point
  • (x p , y p ) represent the projection point of the three-dimensional mark point projection
  • Prect and R rect represent the rotation Correction matrix and projection matrix.
  • inputting the feature map of the image to be detected into the trained depth map generation network can determine the depth map corresponding to the projection area where the target object in the image to be detected is projected to the ground, and then can combine the feature map and the three-dimensional prior frame The information realizes the three-dimensional prediction of the target object.
  • the feature map of the image to be detected can be extracted through the feature extraction network 32 first. Then, on the one hand, two-dimensional detection is performed through the first target detection network 33 to obtain two-dimensional detection information for the target object; Depth Chart 35.
  • the three-dimensional prior frame information for the target object may be determined. As shown in FIG. 2J , it is an exemplary display of three 3D a priori frame information determined for the corresponding three subcategories 36 .
  • the cropping in ROI-align mode can be performed based on the two-dimensional detection information, and then the depth map and feature map obtained by clipping can be input into The second target detection network 37 can obtain the corresponding three-dimensional detection frame offset, such as ⁇ (w,h,l) 3d , ⁇ z 3d and other information as shown in FIG. 2J .
  • the 3D detection information can be determined by combining the above 3D detection frame offset and 3D prior frame information. In practical applications, the above three-dimensional detection information can be presented on the image to be detected.
  • the embodiment of the present disclosure also provides a target detection device corresponding to the target detection method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned target detection method of the embodiment of the present disclosure, therefore The implementation of the device can refer to the implementation of the method.
  • Fig. 3 is a schematic diagram of a target detection device provided by an embodiment of the present disclosure, the device includes: an extraction module 301, a generation module 302, and a first detection module 303; wherein,
  • the extraction module 301 is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;
  • the generation module 302 is configured to generate a depth map corresponding to a projection area where the target object in the image to be detected is projected to the ground based on the feature map;
  • the first detection module 303 is configured to determine three-dimensional detection information of the target object based on the depth map and the feature map.
  • a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features
  • the map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground
  • the depth map corresponding to the ground of the situation can be used as a guide for three-dimensional detection to improve the accuracy of detection.
  • the above-mentioned device also includes:
  • the second detection module 304 is configured to detect the feature map after obtaining the feature map of the image to be detected, and obtain two-dimensional detection information for the target object;
  • the first detection module 303 is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map according to the following steps:
  • the 3D detection frame information of the target object is determined.
  • the two-dimensional detection information includes the two-dimensional detection frame information where the target object is located and the category information of the target object; the first detection module 303 is configured to determine the target object based on the two-dimensional detection information according to the following steps
  • the three-dimensional prior frame information for the target object is determined.
  • the first detection module 303 is configured to determine the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located according to the following steps:
  • each subcategory in each subcategory determines the depth value corresponding to the subcategory based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located;
  • a three-dimensional prior frame information for the target object is determined.
  • the first detection module 303 is configured to determine the 3D detection frame information of the target object based on the 3D prior frame information, the depth map and the feature map according to the following steps:
  • the 3D detection frame information of the target object is determined.
  • the first detection module 303 is configured to determine the offset of the three-dimensional detection frame according to the depth map and the feature map according to the following steps:
  • the first detection module 303 is configured to determine the 3D detection frame of the target object based on the 3D prior frame information and the offset of the 3D detection frame according to the following steps information:
  • the three-dimensional detection frame information of the target object is determined based on each three-dimensional prior frame information, the weight corresponding to each three-dimensional prior frame information, and the three-dimensional detection frame offset.
  • the first detection module 303 is configured to determine the weight corresponding to each three-dimensional prior frame information according to the following steps:
  • the weight of the three-dimensional prior frame information corresponding to each subcategory is determined.
  • the second detection module 304 is configured to detect the feature map according to the following steps to obtain two-dimensional detection information for the target object:
  • the two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
  • the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and based on the three-dimensional annotation frame information of the target object marked in the image samples. Figure training obtained.
  • the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the central point of the bottom surface of the annotation frame; the generation module 302 is configured to obtain the annotation depth map according to the following steps:
  • the information of the three-dimensional label frame marked by the target object is projected to the ground, and the projection area of the target object on the ground and the extension of the projection area are obtained area;
  • each three-dimensional label point on the extended area under the camera coordinate system is projected to the pixel plane under the image coordinate system to obtain the projection point in the pixel plane;
  • a marked depth map is obtained based on the depth values of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
  • the generation module 302 is configured to determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information according to the following steps:
  • the depth value of the central point of the extended area is used as the initial depth value, and the depth values of each three-dimensional label point in the extended area are determined at preset depth intervals.
  • FIG. 4 is a schematic structural diagram of the electronic device provided by the embodiment of the present disclosure, including: a processor 401 , a memory 402 , and a bus 403 .
  • the memory 402 stores machine-readable instructions executable by the processor 401 (for example, the execution instructions corresponding to the extraction module 301, the generation module 302, and the first detection module 303 in the device in FIG. 3 ), and when the electronic device is running, the processing
  • the processor 401 communicates with the memory 402 through the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:
  • the 3D detection information of the target object is determined.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the method for object detection described in the foregoing method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the method for target detection described in the above method embodiment, please refer to the above method Example.
  • the above-mentioned computer program product may be realized by hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the method for target detection includes: performing feature extraction on the image to be detected to obtain a feature map of the image to be detected; based on the feature map, generating a projection of the target object in the image to be detected projected to the ground A depth map corresponding to the region; based on the depth map and the feature map, determine the three-dimensional detection information of the target object.
  • the above target detection method not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object.
  • the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground
  • the depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.

Abstract

The present disclosure provides a target detection method and apparatus, an electronic device, a storage medium, and a computer program product. The method comprises: performing feature extraction on an image to be detected, and obtaining a feature map of said image; on the basis of the feature map, generating a depth map corresponding to a projection region in which a target object in said image is projected to the ground; and, on the basis of the depth map and the feature map, determining three-dimensional detection information of the target object. The projection region in the present disclosure is associated with the target object to a certain degree. In this way, the depth map corresponding to the local ground can targetedly guide the feature map of the target object on the local ground so as to perform three-dimensional detection, thereby improving the accuracy of detection.

Description

一种目标检测的方法、装置、电子设备及存储介质、计算机程序产品A target detection method, device, electronic equipment, storage medium, and computer program product
相关申请的交叉引用Cross References to Related Applications
本公开基于申请号为202111164729.9、申请日为2021年9月30日、发明名称为“一种目标检测的方法、装置、电子设备及存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。This disclosure is based on a Chinese patent application with the application number 202111164729.9, the filing date is September 30, 2021, and the title of the invention is "a method, device, electronic device and storage medium for target detection", and requires the Chinese patent application Priority, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.
技术领域technical field
本公开涉及图像处理技术领域,尤其涉及一种目标检测的方法、装置、电子设备及存储介质、计算机程序产品。The present disclosure relates to the technical field of image processing, and in particular to a target detection method, device, electronic equipment, storage medium, and computer program product.
背景技术Background technique
相较于二维(2D,2-Dimension)目标检测任务,三维(3D,3-Dimension)目标检测任务难度更大,复杂度更高,往往需要从3D场景中检测出目标的3D几何信息和语义信息,主要包括目标的长宽高、中心点和朝向角信息。其中,由于单目图像的3D目标检测具有经济适用的优良特性,被广泛应用于各种领域(如无人驾驶领域)。Compared with two-dimensional (2D, 2-Dimension) target detection tasks, three-dimensional (3D, 3-Dimension) target detection tasks are more difficult and more complex, and often need to detect the 3D geometric information and Semantic information mainly includes the length, width, height, center point and orientation angle information of the target. Among them, the 3D target detection of monocular images is widely used in various fields (such as the field of unmanned driving) due to its economical and practical characteristics.
然而,基于单目图像的3D目标检测技术主要依赖于一些外部的子任务,这些子任务负责执行2D目标检测、深度图估计等任务。由于子任务单独训练,本身存在精度损失,限制了网络模型的性能上限,无法满足3D检测的精度需求。However, 3D object detection techniques based on monocular images mainly rely on some external subtasks, which are responsible for performing tasks such as 2D object detection, depth map estimation, etc. Since the sub-tasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection.
发明内容Contents of the invention
本公开实施例至少提供一种目标检测的方法、装置、电子设备及存储介质、计算机程序产品,以提升3D目标检测的精度。Embodiments of the present disclosure provide at least one object detection method, device, electronic device, storage medium, and computer program product, so as to improve the accuracy of 3D object detection.
第一方面,本公开实施例提供了一种目标检测的方法,所述方法包括:In a first aspect, an embodiment of the present disclosure provides a method for target detection, the method comprising:
对待检测图像进行特征提取,得到所述待检测图像的特征图;基于所述特征图,生成所述待检测图像中的目标对象投影至地面的投影区域对应的深度图;基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息。performing feature extraction on the image to be detected to obtain a feature map of the image to be detected; based on the feature map, generating a depth map corresponding to a projection area where the target object in the image to be detected is projected onto the ground; based on the depth map and The feature map determines the three-dimensional detection information of the target object.
采用上述目标检测的方法,不仅可以对待检测图像进行特征提取,还可以基于提取得到的特征图,生成待检测图像中的目标对象投影至地面的投影区域对应的深度图,进而基于深度图和特征图确定目标对象的三维检测信息。由于生成的深度图是指向待检测图像中的目标对象的,且对应的是目标对象投影到地面的投影区域,该投影区域一定程度上与目标对象关联,这样,在利用局部地面上的目标对象的特征图进行三维检测时可以以该局部地面对应的深度图作为指导,从而提升检测的精度。Using the above target detection method, not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground The depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.
在一种可能的实施方式中,在得到所述待检测图像的特征图之后,所述方法还包括:In a possible implementation manner, after obtaining the feature map of the image to be detected, the method further includes:
对所述特征图进行检测,得到针对所述目标对象的二维检测信息;Detecting the feature map to obtain two-dimensional detection information for the target object;
所述基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息,包括:The determining the three-dimensional detection information of the target object based on the depth map and the feature map includes:
基于所述二维检测信息,确定针对所述目标对象的三维先验框信息;Based on the two-dimensional detection information, determine three-dimensional prior frame information for the target object;
基于所述三维先验框信息、所述深度图以及所述特征图,确定所述目标对象的三维检测框信息。Based on the 3D priori frame information, the depth map and the feature map, determine the 3D detection frame information of the target object.
这里的三维检测可以是结合三维先验框信息的检测,三维先验框一定程度上可以约束三 维检测的起始位置,以在起始位置附近搜索三维检测框信息,从而进一步提升三维检测的精度。The 3D detection here can be the detection combined with the information of the 3D priori frame. The 3D priori frame can constrain the starting position of the 3D detection to a certain extent, so as to search for the information of the 3D detection frame near the starting position, thereby further improving the accuracy of the 3D detection. .
在一种可能的实施方式中,所述二维检测信息包括所述目标对象所在二维检测框信息和所述目标对象所属类别信息;所述基于所述二维检测信息,确定针对所述目标对象的三维先验框信息,包括:In a possible implementation manner, the two-dimensional detection information includes the two-dimensional detection frame information of the target object and the category information of the target object; The 3D prior frame information of the object, including:
基于所述目标对象所属类别信息,确定所述目标对象所属类别包括的各个子类别的聚类信息;Based on the category information of the target object, determine the clustering information of each subcategory included in the category of the target object;
根据所述各个子类别的聚类信息以及所述目标对象所在二维检测框信息,确定针对所述目标对象的三维先验框信息。According to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, the three-dimensional prior frame information for the target object is determined.
这里的三维先验框可以是结合目标对象所属类别信息来确定的。目标对象所属类别不同,所对应的三维先验框的大小、位置等可能都不同,利用类别信息可以辅助确定三维先验框的位置,准确度较高。The three-dimensional prior frame here can be determined in combination with the category information of the target object. The size and position of the corresponding three-dimensional prior frame may be different for different categories of target objects. The use of category information can assist in determining the position of the three-dimensional prior frame with high accuracy.
在一种可能的实施方式中,所述根据所述各个子类别的聚类信息以及所述目标对象所在二维检测框信息,确定针对所述目标对象的三维先验框信息,包括:In a possible implementation manner, the determining the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information of the target object includes:
针对所述各个子类别中的每个子类别,基于该子类别的聚类信息包括的聚类高度值、所述目标对象所在二维检测框信息包括的宽度值,确定与该子类别对应的深度值;For each subcategory in the various subcategories, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth corresponding to the subcategory value;
基于该子类别的聚类信息以及与该子类别对应的深度值,确定针对所述目标对象的一个三维先验框信息。Based on the clustering information of the subcategory and the depth value corresponding to the subcategory, determine a 3D prior frame information for the target object.
在一种可能的实施方式中,所述基于所述三维先验框信息、所述深度图以及所述特征图,确定所述目标对象的三维检测框信息,包括:In a possible implementation manner, the determining the 3D detection frame information of the target object based on the 3D prior frame information, the depth map, and the feature map includes:
根据所述深度图以及所述特征图确定三维检测框偏移量;determining a three-dimensional detection frame offset according to the depth map and the feature map;
基于所述三维先验框信息以及所述三维检测框偏移量,确定所述目标对象的三维检测框信息。Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.
这里可以是针对偏移量的预测,结合偏移量以及三维先验框可以得到更为准确的三维检测框。Here it can be the prediction for the offset. Combining the offset and the 3D prior frame can get a more accurate 3D detection frame.
在一种可能的实施方式中,所述根据所述深度图以及所述特征图确定三维检测框偏移量,包括:In a possible implementation manner, the determining the offset of the 3D detection frame according to the depth map and the feature map includes:
基于所述目标对象所在二维检测框信息包括的位置范围,分别从所述深度图以及所述特征图中提取与所述位置范围匹配的深度图和特征图;Based on the position range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map matching the position range from the depth map and the feature map respectively;
基于与所述位置范围匹配的深度图和特征图确定所述三维检测框偏移量。The three-dimensional detection frame offset is determined based on a depth map and a feature map matched with the position range.
这里,利用目标对象所在二维检测框信息包括的位置范围可以实现深度图和特征图在对应位置处的裁剪,这将使得所预测得到的偏移量是针对目标对象的,而不包含其他干扰区域的相关信息,提升预测精度。Here, using the position range included in the two-dimensional detection frame information where the target object is located can realize the clipping of the depth map and feature map at the corresponding position, which will make the predicted offset for the target object without other interference The relevant information of the region improves the prediction accuracy.
在一种可能的实施方式中,所述三维先验框信息为多个;所述基于所述三维先验框信息以及所述三维检测框偏移量,确定所述目标对象的三维检测框信息,包括:In a possible implementation manner, there are multiple 3D prior frame information; determining the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset ,include:
确定与每个所述三维先验框信息对应的权重;Determining the weight corresponding to each of the three-dimensional prior frame information;
基于各个所述三维先验框信息、每个所述三维先验框信息对应的权重、以及所述三维检测框偏移量,确定所述目标对象的三维检测框信息。The 3D detection frame information of the target object is determined based on each of the 3D prior frame information, the weight corresponding to each of the 3D prior frame information, and the 3D detection frame offset.
在一种可能的实施方式中,所述方法还包括:In a possible implementation manner, the method also includes:
根据所述深度图以及所述特征图确定所述目标对象所在类别信息包括的各个子类别的预测概率;determining the predicted probability of each subcategory included in the category information of the target object according to the depth map and the feature map;
所述确定与每个所述三维先验框信息对应的权重,包括:The determination of the weight corresponding to each of the three-dimensional prior frame information includes:
基于各个子类别的预测概率,确定与每个所述子类别对应的三维先验框信息的权重。Based on the prediction probability of each subcategory, the weight of the three-dimensional prior frame information corresponding to each of the subcategories is determined.
考虑到针对不同的子类别的预测概率并不相同,概率越大,说明目标对象指向对应子类别的可能性也就越高,进而可以为对应的三维先验框信息赋予更高的权重,这将进一步提升最终所得到的三维检测框的预测精度。Considering that the predicted probabilities for different subcategories are not the same, the greater the probability, the higher the possibility that the target object points to the corresponding subcategory, and then the corresponding three-dimensional prior frame information can be given higher weight. The prediction accuracy of the final 3D detection frame will be further improved.
在一种可能的实施方式中,所述对所述特征图进行检测,得到针对所述目标对象的二维 检测信息,包括:In a possible implementation manner, the detecting the feature map to obtain the two-dimensional detection information for the target object includes:
根据所述特征图确定二维检测框偏移量;determining a two-dimensional detection frame offset according to the feature map;
基于预设的二维先验框信息以及所述二维检测框偏移量,确定所述目标对象的二维检测信息。The two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
在一种可能的实施方式中,所述深度图为通过训练好的深度图生成网络确定的;所述深度图生成网络是由图像样本、以及基于所述图像样本中标注的目标对象的三维标注框信息确定的标注深度图训练得到的。In a possible implementation manner, the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and three-dimensional annotations based on target objects marked in the image samples It is obtained by training the labeled depth map determined by the box information.
在一种可能的实施方式中,所述目标对象的三维标注框信息包括标注框底面中心点的位置坐标和深度值;按照如下步骤获取所述标注深度图:In a possible implementation manner, the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the center point of the bottom surface of the annotation frame; the annotation depth map is acquired according to the following steps:
基于三维标注框所在三维坐标系与标注框底面中心点所在地面坐标系之间的对应关系,将所述目标对象标注的三维标注框信息投影至地面,得到所述目标对象在地面的投影区域及所述投影区域所在延展区域;Based on the corresponding relationship between the three-dimensional coordinate system where the three-dimensional label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, the three-dimensional label frame information marked by the target object is projected to the ground, and the projection area and the projection area of the target object on the ground are obtained. The extension area where the projection area is located;
基于所述三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定所述延展区域上各三维标注点的深度值;Determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label frame information;
基于相机坐标系与图像坐标系之间的对应关系,将所述相机坐标系下所述延展区域上的各三维标注点投影至所述图像坐标系下的像素平面,得到所述像素平面中的投影点;Based on the corresponding relationship between the camera coordinate system and the image coordinate system, each three-dimensional label point on the extended area under the camera coordinate system is projected to a pixel plane under the image coordinate system to obtain the pixel plane in the pixel plane projection point;
基于所述延展区域上各三维标注点的深度值以及所述像素平面中的投影点,得到所述标注深度图。The marked depth map is obtained based on the depth value of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
这里的标注深度图可以是结合地面投影操作以及坐标系之间的转换操作实现的。通过延展区域的构建可以使得目标对象在内的局部地面区域得以完整的被覆盖,利用延展区域的三维投影结果可以得到对应的标注深度图,该标注深度图可以反映包括局部地面区域在内的延展区域的深度信息,该深度信息可以针对性的辅助对应局部地面区域上的目标对象进行三维检测。The annotated depth map here can be implemented in combination with ground projection operations and conversion operations between coordinate systems. Through the construction of the extended area, the local ground area including the target object can be completely covered, and the corresponding marked depth map can be obtained by using the 3D projection result of the extended area, which can reflect the extended area including the local ground area. The depth information of the region can specifically assist the three-dimensional detection of the target object on the corresponding local ground region.
在一种可能的实施方式中,所述基于所述三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定所述延展区域上各三维标注点的深度值,包括:In a possible implementation manner, the determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information includes:
将所述标注框底面中心点的深度值和位置坐标分别确定为所述延展区域的中心点的深度值和位置坐标;Determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;
在确定所述延展区域的中心点的位置坐标的情况下,以所述延展区域的中心点的深度值为起始深度值,以预设深度间隔确定所述延展区域中各三维标注点的深度值。In the case of determining the position coordinates of the central point of the extended area, the depth value of the central point of the extended area is used as the initial depth value, and the depth of each three-dimensional label point in the extended area is determined at a preset depth interval value.
第二方面,本公开实施例还提供了一种目标检测的装置,所述装置包括:In the second aspect, the embodiment of the present disclosure also provides a target detection device, the device comprising:
提取模块,配置为对待检测图像进行特征提取,得到所述待检测图像的特征图;生成模块,配置为基于所述特征图,生成所述待检测图像中的目标对象投影至地面的投影区域对应的深度图;第一检测模块,配置为基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息。The extraction module is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected; the generation module is configured to generate a projection area corresponding to the projection of the target object in the image to be detected to the ground based on the feature map The depth map; the first detection module is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map.
第三方面,本公开实施例还提供了一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如第一方面及其各种实施方式任一所述的目标检测的方法的步骤。In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method for object detection as described in any one of the first aspect and its various implementation manners are executed.
第四方面,本公开实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如第一方面及其各种实施方式任一所述的目标检测的方法的步骤。In the fourth aspect, the embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed when the processor runs, as in the first aspect and its various implementation modes The steps of any one of the methods for target detection.
第五方面,本公开实施例还提供了一种计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令被计算机设备的处理器运行时,实现如第一方面及其各种实施方式任一所述的目标检测的方法的步骤。In the fifth aspect, the embodiments of the present disclosure further provide a computer program product, including a computer-readable storage medium storing program codes, and when the instructions included in the program codes are executed by the processor of the computer device, the first aspect can be implemented. The steps of the target detection method described in any one of its various embodiments.
关于上述目标检测的装置、电子设备、及计算机可读存储介质的效果描述参见上述目标检测的方法的说明。For the effect description of the above-mentioned object detection device, electronic equipment, and computer-readable storage medium, refer to the description of the above-mentioned object detection method.
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附 附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments are specifically cited below, together with the accompanying drawings, and described in detail as follows.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. The accompanying drawings here are incorporated into the specification and constitute a part of the specification. The drawings show the embodiments consistent with the present disclosure, and are used together with the description to explain the technical solutions of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those skilled in the art, they can also make From these drawings other related drawings are obtained.
图1示出了本公开实施例所提供的一种目标检测的方法的流程图;FIG. 1 shows a flow chart of a method for target detection provided by an embodiment of the present disclosure;
图2A示出了本公开实施例所提供的一种确定目标对象的二维检测信息的方法的流程图;FIG. 2A shows a flowchart of a method for determining two-dimensional detection information of a target object provided by an embodiment of the present disclosure;
图2B示出了本公开实施例所提供的一种确定针对目标对象的三维先验框信息的方法的流程图;FIG. 2B shows a flow chart of a method for determining 3D prior frame information for a target object provided by an embodiment of the present disclosure;
图2C示出了本公开实施例所提供的一种确定针对目标对象的一个三维先验框信息的方法的流程图;FIG. 2C shows a flow chart of a method for determining a three-dimensional prior frame information for a target object provided by an embodiment of the present disclosure;
图2D示出了本公开实施例所提供的一种确定目标对象的三维检测框信息的方法的流程图;FIG. 2D shows a flow chart of a method for determining the three-dimensional detection frame information of a target object provided by an embodiment of the present disclosure;
图2E示出了本公开实施例所提供的一种确定目标对象的三维检测框偏移量的方法的流程图;FIG. 2E shows a flow chart of a method for determining the offset of a three-dimensional detection frame of a target object provided by an embodiment of the present disclosure;
图2F示出了本公开实施例所提供的一种深度图生成网络的训练过程的方法的流程图;FIG. 2F shows a flowchart of a method for training a depth map generation network provided by an embodiment of the present disclosure;
图2G示出了本公开实施例所提供的一种获取标注深度图的方法的流程图;FIG. 2G shows a flow chart of a method for obtaining a marked depth map provided by an embodiment of the present disclosure;
图2H示出了本公开实施例所提供的一种确定延展区域上各三维标注点的深度值的方法的流程图;FIG. 2H shows a flowchart of a method for determining the depth value of each three-dimensional label point on the extended area provided by an embodiment of the present disclosure;
图2I示出了本公开实施例所提供的一种局部地面深度标签生成的方法的应用示意图;FIG. 2I shows a schematic diagram of the application of a method for generating local ground depth labels provided by an embodiment of the present disclosure;
图2J示出了本公开实施例所提供的一种目标检测的方法的应用示意图;Fig. 2J shows a schematic diagram of the application of a target detection method provided by an embodiment of the present disclosure;
图3示出了本公开实施例所提供的一种目标检测的装置的示意图;FIG. 3 shows a schematic diagram of a target detection device provided by an embodiment of the present disclosure;
图4示出了本公开实施例所提供的一种电子设备的示意图。Fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only It is a part of the embodiments of the present disclosure, but not all of them. The components of the disclosed embodiments generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present disclosure.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.
本文中术语“和/或”,仅仅是描述一种关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article only describes an association relationship, which means that there can be three kinds of relationships, for example, A and/or B can mean: there is A alone, A and B exist at the same time, and B exists alone. situation. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.
经研究发现,随着深度学习在目标检测领域,特别是3D目标检测方面上的成功应用,检测精度已经到了非常高的地步。常用的3D目标检测方法是基于LiDAR数据,但是由于采集数据设备昂贵,难以满足大规模的应用和部署。而单目图像的3D目标检测方式可以是采 用汽车的车载摄像头,经济可用。对于单个视角的图像,单目3D检测的任务是从3D场景中检测出目标对象的3D几何信息和语义信息,包括目标对象的长宽高、中心点和朝向角信息。After research, it was found that with the successful application of deep learning in the field of target detection, especially in 3D target detection, the detection accuracy has reached a very high level. The commonly used 3D target detection method is based on LiDAR data, but it is difficult to meet large-scale applications and deployments due to the expensive equipment for collecting data. And the 3D target detection method of monocular image can be to adopt the vehicle-mounted camera of the car, economical and available. For a single-view image, the task of monocular 3D detection is to detect the 3D geometric information and semantic information of the target object from the 3D scene, including the length, width, height, center point and orientation angle information of the target object.
相关技术中,基于单目图像的3D目标检测技术依赖于一些外部的子任务,这些子任务负责执行2D目标检测、深度图估计等任务。由于子任务单独训练,本身存在精度损失,限制了网络模型的性能上限,无法满足3D检测的精度需求,很难用于实际的应用中。In related technologies, the monocular image-based 3D object detection technology relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Since the subtasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection, so it is difficult to be used in practical applications.
目前3D目标检测方法的困难在于3D检测框的深度预测。3D目标检测的标签只提供了目标框中心点或者角点的深度信息,网络难以进行学习,无法生成更多数量、更为准确的深度信息。这是由于相关技术中的3D目标检测方法是通过其子任务,比如深度估计、伪点云生成和语义分割的预测结果来指导3D检测框的学习,但是其子任务需要大量精准的深度标签,很难用于实际应用当中,且子任务的精度限制了3D目标检测的性能上限,在3D目标检测中并不是非常可靠。The difficulty of current 3D object detection methods lies in the depth prediction of 3D detection frames. The label of 3D target detection only provides the depth information of the center point or corner of the target frame, and it is difficult for the network to learn and cannot generate more and more accurate depth information. This is because the 3D object detection method in the related art guides the learning of the 3D detection frame through its subtasks, such as depth estimation, pseudo point cloud generation and semantic segmentation prediction results, but its subtasks require a large number of accurate depth labels, It is difficult to use in practical applications, and the accuracy of subtasks limits the performance upper limit of 3D target detection, and it is not very reliable in 3D target detection.
基于上述研究,本公开提供了一种目标检测的方法、装置、电子设备及存储介质、计算机程序产品,以提升3D目标检测的精度。Based on the above research, the present disclosure provides a method, device, electronic device, storage medium, and computer program product for object detection, so as to improve the accuracy of 3D object detection.
为便于对本实施例进行理解,首先对本公开实施例所公开的一种目标检测的方法进行详细介绍,本公开实施例所提供的目标检测的方法的执行主体一般为具有一定计算能力的计算机设备,该计算机设备例如包括:终端设备或服务器或其它处理设备,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该目标检测的方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。In order to facilitate the understanding of this embodiment, a method for target detection disclosed in the embodiments of the present disclosure is firstly introduced in detail. The execution subject of the method for target detection provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities. The computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc. In some possible implementation manners, the object detection method may be implemented by a processor invoking computer-readable instructions stored in a memory.
图1为本公开实施例提供的一种目标检测的方法的流程图,所述方法由电子设备执行,所述方法包括步骤S101至步骤S103,其中:FIG. 1 is a flow chart of a method for target detection provided by an embodiment of the present disclosure, the method is executed by an electronic device, and the method includes steps S101 to S103, wherein:
步骤S101、对待检测图像进行特征提取,得到待检测图像的特征图;Step S101, performing feature extraction on the image to be detected to obtain a feature map of the image to be detected;
步骤S102、基于特征图,生成待检测图像中的目标对象投影至地面的投影区域对应的深度图;Step S102, based on the feature map, generate a depth map corresponding to the projection area where the target object in the image to be detected is projected onto the ground;
步骤S103、基于深度图以及特征图,确定目标对象的三维检测信息。Step S103, based on the depth map and the feature map, determine the three-dimensional detection information of the target object.
为了便于理解本公开实施例提供的目标检测的方法,接下来首先对该方法的应用场景进行详细介绍。上述目标检测的方法可以应用于计算机视觉领域,例如可以应用于无人驾驶中的车辆检测、无人机探测等场景。考虑到无人驾驶的广泛应用,接下来多以车辆检测进行示例说明。In order to facilitate the understanding of the target detection method provided by the embodiment of the present disclosure, the application scenario of the method will firstly be introduced in detail. The above target detection method can be applied to the field of computer vision, for example, it can be applied to scenarios such as vehicle detection in unmanned driving, and drone detection. Considering the wide application of unmanned driving, the following is an example of vehicle detection.
相关技术中的3D目标检测技术依赖于一些外部的子任务,这些子任务负责执行2D目标检测、深度图估计等任务。由于子任务单独训练,本身存在精度损失,导致最终的3D检测精度也不高。The 3D object detection technology in the related art relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Due to the separate training of the subtasks, there is a loss of accuracy in itself, resulting in a low final 3D detection accuracy.
正是为了解决上述问题,本公开实施例才提供了一种结合局部深度图以及特征图进行三维检测的方案,检测的精度较高。Just to solve the above problems, the embodiments of the present disclosure provide a solution for three-dimensional detection in combination with a local depth map and a feature map, and the detection accuracy is high.
本公开实施例中的待检测图像可以是在目标场景下采集到的图像,不同的应用场景所对应采集的图像也不同。以无人驾驶为例,这里的待检测图像可以是无人驾驶汽车上安装的摄像装置在车辆行进的过程中所采集的图像,该图像中可以包括摄像装置的拍摄视野内所有的目标对象,这里的目标对象可以是前方车辆,还可以是前方行人,在此不做限制。The image to be detected in the embodiments of the present disclosure may be an image collected in a target scene, and different application scenes correspond to different collected images. Taking unmanned driving as an example, the image to be detected here can be the image collected by the camera device installed on the driverless car during the driving process of the vehicle, and the image can include all target objects within the shooting field of view of the camera device. The target object here may be a vehicle in front or a pedestrian in front, which is not limited here.
在进行三维检测之前,本公开实施例可以针对待检测图像,利用各种特征提取方法进行特征图的提取。例如,可以通过图像处理,从待检测图像中提取出特征图,再如可以利用训练好的特征提取网络进行特征图的提取。Before performing three-dimensional detection, the embodiments of the present disclosure may use various feature extraction methods to extract feature maps for the image to be detected. For example, the feature map can be extracted from the image to be detected through image processing, and for example, the feature map can be extracted by using a trained feature extraction network.
考虑到特征提取网络可以挖掘更深层次的图像特征,本公开实施例中可以采用特征提取网络提取特征图。这里的特征提取网络可以是卷积神经网络(Convolutional Neural Networks,CNN)。在实际应用中,可以采用包括卷积块、密集块和过渡块的CNN模型来实现,这里的卷积块可以由卷积层、批规范化(Batch normalization)层和线性整流层(Rectified Linear Unit, ReLU)组成,密集块可以由多个卷积块和多个跳转连接(Skip connection)组成,过渡块一般由卷积块和平均池化层组成。有关卷积块、密集块、过渡块的组成方式,例如,应用时包括几个卷积层、几个平均池化层等可以结合应用场景来确定,在此不做限制。Considering that the feature extraction network can mine deeper image features, the feature extraction network can be used to extract the feature map in the embodiments of the present disclosure. The feature extraction network here can be a convolutional neural network (Convolutional Neural Networks, CNN). In practical applications, it can be implemented by using a CNN model including convolutional blocks, dense blocks, and transition blocks. The convolutional blocks here can be composed of convolutional layers, batch normalization (Batch normalization) layers, and linear rectification layers (Rectified Linear Unit, ReLU), the dense block can be composed of multiple convolutional blocks and multiple skip connections (Skip connection), and the transitional block is generally composed of convolutional blocks and average pooling layers. The composition of convolutional blocks, dense blocks, and transitional blocks, for example, including several convolutional layers and average pooling layers during application, can be determined in conjunction with the application scenario, and there is no limitation here.
为了进行三维检测,本公开实施例可以基于提取出的特征图生成局部深度图,该局部深度图是与待检测图像中的目标对象投影至地面的投影区域相对应的,指向的是与目标对象关联的局部地面的深度信息。由于局部地面一定程度上与目标对象存在绑定关系,这样,再结合上述提取出的特征图可以更为准确的检测出目标对象。In order to perform three-dimensional detection, the embodiments of the present disclosure can generate a local depth map based on the extracted feature map. The local depth map corresponds to the projection area of the target object in the image to be detected projected to the ground, pointing to the target object Associated local ground depth information. Since there is a binding relationship between the local ground and the target object to a certain extent, the target object can be detected more accurately in combination with the feature map extracted above.
其中,上述局部深度图可以是利用训练好的深度图生成网络确定的。该深度图生成网络训练的是图像样本与标注深度图中对应像素点的特征与深度之间的对应关系,这样,在将提取出的特征图输入到训练好的深度图生成网络的情况下,可以输出指向目标对象在地面投影区域所对应的深度图。Wherein, the above partial depth map may be determined by using a trained depth map generation network. The depth map generation network trains the corresponding relationship between the image sample and the feature and depth of the corresponding pixel in the marked depth map. In this way, when the extracted feature map is input to the trained depth map generation network, The depth map corresponding to the projection area on the ground pointing to the target object can be output.
在实际应用中,可以结合区域特征聚类方法(ROI-align)方式对特征图和深度图进行裁剪,根据裁剪得到的与目标对象对应的深度图和特征图实现有关目标对象的三维检测。In practical applications, the feature map and depth map can be clipped in combination with the region feature clustering method (ROI-align), and the three-dimensional detection of the target object can be realized according to the depth map and feature map corresponding to the target object obtained by clipping.
本公开实施例中的三维检测可以是基于三维先验框的残差预测,这是考虑到在残差预测中,可以基于原始三维先验框的信息来指导后续的三维检测,例如,可以以原始三维先验框为初始位置,在该初始位置附近进行三维检测框的搜索,特别是在三维先验框的准确度比较高的情况下,这相比直接进行三维检测将显著提升检测的效率。The 3D detection in the embodiments of the present disclosure may be based on the residual prediction of the 3D priori frame, which is considering that in the residual prediction, the information of the original 3D priori frame can be used to guide the subsequent 3D detection, for example, it can be The original 3D prior frame is the initial position, and the 3D detection frame is searched near the initial position, especially when the accuracy of the 3D prior frame is relatively high, which will significantly improve the detection efficiency compared to direct 3D detection .
上述三维先验框可以是基于二维检测信息确定的,这样即可以基于三维先验框信息、深度图以及特征图实现三维检测。The above-mentioned 3D prior frame may be determined based on 2D detection information, so that 3D detection may be realized based on 3D prior frame information, a depth map, and a feature map.
本公开实施例可以按照图2A所示的步骤确定目标对象的二维检测信息,所述步骤包括:In the embodiment of the present disclosure, the two-dimensional detection information of the target object can be determined according to the steps shown in FIG. 2A, and the steps include:
步骤S201、根据特征图确定二维检测框偏移量;Step S201, determining the offset of the two-dimensional detection frame according to the feature map;
步骤S202、基于预设的二维先验框信息以及二维检测框偏移量,确定目标对象的二维检测信息。Step S202, based on the preset 2D prior frame information and the offset of the 2D detection frame, determine the 2D detection information of the target object.
这里可以基于二维检测框偏移量以及预设的二维先验框信息之间的运算结果,确定二维检测信息。Here, the two-dimensional detection information can be determined based on the calculation result between the offset of the two-dimensional detection frame and the preset two-dimensional prior frame information.
其中,本公开实施例中的二维检测信息可以是利用训练好的第一目标检测网络对特征图进行二维检测得到的。这里的第一目标检测网络训练的可以是图像样本的特征图与二维标注信息之间的对应关系,也可以训练的是图像样本的特征图与偏移量(对应二维标注框与二维先验框之间的差值)之间的对应关系,利用前一对应关系可以直接确定待检测图像中目标对象的二维检测信息,利用后一对应关系可以先确定偏移量而后将偏移量与二维先验框求和来确定目标对象的二维检测信息。Wherein, the two-dimensional detection information in the embodiment of the present disclosure may be obtained by using the trained first target detection network to perform two-dimensional detection on the feature map. Here, the first target detection network training can be the correspondence between the feature map of the image sample and the two-dimensional label information, or the feature map and offset of the image sample (corresponding to the two-dimensional label frame and the two-dimensional The difference between the prior frames) and the corresponding relationship between the two-dimensional detection information of the target object in the image to be detected can be directly determined by using the previous corresponding relationship, and the offset can be determined first by using the latter corresponding relationship, and then the offset The sum of the quantity and the two-dimensional prior box is used to determine the two-dimensional detection information of the target object.
不管是采用上述哪种对应关系,所确定的二维检测信息均可以包括二维检测框的位置信息(x 2d,y 2d,w 2d,h 2d)、中心点位置信息(x p,y p),朝向角(α 3d),目标对象所属类别信息(cls),还可以包括其它与二维检测相关的信息,在此不做限制。 Regardless of which of the above correspondences is adopted, the determined two-dimensional detection information may include the position information of the two-dimensional detection frame (x 2d , y 2d , w 2d , h 2d ), the center point position information (x p , y p ), orientation angle (α 3d ), category information (cls) to which the target object belongs, and other information related to two-dimensional detection may also be included, which is not limited here.
考虑到残差预测所具有的优良特性,这里的第一目标检测网络可以是二维残差预测。在实际应用中,这里的第一目标检测网络可以先通过一个卷积层和一个线性整流层进行降维,然后通过多个卷积层分别进行二维检测框的残差预测,预测的准确度较高。Considering the excellent characteristics of residual prediction, the first target detection network here can be a two-dimensional residual prediction. In practical applications, the first target detection network here can first perform dimensionality reduction through a convolutional layer and a linear rectification layer, and then perform residual prediction of the two-dimensional detection frame through multiple convolutional layers. The prediction accuracy higher.
本公开实施例中,基于上述第一目标检测网络所确定的二维检测信息可以确定三维先验框信息,如图2B所示,所述步骤包括:In the embodiment of the present disclosure, based on the two-dimensional detection information determined by the above-mentioned first target detection network, the three-dimensional prior frame information can be determined, as shown in FIG. 2B, the steps include:
步骤S301、基于目标对象所属类别信息,确定目标对象所属类别包括的各个子类别的聚类信息;Step S301, based on the category information of the target object, determine the clustering information of each subcategory included in the category to which the target object belongs;
步骤S302、根据各个子类别的聚类信息以及目标对象所在二维检测框信息,确定针对目标对象的三维先验框信息。Step S302, according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, determine the three-dimensional prior frame information for the target object.
这里,可以结合目标对象所属类别包括的各个子类别的聚类信息以及目标对象所在二维检测框信息来确定三维先验框信息,这是考虑到对于同属一个类别的目标对象而言,不同子类别所对应的三维检测结果将存在一定的区别,例如,针对同属车辆这一类的各个目标而言, 对于小汽车这一子类别而言,其三维检测框的大小与大卡车这一子类别的三维检测框大小即存在较大的区别。为了兼顾各个子类别被三维预测到的可能性,这里可以预先进行子类别的划分,并基于每一个划分子类别的聚类信息实现对应三维先验框信息的确定。Here, the three-dimensional prior frame information can be determined by combining the clustering information of each sub-category included in the category to which the target object belongs and the two-dimensional detection frame information where the target object is located. There will be some differences in the 3D detection results corresponding to the categories. For example, for each target belonging to the same category of vehicles, for the subcategory of cars, the size of the 3D detection frame is the same as that of the subcategory of large trucks. There is a big difference in the size of the three-dimensional detection frame. In order to take into account the possibility of each subcategory being three-dimensionally predicted, the subcategory can be divided in advance, and the corresponding three-dimensional prior frame information can be determined based on the clustering information of each divided subcategory.
本公开实施例中在确定目标对象所属类别信息的情况下,可以确定这一类别信息所对应的聚类结果。这里仍以车辆作为目标对象为例,预先可以收集包括各种子类别的车辆图像样本,该车辆图像样本确定有车辆的长宽高等信息。针对车辆图像样本而言,可以基于高度值进行聚类,这样,同属一个高度范围的车辆图像样本可以对应划分到一个子类别,进而可以确定出这一子类别的聚类信息。在实际应用中,可以采用K均值聚类算法(k-means clustering algorithm,K-means)等聚类方法来实现上述聚类过程。In the embodiment of the present disclosure, in the case of determining the category information of the target object, the clustering result corresponding to this category information may be determined. Still taking the vehicle as the target object here, vehicle image samples including various subcategories may be collected in advance, and the vehicle image samples are determined to have information such as the length, width, and height of the vehicle. For vehicle image samples, clustering can be performed based on height values, so that vehicle image samples belonging to the same height range can be correspondingly divided into a subcategory, and then the clustering information of this subcategory can be determined. In practical applications, clustering methods such as K-means clustering algorithm (K-means) can be used to realize the above clustering process.
有关结合聚类信息以及目标对象所在二维检测框信息确定三维先验框信息的过程可以按照图2C所示的步骤来实现,所述步骤包括:The process of determining the three-dimensional a priori frame information by combining the clustering information and the two-dimensional detection frame information where the target object is located can be implemented according to the steps shown in Figure 2C, and the steps include:
步骤S3021、针对各个子类别中的每个子类别,基于该子类别的聚类信息包括的聚类高度值、目标对象所在二维检测框信息包括的宽度值,确定与该子类别对应的深度值;Step S3021, for each subcategory in each subcategory, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth value corresponding to the subcategory ;
步骤S3022、基于该子类别的聚类信息以及与该子类别对应的深度值,确定针对目标对象的一个三维先验框信息。Step S3022, based on the clustering information of the sub-category and the depth value corresponding to the sub-category, determine a 3D prior frame information for the target object.
这里,每个子类别可以对应一个三维先验框信息,有关三维先验框的尺寸等信息可以是由对应子类别的聚类信息来确定的;有关深度信息则可以是由聚类高度值以及二维检测框信息包括的宽度值来确定的,在实际应用中,可以先进行聚类高度值与宽度值之间的比值运算,而后进行摄像装置的焦距的乘法运算来实现。Here, each subcategory can correspond to a 3D priori frame information, and information such as the size of the 3D priori frame can be determined by the clustering information of the corresponding subcategory; the relevant depth information can be determined by the cluster height value and the two It is determined by the width value included in the three-dimensional detection frame information. In practical applications, it can be realized by performing the ratio operation between the cluster height value and the width value first, and then performing the multiplication operation of the focal length of the camera device.
在确定三维先验框信息的情况下,本公开实施例可以结合这一信息、以及深度图和特征图来确定三维检测框信息,如图2D所示,包括如下步骤:In the case of determining the three-dimensional prior frame information, the embodiment of the present disclosure may combine this information, as well as the depth map and the feature map to determine the three-dimensional detection frame information, as shown in Figure 2D, including the following steps:
步骤S1031、根据深度图以及特征图确定三维检测框偏移量;Step S1031, determining the offset of the three-dimensional detection frame according to the depth map and the feature map;
步骤S1032、基于三维先验框信息以及三维检测框偏移量,确定目标对象的三维检测框信息。Step S1032: Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.
这里,可以利用第二目标检测网络实现三维检测,得到第二目标检测网络输出的三维检测框偏移量,而后基于三维先验框信息以及三维检测框偏移量,确定目标对象的三维检测框信息。Here, the second target detection network can be used to realize three-dimensional detection, and the offset of the three-dimensional detection frame output by the second target detection network can be obtained, and then the three-dimensional detection frame of the target object can be determined based on the three-dimensional prior frame information and the offset of the three-dimensional detection frame information.
其中,上述三维检测框信息可以包括检测框的形状信息(w 3d,h 3d,l 3d)和深度信息(z 3d)。 Wherein, the above three-dimensional detection frame information may include shape information (w 3d , h 3d , l 3d ) and depth information (z 3d ) of the detection frame.
需要说明的是,相比二维预测而言,三维预测可以确定出目标对象更多维度的信息,例如,这里还可以确定目标对象所在类别信息包括的各个子类别,例如,可以确定属于车辆这一类别的目标对象是小汽车还是大卡车。It should be noted that, compared with two-dimensional prediction, three-dimensional prediction can determine more dimensional information of the target object. For example, it can also determine the subcategories included in the category information of the target object. One class targets cars or trucks.
考虑到本公开实施例中的三维先验框可以有多个,基于每个三维先验框均可以对应预测一个三维检测框偏移量,又考虑到不同三维先验框所对应的子类别也不同,而不同的子类别的预测概率也不同,因而,这里可以首先基于各个子类别的预测概率,为与每个子类别对应的三维先验框信息赋予对应的权重,然后再基于各个三维先验框信息、每个三维先验框信息对应的权重、以及三维检测框偏移量,确定目标对象的三维检测框信息。Considering that there may be multiple 3D priori frames in the embodiments of the present disclosure, a 3D detection frame offset can be predicted based on each 3D priori frame, and considering that the subcategories corresponding to different 3D priori frames are also different, and the prediction probabilities of different subcategories are also different. Therefore, based on the prediction probabilities of each subcategory, the corresponding weights can be given to the three-dimensional prior frame information corresponding to each subcategory, and then based on each three-dimensional priori The frame information, the weight corresponding to each 3D prior frame information, and the offset of the 3D detection frame determine the 3D detection frame information of the target object.
这里,预测概率越高的子类别可以赋予更高的权重,以彰显对应三维先验框在后续三维检测中的作用,同理,预测概率越低的子类别可以赋予更小的权重,以弱化对应三维先验框在后续三维检测中的作用,从而使得所确定的三维检测框信息更为精确。Here, subcategories with higher predicted probabilities can be given higher weights to highlight the role of the corresponding 3D prior frame in subsequent 3D detection. Similarly, subcategories with lower predicted probabilities can be given smaller weights to weaken Corresponding to the role of the 3D prior frame in the subsequent 3D detection, so that the determined 3D detection frame information is more accurate.
在一些实施例中,为了提升三维检测的精度,这里可以先进行深度图和特征图的裁剪,而后再进行三维检测,如图2E所示,可以通过如下步骤来实现:In some embodiments, in order to improve the accuracy of the 3D detection, the depth map and the feature map can be clipped first, and then the 3D detection can be performed, as shown in FIG. 2E , which can be achieved by the following steps:
步骤S1031a、基于目标对象所在二维检测框信息包括的位置范围,分别从深度图以及特征图中提取与位置范围匹配的深度图和特征图;Step S1031a, based on the location range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map that match the location range from the depth map and the feature map, respectively;
步骤S1031b、基于与位置范围匹配的深度图和特征图确定三维检测框偏移量。Step S1031b. Determine the offset of the 3D detection frame based on the depth map and feature map matched with the location range.
这里,可以基于二维检测框信息包括的位置范围实现这一位置范围所对应的深度图和特征图的裁剪,也即,可以得到指向目标对象的局部深度图和局部特征图。基于局部深度图和 局部特征图可以确定对应的三维检测框偏移量,这里的三维检测框偏移量也可以是利用第二目标检测网络确定的。Here, the depth map and feature map corresponding to the position range can be clipped based on the position range included in the two-dimensional detection frame information, that is, the local depth map and local feature map pointing to the target object can be obtained. The corresponding 3D detection frame offset can be determined based on the local depth map and the local feature map, and the 3D detection frame offset here can also be determined by using the second target detection network.
在进行三维检测框偏移量预测的过程中,由于所采用的局部深度图和局部特征图可以排除其它无关特征的影响,使得预测的准确度更高。In the process of predicting the offset of the three-dimensional detection frame, since the local depth map and local feature map adopted can exclude the influence of other irrelevant features, the prediction accuracy is higher.
为了实现本公开实施例提供的目标检测的方法,需要进行第一目标检测网络和第二目标检测网络的训练。针对不同的目标检测网络可以设置对应的监督信号(即先验框信息),进而可以确定出对应的损失函数值,基于这些损失函数值可以反向传播指导网络训练,在此不做限制。In order to implement the object detection method provided by the embodiments of the present disclosure, training of the first object detection network and the second object detection network is required. For different target detection networks, corresponding supervisory signals (that is, prior frame information) can be set, and then the corresponding loss function values can be determined. Based on these loss function values, network training can be guided by backpropagation, and there is no limitation here.
考虑到目标对象在地面的投影区域所对应的深度图对于上述目标检测过程中的关键作用,本公开实施例还针对深度图设置了对应的监督信号(即标注深度图),可以通过深度图生成网络来实现。该深度图生成网络的训练过程如图2F所示,所述训练过程包括如下步骤:Considering that the depth map corresponding to the projection area of the target object on the ground plays a key role in the above-mentioned target detection process, the embodiment of the present disclosure also sets a corresponding supervisory signal for the depth map (that is, marking the depth map), which can be generated through the depth map network to achieve. The training process of the depth map generation network is as shown in Figure 2F, and the training process includes the following steps:
步骤S401、获取图像样本、以及基于图像样本中标注的目标对象的三维标注框信息确定的标注深度图;Step S401, acquiring an image sample and an annotated depth map determined based on the three-dimensional annotation frame information of the target object annotated in the image sample;
步骤S402、对图像样本进行特征提取,得到图像样本的特征图;Step S402, performing feature extraction on the image sample to obtain a feature map of the image sample;
步骤S403、将图像样本的特征图输入到待训练的深度图生成网络,得到深度图生成网络输出的深度图,并基于输出的深度图与标注深度图之间的相似度确定损失函数值;Step S403, input the feature map of the image sample into the depth map generation network to be trained, obtain the depth map output by the depth map generation network, and determine the loss function value based on the similarity between the output depth map and the marked depth map;
步骤S404、在损失函数值大于预设阈值的情况下,调整深度图生成网络的网络参数值,并将图像样本的特征图输入到调整后的深度图生成网络中,直至损失函数值小于或等于预设阈值。Step S404. When the loss function value is greater than the preset threshold, adjust the network parameter value of the depth map generation network, and input the feature map of the image sample into the adjusted depth map generation network until the loss function value is less than or equal to preset threshold.
这里所获取的图像样本与待检测图像的获取方式类似。除此之外,有关图像样本的特征图的提取可以参见上述待检测图像的特征图的提取过程。The image sample obtained here is similar to the acquisition method of the image to be detected. In addition, for the extraction of the feature map of the image sample, please refer to the above extraction process of the feature map of the image to be detected.
本公开实施例可以基于深度图生成网络输出的深度图与标注深度图之间的相似度确定损失函数值,并根据该损失函数值调整深度图生成网络的网络参数值,使得网络输入结果与标注结果趋于一致或者更为接近。The embodiment of the present disclosure can determine the loss function value based on the similarity between the depth map output by the depth map generation network and the marked depth map, and adjust the network parameter values of the depth map generation network according to the loss function value, so that the network input result is consistent with the marked depth map. The results tend to be the same or closer.
其中,上述步骤401中所述标注深度图可以按照如图2G所示的步骤来获取:Wherein, the marked depth map described in the above step 401 can be obtained according to the steps shown in FIG. 2G:
步骤S4011、基于三维标注框所在三维坐标系与标注框底面中心点所在地面坐标系之间的对应关系,将目标对象标注的三维标注框信息投影至地面,得到目标对象在地面的投影区域及投影区域所在延展区域;Step S4011, based on the corresponding relationship between the 3D coordinate system where the 3D label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, project the 3D label frame information marked by the target object to the ground, and obtain the projection area and projection of the target object on the ground The extended area where the area is located;
步骤S4012、基于三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定延展区域上各三维标注点的深度值;Step S4012. Determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information;
步骤S4013、基于相机坐标系与图像坐标系之间的对应关系,将相机坐标系下延展区域上的各三维标注点投影至图像坐标系下的像素平面,得到像素平面中的投影点;Step S4013, based on the corresponding relationship between the camera coordinate system and the image coordinate system, project the three-dimensional label points on the extended area under the camera coordinate system to the pixel plane under the image coordinate system to obtain the projection points in the pixel plane;
步骤S4014、基于延展区域上各三维标注点的深度值以及像素平面中的投影点,得到标注深度图。Step S4014, based on the depth values of the three-dimensional marked points on the extended area and the projected points in the pixel plane, the marked depth map is obtained.
在实施时,如图2H所示,基于三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定延展区域上各三维标注点的深度值,包括如下步骤:During implementation, as shown in Figure 2H, the depth value of each three-dimensional label point on the extended area is determined based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information, including the following steps:
步骤S41、将所述标注框底面中心点的深度值和位置坐标分别确定为所述延展区域的中心点的深度值和位置坐标;Step S41, determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;
步骤S42、在确定所述延展区域的中心点的位置坐标的情况下,以所述延展区域的中心点的深度值为起始深度值,以预设深度间隔确定所述延展区域中各三维标注点的深度值。Step S42, in the case of determining the position coordinates of the central point of the extended area, use the depth value of the central point of the extended area as the initial depth value, and determine each three-dimensional annotation in the extended area at a preset depth interval The depth value of the point.
本公开实施例提供的是一种局部地面深度标签生成的方法。这里,可以利用目标对象的三维标注框信息包括的标注框底面中心点(该中心点落在地面上)的位置,得到周围地面(对应延展区域)的深度信息。在实施时,如图2I所示,图(a)中示出了标注的目标对象20以及目标对象的三维标注框21,这里,三维标注框底面中心点与周围地面在同一高度,在图(b)中,在中心点周围的一个延展区域22中可以生成大量的三维标注点23,这里的三维点包括中心点,还包括以延展区域22的中心点的深度值为起始深度值,以预设深度间隔确定的在延展区域上的各三维标注点23。An embodiment of the present disclosure provides a method for generating local ground depth labels. Here, the depth information of the surrounding ground (corresponding to the extended area) can be obtained by using the position of the center point of the bottom surface of the callout frame (the center point falls on the ground) included in the three-dimensional callout frame information of the target object. During implementation, as shown in FIG. 2I, the three-dimensional labeling frame 21 of the target object 20 and the target object shown in the figure (a), here, the center point of the bottom surface of the three-dimensional labeling frame is at the same height as the surrounding ground, in the figure ( In b), a large number of three-dimensional labeling points 23 can be generated in an extended area 22 around the central point, where the three-dimensional points include the central point, and also include the initial depth value of the central point of the extended area 22, with Each three-dimensional marking point 23 on the extension area determined by the preset depth interval.
这样,如图2I中的图(b)所示,利用投影关系可以将三维标注点23投影到像素平面上,得到三维标注点在像素平面上对应的投影点,记录三维标注点23的深度值与其投影的投影点之间的对应关系,对每一个投影得到的投影点求出所对应的至少一个三维标注点的平均值深度值即可以得到图2I中的图(c)所示的标注深度图24,从标注深度图24即可得到周围地面(对应延展区域)的深度信息。Like this, as shown in Fig. (b) among Fig. 2I, utilize projective relationship to be able to project three-dimensional mark point 23 on the pixel plane, obtain the projection point corresponding to three-dimensional mark point on the pixel plane, record the depth value of three-dimensional mark point 23 For the corresponding relationship between the projected points of its projection, the average depth value of at least one corresponding three-dimensional label point can be obtained for each projected projected point, and the marked depth shown in Figure (c) in Figure 2I can be obtained As shown in Fig. 24, the depth information of the surrounding ground (corresponding to the extended area) can be obtained from the marked depth map 24.
其中,上述投影关系可以采用如下公式(1)来实现:Wherein, the above-mentioned projection relationship can be realized by using the following formula (1):
Figure PCTCN2022090957-appb-000001
Figure PCTCN2022090957-appb-000001
其中,(x 3d,y 3d,z 3d)表征的是三维标注点的相机坐标,(x p,y p)表征的是三维标注点投影的投影点,P rect和R rect分别表征的是旋转校正矩阵和投影矩阵。 Among them, (x 3d , y 3d , z 3d ) represent the camera coordinates of the three-dimensional label point, (x p , y p ) represent the projection point of the three-dimensional mark point projection, Prect and R rect represent the rotation Correction matrix and projection matrix.
这样,将待检测图像的特征图输入到训练好的深度图生成网络即可以确定出待检测图像中的目标对象投影至地面的投影区域对应的深度图,进而可以结合特征图、三维先验框信息实现目标对象的三维预测。In this way, inputting the feature map of the image to be detected into the trained depth map generation network can determine the depth map corresponding to the projection area where the target object in the image to be detected is projected to the ground, and then can combine the feature map and the three-dimensional prior frame The information realizes the three-dimensional prediction of the target object.
为了便于进一步理解上述三维预测的过程,接下来可以结合图2J进行说明。In order to facilitate a further understanding of the above three-dimensional prediction process, it can be explained in conjunction with FIG. 2J next.
如图2J所示,对于包含车辆30这一目标对象的待检测图像31而言,可以先通过特征提取网络32提取出待检测图像的特征图。之后一方面通过第一目标检测网络33进行二维检测,得到针对目标对象的二维检测信息,另一方面利用深度图生成网络34生成待检测图像中的目标对象投影至地面的投影区域对应的深度图35。As shown in FIG. 2J , for the image to be detected 31 including the target object of the vehicle 30 , the feature map of the image to be detected can be extracted through the feature extraction network 32 first. Then, on the one hand, two-dimensional detection is performed through the first target detection network 33 to obtain two-dimensional detection information for the target object; Depth Chart 35.
本公开实施例中,基于上述二维检测信息,可以确定针对目标对象的三维先验框信息。如图2J所示为对应的三个子类别36所确定的三个三维先验框信息的示例性展示。In the embodiments of the present disclosure, based on the above two-dimensional detection information, the three-dimensional prior frame information for the target object may be determined. As shown in FIG. 2J , it is an exemplary display of three 3D a priori frame information determined for the corresponding three subcategories 36 .
这里,在将深度图以及特征图输入到训练好的第二目标检测网络37之前,可以先基于二维检测信息进行ROI-align方式下的裁剪,而后将裁剪得到的深度图和特征图输入到第二目标检测网络37,可以得到对应的三维检测框偏移量,如图2J所示的Δ(w,h,l) 3d、Δz 3d等信息。 Here, before inputting the depth map and feature map into the trained second target detection network 37, the cropping in ROI-align mode can be performed based on the two-dimensional detection information, and then the depth map and feature map obtained by clipping can be input into The second target detection network 37 can obtain the corresponding three-dimensional detection frame offset, such as Δ(w,h,l) 3d , Δz 3d and other information as shown in FIG. 2J .
结合上述三维检测框偏移量以及三维先验框信息,可以确定三维检测信息。在实际应用中,可以将上述三维检测信息呈现在待检测图像上。The 3D detection information can be determined by combining the above 3D detection frame offset and 3D prior frame information. In practical applications, the above three-dimensional detection information can be presented on the image to be detected.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that in the above-mentioned method of specific implementation, the writing order of each step does not imply a strict execution order and constitutes any limitation on the implementation process, and the execution order of each step should be based on its function and possible internal Logically OK.
基于同一发明构思,本公开实施例中还提供了与目标检测的方法对应的目标检测的装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述目标检测的方法相似,因此装置的实施可以参见方法的实施。Based on the same inventive concept, the embodiment of the present disclosure also provides a target detection device corresponding to the target detection method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned target detection method of the embodiment of the present disclosure, therefore The implementation of the device can refer to the implementation of the method.
图3为本公开实施例提供的一种目标检测的装置的示意图,装置包括:提取模块301、生成模块302、第一检测模块303;其中,Fig. 3 is a schematic diagram of a target detection device provided by an embodiment of the present disclosure, the device includes: an extraction module 301, a generation module 302, and a first detection module 303; wherein,
提取模块301,配置为对待检测图像进行特征提取,得到待检测图像的特征图;The extraction module 301 is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;
生成模块302,配置为基于特征图,生成待检测图像中的目标对象投影至地面的投影区域对应的深度图;The generation module 302 is configured to generate a depth map corresponding to a projection area where the target object in the image to be detected is projected to the ground based on the feature map;
第一检测模块303,配置为基于深度图以及特征图,确定目标对象的三维检测信息。The first detection module 303 is configured to determine three-dimensional detection information of the target object based on the depth map and the feature map.
采用上述目标检测的装置,不仅可以对待检测图像进行特征提取,还可以基于提取得到的特征图,生成待检测图像中的目标对象投影至地面的投影区域对应的深度图,进而基于深度图和特征图确定目标对象的三维检测信息。由于生成的深度图是指向待检测图像中的目标对象的,且对应的是目标对象投影到地面的投影区域,该投影区域一定程度上与目标对象关联,这样,在利用局部地面上的目标对象的特征图进行三维检测时可以以该局面地面对应的 深度图作为指导,提升检测的精度。Using the above target detection device, not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground The depth map corresponding to the ground of the situation can be used as a guide for three-dimensional detection to improve the accuracy of detection.
在一种可能的实施方式中,上述装置还包括:In a possible implementation manner, the above-mentioned device also includes:
第二检测模块304,配置为在得到待检测图像的特征图之后,对特征图进行检测,得到针对目标对象的二维检测信息;The second detection module 304 is configured to detect the feature map after obtaining the feature map of the image to be detected, and obtain two-dimensional detection information for the target object;
第一检测模块303,配置为按照以下步骤基于深度图以及特征图,确定目标对象的三维检测信息:The first detection module 303 is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map according to the following steps:
基于二维检测信息,确定针对目标对象的三维先验框信息;Based on the two-dimensional detection information, determine the three-dimensional prior frame information for the target object;
基于三维先验框信息、深度图以及特征图,确定目标对象的三维检测框信息。Based on the 3D prior frame information, the depth map and the feature map, the 3D detection frame information of the target object is determined.
在一种可能的实施方式中,二维检测信息包括目标对象所在二维检测框信息和目标对象所属类别信息;第一检测模块303,配置为按照以下步骤基于二维检测信息,确定针对目标对象的三维先验框信息:In a possible implementation manner, the two-dimensional detection information includes the two-dimensional detection frame information where the target object is located and the category information of the target object; the first detection module 303 is configured to determine the target object based on the two-dimensional detection information according to the following steps The three-dimensional prior frame information of :
基于目标对象所属类别信息,确定目标对象所属类别包括的各个子类别的聚类信息;Based on the category information of the target object, determine the clustering information of each subcategory included in the category to which the target object belongs;
根据各个子类别的聚类信息以及目标对象所在二维检测框信息,确定针对目标对象的三维先验框信息。According to the clustering information of each subcategory and the information of the two-dimensional detection frame where the target object is located, the three-dimensional prior frame information for the target object is determined.
在一种可能的实施方式中,第一检测模块303,配置为按照以下步骤根据各个子类别的聚类信息以及目标对象所在二维检测框信息,确定针对目标对象的三维先验框信息:In a possible implementation manner, the first detection module 303 is configured to determine the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located according to the following steps:
针对各个子类别中的每个子类别,基于该子类别的聚类信息包括的聚类高度值、目标对象所在二维检测框信息包括的宽度值,确定与该子类别对应的深度值;For each subcategory in each subcategory, determine the depth value corresponding to the subcategory based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located;
基于该子类别的聚类信息以及与该子类别对应的深度值,确定针对目标对象的一个三维先验框信息。Based on the clustering information of the subcategory and the depth value corresponding to the subcategory, a three-dimensional prior frame information for the target object is determined.
在一种可能的实施方式中,第一检测模块303,配置为按照以下步骤基于三维先验框信息、深度图以及特征图,确定目标对象的三维检测框信息:In a possible implementation manner, the first detection module 303 is configured to determine the 3D detection frame information of the target object based on the 3D prior frame information, the depth map and the feature map according to the following steps:
根据深度图以及特征图确定三维检测框偏移量;Determine the offset of the three-dimensional detection frame according to the depth map and the feature map;
基于三维先验框信息以及三维检测框偏移量,确定目标对象的三维检测框信息。Based on the 3D prior frame information and the offset of the 3D detection frame, the 3D detection frame information of the target object is determined.
在一种可能的实施方式中,第一检测模块303,配置为按照以下步骤根据深度图以及特征图确定三维检测框偏移量:In a possible implementation manner, the first detection module 303 is configured to determine the offset of the three-dimensional detection frame according to the depth map and the feature map according to the following steps:
基于目标对象所在二维检测框信息包括的位置范围,分别从深度图以及特征图中提取与位置范围匹配的深度图和特征图;Based on the position range included in the two-dimensional detection frame information where the target object is located, extract the depth map and feature map matching the position range from the depth map and feature map;
基于与位置范围匹配的深度图和特征图确定三维检测框偏移量。Determine the 3D detection box offset based on the depth map and feature map matched to the location range.
在一种可能的实施方式中,三维先验框信息为多个;第一检测模块303,配置为按照以下步骤基于三维先验框信息以及三维检测框偏移量,确定目标对象的三维检测框信息:In a possible implementation manner, there are multiple 3D prior frame information; the first detection module 303 is configured to determine the 3D detection frame of the target object based on the 3D prior frame information and the offset of the 3D detection frame according to the following steps information:
确定与每个三维先验框信息对应的权重;Determine the weight corresponding to each three-dimensional prior box information;
基于各个三维先验框信息、每个三维先验框信息对应的权重、以及三维检测框偏移量,确定目标对象的三维检测框信息。The three-dimensional detection frame information of the target object is determined based on each three-dimensional prior frame information, the weight corresponding to each three-dimensional prior frame information, and the three-dimensional detection frame offset.
在一种可能的实施方式中,第一检测模块303,配置为按照以下步骤确定与每个三维先验框信息对应的权重:In a possible implementation manner, the first detection module 303 is configured to determine the weight corresponding to each three-dimensional prior frame information according to the following steps:
根据深度图以及特征图确定目标对象所在类别信息包括的各个子类别的预测概率;Determine the predicted probability of each subcategory included in the category information of the target object according to the depth map and the feature map;
基于各个子类别的预测概率,确定与每个子类别对应的三维先验框信息的权重。Based on the prediction probability of each subcategory, the weight of the three-dimensional prior frame information corresponding to each subcategory is determined.
在一种可能的实施方式中,第二检测模块304,配置为按照以下步骤对特征图进行检测,得到针对目标对象的二维检测信息:In a possible implementation manner, the second detection module 304 is configured to detect the feature map according to the following steps to obtain two-dimensional detection information for the target object:
根据特征图确定二维检测框偏移量;Determine the offset of the two-dimensional detection frame according to the feature map;
基于预设的二维先验框信息以及二维检测框偏移量,确定目标对象的二维检测信息。The two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
在一种可能的实施方式中,深度图为通过训练好的深度图生成网络确定的;深度图生成网络是由图像样本、以及基于图像样本中标注的目标对象的三维标注框信息确定的标注深度图训练得到的。In a possible implementation, the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and based on the three-dimensional annotation frame information of the target object marked in the image samples. Figure training obtained.
在一种可能的实施方式中,目标对象的三维标注框信息包括标注框底面中心点的位置坐标和深度值;生成模块302,配置为按照以下步骤获取标注深度图:In a possible implementation manner, the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the central point of the bottom surface of the annotation frame; the generation module 302 is configured to obtain the annotation depth map according to the following steps:
基于三维标注框所在三维坐标系与标注框底面中心点所在地面坐标系之间的对应关系,将目标对象标注的三维标注框信息投影至地面,得到目标对象在地面的投影区域及投影区域所在延展区域;Based on the corresponding relationship between the three-dimensional coordinate system where the three-dimensional label frame is located and the ground coordinate system where the center point of the bottom surface of the label box is located, the information of the three-dimensional label frame marked by the target object is projected to the ground, and the projection area of the target object on the ground and the extension of the projection area are obtained area;
基于三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定延展区域上各三维标注点的深度值;Determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label frame information;
基于相机坐标系与图像坐标系之间的对应关系,将相机坐标系下延展区域上的各三维标注点投影至图像坐标系下的像素平面,得到像素平面中的投影点;Based on the corresponding relationship between the camera coordinate system and the image coordinate system, each three-dimensional label point on the extended area under the camera coordinate system is projected to the pixel plane under the image coordinate system to obtain the projection point in the pixel plane;
基于延展区域上各三维标注点的深度值以及像素平面中的投影点,得到标注深度图。A marked depth map is obtained based on the depth values of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
在一种可能的实施方式中,生成模块302,配置为按照如下步骤基于三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定延展区域上各三维标注点的深度值:In a possible implementation manner, the generation module 302 is configured to determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information according to the following steps:
将标注框底面中心点的深度值和位置坐标分别确定为延展区域的中心点的深度值和位置坐标;Determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area respectively;
在确定延展区域的中心点的位置坐标的情况下,以延展区域的中心点的深度值为起始深度值,以预设深度间隔确定延展区域中各三维标注点的深度值。In the case of determining the position coordinates of the central point of the extended area, the depth value of the central point of the extended area is used as the initial depth value, and the depth values of each three-dimensional label point in the extended area are determined at preset depth intervals.
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。For the description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant description in the above method embodiment, and details will not be described here.
本公开实施例还提供了一种电子设备,如图4所示,为本公开实施例提供的电子设备结构示意图,包括:处理器401、存储器402、和总线403。存储器402存储有处理器401可执行的机器可读指令(比如,图3中的装置中提取模块301、生成模块302、第一检测模块303对应的执行指令等),当电子设备运行时,处理器401与存储器402之间通过总线403通信,机器可读指令被处理器401执行时执行如下处理:An embodiment of the present disclosure also provides an electronic device, as shown in FIG. 4 , which is a schematic structural diagram of the electronic device provided by the embodiment of the present disclosure, including: a processor 401 , a memory 402 , and a bus 403 . The memory 402 stores machine-readable instructions executable by the processor 401 (for example, the execution instructions corresponding to the extraction module 301, the generation module 302, and the first detection module 303 in the device in FIG. 3 ), and when the electronic device is running, the processing The processor 401 communicates with the memory 402 through the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:
对待检测图像进行特征提取,得到待检测图像的特征图;Perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;
基于特征图,生成待检测图像中的目标对象投影至地面的投影区域对应的深度图;Based on the feature map, generate a depth map corresponding to the projection area where the target object in the image to be detected is projected to the ground;
基于深度图以及特征图,确定目标对象的三维检测信息。Based on the depth map and the feature map, the 3D detection information of the target object is determined.
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的目标检测的方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the method for object detection described in the foregoing method embodiments are executed. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的目标检测的方法的步骤,可参见上述方法实施例。The embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the method for target detection described in the above method embodiment, please refer to the above method Example.
其中,上述计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。Wherein, the above-mentioned computer program product may be realized by hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的工作过程,可以参考前述方法实施例中的对应过程。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。Those skilled in the art can clearly understand that for the convenience and brevity of description, for the working process of the above-described system and device, reference may be made to the corresponding process in the foregoing method embodiments. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在 一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that: the above-mentioned embodiments are only specific implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, rather than limit them, and the protection scope of the present disclosure is not limited thereto, although referring to the aforementioned The embodiments have described the present disclosure in detail, and those skilled in the art should understand that any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present disclosure Changes can be easily imagined, or equivalent replacements can be made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be included in this disclosure. within the scope of protection. Therefore, the protection scope of the present disclosure should be defined by the protection scope of the claims.
工业实用性Industrial Applicability
本公开实施例中,目标检测的方法包括:对待检测图像进行特征提取,得到所述待检测图像的特征图;基于所述特征图,生成所述待检测图像中的目标对象投影至地面的投影区域对应的深度图;基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息。采用上述目标检测的方法,不仅可以对待检测图像进行特征提取,还可以基于提取得到的特征图,生成待检测图像中的目标对象投影至地面的投影区域对应的深度图,进而基于深度图和特征图确定目标对象的三维检测信息。由于生成的深度图是指向待检测图像中的目标对象的,且对应的是目标对象投影到地面的投影区域,该投影区域一定程度上与目标对象关联,这样,在利用局部地面上的目标对象的特征图进行三维检测时可以以该局部地面对应的深度图作为指导,从而提升检测的精度。In an embodiment of the present disclosure, the method for target detection includes: performing feature extraction on the image to be detected to obtain a feature map of the image to be detected; based on the feature map, generating a projection of the target object in the image to be detected projected to the ground A depth map corresponding to the region; based on the depth map and the feature map, determine the three-dimensional detection information of the target object. Using the above target detection method, not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground The depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.

Claims (16)

  1. 一种目标检测的方法,所述方法包括:A method for target detection, the method comprising:
    对待检测图像进行特征提取,得到所述待检测图像的特征图;performing feature extraction on the image to be detected to obtain a feature map of the image to be detected;
    基于所述特征图,生成所述待检测图像中的目标对象投影至地面的投影区域对应的深度图;Based on the feature map, generate a depth map corresponding to a projection area where the target object in the image to be detected is projected to the ground;
    基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息。Based on the depth map and the feature map, determine three-dimensional detection information of the target object.
  2. 根据权利要求1所述的方法,其中,在得到所述待检测图像的特征图之后,所述方法还包括:The method according to claim 1, wherein, after obtaining the feature map of the image to be detected, the method further comprises:
    对所述特征图进行检测,得到针对所述目标对象的二维检测信息;Detecting the feature map to obtain two-dimensional detection information for the target object;
    所述基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息,包括:The determining the three-dimensional detection information of the target object based on the depth map and the feature map includes:
    基于所述二维检测信息,确定针对所述目标对象的三维先验框信息;Based on the two-dimensional detection information, determine three-dimensional prior frame information for the target object;
    基于所述三维先验框信息、所述深度图以及所述特征图,确定所述目标对象的三维检测框信息。Based on the 3D priori frame information, the depth map and the feature map, determine the 3D detection frame information of the target object.
  3. 根据权利要求2所述的方法,其中,所述二维检测信息包括所述目标对象所在二维检测框信息和所述目标对象所属类别信息;所述基于所述二维检测信息,确定针对所述目标对象的三维先验框信息,包括:The method according to claim 2, wherein the two-dimensional detection information includes the two-dimensional detection frame information of the target object and the category information of the target object; Describe the three-dimensional prior frame information of the target object, including:
    基于所述目标对象所属类别信息,确定所述目标对象所属类别包括的各个子类别的聚类信息;Based on the category information of the target object, determine the clustering information of each subcategory included in the category of the target object;
    根据所述各个子类别的聚类信息以及所述目标对象所在二维检测框信息,确定针对所述目标对象的三维先验框信息。According to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, the three-dimensional prior frame information for the target object is determined.
  4. 根据权利要求3所述的方法,其中,所述根据所述各个子类别的聚类信息以及所述目标对象所在二维检测框信息,确定针对所述目标对象的三维先验框信息,包括:The method according to claim 3, wherein, according to the clustering information of each subcategory and the two-dimensional detection frame information of the target object, determining the three-dimensional prior frame information for the target object includes:
    针对所述各个子类别中的每个子类别,基于该子类别的聚类信息包括的聚类高度值、所述目标对象所在二维检测框信息包括的宽度值,确定与该子类别对应的深度值;For each subcategory in the various subcategories, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth corresponding to the subcategory value;
    基于该子类别的聚类信息以及与该子类别对应的深度值,确定针对所述目标对象的一个三维先验框信息。Based on the clustering information of the subcategory and the depth value corresponding to the subcategory, determine a 3D prior frame information for the target object.
  5. 根据权利要求3或4所述的方法,其中,所述基于所述三维先验框信息、所述深度图以及所述特征图,确定所述目标对象的三维检测框信息,包括:The method according to claim 3 or 4, wherein the determining the 3D detection frame information of the target object based on the 3D prior frame information, the depth map, and the feature map includes:
    根据所述深度图以及所述特征图确定三维检测框偏移量;determining a three-dimensional detection frame offset according to the depth map and the feature map;
    基于所述三维先验框信息以及所述三维检测框偏移量,确定所述目标对象的三维检测框信息。Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.
  6. 根据权利要求5所述的方法,其中,所述根据所述深度图以及所述特征图确定三维检测框偏移量,包括:The method according to claim 5, wherein said determining the offset of the three-dimensional detection frame according to the depth map and the feature map comprises:
    基于所述目标对象所在二维检测框信息包括的位置范围,分别从所述深度图以及所述特征图中提取与所述位置范围匹配的深度图和特征图;Based on the position range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map matching the position range from the depth map and the feature map respectively;
    基于与所述位置范围匹配的深度图和特征图确定所述三维检测框偏移量。The three-dimensional detection frame offset is determined based on a depth map and a feature map matched with the position range.
  7. 根据权利要求5或6所述的方法,其中,所述三维先验框信息为多个;所述基于所述三维先验框信息以及所述三维检测框偏移量,确定所述目标对象的三维检测框信息,包括:The method according to claim 5 or 6, wherein there are multiple 3D prior frame information; determining the target object based on the 3D prior frame information and the offset of the 3D detection frame 3D detection frame information, including:
    确定与每个所述三维先验框信息对应的权重;Determining the weight corresponding to each of the three-dimensional prior frame information;
    基于各个所述三维先验框信息、每个所述三维先验框信息对应的权重、以及所述三维检测框偏移量,确定所述目标对象的三维检测框信息。The 3D detection frame information of the target object is determined based on each of the 3D prior frame information, the weight corresponding to each of the 3D prior frame information, and the 3D detection frame offset.
  8. 根据权利要求7所述的方法,其中,所述方法还包括:The method according to claim 7, wherein the method further comprises:
    根据所述深度图以及所述特征图确定所述目标对象所在类别信息包括的各个子类别的 预测概率;Determine the predicted probability of each subcategory included in the category information of the target object according to the depth map and the feature map;
    所述确定与每个所述三维先验框信息对应的权重,包括:The determination of the weight corresponding to each of the three-dimensional prior frame information includes:
    基于各个子类别的预测概率,确定与每个所述子类别对应的三维先验框信息的权重。Based on the prediction probability of each subcategory, the weight of the three-dimensional prior frame information corresponding to each of the subcategories is determined.
  9. 根据权利要求2至8任一所述的方法,其中,所述对所述特征图进行检测,得到针对所述目标对象的二维检测信息,包括:The method according to any one of claims 2 to 8, wherein the detecting the feature map to obtain two-dimensional detection information for the target object comprises:
    根据所述特征图确定二维检测框偏移量;determining a two-dimensional detection frame offset according to the feature map;
    基于预设的二维先验框信息以及所述二维检测框偏移量,确定所述目标对象的二维检测信息。The two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
  10. 根据权利要求1至9任一所述的方法,其中,所述深度图是通过训练好的深度图生成网络确定的;所述深度图生成网络是由图像样本、以及基于所述图像样本中标注的目标对象的三维标注框信息确定的标注深度图训练得到的。The method according to any one of claims 1 to 9, wherein the depth map is determined by a trained depth map generation network; the depth map generation network is composed of image samples and labels based on the image samples The 3D annotation frame information of the target object is obtained by training the annotation depth map.
  11. 根据权利要求10所述的方法,其中,所述目标对象的三维标注框信息包括标注框底面中心点的位置坐标和深度值;按照以下步骤获取所述标注深度图:The method according to claim 10, wherein the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the center point of the bottom surface of the annotation frame; and the annotation depth map is obtained according to the following steps:
    基于三维标注框所在三维坐标系与标注框底面中心点所在地面坐标系之间的对应关系,将所述目标对象标注的三维标注框信息投影至地面,得到所述目标对象在地面的投影区域及所述投影区域所在延展区域;Based on the corresponding relationship between the three-dimensional coordinate system where the three-dimensional label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, the three-dimensional label frame information marked by the target object is projected to the ground, and the projection area and the projection area of the target object on the ground are obtained. The extension area where the projection area is located;
    基于所述三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定所述延展区域上各三维标注点的深度值;Determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label frame information;
    基于相机坐标系与图像坐标系之间的对应关系,将所述相机坐标系下所述延展区域上的各三维标注点投影至所述图像坐标系下的像素平面,得到所述像素平面中的投影点;Based on the corresponding relationship between the camera coordinate system and the image coordinate system, each three-dimensional label point on the extended area under the camera coordinate system is projected to a pixel plane under the image coordinate system to obtain the pixel plane in the pixel plane projection point;
    基于所述延展区域上各三维标注点的深度值以及所述像素平面中的投影点,得到所述标注深度图。The marked depth map is obtained based on the depth value of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
  12. 根据权利要求11所述的方法,其中,所述基于所述三维标注框信息包括的标注框底面中心点的位置坐标和深度值确定所述延展区域上各三维标注点的深度值,包括:The method according to claim 11, wherein the determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information includes:
    将所述标注框底面中心点的深度值和位置坐标分别确定为所述延展区域的中心点的深度值和位置坐标;Determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;
    在确定所述延展区域的中心点的位置坐标的情况下,以所述延展区域的中心点的深度值为起始深度值,以预设深度间隔确定所述延展区域中各三维标注点的深度值。In the case of determining the position coordinates of the central point of the extended area, the depth value of the central point of the extended area is used as the initial depth value, and the depth of each three-dimensional label point in the extended area is determined at a preset depth interval value.
  13. 一种目标检测的装置,所述装置包括:A device for target detection, the device comprising:
    提取模块,配置为对待检测图像进行特征提取,得到所述待检测图像的特征图;The extraction module is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;
    生成模块,配置为基于所述特征图,生成所述待检测图像中的目标对象投影至地面的投影区域对应的深度图;A generation module configured to generate a depth map corresponding to a projection area of the target object in the image to be detected projected to the ground based on the feature map;
    第一检测模块,配置为基于所述深度图以及所述特征图,确定所述目标对象的三维检测信息。The first detection module is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map.
  14. 一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至12任一所述的目标检测的方法的步骤。An electronic device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory through the bus , when the machine-readable instructions are executed by the processor, the steps of the method for object detection according to any one of claims 1 to 12 are executed.
  15. 一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至12任一所述的目标检测的方法的步骤。A computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the method for detecting an object according to any one of claims 1 to 12 are executed.
  16. 一种计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令被计算机设备的处理器运行时,实现权利要求1至12中任一项所述的目标检测的方法的步骤。A computer program product, comprising a computer-readable storage medium storing program code, when the instructions included in the program code are executed by the processor of the computer device, the object detection method described in any one of claims 1 to 12 is realized method steps.
PCT/CN2022/090957 2021-09-30 2022-05-05 Target detection method and apparatus, electronic device, storage medium, and computer program product WO2023050810A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111164729.9A CN114119991A (en) 2021-09-30 2021-09-30 Target detection method and device, electronic equipment and storage medium
CN202111164729.9 2021-09-30

Publications (1)

Publication Number Publication Date
WO2023050810A1 true WO2023050810A1 (en) 2023-04-06

Family

ID=80441823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090957 WO2023050810A1 (en) 2021-09-30 2022-05-05 Target detection method and apparatus, electronic device, storage medium, and computer program product

Country Status (2)

Country Link
CN (1) CN114119991A (en)
WO (1) WO2023050810A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681660A (en) * 2023-05-18 2023-09-01 中国长江三峡集团有限公司 Target object defect detection method and device, electronic equipment and storage medium
CN117315402A (en) * 2023-11-02 2023-12-29 北京百度网讯科技有限公司 Training method of three-dimensional object detection model and three-dimensional object detection method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119991A (en) * 2021-09-30 2022-03-01 深圳市商汤科技有限公司 Target detection method and device, electronic equipment and storage medium
CN116189150A (en) * 2023-03-02 2023-05-30 吉咖智能机器人有限公司 Monocular 3D target detection method, device, equipment and medium based on fusion output

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046767A (en) * 2019-12-04 2020-04-21 武汉大学 3D target detection method based on monocular image
US20200160033A1 (en) * 2018-11-15 2020-05-21 Toyota Research Institute, Inc. System and method for lifting 3d representations from monocular images
CN111832338A (en) * 2019-04-16 2020-10-27 北京市商汤科技开发有限公司 Object detection method and device, electronic equipment and storage medium
CN112733672A (en) * 2020-12-31 2021-04-30 深圳一清创新科技有限公司 Monocular camera-based three-dimensional target detection method and device and computer equipment
CN114119991A (en) * 2021-09-30 2022-03-01 深圳市商汤科技有限公司 Target detection method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160033A1 (en) * 2018-11-15 2020-05-21 Toyota Research Institute, Inc. System and method for lifting 3d representations from monocular images
CN111832338A (en) * 2019-04-16 2020-10-27 北京市商汤科技开发有限公司 Object detection method and device, electronic equipment and storage medium
CN111046767A (en) * 2019-12-04 2020-04-21 武汉大学 3D target detection method based on monocular image
CN112733672A (en) * 2020-12-31 2021-04-30 深圳一清创新科技有限公司 Monocular camera-based three-dimensional target detection method and device and computer equipment
CN114119991A (en) * 2021-09-30 2022-03-01 深圳市商汤科技有限公司 Target detection method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681660A (en) * 2023-05-18 2023-09-01 中国长江三峡集团有限公司 Target object defect detection method and device, electronic equipment and storage medium
CN116681660B (en) * 2023-05-18 2024-04-19 中国长江三峡集团有限公司 Target object defect detection method and device, electronic equipment and storage medium
CN117315402A (en) * 2023-11-02 2023-12-29 北京百度网讯科技有限公司 Training method of three-dimensional object detection model and three-dimensional object detection method

Also Published As

Publication number Publication date
CN114119991A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
WO2023050810A1 (en) Target detection method and apparatus, electronic device, storage medium, and computer program product
CN111060115B (en) Visual SLAM method and system based on image edge features
KR102225093B1 (en) Apparatus and method for estimating camera pose
CN108288088B (en) Scene text detection method based on end-to-end full convolution neural network
CN110807350B (en) System and method for scan-matching oriented visual SLAM
CN110226186B (en) Method and device for representing map elements and method and device for positioning
US20180365504A1 (en) Obstacle type recognizing method and apparatus, device and storage medium
US10606824B1 (en) Update service in a distributed environment
CN108734058B (en) Obstacle type identification method, device, equipment and storage medium
Zhou et al. Moving object detection and segmentation in urban environments from a moving platform
WO2016201670A1 (en) Method and apparatus for representing map element and method and apparatus for locating vehicle/robot
WO2022161140A1 (en) Target detection method and apparatus, and computer device and storage medium
JP2019132664A (en) Vehicle position estimating device, vehicle position estimating method, and vehicle position estimating program
CN114495026A (en) Laser radar identification method and device, electronic equipment and storage medium
WO2024077935A1 (en) Visual-slam-based vehicle positioning method and apparatus
CN110992424B (en) Positioning method and system based on binocular vision
Mazoul et al. Fast spatio-temporal stereo for intelligent transportation systems
CN114140527A (en) Dynamic environment binocular vision SLAM method based on semantic segmentation
CN113409340A (en) Semantic segmentation model training method, semantic segmentation device and electronic equipment
CN114429631B (en) Three-dimensional object detection method, device, equipment and storage medium
Dong et al. Monocular visual-IMU odometry using multi-channel image patch exemplars
US11657506B2 (en) Systems and methods for autonomous robot navigation
CN113763468A (en) Positioning method, device, system and storage medium
Yamazaki et al. Discovering correspondence among image sets with projection view preservation for 3D object detection in point clouds
US20190102885A1 (en) Image processing method and image processing apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874208

Country of ref document: EP

Kind code of ref document: A1