WO2023050810A1 - Procédé et appareil de détection de cible, dispositif électronique, support d'enregistrement et produit programme d'ordinateur - Google Patents

Procédé et appareil de détection de cible, dispositif électronique, support d'enregistrement et produit programme d'ordinateur Download PDF

Info

Publication number
WO2023050810A1
WO2023050810A1 PCT/CN2022/090957 CN2022090957W WO2023050810A1 WO 2023050810 A1 WO2023050810 A1 WO 2023050810A1 CN 2022090957 W CN2022090957 W CN 2022090957W WO 2023050810 A1 WO2023050810 A1 WO 2023050810A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimensional
target object
information
detection
depth
Prior art date
Application number
PCT/CN2022/090957
Other languages
English (en)
Chinese (zh)
Inventor
刘配
杨国润
王哲
石建萍
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023050810A1 publication Critical patent/WO2023050810A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present disclosure relates to the technical field of image processing, and in particular to a target detection method, device, electronic equipment, storage medium, and computer program product.
  • 3D, 3-Dimension target detection tasks are more difficult and more complex, and often need to detect the 3D geometric information and Semantic information mainly includes the length, width, height, center point and orientation angle information of the target.
  • 3D target detection of monocular images is widely used in various fields (such as the field of unmanned driving) due to its economical and practical characteristics.
  • 3D object detection techniques based on monocular images mainly rely on some external subtasks, which are responsible for performing tasks such as 2D object detection, depth map estimation, etc. Since the sub-tasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection.
  • Embodiments of the present disclosure provide at least one object detection method, device, electronic device, storage medium, and computer program product, so as to improve the accuracy of 3D object detection.
  • an embodiment of the present disclosure provides a method for target detection, the method comprising:
  • the feature map determines the three-dimensional detection information of the target object.
  • a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features
  • the map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground
  • the depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.
  • the method further includes:
  • the determining the three-dimensional detection information of the target object based on the depth map and the feature map includes:
  • the depth map and the feature map determine the 3D detection frame information of the target object.
  • the 3D detection here can be the detection combined with the information of the 3D priori frame.
  • the 3D priori frame can constrain the starting position of the 3D detection to a certain extent, so as to search for the information of the 3D detection frame near the starting position, thereby further improving the accuracy of the 3D detection. .
  • the two-dimensional detection information includes the two-dimensional detection frame information of the target object and the category information of the target object;
  • the 3D prior frame information of the object including:
  • the three-dimensional prior frame information for the target object is determined.
  • the three-dimensional prior frame here can be determined in combination with the category information of the target object.
  • the size and position of the corresponding three-dimensional prior frame may be different for different categories of target objects.
  • the use of category information can assist in determining the position of the three-dimensional prior frame with high accuracy.
  • the determining the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information of the target object includes:
  • the determining the 3D detection frame information of the target object based on the 3D prior frame information, the depth map, and the feature map includes:
  • the determining the offset of the 3D detection frame according to the depth map and the feature map includes:
  • the three-dimensional detection frame offset is determined based on a depth map and a feature map matched with the position range.
  • determining the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset include:
  • the 3D detection frame information of the target object is determined based on each of the 3D prior frame information, the weight corresponding to each of the 3D prior frame information, and the 3D detection frame offset.
  • the method also includes:
  • the determination of the weight corresponding to each of the three-dimensional prior frame information includes:
  • the weight of the three-dimensional prior frame information corresponding to each of the subcategories is determined.
  • the greater the probability the higher the possibility that the target object points to the corresponding subcategory, and then the corresponding three-dimensional prior frame information can be given higher weight.
  • the prediction accuracy of the final 3D detection frame will be further improved.
  • the detecting the feature map to obtain the two-dimensional detection information for the target object includes:
  • the two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
  • the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and three-dimensional annotations based on target objects marked in the image samples It is obtained by training the labeled depth map determined by the box information.
  • the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the center point of the bottom surface of the annotation frame; the annotation depth map is acquired according to the following steps:
  • the three-dimensional label frame information marked by the target object is projected to the ground, and the projection area and the projection area of the target object on the ground are obtained.
  • each three-dimensional label point on the extended area under the camera coordinate system is projected to a pixel plane under the image coordinate system to obtain the pixel plane in the pixel plane projection point;
  • the marked depth map is obtained based on the depth value of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
  • the annotated depth map here can be implemented in combination with ground projection operations and conversion operations between coordinate systems.
  • the local ground area including the target object can be completely covered, and the corresponding marked depth map can be obtained by using the 3D projection result of the extended area, which can reflect the extended area including the local ground area.
  • the depth information of the region can specifically assist the three-dimensional detection of the target object on the corresponding local ground region.
  • the determining the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information includes:
  • the depth value of the central point of the extended area is used as the initial depth value, and the depth of each three-dimensional label point in the extended area is determined at a preset depth interval value.
  • the embodiment of the present disclosure also provides a target detection device, the device comprising:
  • the extraction module is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected; the generation module is configured to generate a projection area corresponding to the projection of the target object in the image to be detected to the ground based on the feature map The depth map; the first detection module is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map.
  • an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method for object detection as described in any one of the first aspect and its various implementation manners are executed.
  • the embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed when the processor runs, as in the first aspect and its various implementation modes The steps of any one of the methods for target detection.
  • the embodiments of the present disclosure further provide a computer program product, including a computer-readable storage medium storing program codes, and when the instructions included in the program codes are executed by the processor of the computer device, the first aspect can be implemented.
  • a computer program product including a computer-readable storage medium storing program codes, and when the instructions included in the program codes are executed by the processor of the computer device, the first aspect can be implemented.
  • FIG. 1 shows a flow chart of a method for target detection provided by an embodiment of the present disclosure
  • FIG. 2A shows a flowchart of a method for determining two-dimensional detection information of a target object provided by an embodiment of the present disclosure
  • FIG. 2B shows a flow chart of a method for determining 3D prior frame information for a target object provided by an embodiment of the present disclosure
  • FIG. 2C shows a flow chart of a method for determining a three-dimensional prior frame information for a target object provided by an embodiment of the present disclosure
  • FIG. 2D shows a flow chart of a method for determining the three-dimensional detection frame information of a target object provided by an embodiment of the present disclosure
  • FIG. 2E shows a flow chart of a method for determining the offset of a three-dimensional detection frame of a target object provided by an embodiment of the present disclosure
  • FIG. 2F shows a flowchart of a method for training a depth map generation network provided by an embodiment of the present disclosure
  • FIG. 2G shows a flow chart of a method for obtaining a marked depth map provided by an embodiment of the present disclosure
  • FIG. 2H shows a flowchart of a method for determining the depth value of each three-dimensional label point on the extended area provided by an embodiment of the present disclosure
  • FIG. 2I shows a schematic diagram of the application of a method for generating local ground depth labels provided by an embodiment of the present disclosure
  • Fig. 2J shows a schematic diagram of the application of a target detection method provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of a target detection device provided by an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
  • the detection accuracy has reached a very high level.
  • the commonly used 3D target detection method is based on LiDAR data, but it is difficult to meet large-scale applications and deployments due to the expensive equipment for collecting data.
  • the 3D target detection method of monocular image can be to adopt the vehicle-mounted camera of the car, economical and available.
  • the task of monocular 3D detection is to detect the 3D geometric information and semantic information of the target object from the 3D scene, including the length, width, height, center point and orientation angle information of the target object.
  • the monocular image-based 3D object detection technology relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Since the subtasks are trained separately, there is a loss of accuracy, which limits the upper limit of the performance of the network model and cannot meet the accuracy requirements of 3D detection, so it is difficult to be used in practical applications.
  • the difficulty of current 3D object detection methods lies in the depth prediction of 3D detection frames.
  • the label of 3D target detection only provides the depth information of the center point or corner of the target frame, and it is difficult for the network to learn and cannot generate more and more accurate depth information.
  • the 3D object detection method in the related art guides the learning of the 3D detection frame through its subtasks, such as depth estimation, pseudo point cloud generation and semantic segmentation prediction results, but its subtasks require a large number of accurate depth labels, It is difficult to use in practical applications, and the accuracy of subtasks limits the performance upper limit of 3D target detection, and it is not very reliable in 3D target detection.
  • the present disclosure provides a method, device, electronic device, storage medium, and computer program product for object detection, so as to improve the accuracy of 3D object detection.
  • the execution subject of the method for target detection provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities.
  • the computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the object detection method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 is a flow chart of a method for target detection provided by an embodiment of the present disclosure, the method is executed by an electronic device, and the method includes steps S101 to S103, wherein:
  • Step S101 performing feature extraction on the image to be detected to obtain a feature map of the image to be detected
  • Step S102 based on the feature map, generate a depth map corresponding to the projection area where the target object in the image to be detected is projected onto the ground;
  • Step S103 based on the depth map and the feature map, determine the three-dimensional detection information of the target object.
  • the above target detection method can be applied to the field of computer vision, for example, it can be applied to scenarios such as vehicle detection in unmanned driving, and drone detection. Considering the wide application of unmanned driving, the following is an example of vehicle detection.
  • the 3D object detection technology in the related art relies on some external subtasks, and these subtasks are responsible for performing tasks such as 2D object detection and depth map estimation. Due to the separate training of the subtasks, there is a loss of accuracy in itself, resulting in a low final 3D detection accuracy.
  • the embodiments of the present disclosure provide a solution for three-dimensional detection in combination with a local depth map and a feature map, and the detection accuracy is high.
  • the image to be detected in the embodiments of the present disclosure may be an image collected in a target scene, and different application scenes correspond to different collected images.
  • the image to be detected here can be the image collected by the camera device installed on the driverless car during the driving process of the vehicle, and the image can include all target objects within the shooting field of view of the camera device.
  • the target object here may be a vehicle in front or a pedestrian in front, which is not limited here.
  • the embodiments of the present disclosure may use various feature extraction methods to extract feature maps for the image to be detected.
  • the feature map can be extracted from the image to be detected through image processing, and for example, the feature map can be extracted by using a trained feature extraction network.
  • the feature extraction network can be used to extract the feature map in the embodiments of the present disclosure.
  • the feature extraction network here can be a convolutional neural network (Convolutional Neural Networks, CNN). In practical applications, it can be implemented by using a CNN model including convolutional blocks, dense blocks, and transition blocks.
  • the convolutional blocks here can be composed of convolutional layers, batch normalization (Batch normalization) layers, and linear rectification layers (Rectified Linear Unit, ReLU), the dense block can be composed of multiple convolutional blocks and multiple skip connections (Skip connection), and the transitional block is generally composed of convolutional blocks and average pooling layers.
  • the composition of convolutional blocks, dense blocks, and transitional blocks for example, including several convolutional layers and average pooling layers during application, can be determined in conjunction with the application scenario, and there is no limitation here.
  • the embodiments of the present disclosure can generate a local depth map based on the extracted feature map.
  • the local depth map corresponds to the projection area of the target object in the image to be detected projected to the ground, pointing to the target object Associated local ground depth information. Since there is a binding relationship between the local ground and the target object to a certain extent, the target object can be detected more accurately in combination with the feature map extracted above.
  • the above partial depth map may be determined by using a trained depth map generation network.
  • the depth map generation network trains the corresponding relationship between the image sample and the feature and depth of the corresponding pixel in the marked depth map. In this way, when the extracted feature map is input to the trained depth map generation network, The depth map corresponding to the projection area on the ground pointing to the target object can be output.
  • the feature map and depth map can be clipped in combination with the region feature clustering method (ROI-align), and the three-dimensional detection of the target object can be realized according to the depth map and feature map corresponding to the target object obtained by clipping.
  • ROI-align region feature clustering method
  • the 3D detection in the embodiments of the present disclosure may be based on the residual prediction of the 3D priori frame, which is considering that in the residual prediction, the information of the original 3D priori frame can be used to guide the subsequent 3D detection, for example, it can be
  • the original 3D prior frame is the initial position, and the 3D detection frame is searched near the initial position, especially when the accuracy of the 3D prior frame is relatively high, which will significantly improve the detection efficiency compared to direct 3D detection .
  • the above-mentioned 3D prior frame may be determined based on 2D detection information, so that 3D detection may be realized based on 3D prior frame information, a depth map, and a feature map.
  • the two-dimensional detection information of the target object can be determined according to the steps shown in FIG. 2A, and the steps include:
  • Step S201 determining the offset of the two-dimensional detection frame according to the feature map
  • Step S202 based on the preset 2D prior frame information and the offset of the 2D detection frame, determine the 2D detection information of the target object.
  • the two-dimensional detection information can be determined based on the calculation result between the offset of the two-dimensional detection frame and the preset two-dimensional prior frame information.
  • the two-dimensional detection information in the embodiment of the present disclosure may be obtained by using the trained first target detection network to perform two-dimensional detection on the feature map.
  • the first target detection network training can be the correspondence between the feature map of the image sample and the two-dimensional label information, or the feature map and offset of the image sample (corresponding to the two-dimensional label frame and the two-dimensional The difference between the prior frames) and the corresponding relationship between the two-dimensional detection information of the target object in the image to be detected can be directly determined by using the previous corresponding relationship, and the offset can be determined first by using the latter corresponding relationship, and then the offset The sum of the quantity and the two-dimensional prior box is used to determine the two-dimensional detection information of the target object.
  • the determined two-dimensional detection information may include the position information of the two-dimensional detection frame (x 2d , y 2d , w 2d , h 2d ), the center point position information (x p , y p ), orientation angle ( ⁇ 3d ), category information (cls) to which the target object belongs, and other information related to two-dimensional detection may also be included, which is not limited here.
  • the first target detection network here can be a two-dimensional residual prediction.
  • the first target detection network here can first perform dimensionality reduction through a convolutional layer and a linear rectification layer, and then perform residual prediction of the two-dimensional detection frame through multiple convolutional layers. The prediction accuracy higher.
  • the steps include:
  • Step S301 based on the category information of the target object, determine the clustering information of each subcategory included in the category to which the target object belongs;
  • Step S302 according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located, determine the three-dimensional prior frame information for the target object.
  • the three-dimensional prior frame information can be determined by combining the clustering information of each sub-category included in the category to which the target object belongs and the two-dimensional detection frame information where the target object is located.
  • the 3D detection results corresponding to the categories For example, for each target belonging to the same category of vehicles, for the subcategory of cars, the size of the 3D detection frame is the same as that of the subcategory of large trucks. There is a big difference in the size of the three-dimensional detection frame.
  • the subcategory can be divided in advance, and the corresponding three-dimensional prior frame information can be determined based on the clustering information of each divided subcategory.
  • the clustering result corresponding to this category information may be determined.
  • vehicle image samples including various subcategories may be collected in advance, and the vehicle image samples are determined to have information such as the length, width, and height of the vehicle.
  • clustering can be performed based on height values, so that vehicle image samples belonging to the same height range can be correspondingly divided into a subcategory, and then the clustering information of this subcategory can be determined.
  • clustering methods such as K-means clustering algorithm (K-means) can be used to realize the above clustering process.
  • the process of determining the three-dimensional a priori frame information by combining the clustering information and the two-dimensional detection frame information where the target object is located can be implemented according to the steps shown in Figure 2C, and the steps include:
  • Step S3021 for each subcategory in each subcategory, based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located, determine the depth value corresponding to the subcategory ;
  • Step S3022 based on the clustering information of the sub-category and the depth value corresponding to the sub-category, determine a 3D prior frame information for the target object.
  • each subcategory can correspond to a 3D priori frame information, and information such as the size of the 3D priori frame can be determined by the clustering information of the corresponding subcategory; the relevant depth information can be determined by the cluster height value and the two It is determined by the width value included in the three-dimensional detection frame information.
  • it can be realized by performing the ratio operation between the cluster height value and the width value first, and then performing the multiplication operation of the focal length of the camera device.
  • the embodiment of the present disclosure may combine this information, as well as the depth map and the feature map to determine the three-dimensional detection frame information, as shown in Figure 2D, including the following steps:
  • Step S1031 determining the offset of the three-dimensional detection frame according to the depth map and the feature map;
  • Step S1032 Determine the 3D detection frame information of the target object based on the 3D prior frame information and the 3D detection frame offset.
  • the second target detection network can be used to realize three-dimensional detection, and the offset of the three-dimensional detection frame output by the second target detection network can be obtained, and then the three-dimensional detection frame of the target object can be determined based on the three-dimensional prior frame information and the offset of the three-dimensional detection frame information.
  • the above three-dimensional detection frame information may include shape information (w 3d , h 3d , l 3d ) and depth information (z 3d ) of the detection frame.
  • three-dimensional prediction can determine more dimensional information of the target object. For example, it can also determine the subcategories included in the category information of the target object.
  • One class targets cars or trucks.
  • a 3D detection frame offset can be predicted based on each 3D priori frame, and considering that the subcategories corresponding to different 3D priori frames are also different, and the prediction probabilities of different subcategories are also different. Therefore, based on the prediction probabilities of each subcategory, the corresponding weights can be given to the three-dimensional prior frame information corresponding to each subcategory, and then based on each three-dimensional priori The frame information, the weight corresponding to each 3D prior frame information, and the offset of the 3D detection frame determine the 3D detection frame information of the target object.
  • subcategories with higher predicted probabilities can be given higher weights to highlight the role of the corresponding 3D prior frame in subsequent 3D detection.
  • subcategories with lower predicted probabilities can be given smaller weights to weaken Corresponding to the role of the 3D prior frame in the subsequent 3D detection, so that the determined 3D detection frame information is more accurate.
  • the depth map and the feature map can be clipped first, and then the 3D detection can be performed, as shown in FIG. 2E , which can be achieved by the following steps:
  • Step S1031a based on the location range included in the two-dimensional detection frame information where the target object is located, extracting a depth map and a feature map that match the location range from the depth map and the feature map, respectively;
  • Step S1031b Determine the offset of the 3D detection frame based on the depth map and feature map matched with the location range.
  • the depth map and feature map corresponding to the position range can be clipped based on the position range included in the two-dimensional detection frame information, that is, the local depth map and local feature map pointing to the target object can be obtained.
  • the corresponding 3D detection frame offset can be determined based on the local depth map and the local feature map, and the 3D detection frame offset here can also be determined by using the second target detection network.
  • the prediction accuracy is higher.
  • training of the first object detection network and the second object detection network is required.
  • corresponding supervisory signals that is, prior frame information
  • the corresponding loss function values can be determined. Based on these loss function values, network training can be guided by backpropagation, and there is no limitation here.
  • the embodiment of the present disclosure also sets a corresponding supervisory signal for the depth map (that is, marking the depth map), which can be generated through the depth map network to achieve.
  • the training process of the depth map generation network is as shown in Figure 2F, and the training process includes the following steps:
  • Step S401 acquiring an image sample and an annotated depth map determined based on the three-dimensional annotation frame information of the target object annotated in the image sample;
  • Step S402 performing feature extraction on the image sample to obtain a feature map of the image sample
  • Step S403 input the feature map of the image sample into the depth map generation network to be trained, obtain the depth map output by the depth map generation network, and determine the loss function value based on the similarity between the output depth map and the marked depth map;
  • Step S404 When the loss function value is greater than the preset threshold, adjust the network parameter value of the depth map generation network, and input the feature map of the image sample into the adjusted depth map generation network until the loss function value is less than or equal to preset threshold.
  • the image sample obtained here is similar to the acquisition method of the image to be detected.
  • the extraction of the feature map of the image sample please refer to the above extraction process of the feature map of the image to be detected.
  • the embodiment of the present disclosure can determine the loss function value based on the similarity between the depth map output by the depth map generation network and the marked depth map, and adjust the network parameter values of the depth map generation network according to the loss function value, so that the network input result is consistent with the marked depth map.
  • the results tend to be the same or closer.
  • the marked depth map described in the above step 401 can be obtained according to the steps shown in FIG. 2G:
  • Step S4011 based on the corresponding relationship between the 3D coordinate system where the 3D label frame is located and the ground coordinate system where the center point of the bottom surface of the label frame is located, project the 3D label frame information marked by the target object to the ground, and obtain the projection area and projection of the target object on the ground The extended area where the area is located;
  • Step S4012 Determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information;
  • Step S4013 based on the corresponding relationship between the camera coordinate system and the image coordinate system, project the three-dimensional label points on the extended area under the camera coordinate system to the pixel plane under the image coordinate system to obtain the projection points in the pixel plane;
  • Step S4014 based on the depth values of the three-dimensional marked points on the extended area and the projected points in the pixel plane, the marked depth map is obtained.
  • the depth value of each three-dimensional label point on the extended area is determined based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information, including the following steps:
  • Step S41 determining the depth value and position coordinates of the center point of the bottom surface of the label frame as the depth value and position coordinates of the center point of the extended area, respectively;
  • Step S42 in the case of determining the position coordinates of the central point of the extended area, use the depth value of the central point of the extended area as the initial depth value, and determine each three-dimensional annotation in the extended area at a preset depth interval The depth value of the point.
  • An embodiment of the present disclosure provides a method for generating local ground depth labels.
  • the depth information of the surrounding ground (corresponding to the extended area) can be obtained by using the position of the center point of the bottom surface of the callout frame (the center point falls on the ground) included in the three-dimensional callout frame information of the target object.
  • the center point of the bottom surface of the three-dimensional labeling frame is at the same height as the surrounding ground, in the figure ( In b), a large number of three-dimensional labeling points 23 can be generated in an extended area 22 around the central point, where the three-dimensional points include the central point, and also include the initial depth value of the central point of the extended area 22, with Each three-dimensional marking point 23 on the extension area determined by the preset depth interval.
  • Fig. (b) among Fig. 2I utilize projective relationship to be able to project three-dimensional mark point 23 on the pixel plane, obtain the projection point corresponding to three-dimensional mark point on the pixel plane, record the depth value of three-dimensional mark point 23
  • the average depth value of at least one corresponding three-dimensional label point can be obtained for each projected projected point, and the marked depth shown in Figure (c) in Figure 2I can be obtained
  • the depth information of the surrounding ground (corresponding to the extended area) can be obtained from the marked depth map 24.
  • (x 3d , y 3d , z 3d ) represent the camera coordinates of the three-dimensional label point
  • (x p , y p ) represent the projection point of the three-dimensional mark point projection
  • Prect and R rect represent the rotation Correction matrix and projection matrix.
  • inputting the feature map of the image to be detected into the trained depth map generation network can determine the depth map corresponding to the projection area where the target object in the image to be detected is projected to the ground, and then can combine the feature map and the three-dimensional prior frame The information realizes the three-dimensional prediction of the target object.
  • the feature map of the image to be detected can be extracted through the feature extraction network 32 first. Then, on the one hand, two-dimensional detection is performed through the first target detection network 33 to obtain two-dimensional detection information for the target object; Depth Chart 35.
  • the three-dimensional prior frame information for the target object may be determined. As shown in FIG. 2J , it is an exemplary display of three 3D a priori frame information determined for the corresponding three subcategories 36 .
  • the cropping in ROI-align mode can be performed based on the two-dimensional detection information, and then the depth map and feature map obtained by clipping can be input into The second target detection network 37 can obtain the corresponding three-dimensional detection frame offset, such as ⁇ (w,h,l) 3d , ⁇ z 3d and other information as shown in FIG. 2J .
  • the 3D detection information can be determined by combining the above 3D detection frame offset and 3D prior frame information. In practical applications, the above three-dimensional detection information can be presented on the image to be detected.
  • the embodiment of the present disclosure also provides a target detection device corresponding to the target detection method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned target detection method of the embodiment of the present disclosure, therefore The implementation of the device can refer to the implementation of the method.
  • Fig. 3 is a schematic diagram of a target detection device provided by an embodiment of the present disclosure, the device includes: an extraction module 301, a generation module 302, and a first detection module 303; wherein,
  • the extraction module 301 is configured to perform feature extraction on the image to be detected to obtain a feature map of the image to be detected;
  • the generation module 302 is configured to generate a depth map corresponding to a projection area where the target object in the image to be detected is projected to the ground based on the feature map;
  • the first detection module 303 is configured to determine three-dimensional detection information of the target object based on the depth map and the feature map.
  • a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features
  • the map determines the 3D detection information of the target object. Since the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground
  • the depth map corresponding to the ground of the situation can be used as a guide for three-dimensional detection to improve the accuracy of detection.
  • the above-mentioned device also includes:
  • the second detection module 304 is configured to detect the feature map after obtaining the feature map of the image to be detected, and obtain two-dimensional detection information for the target object;
  • the first detection module 303 is configured to determine the three-dimensional detection information of the target object based on the depth map and the feature map according to the following steps:
  • the 3D detection frame information of the target object is determined.
  • the two-dimensional detection information includes the two-dimensional detection frame information where the target object is located and the category information of the target object; the first detection module 303 is configured to determine the target object based on the two-dimensional detection information according to the following steps
  • the three-dimensional prior frame information for the target object is determined.
  • the first detection module 303 is configured to determine the three-dimensional prior frame information for the target object according to the clustering information of each subcategory and the two-dimensional detection frame information where the target object is located according to the following steps:
  • each subcategory in each subcategory determines the depth value corresponding to the subcategory based on the cluster height value included in the cluster information of the subcategory and the width value included in the two-dimensional detection frame information where the target object is located;
  • a three-dimensional prior frame information for the target object is determined.
  • the first detection module 303 is configured to determine the 3D detection frame information of the target object based on the 3D prior frame information, the depth map and the feature map according to the following steps:
  • the 3D detection frame information of the target object is determined.
  • the first detection module 303 is configured to determine the offset of the three-dimensional detection frame according to the depth map and the feature map according to the following steps:
  • the first detection module 303 is configured to determine the 3D detection frame of the target object based on the 3D prior frame information and the offset of the 3D detection frame according to the following steps information:
  • the three-dimensional detection frame information of the target object is determined based on each three-dimensional prior frame information, the weight corresponding to each three-dimensional prior frame information, and the three-dimensional detection frame offset.
  • the first detection module 303 is configured to determine the weight corresponding to each three-dimensional prior frame information according to the following steps:
  • the weight of the three-dimensional prior frame information corresponding to each subcategory is determined.
  • the second detection module 304 is configured to detect the feature map according to the following steps to obtain two-dimensional detection information for the target object:
  • the two-dimensional detection information of the target object is determined based on the preset two-dimensional prior frame information and the offset of the two-dimensional detection frame.
  • the depth map is determined by a trained depth map generation network; the depth map generation network is determined by image samples and based on the three-dimensional annotation frame information of the target object marked in the image samples. Figure training obtained.
  • the three-dimensional annotation frame information of the target object includes position coordinates and depth values of the central point of the bottom surface of the annotation frame; the generation module 302 is configured to obtain the annotation depth map according to the following steps:
  • the information of the three-dimensional label frame marked by the target object is projected to the ground, and the projection area of the target object on the ground and the extension of the projection area are obtained area;
  • each three-dimensional label point on the extended area under the camera coordinate system is projected to the pixel plane under the image coordinate system to obtain the projection point in the pixel plane;
  • a marked depth map is obtained based on the depth values of each three-dimensional marked point on the extended area and the projected point in the pixel plane.
  • the generation module 302 is configured to determine the depth value of each three-dimensional label point on the extended area based on the position coordinates and depth values of the center point of the bottom surface of the label box included in the three-dimensional label box information according to the following steps:
  • the depth value of the central point of the extended area is used as the initial depth value, and the depth values of each three-dimensional label point in the extended area are determined at preset depth intervals.
  • FIG. 4 is a schematic structural diagram of the electronic device provided by the embodiment of the present disclosure, including: a processor 401 , a memory 402 , and a bus 403 .
  • the memory 402 stores machine-readable instructions executable by the processor 401 (for example, the execution instructions corresponding to the extraction module 301, the generation module 302, and the first detection module 303 in the device in FIG. 3 ), and when the electronic device is running, the processing
  • the processor 401 communicates with the memory 402 through the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:
  • the 3D detection information of the target object is determined.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the method for object detection described in the foregoing method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the method for target detection described in the above method embodiment, please refer to the above method Example.
  • the above-mentioned computer program product may be realized by hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the method for target detection includes: performing feature extraction on the image to be detected to obtain a feature map of the image to be detected; based on the feature map, generating a projection of the target object in the image to be detected projected to the ground A depth map corresponding to the region; based on the depth map and the feature map, determine the three-dimensional detection information of the target object.
  • the above target detection method not only can feature extraction be performed on the image to be detected, but also based on the extracted feature map, a depth map corresponding to the projection area of the target object in the image to be detected can be generated to the ground, and then based on the depth map and features The map determines the 3D detection information of the target object.
  • the generated depth map points to the target object in the image to be detected, and corresponds to the projection area where the target object is projected onto the ground, the projection area is associated with the target object to a certain extent. In this way, using the target object on the local ground
  • the depth map corresponding to the local ground can be used as a guide when performing three-dimensional detection on the feature map of the local ground, thereby improving the accuracy of detection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

La présente divulgation concerne un procédé et un appareil de détection de cible, un dispositif électronique, un support d'enregistrement et un produit programme d'ordinateur. Le procédé comprend les étapes consistant à : effectuer une extraction de caractéristiques sur une image à détecter et obtenir une carte de caractéristiques de ladite image ; sur la base de la carte de caractéristiques, générer une carte de profondeurs correspondant à une région de projection dans laquelle un objet cible dans ladite image est projeté sur le sol ; et, sur la base de la carte de profondeurs et de la carte de caractéristiques, déterminer des informations de détection tridimensionnelle de l'objet cible. La région de projection de la présente divulgation est associée à l'objet cible dans une certaine mesure. De cette manière, la carte de profondeurs correspondant au sol local peut guider de manière ciblée la carte de caractéristiques de l'objet cible sur le sol local de façon à effectuer une détection tridimensionnelle, ce qui permet d'améliorer la précision de détection.
PCT/CN2022/090957 2021-09-30 2022-05-05 Procédé et appareil de détection de cible, dispositif électronique, support d'enregistrement et produit programme d'ordinateur WO2023050810A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111164729.9 2021-09-30
CN202111164729.9A CN114119991A (zh) 2021-09-30 2021-09-30 一种目标检测的方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023050810A1 true WO2023050810A1 (fr) 2023-04-06

Family

ID=80441823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090957 WO2023050810A1 (fr) 2021-09-30 2022-05-05 Procédé et appareil de détection de cible, dispositif électronique, support d'enregistrement et produit programme d'ordinateur

Country Status (2)

Country Link
CN (1) CN114119991A (fr)
WO (1) WO2023050810A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681660A (zh) * 2023-05-18 2023-09-01 中国长江三峡集团有限公司 一种目标对象缺陷检测方法、装置、电子设备及存储介质
CN117315402A (zh) * 2023-11-02 2023-12-29 北京百度网讯科技有限公司 三维对象检测模型的训练方法及三维对象检测方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119991A (zh) * 2021-09-30 2022-03-01 深圳市商汤科技有限公司 一种目标检测的方法、装置、电子设备及存储介质
CN116189150B (zh) * 2023-03-02 2024-05-17 吉咖智能机器人有限公司 基于融合输出的单目3d目标检测方法、装置、设备和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046767A (zh) * 2019-12-04 2020-04-21 武汉大学 一种基于单目图像的3d目标检测方法
US20200160033A1 (en) * 2018-11-15 2020-05-21 Toyota Research Institute, Inc. System and method for lifting 3d representations from monocular images
CN111832338A (zh) * 2019-04-16 2020-10-27 北京市商汤科技开发有限公司 对象检测方法及装置、电子设备和存储介质
CN112733672A (zh) * 2020-12-31 2021-04-30 深圳一清创新科技有限公司 基于单目相机的三维目标检测方法、装置和计算机设备
CN114119991A (zh) * 2021-09-30 2022-03-01 深圳市商汤科技有限公司 一种目标检测的方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160033A1 (en) * 2018-11-15 2020-05-21 Toyota Research Institute, Inc. System and method for lifting 3d representations from monocular images
CN111832338A (zh) * 2019-04-16 2020-10-27 北京市商汤科技开发有限公司 对象检测方法及装置、电子设备和存储介质
CN111046767A (zh) * 2019-12-04 2020-04-21 武汉大学 一种基于单目图像的3d目标检测方法
CN112733672A (zh) * 2020-12-31 2021-04-30 深圳一清创新科技有限公司 基于单目相机的三维目标检测方法、装置和计算机设备
CN114119991A (zh) * 2021-09-30 2022-03-01 深圳市商汤科技有限公司 一种目标检测的方法、装置、电子设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681660A (zh) * 2023-05-18 2023-09-01 中国长江三峡集团有限公司 一种目标对象缺陷检测方法、装置、电子设备及存储介质
CN116681660B (zh) * 2023-05-18 2024-04-19 中国长江三峡集团有限公司 一种目标对象缺陷检测方法、装置、电子设备及存储介质
CN117315402A (zh) * 2023-11-02 2023-12-29 北京百度网讯科技有限公司 三维对象检测模型的训练方法及三维对象检测方法

Also Published As

Publication number Publication date
CN114119991A (zh) 2022-03-01

Similar Documents

Publication Publication Date Title
WO2023050810A1 (fr) Procédé et appareil de détection de cible, dispositif électronique, support d'enregistrement et produit programme d'ordinateur
CN111060115B (zh) 一种基于图像边缘特征的视觉slam方法及系统
KR102225093B1 (ko) 카메라 포즈 추정 장치 및 방법
CN108288088B (zh) 一种基于端到端全卷积神经网络的场景文本检测方法
CN110807350B (zh) 用于面向扫描匹配的视觉slam的系统和方法
CN110226186B (zh) 表示地图元素的方法和装置以及定位的方法和装置
US20180365504A1 (en) Obstacle type recognizing method and apparatus, device and storage medium
US10606824B1 (en) Update service in a distributed environment
CN108734058B (zh) 障碍物类型识别方法、装置、设备及存储介质
Zhou et al. Moving object detection and segmentation in urban environments from a moving platform
WO2016201670A1 (fr) Procédé et appareil pour représenter un élément de carte et procédé et appareil pour localiser un véhicule/robot
WO2022161140A1 (fr) Procédé et appareil de détection cible, et dispositif informatique et support de stockage
JP2019132664A (ja) 自車位置推定装置、自車位置推定方法、及び自車位置推定プログラム
CN114495026A (zh) 一种激光雷达识别方法、装置、电子设备和存储介质
WO2024077935A1 (fr) Procédé et appareil de positionnement de véhicule à base de slam visuel
CN110992424B (zh) 基于双目视觉的定位方法和系统
Mazoul et al. Fast spatio-temporal stereo for intelligent transportation systems
CN114140527A (zh) 一种基于语义分割的动态环境双目视觉slam方法
CN113409340A (zh) 语义分割模型训练方法、语义分割方法、装置及电子设备
US11657506B2 (en) Systems and methods for autonomous robot navigation
CN113763468A (zh) 一种定位方法、装置、系统及存储介质
Yamazaki et al. Discovering correspondence among image sets with projection view preservation for 3D object detection in point clouds
US20190102885A1 (en) Image processing method and image processing apparatus
Wang et al. Integrated pedestrian detection and localization using stereo cameras
WO2018120932A1 (fr) Appareil et procédé d'optimisation de données de balayage et appareil et procédé de correction de trajectoire

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874208

Country of ref document: EP

Kind code of ref document: A1