WO2023221566A1 - 一种基于多视角融合的3d目标检测方法及装置 - Google Patents

一种基于多视角融合的3d目标检测方法及装置 Download PDF

Info

Publication number
WO2023221566A1
WO2023221566A1 PCT/CN2023/074861 CN2023074861W WO2023221566A1 WO 2023221566 A1 WO2023221566 A1 WO 2023221566A1 CN 2023074861 W CN2023074861 W CN 2023074861W WO 2023221566 A1 WO2023221566 A1 WO 2023221566A1
Authority
WO
WIPO (PCT)
Prior art keywords
bird
camera
eye view
image
target object
Prior art date
Application number
PCT/CN2023/074861
Other languages
English (en)
French (fr)
Inventor
李翔宇
朱红梅
张骞
任伟强
Original Assignee
北京地平线机器人技术研发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京地平线机器人技术研发有限公司 filed Critical 北京地平线机器人技术研发有限公司
Publication of WO2023221566A1 publication Critical patent/WO2023221566A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present disclosure relates to the field of computer vision, and specifically to a 3D target detection method and device based on multi-view fusion.
  • the autonomous driving carrier can perform 3D detection of target objects (vehicles, pedestrians, cyclists, etc.) within a certain distance around them to obtain the three-dimensional spatial information of the target objects. Based on the three-dimensional spatial information of the target object, the distance and speed of the target object are measured to achieve better driving control.
  • target objects vehicles, pedestrians, cyclists, etc.
  • the distance and speed of the target object are measured to achieve better driving control.
  • autonomous driving carriers can collect multiple images with different viewing angles, then perform 3D detection on each image, and finally fuse the 3D detection results of each image to generate three-dimensional spatial information of target objects in the surrounding environment of the carrier.
  • the existing technical solution requires 3D detection of each image collected by the autonomous driving carrier, and then fusion of the 3D detection results of each image to obtain information about other vehicles in the 360-degree environment around the carrier, resulting in low detection efficiency. Low.
  • Embodiments of the present disclosure provide a 3D target detection method and device based on multi-view fusion.
  • a 3D target detection method based on multi-view fusion including:
  • Target prediction is performed on the target object in the bird's-eye view fusion feature to obtain three-dimensional spatial information of the target object.
  • a 3D target detection device based on multi-view fusion including:
  • An image receiving module used to obtain at least one image collected from multiple camera perspectives
  • a feature extraction module configured to perform feature extraction on the at least one image obtained by the image receiving module, and obtain corresponding feature data containing target object features corresponding to the at least one image in a multi-camera perspective space;
  • the image feature mapping module is used to map the corresponding feature data of the at least one image obtained by the feature extraction module in the multi-camera perspective space to the same one based on the internal parameters and vehicle parameters of the multi-camera system.
  • Bird's-eye view space obtaining corresponding feature data of the at least one image in the bird's-eye view space;
  • An image fusion module configured to perform feature fusion on the corresponding feature data of the at least one image obtained by the image feature mapping module in a bird's-eye view space, to obtain a bird's-eye view fusion feature
  • the 3D detection module is used to perform target prediction on the target object in the bird's-eye view fusion feature obtained by the image fusion module, and obtain the three-dimensional spatial information of the target object.
  • a computer-readable storage medium stores a computer program.
  • the computer program is used to execute the above-mentioned 3D target detection method based on multi-view fusion.
  • an electronic device including:
  • memory for storing instructions executable by the processor
  • the processor is configured to read the executable instructions from the memory and execute the instructions to implement the above-mentioned 3D target detection method based on multi-view fusion.
  • feature extraction is performed on at least one image from multi-camera viewpoints collected by the multi-camera system, and based on the internal features of the multi-camera system Parameters, map the extracted feature data containing the characteristics of the target object in the multi-camera perspective space to the same bird's-eye perspective space, obtain the corresponding feature data of at least one image in the bird's-eye perspective space, and map at least one image Feature fusion is performed on corresponding feature data in the bird's-eye view space to obtain bird's-eye view fusion features.
  • multi-viewpoint feature fusion is first performed and then 3D target detection is performed.
  • the 3D target detection of scene objects from a bird's-eye view is completed end-to-end, avoiding the need for conventional multi-viewpoint detection.
  • the post-processing stage of 3D inspection improves inspection efficiency.
  • Figure 1 is a scene diagram to which the present disclosure is applicable.
  • Figure 2 is a system block diagram of a vehicle-mounted automatic driving system provided by an embodiment of the present disclosure.
  • Figure 3 is a flow chart of a 3D target detection method based on multi-view fusion provided by an exemplary embodiment of the present disclosure.
  • FIG. 4 is a schematic block diagram of a multi-camera system collecting images according to an exemplary embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of images from multiple camera perspectives provided by an exemplary embodiment of the present disclosure.
  • Figure 6 is a schematic block diagram of feature extraction provided by an exemplary embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of generating a bird's-eye view image from images collected by a multi-camera system according to an exemplary embodiment of the present disclosure.
  • Figure 8 is a schematic block diagram of target detection provided by an exemplary embodiment of the present disclosure.
  • FIG. 9 is a flowchart for determining feature data in a bird's-eye view space provided by an exemplary embodiment of the present disclosure.
  • FIG. 10 is a schematic block diagram of performing step S303 and step S304 provided by an exemplary embodiment of the present disclosure.
  • Figure 11 is a flow chart of target detection provided by an exemplary embodiment of the present disclosure.
  • Figure 12 is a schematic diagram of the output results of a prediction network provided by an exemplary embodiment of the present disclosure.
  • Figure 13 is another flowchart of target detection provided by an exemplary embodiment of the present disclosure.
  • Figure 14 is a schematic diagram of a Gaussian kernel provided by an exemplary embodiment of the present disclosure.
  • Figure 15 is a schematic diagram of a heat map provided by an exemplary embodiment of the present disclosure.
  • Figure 16 is another flowchart of target detection provided by an exemplary embodiment of the present disclosure.
  • Figure 17 is a structural diagram of a 3D target detection device based on multi-view fusion provided by an exemplary embodiment of the present disclosure.
  • Figure 18 is another structural diagram of a 3D target detection device based on multi-view fusion provided by an exemplary embodiment of the present disclosure.
  • Figure 19 is a structural block diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • the autonomous driving carrier can perform real-time detection of target objects (such as vehicles, pedestrians, cyclists, etc.) within a certain distance around the carrier to obtain the three-dimensional spatial information of the 3D target objects (such as: attributes such as location, size, orientation, and category). Based on the three-dimensional spatial information of the target object, the distance and speed of the target object are measured to achieve better driving control.
  • target objects such as vehicles, pedestrians, cyclists, etc.
  • the autonomous driving carrier can be a vehicle, an airplane, etc.
  • Autonomous driving carriers can use a multi-camera system to collect multiple images with different viewing angles, and then perform 3D target detection on each image separately, such as filtering and deduplicating target objects on multiple images collected by cameras with different viewing angles. . Finally, the 3D detection results of each image are fused to generate three-dimensional spatial information of the target object in the surrounding environment of the carrier. It can be seen that the existing technical solution requires 3D detection of each image collected by the autonomous driving carrier, and then fusion of the 3D detection results of each image, resulting in low detection efficiency.
  • embodiments of the present disclosure provide a 3D target detection method and device based on multi-view fusion.
  • the autonomous driving carrier can perform feature extraction on at least one image from the multi-camera perspective collected by the multi-camera system to obtain feature data containing the characteristics of the target object in the multi-camera perspective space.
  • the feature data in the multi-camera perspective space is mapped to the same bird's-eye perspective space, and the corresponding feature data of at least one image in the bird's-eye perspective space is obtained.
  • the solution of the embodiment of the present disclosure performs 3D target detection based on multi-view fusion, it simultaneously maps the feature data of at least one image under multiple camera views to the same bird's-eye view space, enabling more reasonable and better fusion. .
  • the fused bird's-eye view fusion features the three-dimensional spatial information of each target object in the surrounding vehicle environment is directly detected in the bird's-eye view space.
  • multi-viewpoint feature fusion is first performed and then 3D target detection is performed, and the 3D target detection of scene objects from a bird's-eye view is completed end-to-end, avoiding the need for conventional
  • the post-processing stage of multi-view 3D target detection improves detection efficiency.
  • the embodiments of the present disclosure can be applied to application scenarios that require 3D target detection, such as autonomous driving application scenarios.
  • a multi-camera system is configured on an autonomous driving carrier (hereinafter referred to as "carrier"), and images from different perspectives are collected through the multi-camera system. Fusion of 3D target detection to obtain the three-dimensional spatial information of target objects in the surrounding environment of the carrier.
  • carrier autonomous driving carrier
  • Figure 1 is a scene diagram to which the present disclosure is applicable.
  • the carrier 100 of assisted driving or automatic driving is configured with a vehicle-mounted automatic driving system 200 and a multi-camera system 300.
  • the vehicle-mounted automatic driving system 200 and Multi-camera system 300 electrical connections.
  • the multi-camera system 300 is used to collect images of the environment surrounding the carrier.
  • the vehicle-mounted automatic driving system 200 is used to obtain the images collected by the multi-camera system 300 and perform 3D target detection based on multi-view fusion to obtain the target objects in the environment surrounding the carrier. three-dimensional spatial information.
  • Figure 2 is a system block diagram of a vehicle-mounted automatic driving system provided by an embodiment of the present disclosure.
  • the vehicle-mounted automatic driving system 200 includes an image receiving module 201 , a feature extraction module 202 , an image feature mapping module 203 , an image fusion module 204 and a 3D detection module 205 .
  • the image receiving module 201 is used to obtain at least one image collected by the multi-camera system 300;
  • the feature extraction module 202 is used to perform feature extraction on at least one image obtained by the image receiving module 201 to obtain feature data;
  • the image feature mapping module 203 uses The feature data of at least one image is mapped from the multi-camera perspective space to the same bird's-eye perspective space;
  • the image fusion module 204 is used to perform feature fusion on the corresponding feature data of at least one image in the bird's-eye perspective space to obtain a bird's-eye view space.
  • the 3D detection module 205 is used to perform target prediction on the target objects in the bird's-eye perspective fusion features obtained by the image fusion module 204, and obtain the three-dimensional spatial information of
  • the multi-camera system 300 includes multiple cameras with different viewing angles, each camera is used to collect an environmental image from one viewing angle, and the multiple cameras cover a 360-degree environmental range around the carrier.
  • Each camera defines its own camera perspective coordinate system, and forms its own camera perspective space through its respective camera perspective coordinate system.
  • the environment image collected by each camera is an image in the corresponding camera perspective space.
  • Figure 3 is a flow chart of a 3D target detection method based on multi-view fusion provided by an exemplary embodiment of the present disclosure.
  • This embodiment can be applied to the vehicle-mounted automatic driving system 200, as shown in Figure 3, including the following steps:
  • Step S301 Obtain at least one image collected from multiple camera perspectives.
  • At least one image may be collected by at least one camera of the multi-camera system.
  • the at least one image may be an image collected in real time by a multi-camera system or an image collected in advance by a multi-camera system.
  • FIG. 4 is a schematic block diagram of a multi-camera system collecting images according to an exemplary embodiment of the present disclosure.
  • the multi-camera system can collect multiple images from different viewing angles in real time, such as images 1, 2...N, and send the collected images to the vehicle-mounted automatic driving system in real time.
  • images 1, 2...N can represent the real situation of the environment surrounding the carrier at the current moment.
  • FIG. 5 is a schematic diagram of images from multiple camera perspectives provided by an exemplary embodiment of the present disclosure.
  • the multi-camera system may include 6 cameras. Six cameras are respectively set at the front end, left front end, right front end, rear end, left rear end and right rear end of the carrier. In this way, at any time, the multi-camera system can collect images from 6 different viewing angles, such as front view image (I front ), left front view image (I frontleft ), right front view image (I frontright ), rear view image ( I rear ), left rear view image (I rearleft ) and right rear view image (I rearright ).
  • Each image includes, but is not limited to, various types of target objects such as roads, traffic lights, street signs, vehicles (small cars, buses, trucks, etc.), pedestrians, and cyclists.
  • target objects such as roads, traffic lights, street signs, vehicles (small cars, buses, trucks, etc.), pedestrians, and cyclists.
  • the categories, positions, etc. of target objects in the surrounding environment of the carrier are different, the categories, positions, etc. of the target objects included in each image also differ.
  • Step S302 Perform feature extraction on at least one image to obtain corresponding feature data containing target object features corresponding to at least one image in the multi-camera perspective space.
  • the vehicle-mounted autonomous driving system can extract feature data in the corresponding camera perspective space from each image.
  • the feature data may include target object features used to describe the target object in the image.
  • the target object features include but are not limited to image texture information, edge contour information, semantic information, etc.
  • the image texture information is used to characterize the image texture of the target object
  • the edge contour information is used to characterize the edge contour of the target object
  • the semantic information is used to characterize the category of the target object.
  • the categories of target objects include but are not limited to: roads, traffic lights, street signs, vehicles (small cars, buses, trucks, etc.), pedestrians, cyclists, etc.
  • Figure 6 is a schematic block diagram of feature extraction provided by an exemplary embodiment of the present disclosure.
  • the vehicle-mounted autonomous driving system can use neural networks to extract features from at least one image (image 1-N), and obtain corresponding feature data 1-N for each image in the multi-camera perspective space.
  • the vehicle-mounted autonomous driving system performs feature extraction on the front-view image (I front ), and can obtain the feature data f front of the front-view image (I front ) in the front-end camera perspective space; perform feature extraction on the left front-view image (I frontleft ) , the feature data f frontleft of the left front-view image (I frontleft ) in the left front-end camera perspective space can be obtained; by performing feature extraction on the right front-view image (I frontright ), the right front-right image can be obtained The feature data f frontright of the front-view image (I frontright ) in the perspective space of the right front-end camera; performing feature extraction on the rear-view image (I rear ), the feature data of the rear-view image (I rear ) in the perspective space of the rear-end camera can be obtained f rear ; Perform feature extraction on the left rearview image (I rearleft ) to obtain the feature data of the left rearview image (I rearleft ) in the left rear camera perspective space f rearleft ; Perform feature extraction on
  • Step S303 Based on the internal parameters of the multi-camera system and the vehicle parameters, map the corresponding feature data of at least one image in the multi-camera perspective space to the same bird's-eye perspective space to obtain at least one image in the bird's-eye perspective space. Below are the corresponding feature data.
  • the internal parameters of the multi-camera system include the internal parameters of each camera and the external parameters of the camera.
  • the internal parameters of the camera are parameters related to the camera's own characteristics, such as the focal length of the camera, pixel size, etc.; the external parameters of the camera are in the world coordinates. Parameters in the system, such as camera position, rotation direction, etc.
  • the vehicle parameters refer to the transformation matrix from the vehicle coordinate system (Vehicle Coordinate System, VCS) to the bird's-eye view coordinate system (BEV).
  • VCS Vehicle Coordinate System
  • BEV bird's-eye view coordinate system
  • the vehicle coordinate system is the coordinate system where the vehicle is located.
  • the vehicle-mounted autonomous driving system maps the feature data f front of the front-view image (I front ) in the front-end camera perspective space to the same bird's-eye perspective space, and obtains the feature data F front of the front-view image (I front ) in the bird's-eye perspective space.
  • Step S304 Feature fusion is performed on corresponding feature data of at least one image in the bird's-eye view space to obtain bird's-eye view fusion features.
  • the bird's-eye view fusion feature is used to represent the characteristic data of the target object around the carrier in the bird's-eye view space.
  • the characteristic data of the target object in the bird's-eye view space can include but is not limited to the shape, size, category, and orientation of the target object. Angle, relative position and other attributes.
  • the vehicle-mounted autonomous driving system can perform additive feature fusion on corresponding feature data of at least one image in a bird's-eye view space to obtain a bird's-eye view fusion feature. Specifically, it can be expressed as the following formula:
  • F' Add( ⁇ F (front , frontleft , frontright , rear , rearleft , rearright) )
  • F' represents the bird's-eye view fusion feature
  • Add represents the additive feature fusion calculation of the corresponding feature data of at least one image in the bird's-eye view space.
  • step S304 is not limited to this.
  • multiplication, superposition, etc. may also be used to compare different cameras.
  • the corresponding feature data of the images from different perspectives are feature fused in the bird's-eye perspective space.
  • FIG. 7 is a schematic diagram of generating a bird's-eye view image from images collected by a multi-camera system according to an exemplary embodiment of the present disclosure.
  • the size of the bird's-eye view image may be the same as the size of at least one image collected by the multi-camera system.
  • the bird's-eye view image can reflect the three-dimensional spatial information of the target object.
  • the three-dimensional spatial information includes at least one attribute information of the target object.
  • the attributes include but are not limited to 3D position information (i.e., coordinate information of the X-axis, Y-axis, and Z-axis), size Information (i.e. length, width, height information), orientation angle information, etc.
  • the coordinate information of the X-axis, Y-axis, and Z-axis refers to the coordinate position (x, y, z) of the target object in the bird's-eye view space.
  • the origin of the coordinate system in the bird's-eye view space is located at any position such as the chassis of the carrier or the center of the carrier.
  • the X-axis direction is from front to back
  • the Y-axis direction is from left to right
  • the Z-axis direction is vertically up and down.
  • the heading angle refers to the angle formed by the front direction or the traveling direction of the target object in the bird's-eye perspective space.
  • the heading angle refers to the angle formed by the pedestrian's traveling direction in the bird's-eye perspective space.
  • the heading angle refers to the angle formed by the vehicle's front direction in the bird's-eye perspective space.
  • the bird's-eye view image may include bird's-eye view fusion features of target objects of different categories.
  • Step S305 Perform target prediction on the target object in the bird's-eye view fusion feature to obtain the three-dimensional spatial information of the target object.
  • the three-dimensional spatial information may include: at least one of the position, size, orientation angle and other attributes of the target object in a bird's-eye view coordinate system.
  • the position refers to the coordinate position (x, y, z) of the target object relative to the carrier in the bird's-eye view space
  • the size refers to the length, width, and height (Height, Width, Length) of the target object in the bird's-eye view space
  • the orientation angle refers to The orientation angle (rotation yaw) of the target object in the bird's-eye view space.
  • Figure 8 is a schematic block diagram of target detection provided by an exemplary embodiment of the present disclosure.
  • the vehicle-mounted autonomous driving system can use one or more prediction networks to perform 3D target prediction on the target objects in the bird's-eye view fusion features to obtain the three-dimensional space of each target object in the surrounding environment of the carrier. information.
  • each prediction network can output one or more attributes of the target object, and the attributes output by different prediction networks are also different.
  • the solution of the embodiment of the present disclosure performs 3D target detection based on multi-view fusion, it can first perform multi-view feature fusion and then perform 3D target detection, completing end-to-end 3D target detection of scene objects from a bird's-eye view, avoiding the need for conventional multi-view processing.
  • the post-processing stage of 3D target detection improves detection efficiency.
  • FIG. 9 is a flowchart for determining feature data in a bird's-eye view space provided by an exemplary embodiment of the present disclosure.
  • step S303 may include the following steps:
  • Step S3031 Based on the internal parameters and vehicle parameters of the multi-camera system, determine the transformation matrix from the camera coordinate system of the multi-camera system to the bird's-eye view coordinate system.
  • the internal parameters of the multi-camera system include the internal parameters of each camera and the external parameters of the camera.
  • the external parameters of the multi-camera system refer to the The transformation matrix from the camera coordinate system to the vehicle coordinate system.
  • the vehicle parameter refers to the transformation matrix from the vehicle coordinate system (Vehicle Coordinate System, VCS) to the bird's-eye view coordinate system (BEV).
  • VCS Vehicle Coordinate System
  • BEV bird's-eye view coordinate system
  • the vehicle coordinate system is the coordinate system where the vehicle is located. .
  • step S3031 includes:
  • the transformation matrix from the multi-camera camera coordinate system to the bird's-eye view coordinate system is determined.
  • the vehicle-mounted autonomous driving system can determine the transformation matrix H from the camera coordinate system of the multi-camera camera to the bird's-eye view coordinate system through the following formula:
  • @ represents matrix multiplication
  • T camera ⁇ vcs represents the transformation matrix from the camera coordinate system to the vehicle coordinate system
  • T camera ⁇ vcs represents the camera external parameters
  • T vcs ⁇ bev represents the transformation matrix from the vehicle coordinate system to the bird's-eye view coordinate system
  • K represents the camera internal parameters.
  • the camera external parameters that is, the transformation matrix from the camera coordinate system to the vehicle coordinate system
  • the transformation matrix from the vehicle coordinate system to the bird's-eye view coordinate system can be determined by the artificially set range of the bird's-eye view (for example, a range of 100 meters each of the front, back, left, and right), and the resolution of the bird's-eye view image (for example, 512 ⁇ 512) calculated.
  • each camera in the multi-camera system can determine the corresponding transformation matrix.
  • the vehicle-mounted autonomous driving system determines the camera coordinate system of the front-end camera to The transformation matrix H front ⁇ bev of the bird's-eye view coordinate system; based on the transformation matrix from the camera coordinate system of the left front-end camera to the vehicle coordinate system, the transformation matrix from the vehicle coordinate system to the bird's-eye view coordinate system and the camera internal parameters of the left front-end camera, Determine the transformation matrix H frontleft ⁇ bev from the camera coordinate system of the front-left camera to the bird's-eye view coordinate system; based on the transformation matrix from the camera coordinate system of the right front-end camera to the vehicle coordinate system, and the transformation matrix from the vehicle coordinate system to the bird's-eye view coordinate system and the in-camera parameters of the right front-end camera to determine the transformation matrix H frontright ⁇ bev from the camera coordinate system of the right front-end camera to the bird's-eye view coordinate system; based on the transformation matrix from the camera coordinate system of the back
  • the prediction network used in 3D target detection in this embodiment of the application is suitable for multi-camera systems. There is no need to train the prediction network from scratch, improving detection Measure efficiency.
  • Step S3032 based on the transformation matrix from the camera coordinate system of the multi-camera camera to the bird's-eye view coordinate system, convert the corresponding feature data of at least one image in the multi-camera perspective space from the multi-camera perspective space to the bird's-eye perspective space. Obtain corresponding feature data of at least one image in the bird's-eye perspective space.
  • the vehicle-mounted autonomous driving system can perform matrix multiplication between the transformation matrix of each camera and the feature data in the respective camera perspective space to obtain the corresponding feature data of at least one image in the bird's-eye perspective space. Specifically, it can be expressed as the following formula:
  • F represents the corresponding feature data F front , F frontleft , F frontright , F rear , F rearleft and F rearright of at least one image in the bird's-eye view space
  • H represents the transformation matrix H corresponding to each camera in the multi-camera system.
  • f represents the feature data f front , f frontleft of at least one image in the multi-camera perspective space , f frontright , f rear , f rearleft and f rearright .
  • the embodiment of the present disclosure calculates the respective conversion matrices (homography) for different cameras in the multi-camera system, and then maps the respective feature data to the bird's-eye view space based on the respective conversion matrices of each camera to obtain the homography of each image.
  • the corresponding feature data in the bird's-eye perspective space can not only be applied to different models of multi-camera systems, but also enable more reasonable feature fusion.
  • step S302 and step S3031 can be executed synchronously or asynchronously, depending on the actual application situation.
  • FIG. 10 is a schematic block diagram of performing step S303 and step S304 provided by an exemplary embodiment of the present disclosure.
  • step S302 is executed, based on the transformation matrix from the camera coordinate system of each camera to the bird's-eye view coordinate system obtained in step S3031 and the feature data corresponding to the camera view space obtained in step S302 Perform the feature space transformation described in step S3032 to obtain feature data of the bird's-eye view space.
  • step S304 is executed to perform feature fusion on the feature data of the multi-camera perspective in the bird's-eye view space to obtain the bird's-eye view fusion feature.
  • Figure 11 is a flow chart of target detection provided by an exemplary embodiment of the present disclosure.
  • step S305 may include the following steps:
  • Step S3051 Use the prediction network to obtain the heat map corresponding to the first preset coordinate value of the target object in the bird's-eye view coordinate system from the bird's-eye view fusion feature, and obtain the heat map used to determine the target object in the bird's-eye view coordinate system. Second preset coordinate values, dimensions, and other attribute maps for orientation angles.
  • the prediction network may be a neural network used for target prediction of the target object. Since the target object needs to predict three-dimensional spatial information with different attributes, there can be multiple types of prediction networks. Different prediction networks are used to predict three-dimensional spatial information of different attributes.
  • the prediction network corresponding to the first preset coordinate value can be used to process the bird's-eye view fusion features in the bird's-eye view image to obtain a heat map to utilize
  • the heat map determines the first preset coordinate value of the target object in the bird's-eye view coordinate system.
  • the heat map can be the same size as the bird's-eye view image.
  • the prediction network corresponding to the second preset coordinate value, size and orientation angle can be used to predict the bird's-eye view angle in the bird's-eye view image.
  • the features are fused for processing to obtain other attribute maps, and other attribute maps are used to determine the second preset coordinate value, size and orientation angle of the target object in the bird's-eye view coordinate system.
  • the first preset coordinate value is the (x, y) position in the bird's-eye view coordinate system
  • the second preset coordinate value is the z position in the bird's-eye view coordinate system
  • the dimensions are length, width and height
  • the orientation angle is the orientation angle.
  • Step S3052 Determine the first preset coordinate value of the target object in the bird's-eye view coordinate system based on the peak information in the heat map, and determine it from other attribute maps based on the first preset coordinate value of the target object in the bird's-eye view coordinate system.
  • the second preset coordinate value, size and orientation angle of the target object in bird's-eye view coordinates are the first preset coordinate value of the target object in the bird's-eye view coordinate system based on the peak information in the heat map, and determine it from other attribute maps based on the first preset coordinate value of the target object in the bird's-eye view coordinate system.
  • the second preset coordinate value, size and orientation angle of the target object in bird's-eye view coordinates are examples of the target object in bird's-eye view coordinates.
  • the peak information refers to the center value of the Gaussian kernel, that is, the center point of the target object.
  • the target object After predicting the first preset coordinate value of the target object in the bird's-eye view space, since other attribute maps can use the attribute output results of the heat map to output their own attribute information, the target object can be calculated based on the bird's-eye view coordinate system.
  • the first preset coordinate value predicts the second preset coordinate value, size and orientation angle of the target object under the bird's-eye view coordinates from other attribute maps.
  • Step S3053 Determine the three-dimensional spatial information of the target object based on the first preset coordinate value, the second preset coordinate value, the size and the orientation angle of the target object in the bird's-eye view coordinate system.
  • the vehicle-mounted automatic driving system can determine the first preset coordinate value and the second preset coordinate value as the (x, y, z) position of the target object in the bird's-eye view space, and determine the size as the target object.
  • the length, width and height in the bird's-eye view space are determined as the orientation angle of the target object in the bird's-eye view space.
  • the three-dimensional spatial information of the target object in the surrounding environment of the carrier is determined based on the (x, y, z) position, length, width, height and orientation angle.
  • Figure 12 is a schematic diagram of the output results of a prediction network provided by an exemplary embodiment of the present disclosure.
  • the center A of the smallest circle is the position of the carrier, and the position B of the box around the center is the target object around the carrier.
  • the vehicle-mounted autonomous driving system can also project the three-dimensional space of the target object onto the images from the multi-camera perspective collected by the multi-camera system, so that the user can intuitively understand the three-dimensional space information of the target object from the vehicle-mounted display.
  • embodiments of the present disclosure can process bird's-eye view images according to the prediction network to obtain heat maps and other attribute maps.
  • Inputting the bird's-eye view fusion features obtained through feature fusion into heat maps and other attribute maps can directly predict the three-dimensional spatial information of the target object and improve the efficiency of 3D target detection.
  • Figure 13 is another flowchart of target detection provided by an exemplary embodiment of the present disclosure.
  • step S305 may also include the following steps:
  • Step S3054 During the training phase of the prediction network, construct a first loss function between the heat map output by the prediction network and the true value heat map, and construct a third loss function between other attribute maps predicted by the prediction network and other true value attribute maps. Two loss functions.
  • the vehicle-mounted autonomous driving system can construct a Gaussian kernel for each target object based on the position of each target object in the bird's-eye view fusion feature.
  • Figure 14 is a schematic diagram of a Gaussian kernel provided by an exemplary embodiment of the present disclosure.
  • a Gaussian kernel of N ⁇ N size can be generated with the position (i, j) of the target object as the center.
  • the value in the center of the Gaussian kernel is 1, and the values around it decay downward to 0.
  • the color change from white to black means that the value decays from 1 to 0.
  • FIG 15 is a schematic diagram of a heat map provided by an exemplary embodiment of the present disclosure. As shown in Figure 15, the Gaussian kernel of each target object can be placed on the heat map to obtain the true value heat map.
  • each white area represents a Gaussian kernel, that is, a target object, such as target objects 1-6.
  • generation method of other true value attribute maps can refer to the generation method of the true value heat map, which will not be described again here.
  • a first loss function can be constructed based on the true value heat map and the heat map of the prediction network output.
  • the first loss function can measure the gap distribution between the output prediction value of the prediction network and the true value, and is used to supervise the training process of the prediction network.
  • the first loss function L cls can be specifically constructed by the following formula:
  • y′ i,j represents the first preset coordinate value of position (i, j) in the true value heat map
  • 1 represents the peak value in the heat map
  • y i,j represents the position of (i, j) predicted by the prediction network.
  • the first preset coordinate value in the heat map, ⁇ and ⁇ are adjustable hyperparameters, the ranges of ⁇ and ⁇ are between 0-1, N represents the number of target objects in the bird's-eye view fusion feature, h, w represents the size of the bird's-eye view fusion feature.
  • the second loss function L reg can be specifically constructed through the following formula:
  • B' is the second preset coordinate value, size and orientation angle of the target object in the bird's-eye view coordinate system
  • B is the second preset coordinate value of the target object predicted by the prediction network in the bird's-eye view coordinate system.
  • predicted values of size and orientation angle, N represents the number of target objects in the bird's-eye view fusion feature.
  • Step S3055 Determine the total loss function of the prediction network in the training stage based on the first loss function and the second loss function to supervise the training process of the prediction network.
  • the total loss function of the prediction network in the training phase can be determined by the following steps:
  • the total loss function of the prediction network in the training phase is determined.
  • the total loss function L 3d of the prediction network in the training stage can be determined by the following formula:
  • L 3d ⁇ 1 L cls + ⁇ 2 L reg ;
  • L cls is the first loss function
  • L reg is the second loss function
  • ⁇ 1 is the weight value of the first loss function
  • ⁇ 2 is the weight value of the second loss function.
  • the embodiment of the present disclosure constructs a total loss function to supervise the overall training process to ensure that the output of various attributes of the prediction network becomes more accurate, thereby ensuring higher 3D target detection efficiency.
  • Figure 16 is another flowchart of target detection provided by an exemplary embodiment of the present disclosure.
  • step S305 may also include the following steps:
  • Step S3056 Use a neural network to extract features from the bird's-eye view fusion features, and obtain bird's-eye view fusion feature data containing the characteristics of the target object.
  • the vehicle-mounted autonomous driving system can use a neural network to perform convolution and other calculations on the bird's-eye view fusion features to implement feature extraction and obtain bird's-eye view fusion feature data.
  • the bird's-eye view fusion feature data includes target object features of different dimensions used to characterize the target object, that is, the target object's scene information from different dimensions in the bird's-eye view space.
  • the neural network can be a pre-trained neural network used for feature extraction.
  • the neural network used for feature extraction is not limited to a specific network result, such as resnet, densenet, mobilenet, etc.
  • Step S3057 Use the prediction network to perform target prediction on the target object in the bird's-eye view fusion feature data containing the characteristics of the target object, and obtain the three-dimensional spatial information of the target object.
  • the prediction network before training the bird's-eye view fusion features through the prediction network, feature extraction is performed on the bird's-eye view fusion features to obtain bird's-eye view fusion feature data.
  • the prediction network is then used to predict the bird's-eye view fusion feature data containing the characteristics of the target object, so that the prediction results are more accurate, that is, the three-dimensional spatial information of the determined target object is more accurate.
  • step S302 may include the following steps:
  • the deep neural network is used to perform convolution calculations on the images corresponding to each viewing angle, and the corresponding feature data of multiple different resolutions containing the characteristics of the target object in the multi-camera viewing space are obtained.
  • the deep neural network can be a pre-trained neural network used for feature extraction.
  • the neural network used for feature extraction is not limited to a specific network result, such as resnet, densenet, mobilenet, etc.
  • the size of image A from a certain viewing angle is H ⁇ W ⁇ 3, where H is the height of image A, W is the width of image A, and 3 indicates that there are three channels. For example, if it is an RGB image, 3 represents the three channels of RGB (R red, G green, and B blue); if it is a YUV image, then 3 represents YUV (Y brightness signal, U blue component signal, V red component signal) 3 channel.
  • Input image A into the deep neural network, and after convolution and other calculations are performed through the deep neural network, a feature matrix of H1 ⁇ W1 ⁇ N dimensions will be output, where H1 and W1 are the height and width of the feature (usually smaller than H and W, N is the number of channels, N is greater than 3).
  • feature data of multiple different resolutions containing target object features of the input image can be obtained, such as low-level image texture, edge contour information, and advanced semantic information corresponding to different resolutions. wait. After obtaining the feature data of the image from each perspective, subsequent spatial transformation, multi-view Angular feature fusion and target prediction steps to obtain the three-dimensional spatial information of the target object.
  • the embodiments of the present disclosure perform convolution, pooling and other calculations on the images corresponding to each viewing angle through a deep neural network to obtain multiple different resolution feature data of the images from each viewing angle.
  • Feature data of different resolutions can better reflect the image features collected by the corresponding viewing angle camera and improve the efficiency of subsequent 3D target detection.
  • FIG 17 is a structural diagram of a 3D target detection device based on multi-view fusion provided by an exemplary embodiment of the present disclosure.
  • the 3D target detection device based on multi-view fusion can be installed in electronic equipment such as terminal equipment and servers, or on a carrier for assisted driving or automatic driving. For example, it can be installed in a vehicle-mounted automatic driving system to perform any of the above-mentioned tasks of the present disclosure.
  • the image receiving module 201 is used to obtain at least one image collected from the perspective of multiple cameras.
  • the feature extraction module 202 is configured to perform feature extraction on the at least one image acquired by the image receiving module, and obtain corresponding feature data containing target object features corresponding to the at least one image in the multi-camera perspective space.
  • the image feature mapping module 203 is configured to map the corresponding feature data of the at least one image obtained by the feature extraction module in the multi-camera perspective space to the same feature based on the internal parameters and vehicle parameters of the multi-camera system.
  • a bird's-eye view space and obtain corresponding feature data of the at least one image in the bird's-eye view space.
  • the image fusion module 204 is used to perform feature fusion on the corresponding feature data of the at least one image obtained by the image feature mapping module in the bird's-eye view space to obtain the bird's-eye view fusion feature.
  • the 3D detection module 205 is used to perform target prediction on the target object in the bird's-eye view fusion feature obtained by the image fusion module, and obtain the three-dimensional spatial information of the target object.
  • the device of the embodiment of the present disclosure performs 3D target detection based on multi-view fusion, it simultaneously maps the feature data of at least one image in multiple camera views to the same bird's-eye view space through middle fusion. A more reasonable and effective integration can be carried out.
  • the fused bird's-eye view fusion features the three-dimensional spatial information of each target object in the surrounding vehicle environment is directly detected in the bird's-eye view space. Therefore, when performing 3D target detection based on multi-view fusion through the device of the embodiment of the present disclosure, the 3D detection of scene objects from a bird's-eye view is completed end-to-end, avoiding the post-processing stage in conventional multi-view 3D target detection and improving detection efficiency. .
  • Figure 18 is another structural diagram of a 3D target detection device based on multi-view fusion provided by an exemplary embodiment of the present disclosure.
  • the image feature mapping module 203 includes:
  • the transformation matrix determination unit 2031 is configured to determine the transformation matrix from the camera coordinate system of the multi-camera system to the bird's-eye view coordinate system based on the internal parameters and vehicle parameters of the multi-camera system;
  • the spatial conversion unit 2032 is configured to convert the camera coordinate system of the multi-camera camera to the bird's-eye view coordinate system based on the conversion matrix determination unit 2031
  • the conversion matrix converts the corresponding feature data of the at least one image in the multi-camera perspective space from the multi-camera perspective space to the bird's-eye perspective space to obtain the corresponding corresponding features of the at least one image in the bird's-eye perspective space. characteristic data.
  • the conversion matrix determination unit 2031 includes:
  • the conversion matrix acquisition subunit is used to respectively obtain the intra-camera parameters and extra-camera parameters of the multiple cameras in the multi-camera system, and obtain the conversion matrix from the vehicle coordinate system to the bird's-eye view coordinate system;
  • the conversion matrix determination subunit is used to determine the camera coordinates of the multi-camera based on the camera extra-camera parameters, camera intra-camera parameters and the conversion matrix from the vehicle coordinate system to the bird's-eye view coordinate system obtained by the conversion matrix acquisition sub-unit. Transformation matrix to the bird's-eye view coordinate system.
  • the 3D detection module 205 includes:
  • the detection network acquisition unit 2051 is configured to use the prediction network to obtain the heat map corresponding to the first preset coordinate value of the target object in the bird's-eye view coordinate system from the bird's-eye view fusion feature, and obtain the heat map used to determine the target Other attribute maps of the second preset coordinate value, size and orientation angle of the object in the bird's-eye view coordinate system;
  • the information detection unit 2052 is configured to determine the first preset coordinate value of the target object in the bird's-eye view coordinate system according to the peak information in the heat map obtained by the detection network acquisition unit 2051, and determine the first preset coordinate value of the target object in the bird's-eye view coordinate system according to the The first preset coordinate value determines the second preset coordinate value, size and orientation angle of the target object under the bird's-eye view coordinates from the other attribute maps;
  • the information determination unit 2053 is configured to determine the three-dimensional space of the target object based on the first preset coordinate value, the second preset coordinate value, the size and the orientation angle of the target object detected by the information detection unit 2052 in the bird's-eye view coordinate system. information.
  • the 3D detection module 205 also includes:
  • the loss function construction unit 2054 is used to construct a first loss function between the heat map predicted by the prediction network and the true value heat map during the training phase of the prediction network, and to construct other attribute maps predicted by the prediction network and other true value attributes. Second loss function between graphs;
  • the total loss function determination unit 2055 is configured to determine the total loss function of the prediction network in the training phase according to the first loss function and the second loss function constructed by the loss function construction unit 2054, so as to supervise the training process of the prediction network.
  • the total loss function determination unit 2055 includes:
  • the weight value acquisition subunit is used to obtain the weight value of the first loss function and the weight value of the second loss function
  • the total loss function determination subunit is used to construct the first loss function and the second loss function based on the loss function construction unit 2054, and the weight value of the first loss function obtained by the weight value acquisition subunit and the The weight value of the second loss function determines the total loss function of the prediction network in the training phase.
  • the 3D detection module 205 also includes:
  • the fusion feature extraction unit 2056 is configured to use a neural network to perform feature extraction on the bird's-eye view fusion feature, and obtain bird's-eye view fusion feature data containing target object features;
  • the target prediction unit 2057 is configured to use the prediction network to perform target prediction on the target object in the bird's-eye view fusion feature data containing the target object characteristics obtained by the feature extraction unit 2056, and obtain the three-dimensional spatial information of the target object.
  • the feature extraction module 202 includes:
  • the feature extraction unit 2021 is used to use a deep neural network to perform convolution calculations on images corresponding to each viewing angle, and obtain multiple different resolution features containing target object features corresponding to the images corresponding to each viewing angle in the multi-camera viewing space. data.
  • Figure 19 is a structural block diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • the electronic device 11 includes one or more processors 111 and memories 112 .
  • the processor 111 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.
  • CPU central processing unit
  • the processor 111 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.
  • Memory 112 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 111 may execute the program instructions to implement the 3D object based on multi-view fusion of various embodiments of the present disclosure described above. detection methods and/or other desired functionality.
  • Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
  • the electronic device 11 may further include an input device 113 and an output device 114, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 113 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 114 can output various information to the outside, including determined distance information, direction information, etc.
  • the output device 114 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.
  • the electronic device 11 may also include any other appropriate components depending on the specific application.
  • embodiments of the present disclosure may also be a computer program product, which includes computer program instructions that, when executed by a processor, cause the processor to perform the “exemplary method” described above in this specification
  • the steps in the 3D object detection method based on multi-view fusion according to various embodiments of the present disclosure are described in the section.
  • embodiments of the present disclosure may also be a computer-readable storage medium having computer program instructions stored thereon.
  • the computer program instructions when executed by a processor, cause the processor to execute the above-mentioned “example method” part of this specification.
  • the steps in the 3D target detection method based on multi-view fusion according to various embodiments of the present disclosure are described in .
  • each component or each step can be decomposed and/or recombined. These decompositions and/or recombinations should be considered equivalent versions of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本公开实施例公开了一种基于多视角融合的3D目标检测方法及装置。在该方法中,对多摄相机系统采集的多摄相机视角的至少一幅图像进行特征提取,并基于多摄相机系统的内部参数和载具参数,将提取到的在多摄相机视角空间下包含目标物体特征的特征数据映射至同一个鸟瞰视角空间,得到至少一幅图像在鸟瞰视角空间下各自对应的特征数据,通过特征融合得到鸟瞰视角融合特征。对鸟瞰视角融合特征中的目标物体进行目标预测,得到目标物体的三维空间信息。通过本公开实施例的方案进行基于多视角融合的3D目标检测时,先进行多视角的特征融合再进行3D目标检测,端到端的完成鸟瞰视角下的场景物体3D检测,提高检测效率。

Description

一种基于多视角融合的3D目标检测方法及装置
本公开要求在2022年5月18日提交的、申请号为202210544237.0、发明名称为“一种基于多视角融合的3D目标检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及计算机视觉领域,具体涉及一种基于多视角融合的3D目标检测方法及装置。
背景技术
随着科技的发展,自动驾驶技术在人们生活中的应用越来越广泛。自动驾驶载体可以对周围一定距离内的目标物体(车辆、行人、骑车人等)进行3D检测,以获得目标物体的三维空间信息。基于目标物体的三维空间信息对目标物体进行测距、测速,以实现更好的驾驶控制。
目前,自动驾驶载体可以采集视角不同的多幅图像,然后分别对每一幅图像进行3D检测,最后对各幅图像的3D检测结果进行融合,以生成载体周围环境的目标物体的三维空间信息。
发明内容
现有的技术方案需要对自动驾驶载体采集的每一幅图像分别进行3D检测,然后再对各幅图像的3D检测结果进行融合,以获取载体周围360度环境的他车信息,导致检测效率较低。
为了解决上述技术问题,提出了本公开。本公开的实施例提供了一种基于多视角融合的3D目标检测方法及装置。
根据本公开的一个方面,提供了一种基于多视角融合的3D目标检测方法,包括:
获取采集的来自多摄相机视角的至少一幅图像;
对所述至少一幅图像进行特征提取,得到所述至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据;
基于多摄相机系统的内部参数和载具参数,将所述至少一幅图像在多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据;
将所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到的鸟瞰视角融合特征;
对所述鸟瞰视角融合特征中的目标物体进行目标预测,得到所述目标物体的三维空间信息。
根据本公开的另一个方面,提供了一种基于多视角融合的3D目标检测装置,包括:
图像接收模块,用于获取采集的来自多摄相机视角的至少一幅图像;
特征提取模块,用于对所述图像接收模块获取的所述至少一幅图像进行特征提取,得到所述至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据;
图像特征映射模块,用于基于多摄相机系统的内部参数和载具参数,将所述特征提取模块获得的所述至少一幅图像在多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据;
图像融合模块,用于将所述图像特征映射模块得到的所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角融合特征;
3D检测模块,用于对所述图像融合模块得到的所述鸟瞰视角融合特征中的目标物体进行目标预测,得到目标物体的三维空间信息。
根据本公开的又一个方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于执行上述的基于多视角融合的3D目标检测方法。
根据本公开的再一个方面,提供了一种电子设备,所述电子设备包括:
处理器;
用于存储所述处理器可执行指令的存储器;
所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述的基于多视角融合的3D目标检测方法。
基于本公开上述实施例提供的一种基于多视角融合的3D目标检测方法及装置,对多摄相机系统采集的多摄相机视角的至少一幅图像进行特征提取,并基于多摄相机系统的内部参数,将提取到的在多摄相机视角空间下包含目标物体特征的特征数据映射至同一个鸟瞰视角空间,得到至少一幅图像在鸟瞰视角空间下各自对应的特征数据,并将至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角融合特征。再对鸟瞰视角融合特征中的目标物体进行目标预测,得到目标物体的三维空间信息。通过本公开实施例的方案进行基于多视角融合的3D目标检测时,先进行多视角的特征融合再进行3D目标检测,端到端的完成鸟瞰视角下的场景物体3D目标检测,避免在常规多视角3D检测上的后处理阶段,提高检测效率。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开所适用的场景图。
图2是本公开实施例提供的车载自动驾驶系统的系统框图。
图3是本公开一示例性实施例提供的基于多视角融合的3D目标检测方法的流程图。
图4是本公开一示例性实施例提供的多摄相机系统采集图像的示意框图。
图5是本公开一示例性实施例提供的来自多摄相机视角的图像的示意图。
图6是本公开一示例性实施例提供的特征提取的示意框图。
图7是本公开一示例性实施例提供的从多摄相机系统采集的图像生成鸟瞰视角图像的示意图。
图8是本公开一示例性实施例提供的目标检测的示意框图。
图9是本公开一示例性实施例提供的确定鸟瞰视角空间下特征数据的流程图。
图10是本公开一示例性实施例提供的执行步骤S303和步骤S304的示意框图。
图11是本公开一示例性实施例提供的目标检测的流程图。
图12是本公开一示例性实施例提供的预测网络的输出结果示意图。
图13是本公开一示例性实施例提供的目标检测的另一流程图。
图14是本公开一示例性实施例提供的高斯核的示意图。
图15是本公开一示例性实施例提供的热力图的示意图。
图16是本公开一示例性实施例提供的目标检测的又一流程图。
图17是本公开一示例性实施例提供的基于多视角融合的3D目标检测装置的结构图。
图18是本公开一示例性实施例提供的基于多视角融合的3D目标检测装置的另一结构图。
图19是本公开一示例性实施例提供的电子设备的结构框图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
申请概述
为保证自动驾驶过程中的安全,自动驾驶载体可以对载体周围一定距离内的目标物体(例如:车辆、行人、骑车人等)进行实时检测,以获得3D目标物体的三维空间信息(例如:位置、尺寸、朝向角和类别等属性)。基于目标物体的三维空间信息对目标物体进行测距、测速,以实现更好的驾驶控制。其中,自动驾驶载体可以为车辆、飞机等。
自动驾驶载体可以利用多摄相机系统采集视角不同的多幅图像,然后分别对每一幅图像进行3D目标检测,如对不同视角相机采集的多幅图像分别进行目标物体的过滤、去重等操作。最后对各幅图像的3D检测结果进行融合,以生成载体周围环境的目标物体的三维空间信息。可见,现有的技术方案需要对自动驾驶载体采集的每一幅图像分别进行3D检测,然后再对各幅图像的3D检测结果进行融合,导致检测效率较低。
有鉴于此,本公开实施例提供一种基于多视角融合的3D目标检测方法及装置。通过本公开的方案进行3D目标检测时,自动驾驶载体可以对多摄相机系统采集的多摄相机视角的至少一幅图像进行特征提取,得到在多摄相机视角空间下包含目标物体特征的特征数据。并基于多摄相机系统的内部参数和载具参数,将在多摄相机视角空间下的特征数据映射至同一个鸟瞰视角空间,得到至少一幅图像在鸟瞰视角空间下各自对应的特征数据。再将至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角 融合特征;对鸟瞰视角融合特征中的目标物体进行目标预测,得到载体周围环境的目标物体的三维空间信息。
本公开实施例的方案进行基于多视角融合的3D目标检测时,将至少一幅图像在多摄相机视角下的特征数据同时映射至同一个鸟瞰视角空间,能够进行更合理,效果更好的融合。同时,通过融合的鸟瞰视角融合特征直接在鸟瞰视角空间检测出车载环境周围内各个目标物体的三维空间信息。因此,通过本公开实施例的方案进行基于多视角融合的3D目标检测时,先进行多视角的特征融合再进行3D目标检测,端到端的完成鸟瞰视角下的场景物体3D目标检测,避免在常规多视角3D目标检测上的后处理阶段,提高检测效率。
示例性系统
本公开实施例可应用于需要进行3D目标检测的应用场景中,例如自动驾驶应用场景。
例如,在自动驾驶的应用场景中,在自动驾驶载体(下文简称“载体”)上配置多摄相机系统,通过多摄相机系统采集不同视角的图像,然后通过本公开实施例的方案基于多视角融合的3D目标检测,获得载体周围环境的目标物体的三维空间信息。
图1是本公开所适用的场景图。
如图1所示,本公开实施例应用在辅助驾驶或自动驾驶的应用场景中,辅助驾驶或自动驾驶的载体100上配置车载自动驾驶系统200和多摄相机系统300,车载自动驾驶系统200和多摄相机系统300电连接。多摄相机系统300用于采集载体周围环境的图像,车载自动驾驶系统200用于获取多摄相机系统300采集的图像,并进行基于多视角融合的3D目标检测,获得载体周围环境的目标物体的三维空间信息。
图2是本公开实施例提供的车载自动驾驶系统的系统框图。
如图2所示,车载自动驾驶系统200包括图像接收模块201、特征提取模块202、图像特征映射模块203,图像融合模块204和3D检测模块205。图像接收模块201用于获取多摄相机系统300采集的至少一幅图像;特征提取模块202用于对图像接收模块201获取的至少一幅图像进行特征提取,获得特征数据;图像特征映射模块203用于将至少一幅图像的特征数据从多摄相机视角空间映射至同一个鸟瞰视角空间;图像融合模块204用于将至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角融合特征;3D检测模块205用于对图像融合模块204得到的鸟瞰视角融合特征中的目标物体进行目标预测,得到载体周围环境的目标物体在的三维空间信息。
多摄相机系统300包括视角不同的多个相机,每个相机用于采集一个视角的环境图像,多个相机覆盖载体周围360度的环境范围。每个相机定义自己的相机视角坐标系,通过各自的相机视角坐标系形成各自的相机视角空间,每个相机采集的环境图像为在对应的相机视角空间下的图像。
示例性方法
图3是本公开一示例性实施例提供的基于多视角融合的3D目标检测方法的流程图。
本实施例可应用在车载自动驾驶系统200,如图3所示,包括如下步骤:
步骤S301,获取采集的来自多摄相机视角的至少一幅图像。
其中,至少一幅图像可以是多摄相机系统的至少一个相机采集到的。示例性的,该至少一幅图像可以是多摄相机系统实时采集的图像,也可以是多摄相机系统预先采集的图像。
图4是本公开一示例性实施例提供的多摄相机系统采集图像的示意框图。
如图4所示,在一个实施例中,多摄相机系统可以实时采集不同视角的多幅图像,如图像1、2……N,并实时将采集到的图像发送给车载自动驾驶系统。这样,车载自动驾驶系统获取到的图像能够表征当前时刻载体周围环境的真实情况。
图5是本公开一示例性实施例提供的来自多摄相机视角的图像的示意图。
如图5中(1)-(6)所示,在一个实施例中,多摄相机系统可以包括6个相机。6个相机分别设置在载体的前端、左前端、右前端、后端、左后端和右后端。这样,在任意时刻,多摄相机系统均可以采集到6个不同视角的图像,如前视图像(Ifront)、左前视图像(Ifrontleft)、右前视图像(Ifrontright)、后视图像(Irear)、左后视图像(Irearleft)和右后视图像(Irearright)。
其中,每一幅图像中包括但不限于呈现道路、交通信号灯、路牌、车辆(小型车、大巴、卡车等)、行人、骑车人等各类别的目标物体。随着载体周围环境中的目标物体的类别位置等不同,各个图像中包含的目标物体的类别、位置等也不同。
步骤S302,对至少一幅图像进行特征提取,得到至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据。
在一个实施例中,车载自动驾驶系统可以分别从每幅图像中提取出在对应相机视角空间下的特征数据。特征数据中可以包含用于描述图像中目标物体的目标物体特征,目标物体特征包括但不限于图像纹理信息、边缘轮廓信息、语义信息等。
其中,图像纹理信息用于表征目标物体的图像纹理,边缘轮廓信息用于表征目标物体的边缘轮廓,语义信息用于表征目标物体的类别。其中,目标物体的类别包括但不限于:道路、交通信号灯、路牌、车辆(小型车、大巴、卡车等)、行人、骑车人等。
图6是本公开一示例性实施例提供的特征提取的示意框图。
如图6所示,车载自动驾驶系统可以采用神经网络对至少一幅图像(图像1-N)进行特征提取,得到每幅图像在多摄相机视角空间下各自对应的特征数据1-N。
例如,车载自动驾驶系统对前视图像(Ifront)进行特征提取,可以得到前视图像(Ifront)在前端相机视角空间下的特征数据ffront;对左前视图像(Ifrontleft)进行特征提取,可以得到左前视图像(Ifrontleft)在左前端相机视角空间下的特征数据ffrontleft;对右前视图像(Ifrontright)进行特征提取,可以得到右前 视图像(Ifrontright)在右前端相机视角空间下的特征数据ffrontright;对后视图像(Irear)进行特征提取,可以得到后视图像(Irear)在后端相机视角空间下的特征数据frear;对左后视图像(Irearleft)进行特征提取,可以得到左后视图像(Irearleft)在左后端相机视角空间下的特征数据frearleft;对右后视图像(Irearright)进行特征提取,可以得到右后视图像(Irearright)在右后端相机视角空间下的特征数据frearright
步骤S303,基于多摄相机系统的内部参数和载具参数,将至少一幅图像在多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到至少一幅图像在鸟瞰视角空间下各自对应的特征数据。
其中,多摄相机系统的内部参数包括每个相机的相机内参数和相机外参数,相机内参数是与相机自身特性相关的参数,比如相机的焦距、像素大小等;相机外参数是在世界坐标系中的参数,比如相机的位置、旋转方向等。载具参数是指载具坐标系(Vehicle Coordinate System,VCS)到鸟瞰视角坐标系(BEV)的转换矩阵,载具坐标系是载体所在坐标系。
例如,车载自动驾驶系统将前视图像(Ifront)在前端相机视角空间下的特征数据ffront映射至同一个鸟瞰视角空间,得到前视图像(Ifront)在鸟瞰视角空间下的特征数据Ffront;将左前视图像(Ifrontleft)在左前端相机视角空间下的特征数据ffrontleft映射至同一个鸟瞰视角空间,得到左前视图像(Ifrontleft)在鸟瞰视角空间下的特征数据Ffrontleft;将右前视图像(Ifrontright)在右前端相机视角空间下的特征数据ffrontright映射至同一个鸟瞰视角空间,得到右前视图像(Ifrontright)在鸟瞰视角空间下的特征数据Ffrontright;将后视图像(Irear)在后端相机视角空间下的特征数据frear映射至同一个鸟瞰视角空间,得到后视图像(Irear)在鸟瞰视角空间下的特征数据Frear;将左后视图像(Irearleft)在左后端相机视角空间下的特征数据frearleft映射至同一个鸟瞰视角空间,得到左后视图像(Irearleft)在鸟瞰视角空间下的特征数据Frearleft;将右后视图像(Irearright)在右后端相机视角空间下的特征数据frearright映射至同一个鸟瞰视角空间,得到右后视图像(Irearright)在鸟瞰视角空间下的特征数据Frearright
步骤S304,将至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角融合特征。
其中,鸟瞰视角融合特征用于表征载体周围的目标物体在鸟瞰视角空间下的特征数据,目标物体在鸟瞰视角空间下的特征数据可以包括但不限定于目标物体的形状、尺寸大小、类别、朝向角、相对位置等属性。
在一个实施例中,车载自动驾驶系统可以将至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行加法特征融合,得到鸟瞰视角融合特征。具体可以表示为以下公式:
F‘=Add(∑F(frontfrontleftfrontrightrearrearleftrearright))
其中,F‘表示鸟瞰视角融合特征,Add表示对至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行的加法特征融合计算。
需要指出的是,步骤S304的实施方式并不局限于此,例如,也可以采用乘法、叠加等方式对不同相机 视角的图像在鸟瞰视角空间下各自对应的特征数据进行特征融合。
图7是本公开一示例性实施例提供的从多摄相机系统采集的图像生成鸟瞰视角图像的示意图。
如图7所示,示例性的,鸟瞰视角图像的大小可以与多摄相机系统采集的至少一幅图像的大小相同。鸟瞰视角图像可以体现目标物体的三维空间信息,三维空间信息包括目标物体的至少一种属性信息,该属性包括但不限于3D位置信息(即X轴、Y轴、Z轴的坐标信息)、尺寸信息(即长、宽、高信息)、朝向角信息等。
其中,X轴、Y轴、Z轴的坐标信息是指目标物体在鸟瞰视角空间的坐标位置(x,y,z),鸟瞰视角空间的坐标系原点位于载体的底盘或者载体中心等任一位置,X轴方向为从前到后的方向,Y轴方向为从左到右的方向,Z轴方向为垂直上下的方向。朝向角是指目标物体的正面方向或行进方向在鸟瞰视角空间下形成的角度,例如,在目标物体为行进的行人时,朝向角是指行人的行进方向在鸟瞰视角空间下形成的角度。在目标物体为静止的车辆时,朝向角是指车辆的车头方向在鸟瞰视角空间下形成的角度。
需要说明的是,由于多摄相机系统采集的至少一幅图像中可能包括不同类别的目标物体,因此,鸟瞰视角图像中可能包括不同类别的目标物体的鸟瞰视角融合特征。
步骤S305,对鸟瞰视角融合特征中的目标物体进行目标预测,得到目标物体的三维空间信息。
其中,三维空间信息可以包括:目标物体在鸟瞰视角坐标系下的位置、尺寸和朝向角等属性中的至少一种。位置是指目标物体在鸟瞰视角空间中相对于载体的坐标位置(x,y,z),尺寸是指目标物体在鸟瞰视角空间中的长宽高(Height、Width、Length),朝向角是指目标物体在鸟瞰视角空间中的朝向角度(rotation yaw)。
图8是本公开一示例性实施例提供的目标检测的示意框图。
如图8所示,在一个实施例中,车载自动驾驶系统可以利用一个或者多个预测网络对鸟瞰视角融合特征中的目标物体进行3D目标预测,得到载体周围环境的每个目标物体的三维空间信息。
如果车载自动驾驶系统利用多个预测网络进行3D目标预测时,每个预测网络可以输出目标物体的一个或者多个属性,不同的预测网络输出的属性也不同。
本公开实施例的方案进行基于多视角融合的3D目标检测时,可以先进行多视角的特征融合再进行3D目标检测,端到端的完成鸟瞰视角下的场景物体3D目标检测,避免在常规多视角3D目标检测上的后处理阶段,提高检测效率。
图9是本公开一示例性实施例提供的确定鸟瞰视角空间下特征数据的流程图。
如图9所示,在上述图3所示实施例的基础上,步骤S303可包括如下步骤:
步骤S3031,基于多摄相机系统的内部参数和载具参数,确定多摄相机系统的多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵。
其中,多摄相机系统的内部参数包括每个相机的相机内参数和相机外参数,相机外参数是指多摄相机 的相机坐标系到载具坐标系的转换矩阵,载具参数是指载具坐标系(Vehicle Coordinate System,VCS)到鸟瞰视角坐标系(BEV)的转换矩阵,载具坐标系是载体所在坐标系。
在一种具体实施方式中,步骤S3031包括:
分别获取多摄相机系统中多摄相机的相机内参数和相机外参数,以及,获取载具坐标系到鸟瞰视角坐标系的转换矩阵;
基于多摄相机的相机外参数、相机内参数与载具坐标系到鸟瞰视角坐标系的转换矩阵,确定多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵。
在一个实施例中,车载自动驾驶系统可以通过以下公式确定多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵H:
H=Tvcs→bev@Tcamera→vcs@K-1
其中,@表示矩阵乘法;Tcamera→vcs表示相机坐标系到载具坐标系的转换矩阵,Tcamera→vcs表征相机外参数;Tvcs→bev表示载具坐标系到鸟瞰视角坐标系的转换矩阵;K表示相机内参数。
需要说明的是,相机外参数,即相机坐标系到载具坐标系的转换矩阵可以通过多摄相机系统的标定得到,一旦标定完成,通常不会变动。载具坐标系到鸟瞰视角坐标系的转换矩阵可以由人为设定的鸟瞰视角的范围(例如前、后、左、右各100米围成的范围),以及鸟瞰视角图像的分辨率(例如512×512)计算得到。
这样,多摄相机系统中的每个相机均可以确定对应的转换矩阵。例如,车载自动驾驶系统基于前端相机的相机坐标系到载具坐标系的转换矩阵、载具坐标系到鸟瞰视角坐标系的转换矩阵和前端相机的相机内参数,确定前端相机的相机坐标系到鸟瞰视角坐标系的转换矩阵Hfront→bev;基于左前端相机的相机坐标系到载具坐标系的转换矩阵、载具坐标系到鸟瞰视角坐标系的转换矩阵和左前端相机的相机内参数,确定左前端相机的相机坐标系到鸟瞰视角坐标系的转换矩阵Hfrontleft→bev;基于右前端相机的相机坐标系到载具坐标系的转换矩阵、载具坐标系到鸟瞰视角坐标系的转换矩阵和右前端相机的相机内参数,确定右前端相机的相机坐标系到鸟瞰视角坐标系的转换矩阵Hfrontright→bev;基于后端相机的相机坐标系到载具坐标系的转换矩阵、载具坐标系到鸟瞰视角坐标系的转换矩阵和后端相机的相机内参数,确定后端相机的相机坐标系到鸟瞰视角坐标系的转换矩阵Hrear→bev;基于左后端相机的相机坐标系到载具坐标系的转换矩阵、载具坐标系到鸟瞰视角坐标系的转换矩阵和左后端相机的相机内参数,确定左后端相机的相机坐标系到鸟瞰视角坐标系的转换矩阵Hrearleft→bev;基于右后端相机的相机坐标系到载具坐标系的转换矩阵、载具坐标系到鸟瞰视角坐标系的转换矩阵和右后端相机的相机内参数,确定右后端相机的相机坐标系到鸟瞰视角坐标系的转换矩阵Hrearright→bev
本实施方式中,由于每个相机都具有从自身的相机视角坐标系到鸟瞰视角坐标系的转换矩阵,所以本申请实施例在进行3D目标检测时所采用的预测网络适用于多摄相机系统,无需从头训练预测网络,提高检 测效率。
步骤S3032,基于多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵,将至少一幅图像在多摄相机视角空间下各自对应的特征数据从多摄相机视角空间转换至鸟瞰视角空间下,得到至少一幅图像在鸟瞰视角空间下各自对应的特征数据。
在一个实施例中,车载自动驾驶系统可以将各个相机的转换矩阵与各自相机视角空间下的特征数据通过矩阵乘法得到至少一幅图像在鸟瞰视角空间下各自对应的特征数据。具体可以表示为以下公式:
F=H@f。
其中,F表示至少一幅图像在鸟瞰视角空间下各自对应的特征数据Ffront、Ffrontleft、Ffrontright、Frear、Frearleft和Frearright;H表示多摄相机系统中各个相机对应的转换矩阵Hfront→bev、Hfrontleft→bev、Hfrontright→bev、Hrear→bev、Hrearleft→bev和Hrearright→bev;f表示至少一幅图像在多摄相机视角空间下的特征数据ffront、ffrontleft、ffrontright、frear、frearleft和frearright
可见,本公开实施例通过对多摄相机系统中的不同相机计算各自的转换矩阵(homography),进而基于每个相机各自的转换矩阵将各自的特征数据映射至鸟瞰视角空间,得到每幅图像在鸟瞰视角空间下各自对应的特征数据,不仅可以适用于不同型号的多摄相机系统,还可进行更加合理的特征融合。
需要说明的是,步骤S302和步骤S3031这两个步骤可同步执行也可异步执行,可基于实际应用情况而定。
图10是本公开一示例性实施例提供的执行步骤S303和步骤S304的示意框图。
如图10所示,在步骤S302和步骤S3031全部执行完成后,基于步骤S3031得到的每个相机的相机坐标系到鸟瞰视角坐标系的转换矩阵和和步骤S302得到的对应相机视角空间的特征数据进行步骤S3032所述的特征空间转换,得到鸟瞰视角空间的特征数据。最后,执行步骤S304将多摄相机视角的在鸟瞰视角空间的特征数据进行特征融合,得到鸟瞰视角融合特征。
图11是本公开一示例性实施例提供的目标检测的流程图。
如图11所示,在上述图3所示实施例的基础上,步骤S305可包括如下步骤:
步骤S3051、利用预测网络从鸟瞰视角融合特征中获取用于确定目标物体在鸟瞰视角坐标系下的第一预设坐标值对应的热力图,以及,获取用于确定目标物体在鸟瞰视角坐标系下的第二预设坐标值、尺寸和朝向角的其他属性图。
其中,预测网络可以为用于对目标物体进行目标预测的神经网络。由于目标物体需进行不同属性的三维空间信息预测,因此,预测网络也可为多种。不同的预测网络用于预测不同属性的三维空间信息。
例如,在需要预测的属性为目标物体的第一预设坐标值时,可以利用第一预设坐标值对应的预测网络对鸟瞰视角图像中的鸟瞰视角融合特征进行处理,获得热力图,以利用热力图确定目标物体在鸟瞰视角坐标系下的第一预设坐标值。热力图的大小可与鸟瞰视角图像的大小相同。
又例如,在需要预测的属性为目标物体的第二预设坐标值、尺寸和朝向角时,可以利用第二预设坐标值、尺寸和朝向角对应的预测网络对鸟瞰视角图像中的鸟瞰视角融合特征进行处理,获得其他属性图,以利用其他属性图确定目标物体在鸟瞰视角坐标系下的第二预设坐标值、尺寸和朝向角。
其中,第一预设坐标值为鸟瞰视角坐标系下的(x,y)位置,第二预设坐标值为鸟瞰视角坐标系下的z位置,尺寸为长宽高,朝向角为朝向角度。
步骤S3052、根据热力图中的峰值信息确定目标物体在鸟瞰视角坐标系下的第一预设坐标值,并且根据目标物体在鸟瞰视角坐标系下的第一预设坐标值从其他属性图中确定目标物体的在鸟瞰视角坐标下的第二预设坐标值、尺寸和朝向角。
其中,峰值信息是指高斯核的中心值,即目标物体的中心点。
在预测出目标物体在鸟瞰视角空间下的第一预设坐标值后,由于其他属性图可利用热力图的属性输出结果来输出各自的属性信息,因此,可根据目标物体在鸟瞰视角坐标系下的第一预设坐标值从其他属性图中预测目标物体的在鸟瞰视角坐标下的第二预设坐标值、尺寸和朝向角。
步骤S3053、根据目标物体在鸟瞰视角坐标系下的第一预设坐标值、第二预设坐标值、尺寸和朝向角,确定目标物体的三维空间信息。
在一个实施例中,车载自动驾驶系统可以将第一预设坐标值和第二预设坐标值确定为目标物体在鸟瞰视角空间中的(x,y,z)位置,将尺寸确定为目标物体在鸟瞰视角空间中的长宽高,将朝向角确定为目标物体在鸟瞰视角空间中的朝向角度。最后,基于(x,y,z)位置、长宽高和朝向角度确定载体周围环境的目标物体的三维空间信息。
图12是本公开一示例性实施例提供的预测网络的输出结果示意图。在图12中,最小圆圈的中心A为载体位置,中心周围的方框位置B为载体周围的目标物体。
另外,车载自动驾驶系统还可以将目标物体的三维空间投影显示到多摄相机系统采集的来自多摄相机视角的图像上,以便与用户从车载显示屏中直观地了解目标物体的三维空间信息。
可见,本公开实施例可以根据预测网络对鸟瞰视角图像进行处理,以获得热力图和其他属性图。将通过特征融合得到的鸟瞰视角融合特征输入热力图和其他属性图可直接预测出目标物体的三维空间信息,提高3D目标检测效率。
图13是本公开一示例性实施例提供的目标检测的另一流程图。
如图13所示,在上述图11所示实施例的基础上,步骤S305还可包括如下步骤:
步骤S3054、在预测网络的训练阶段,构建预测网络输出的热力图与真值热力图之间的第一损失函数,以及,构建预测网络预测的其他属性图与其他真值属性图之间的第二损失函数。
在一个实施例中,车载自动驾驶系统可以根据鸟瞰视角融合特征中的每一个目标物体的位置,分别为每一个目标物体构建高斯核。
图14是本公开一示例性实施例提供的高斯核的示意图。如图14所示,在构建高斯核时,可以以目标物体的位置(i,j)为中心,生成1个N×N大小的高斯核。其中,高斯核中心的值为1,四周的值向下衰减至0,颜色由白色到黑色表示值由1衰减到0。
图15是本公开一示例性实施例提供的热力图的示意图。如图15所示,可以将各个目标物体的高斯核置于热力图上,即可得到真值热力图。在图15中,每个白色区域都表示一个高斯核,即一个目标物体,如目标物体1-6。
需要说明的是,其他真值属性图的生成方式可参照真值热力图的生成方式,这里不再赘述。
在确定真值热力图之后,可以基于真值热力图和预测网络输出的热力图构建第一损失函数。其中,第一损失函数可以衡量预测网络的输出预测值与真值之间的差距分布,用于对预测网络的训练过程进行监督。
在一个实施例中,第一损失函数Lcls具体可以通过以下公式构建:
其中,y′i,j表示(i,j)位置在真值热力图中第一预设坐标值,1表示热力图中的峰值,yi,j表示(i,j)位置在预测网络预测的热力图中的第一预设坐标值,α和β为可调整的超参数,α和β的范围均在0-1之间,N表示鸟瞰视角融合特征中目标物体的数量和,h,w表示鸟瞰视角融合特征的尺寸。
在一个实施例中,第二损失函数Lreg具体可以通过以下公式构建:
其中,B′为目标物体在鸟瞰视角坐标系下的第二预设坐标值、尺寸和朝向角的真值,B为预测网络预测的目标物体在鸟瞰视角坐标系下的第二预设坐标值、尺寸和朝向角的预测值,N表示鸟瞰视角融合特征中目标物体的数量。
步骤S3055、根据第一损失函数和第二损失函数确定预测网络在训练阶段的总损失函数,以监督预测网络的训练过程。
在一个实施例中,预测网络在训练阶段的总损失函数可以通过以下步骤确定:
获取第一损失函数的权重值和第二损失函数的权重值;
基于第一损失函数、第一损失函数的权重值、第二损失函数和第二损失函数的权重值,确定预测网络在训练阶段的总损失函数。
这里,在利用预测网络预测目标物体的三维空间信息时,不同的属性在训练过程的重要程度不同,使得对应的损失函数的重要程度也不同。因此,根据每个属性的训练过程的重要程度,为不同属性对应的损失函数配置不同的权重值。
其中,预测网络在训练阶段的总损失函数L3d可以通过以下公式确定:
L3d=λ1Lcls2Lreg
其中,Lcls为第一损失函数,Lreg为第二损失函数,λ1为第一损失函数的权重值,λ2为第二损失函数的权重值。λ1和λ2均在0-1之间,λ12,λ12=1。
可见,本公开实施例在对预测网络进行训练时,构建总损失函数对总训练过程进行监督,以保证预测网络的各种属性的输出愈加准确,进而保证3D目标检测效率更高。
图16是本公开一示例性实施例提供的目标检测的又一流程图。
如图16所示,在上述图3所示实施例的基础上,步骤S305还可包括如下步骤:
步骤S3056、利用神经网络对鸟瞰视角融合特征进行特征提取,获得包含目标物体特征的鸟瞰视角融合特征数据。
在一个实施例中,车载自动驾驶系统可以利用神经网络对鸟瞰视角融合特征进行卷积等计算,以实现特征提取,获得鸟瞰视角融合特征数据。鸟瞰视角融合特征数据中包括用于表征目标物体的不同维度的目标物体特征,即目标物体在鸟瞰视角空间中来自不同维度的场景信息。
其中,神经网络可以为预先训练好的、用于特征提取的神经网络。可选地,用于特征提取的神经网络不仅限于某一种特定的网络结果,如:resnet、densenet、mobilenet等。
步骤S3057、利用预测网络对包含目标物体特征的鸟瞰视角融合特征数据中的目标物体进行目标预测,得到目标物体的三维空间信息。
可见,本公开实施例通过预测网络对鸟瞰视角融合特征进行训练之前,对鸟瞰视角融合特征进行特征提取,得到鸟瞰视角融合特征数据。再利用预测网络对包含目标物体特征的鸟瞰视角融合特征数据进行预测,使得预测结果更准确,即确定的目标物体的三维空间信息更准确。
在上述图3所示实施例的基础上,步骤S302可包括如下步骤:
利用深度神经网络对各个视角对应的图像进行卷积计算,获得各个视角对应的图像在多摄相机视角空间下各自对应的包含目标物体特征的多个不同分辨率的特征数据。
这里,深度神经网络可以为预先训练好的、用于特征提取的神经网络。可选地,用于特征提取的神经网络不仅限于某一种特定的网络结果,如:resnet、densenet、mobilenet等。利用深度神经网络对目标视角的图像进行卷积、池化等计算,可以获取到目标视角的图像对应的多个不同分辨率(尺度)的特征数据。
例如,某视角的图像A的尺寸为H×W×3,其中,H为图像A的高度,W为图像A的宽度,3表示通道数有3个。例如,如果为RGB图像,则3表示RGB(R红、G绿、B蓝)3个通道;如果为YUV图像,则3表示YUV(Y亮度信号、U蓝分量信号、V红分量信号)3个通道。将图像A输入深度神经网络,通过深度神经网络进行卷积等计算后会输出H1×W1×N维度的特征矩阵,其中,H1,W1为特征的高度和宽度(通常比H和W小,N是通道数,N大于3)。通过神经网络对输入数据的拟合训练,可以获得输入图像的包含目标物体特征的多个不同分辨率的特征数据,例如不同分辨率对应的低级的图像纹理,边缘轮廓信息,以及高级的语义信息等。在获得每个视角的图像的特征数据后,即可进行后续的空间转换、多视 角特征融合和目标预测步骤,以获得目标物体的三维空间信息。
可见,本公开实施例通过深度神经网络对各个视角对应的图像进行卷积、池化等计算,以获得每个视角图像的多个不同分辨率的特征数据。通过不同分辨率的特征数据可更好地反应对应视角相机所采集的图像特征,提高后续3D目标检测的效率。
示例性装置
图17是本公开一示例性实施例提供的基于多视角融合的3D目标检测装置的结构图。该基于多视角融合的3D目标检测装置可以设置于终端设备、服务器等电子设备中,或者辅助驾驶或自动驾驶的载体上,示例性的,可设置在车载自动驾驶系统中,执行本公开上述任一实施例的基于多视角融合的3D目标检测方法。如图17所示,该实施例的基于多视角融合的3D目标检测装置包括:图像接收模块201、特征提取模块202、图像特征映射模块203,图像融合模块204和3D检测模块205。
其中,图像接收模块201,用于获取采集的来自多摄相机视角的至少一幅图像。
特征提取模块202,用于对所述图像接收模块获取的所述至少一幅图像进行特征提取,得到所述至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据。
图像特征映射模块203,用于基于多摄相机系统的内部参数和载具参数,将所述特征提取模块获得的所述至少一幅图像在多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据。
图像融合模块204,用于将所述图像特征映射模块得到的所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角融合特征。
3D检测模块205,用于对所述图像融合模块得到的所述鸟瞰视角融合特征中的目标物体进行目标预测,得到目标物体的三维空间信息。
可见,本公开实施例的装置在进行基于多视角融合的3D目标检测时,通过中融合(middle fusion)将至少一幅图像在多摄相机视角下的特征数据同时映射至同一个鸟瞰视角空间,能够进行更合理,效果更好的融合。同时,通过融合的鸟瞰视角融合特征直接在鸟瞰视角空间检测出车载环境周围内各个目标物体的三维空间信息。因此,通过本公开实施例的装置进行基于多视角融合的3D目标检测时,端到端的完成鸟瞰视角下的场景物体3D检测,避免在常规多视角3D目标检测上的后处理阶段,提高检测效率。
图18是本公开一示例性实施例提供的基于多视角融合的3D目标检测装置的另一结构图。
进一步的,如图18所示的结构图,该图像特征映射模块203包括:
转换矩阵确定单元2031,用于基于所述多摄相机系统的内部参数和载具参数,确定所述多摄相机系统的多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵;
空间转换单元2032,用于基于转换矩阵确定单元2031确定的多摄相机的相机坐标系到鸟瞰视角坐标系 的转换矩阵,将所述至少一幅图像在多摄相机视角空间下各自对应的特征数据从多摄相机视角空间转换至鸟瞰视角空间下,得到所述至少一幅图像在鸟瞰视角空间下各自对应的特征数据。
在一种可行的实施方式中,该转换矩阵确定单元2031包括:
转换矩阵获取子单元,用于分别获取所述多摄相机系统中多摄相机的相机内参数和相机外参数,以及,获取载具坐标系到鸟瞰视角坐标系的转换矩阵;
转换矩阵确定子单元,用于基于所述转换矩阵获取子单元获取的多摄相机的相机外参数、相机内参数与载具坐标系到鸟瞰视角坐标系的转换矩阵,确定多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵。
进一步的,该3D检测模块205包括:
检测网络获取单元2051,用于利用预测网络从所述鸟瞰视角融合特征中获取用于确定目标物体在鸟瞰视角坐标系下的第一预设坐标值对应的热力图,以及,获取用于确定目标物体在鸟瞰视角坐标系下的第二预设坐标值、尺寸和朝向角的其他属性图;
信息检测单元2052,用于根据所述检测网络获取单元2051获取的热力图中的峰值信息确定目标物体在鸟瞰视角坐标系下的第一预设坐标值,并且根据目标物体在鸟瞰视角坐标系下的第一预设坐标值从所述其他属性图中确定目标物体的在鸟瞰视角坐标下的第二预设坐标值、尺寸和朝向角;
信息确定单元2053,用于根据所述信息检测单元2052检测的目标物体在鸟瞰视角坐标系下的第一预设坐标值、第二预设坐标值、尺寸和朝向角,确定目标物体的三维空间信息。
在一种可行的实施方式中,该3D检测模块205还包括:
损失函数构建单元2054,用于在预测网络的训练阶段,构建预测网络预测的热力图与真值热力图之间的第一损失函数,以及,构建预测网络预测的其他属性图与其他真值属性图之间的第二损失函数;
总损失函数确定单元2055,用于根据所述损失函数构建单元2054构建的第一损失函数和所述第二损失函数确定预测网络在训练阶段的总损失函数,以监督预测网络的训练过程。
在一种可行的实施方式中,总损失函数确定单元2055包括:
权重值获取子单元,用于获取第一损失函数的权重值和第二损失函数的权重值;
总损失函数确定子单元,用于基于所述损失函数构建单元2054构建的第一损失函数、第二损失函数,以及,所述权重值获取子单元获取的第一损失函数的权重值和所述第二损失函数的权重值,确定预测网络在训练阶段的总损失函数。
在一种可行的实施方式中,该3D检测模块205还包括:
融合特征提取单元2056,用于利用神经网络对所述鸟瞰视角融合特征进行特征提取,获得包含目标物体特征的鸟瞰视角融合特征数据;
目标预测单元2057,用于利用预测网络对所述特征提取单元2056得到的包含目标物体特征的鸟瞰视角融合特征数据中的目标物体进行目标预测,得到目标物体的三维空间信息。
进一步的,该特征提取模块202包括:
特征提取单元2021,用于利用深度神经网络对各个视角对应的图像进行卷积计算,获得各个视角对应的图像在多摄相机视角空间下各自对应的包含目标物体特征的多个不同分辨率的特征数据。
示例性电子设备
下面,参考图19来描述根据本公开实施例的电子设备。
图19是本公开一示例性实施例提供的电子设备的结构框图。
如图19所示,电子设备11包括一个或多个处理器111和存储器112。
处理器111可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备11中的其他组件以执行期望的功能。
存储器112可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器111可以运行所述程序指令,以实现上文所述的本公开的各个实施例的基于多视角融合的3D目标检测方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备11还可以包括:输入装置113和输出装置114,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
此外,该输入装置113还可以包括例如键盘、鼠标等等。
该输出装置114可以向外部输出各种信息,包括确定出的距离信息、方向信息等。该输出装置114可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图20中仅示出了该电子设备11中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备11还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的基于多视角融合的3D目标检测方法中的步骤。
此外,本公开的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的基于多视角融合的3D目标检测方法中的步骤。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。

Claims (11)

  1. 一种基于多视角融合的3D目标检测方法,包括:
    获取采集的来自多摄相机视角的至少一幅图像;
    对所述至少一幅图像进行特征提取,得到所述至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据;
    基于多摄相机系统的内部参数和载具参数,将所述至少一幅图像在所述多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到所述至少一幅图像在所述鸟瞰视角空间下各自对应的特征数据;
    将所述至少一幅图像在所述鸟瞰视角空间下各自对应的特征数据进行特征融合,得到鸟瞰视角融合特征;
    对所述鸟瞰视角融合特征中的目标物体进行目标预测,得到所述目标物体的三维空间信息。
  2. 根据权利要求1所述的方法,其中,所述基于多摄相机系统的内部参数和载具参数,将所述至少一幅图像在所述多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到所述至少一幅图像在所述鸟瞰视角空间下各自对应的特征数据,包括:
    基于所述多摄相机系统的内部参数和载具参数,确定所述多摄相机系统的多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵;
    基于所述多摄相机的相机坐标系到所述鸟瞰视角坐标系的转换矩阵,将所述至少一幅图像在所述多摄相机视角空间下各自对应的特征数据从所述多摄相机视角空间转换至所述鸟瞰视角空间下,得到所述至少一幅图像在所述鸟瞰视角空间下各自对应的特征数据。
  3. 根据权利要求2所述的方法,其中,所述基于所述多摄相机系统的内部参数和载具参数,确定所述多摄相机系统的多摄相机的相机坐标系到鸟瞰视角坐标系的转换矩阵,包括:
    分别获取所述多摄相机系统中多摄相机的相机内参数和相机外参数,以及,获取载具坐标系到所述鸟瞰视角坐标系的转换矩阵;
    基于所述多摄相机的所述相机外参数、所述相机内参数与所述载具坐标系到鸟瞰视角坐标系的转换矩阵,确定所述多摄相机的相机坐标系到所述鸟瞰视角坐标系的转换矩阵。
  4. 根据权利要求1所述的方法,其中,所述对所述鸟瞰视角融合特征中的目标物体进行目标预测,得到所述目标物体的三维空间信息,包括:
    利用预测网络从所述鸟瞰视角融合特征中获取用于确定所述目标物体在鸟瞰视角坐标系下的第一预设坐标值对应的热力图,以及,获取用于确定所述目标物体在所述鸟瞰视角坐标系下的第二预设坐标值、尺寸和朝向角的其他属性图;
    根据所述热力图中的峰值信息确定所述目标物体在所述鸟瞰视角坐标系下的所述第一预设坐标值,并且根据所述目标物体在所述鸟瞰视角坐标系下的所述第一预设坐标值从所述其他属性图中确定所述目标物 体在所述鸟瞰视角坐标下的所述第二预设坐标值、尺寸和朝向角;
    根据所述目标物体在鸟瞰视角坐标系下的所述第一预设坐标值、所述第二预设坐标值、所述尺寸和所述朝向角,确定所述目标物体的三维空间信息。
  5. 根据权利要求4所述的方法,其中,还包括:
    在所述预测网络的训练阶段,构建所述预测网络预测的热力图与真值热力图之间的第一损失函数,以及,构建所述预测网络预测的其他属性图与其他真值属性图之间的第二损失函数;
    根据所述第一损失函数和所述第二损失函数确定所述预测网络在训练阶段的总损失函数,以监督所述预测网络的训练过程。
  6. 根据权利要求5所述的方法,其中,所述根据所述第一损失函数和所述第二损失函数确定所述预测网络在训练阶段的总损失函数,包括:
    获取所述第一损失函数的权重值和所述第二损失函数的权重值;
    基于所述第一损失函数、所述第一损失函数的权重值、所述第二损失函数和所述第二损失函数的权重值,确定所述预测网络在训练阶段的总损失函数。
  7. 根据权利要求1或4所述的方法,其中,所述对所述鸟瞰视角融合特征中的目标物体进行目标预测,得到所述目标物体的三维空间信息,包括:
    利用神经网络对所述鸟瞰视角融合特征进行特征提取,获得包含所述目标物体特征的鸟瞰视角融合特征数据;
    利用预测网络对所述包含所述目标物体特征的鸟瞰视角融合特征数据中的所述目标物体进行目标预测,得到所述目标物体的三维空间信息。
  8. 根据权利要求1所述的方法,其中,所述对所述至少一幅图像进行特征提取,得到所述至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据,包括:
    利用深度神经网络对各个视角对应的图像进行卷积计算,获得所述各个视角对应的图像在多摄相机视角空间下各自对应的包含所述目标物体特征的多个不同分辨率的特征数据。
  9. 一种基于多视角融合的3D目标检测装置,包括:
    图像接收模块,用于获取采集的来自多摄相机视角的至少一幅图像;
    特征提取模块,用于对所述图像接收模块获取的所述至少一幅图像进行特征提取,得到所述至少一幅图像在多摄相机视角空间下各自对应的包含目标物体特征的特征数据;
    图像特征映射模块,用于基于多摄相机系统的内部参数和载具参数,将所述特征提取模块获得的所述至少一幅图像在所述多摄相机视角空间下各自对应的特征数据映射至同一个鸟瞰视角空间,得到所述至少一幅图像在所述鸟瞰视角空间下各自对应的特征数据;
    图像融合模块,用于将所述图像特征映射模块得到的所述至少一幅图像在所述鸟瞰视角空间下各自对 应的特征数据进行特征融合,得到鸟瞰视角融合特征;
    3D检测模块,用于对所述图像融合模块得到的所述鸟瞰视角融合特征中的目标物体进行目标预测,得到所述目标物体的三维空间信息。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-8任一所述的基于多视角融合的3D目标检测方法。
  11. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-8任一所述的基于多视角融合的3D目标检测方法。
PCT/CN2023/074861 2022-05-18 2023-02-08 一种基于多视角融合的3d目标检测方法及装置 WO2023221566A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210544237.0 2022-05-18
CN202210544237.0A CN114913506A (zh) 2022-05-18 2022-05-18 一种基于多视角融合的3d目标检测方法及装置

Publications (1)

Publication Number Publication Date
WO2023221566A1 true WO2023221566A1 (zh) 2023-11-23

Family

ID=82768370

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074861 WO2023221566A1 (zh) 2022-05-18 2023-02-08 一种基于多视角融合的3d目标检测方法及装置

Country Status (2)

Country Link
CN (1) CN114913506A (zh)
WO (1) WO2023221566A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118154854A (zh) * 2024-05-09 2024-06-07 中国科学技术大学 多视角特征聚合的目标检测方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置
CN115797455B (zh) * 2023-01-18 2023-05-02 北京百度网讯科技有限公司 目标检测方法、装置、电子设备和存储介质
CN116012805B (zh) * 2023-03-24 2023-08-29 深圳佑驾创新科技有限公司 目标感知方法、装置、计算机设备、存储介质
CN117315152B (zh) * 2023-09-27 2024-03-29 杭州一隅千象科技有限公司 双目立体成像方法及其系统
CN118155063A (zh) * 2024-02-22 2024-06-07 中国科学院空天信息创新研究院 多视角三维目标检测方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476822A (zh) * 2020-04-08 2020-07-31 浙江大学 一种基于场景流的激光雷达目标检测与运动跟踪方法
CN113673425A (zh) * 2021-08-19 2021-11-19 清华大学 一种基于Transformer的多视角目标检测方法及系统
CN114419568A (zh) * 2022-01-18 2022-04-29 东北大学 一种基于特征融合的多视角行人检测方法
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378605B (zh) * 2020-03-10 2024-04-09 北京京东乾石科技有限公司 多源信息融合方法及装置、电子设备和存储介质
CN113673444B (zh) * 2021-08-19 2022-03-11 清华大学 一种基于角点池化的路口多视角目标检测方法及系统
CN113902897B (zh) * 2021-09-29 2022-08-23 北京百度网讯科技有限公司 目标检测模型的训练、目标检测方法、装置、设备和介质
CN114119748A (zh) * 2021-11-19 2022-03-01 上海汽车集团股份有限公司 一种车载环视相机的安装位姿确定方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476822A (zh) * 2020-04-08 2020-07-31 浙江大学 一种基于场景流的激光雷达目标检测与运动跟踪方法
CN113673425A (zh) * 2021-08-19 2021-11-19 清华大学 一种基于Transformer的多视角目标检测方法及系统
CN114419568A (zh) * 2022-01-18 2022-04-29 东北大学 一种基于特征融合的多视角行人检测方法
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118154854A (zh) * 2024-05-09 2024-06-07 中国科学技术大学 多视角特征聚合的目标检测方法

Also Published As

Publication number Publication date
CN114913506A (zh) 2022-08-16

Similar Documents

Publication Publication Date Title
WO2023221566A1 (zh) 一种基于多视角融合的3d目标检测方法及装置
CN109461211B (zh) 基于视觉点云的语义矢量地图构建方法、装置和电子设备
CN108647638B (zh) 一种车辆位置检测方法及装置
EP4213068A1 (en) Target detection method and apparatus based on monocular image
CN108638999B (zh) 一种基于360度环视输入的防碰撞预警系统及方法
CN109657638B (zh) 障碍物定位方法、装置和终端
CN110969064B (zh) 一种基于单目视觉的图像检测方法、装置及存储设备
WO2022183682A1 (zh) 一种目标确定方法及装置、电子设备、计算机可读存储介质
WO2023216654A1 (zh) 多视角语义分割方法、装置、电子设备和存储介质
CN114913290A (zh) 多视角融合的场景重建方法、感知网络训练方法及装置
CN115578705A (zh) 基于多模态融合的鸟瞰图特征生成方法
CN114550142A (zh) 基于4d毫米波雷达和图像识别融合的车位检测方法
CN116486351A (zh) 行车预警方法、装置、设备及存储介质
CN118135840A (zh) 车位检测方法、装置、电子设备及可读存储介质
CN114913329A (zh) 一种图像处理方法、语义分割网络的训练方法及装置
CN110717457A (zh) 用于车辆的行人位姿解算方法
WO2021175119A1 (zh) 用于获取车辆3d信息的方法和装置
CN110677491B (zh) 用于车辆的旁车位置估计方法
CN114332845A (zh) 一种3d目标检测的方法及设备
CN112241963A (zh) 基于车载视频的车道线识别方法、系统和电子设备
CN114648639B (zh) 一种目标车辆的检测方法、系统及装置
CN116524324A (zh) Bev模型训练方法、装置、系统、车辆及可读存储介质
CN116101174A (zh) 车辆的碰撞提醒方法、装置、车辆及存储介质
Du et al. Validation of vehicle detection and distance measurement method using virtual vehicle approach
CN112308987A (zh) 车载图像拼接方法、系统以及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806528

Country of ref document: EP

Kind code of ref document: A1