CN115965847A

CN115965847A - Three-dimensional target detection method and system based on multi-modal feature fusion under cross view angle

Info

Publication number: CN115965847A
Application number: CN202310076916.4A
Authority: CN
Inventors: 江昆; 杨殿阁; 周韬华; 杨蒙蒙; 陈俊杰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-04-14

Abstract

The invention relates to a method and a system for detecting a three-dimensional target by multi-modal feature fusion under a cross view angle, which comprises the following steps: extracting features of the camera image data and the millimeter wave radar data under different viewing angles, and performing cross viewing angle conversion to obtain feature information under the cross viewing angle; and constructing a fusion network based on cross view angle multi-modal data, performing deep fusion on the obtained feature information under the cross view angle, extracting features, and performing regression on the target category and the three-dimensional position information to obtain complete three-dimensional target detection information. The space characteristics of the camera image information under the front view angle and the space characteristics of the millimeter wave radar point cloud information under the bird's-eye view angle are fully considered, the space characteristics of different sensors can be adapted to further perform effective fusion, the fusion performance is improved, the accuracy is effectively improved, and the follow-up algorithm processing is facilitated. The invention can be widely applied to the field of environment perception of intelligent automobiles.

Description

Three-dimensional target detection method and system based on multi-modal feature fusion under cross view angle

Technical Field

The invention relates to the field of environment perception of intelligent automobiles, in particular to a three-dimensional target detection method and system by means of multi-modal feature fusion under a cross view angle.

Background

The intelligent automobile needs to sense and understand the driving environment by utilizing observation information provided by the vehicle-mounted sensor, and sensing results are used for driving tasks such as path planning, risk target obstacle avoidance and the like by realizing algorithms such as target detection and tracking, semantic segmentation, scene understanding and the like. And because the driving environment of the intelligent automobile generally has higher complexity and dynamics, higher requirements are provided for the accuracy, stability and reliability of the vehicle sensing system. The single sensor has limitations in sensing range, sensing precision and sensing information richness, and is difficult to meet the sensing requirement of advanced automatic driving, so that the fusion sensing by utilizing multi-sensor information becomes an effective sensing enhancement means.

Cameras and millimeter wave radars are two common in-vehicle sensing sensors. The image collected by the camera has dense semantic information, the millimeter wave radar can directly observe the relative position and speed information of the target, and meanwhile, the millimeter wave radar can resist severe weather conditions due to the all-weather working characteristics of the millimeter wave radar, so that the information fusion perception algorithm of the two sensors is applied to a mass-production driving assistance function system in a large scale. However, because the image collected by the camera lacks depth-of-field information, the millimeter wave radar point cloud lacks height information, and the millimeter wave radar point cloud is sparse and has many clutter, the information from the camera and the millimeter wave radar lacks complete description of a three-dimensional environment space, and the millimeter wave radar point cloud is difficult to be directly applied to a three-dimensional target perception task.

The existing radar point cloud and camera image fusion sensing method generally projects millimeter wave radar point cloud data to a camera image by using a space calibration matrix, designs a corresponding feature expression rule, enables the radar point cloud to be expressed in a consistent manner in an image space, and then performs subsequent fusion feature extraction to realize target detection or target tracking. The method does not consider millimeter wave radar point cloud and camera image information as two kinds of spatially heterogeneous information, large spatial difference exists between data, the data are forcibly unified to a front view angle of a camera space when multi-modal data fusion is carried out, the adaptability to millimeter wave radar data characteristics is poor, the fusion effect is limited, and the algorithm performance needs to be improved.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method and a system for detecting a three-dimensional target with multi-modal feature fusion under a cross-view angle, which can be applied to a sensing algorithm for performing data-level fusion on multi-modal spatial heterogeneous information from multiple sensors, can fully adapt to spatial characteristics of different modal data, improve fusion performance, obtain more accurate and reliable target detection information, and improve security of an intelligent vehicle system.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for detecting a three-dimensional target by multi-modal feature fusion under a cross-view angle, including the following steps:

extracting features of the camera image data and the millimeter wave radar data under different viewing angles, and performing cross viewing angle conversion to obtain feature information under the cross viewing angle;

and constructing a fusion network based on cross view angle multi-modal data, performing deep fusion on the obtained feature information under the cross view angle, extracting features, and performing regression on the target category and the three-dimensional position information to obtain complete three-dimensional target detection information.

Further, the method for extracting features of the camera image data and the millimeter wave radar data at different viewing angles and performing cross viewing angle conversion to obtain feature information at a cross viewing angle includes the following steps:

constructing a feature extractor under a camera image view angle, and performing feature extraction on the camera image to obtain two-dimensional multi-scale convolution features and a corresponding 2D target detection position thereof under a front view angle;

constructing a feature extractor under the viewing angle of the millimeter wave radar, performing multi-frame point cloud accumulation processing on the millimeter wave radar point cloud data, and obtaining a radar point cloud feature distribution map under the viewing angle of the aerial view;

and constructing a cross visual angle feature converter, and performing visual angle conversion on the two-dimensional multi-scale convolution features under the visual angle of the front view and the radar point cloud feature distribution map under the visual angle of the aerial view to obtain feature information under the cross visual angle.

Further, the method for constructing the feature extractor under the camera image view angle and extracting the features of the camera image to obtain the two-dimensional multi-scale convolution features and the corresponding 2D target detection position under the front view angle comprises the following steps:

performing feature extraction on the camera image by using a convolutional neural network to obtain two-dimensional multi-scale convolution features under a front view visual angle;

based on the obtained two-dimensional multi-scale convolution characteristics, the pixel with the coordinate (h, w) of each pixel of the camera image is

Estimate its depth distribution D and class distribution->

And simultaneously performing primary regression on the 2D position of the target to obtain the 2D target detection position of the camera image.

Further, the method for constructing the feature extractor under the view angle of the millimeter wave radar, performing multi-frame point cloud accumulation processing on millimeter wave radar data, and obtaining the radar point cloud feature distribution map under the view angle of the aerial view comprises the following steps:

performing multi-frame point cloud accumulation processing on the millimeter wave radar point cloud data to obtain a current frame millimeter wave radar point cloud observation result;

and constructing a radar point cloud characteristic distribution map under the view angle of the aerial view by utilizing a Gaussian probability distribution model based on the current frame millimeter wave radar point cloud observation result.

Further, the current frame millimeter wave radar point cloud observation result is as follows:

Z _radar (t)＝T _{c_from_r} T _{c_from_g} (t)T _{g_from_c} (t-k)T _{r_from_c} Z _radar (t-k)

wherein Z is _radar (t) a current frame millimeter wave radar point cloud observation result; t is _{c_from_r} A transfer matrix representing a transfer from the millimeter wave radar coordinate system to the host vehicle coordinate system; t is _{c_from_g} (t) a transfer matrix representing a current frame from the global coordinate system to the host vehicle coordinate system; t is _{g_from_c} (t-k) represents a transfer matrix of the k-th frame from the own vehicle coordinate system to the global coordinate system; t is _{r_from_c} A transfer matrix representing a coordinate system from the host vehicle to the millimeter wave radar; z _radar (t) and Z _radar And (t-k) represents the current frame millimeter wave radar point cloud observation result and the millimeter wave radar point cloud observation result before k frames.

Further, the method for constructing the cross view angle feature converter and converting the view angle of the two-dimensional multi-scale convolution feature under the view angle of the front view and the radar point cloud feature distribution map under the view angle of the aerial view to obtain the feature information under the cross view angle comprises the following steps:

constructing a foresight view angle feature converter based on internal and external parameter information of the image, converting a radar point cloud feature distribution map under a bird's-eye view angle to a foresight view angle, and obtaining a radar Gaussian feature fusion result under the foresight view;

and constructing a bird's-eye view visual angle feature converter based on internal and external parameter information of the image, and converting the two-dimensional multi-scale convolution features under the front view visual angle to the bird's-eye view visual angle to obtain an image convolution feature fusion result under the bird's-eye view visual angle.

Further, the method comprises the following steps of constructing a foresight view characteristic converter based on the internal and external parameter information of the image, converting the radar point cloud characteristic distribution map under the bird's-eye view into a foresight view, and obtaining a radar Gaussian characteristic fusion result under the foresight view, wherein the method comprises the following steps:

firstly, a space transformation relation T from a bird's eye view to a front view coordinate system is utilized _{f_from_b} Projecting the radar point cloud to a front view;

and secondly, according to the 2D target detection position of the camera image, reserving the radar point cloud falling in the two-dimensional space surrounding frame of the target in the front view, filling the corresponding pixel position, and obtaining a radar Gaussian feature fusion result under the front view.

Further, the method for obtaining the image convolution feature fusion result under the view angle of the aerial view by constructing the aerial view feature converter based on the calibration information between the millimeter wave radar and the image and converting the two-dimensional multi-scale convolution feature under the view angle of the front view into the aerial view feature comprises the following steps:

first, the spatial size (d) of each pixel representation in three-dimensional space is determined _x ，d _y ，d _z ) Constructing a corresponding viewing cone;

secondly, by utilizing the internal reference from the camera to the image coordinate system and the external reference from the camera space to the bird's-eye view space coordinate system, the position distribution of the pixel with (u, v) as each image coordinate position in the bird's-eye view space is obtained

Let the transition matrix be T _{b_from_f} ；

Thirdly, the corresponding image viewing cones are converted by the position conversion relation T _{b_from_f} Projecting to a bird's eye view;

and finally, filling the convolution features of the two-dimensional multi-scale images into corresponding viewing cones under the aerial view by using an interpolation function to obtain an image convolution feature fusion result under the aerial view viewing angle.

Further, the constructing a fusion network based on cross-view multi-modal data, performing deep fusion on the obtained feature information under the cross view, extracting features, and performing regression on the target category and the three-dimensional position information to obtain complete three-dimensional target detection information includes:

constructing a multi-modal data fusion network based on a cross view, connecting two-dimensional multi-scale convolution features under a view angle of a front view and projected radar Gaussian features together, and connecting a radar point cloud feature distribution map under a view angle of a bird view and projected image convolution features together;

performing deep fusion feature extraction on the obtained feature fusion information, connecting all feature information together after unifying the scale, and realizing corresponding information weight distribution by using the convolutional layer;

and (3) performing regression of target type and target pose information by using the three-dimensional target detection head, calculating a corresponding loss function, and performing training and optimization of a network to obtain complete three-dimensional target detection information.

In a second aspect, the present invention provides a system for detecting a three-dimensional target by multi-modal feature fusion under cross-view angles, including:

the cross visual angle feature extraction module is used for extracting features of the camera image data and the millimeter wave radar data under different visual angles, and performing cross visual angle conversion to obtain feature information under the cross visual angles;

and the three-dimensional target detection module is used for constructing a fusion network based on the cross view angle multi-modal data, further extracting the characteristics of the obtained characteristic information under the cross view angle, and simultaneously performing regression of the target type and the three-dimensional position information to obtain complete three-dimensional target detection information.

Due to the adoption of the technical scheme, the invention has the following advantages: the spatial characteristics of the camera image information under the front view viewing angle and the spatial characteristics of the millimeter wave radar point cloud information under the aerial view viewing angle are fully considered, the spatial characteristics of different sensors can be adapted to further perform effective fusion, the fusion performance is improved, the accuracy is effectively improved, and the subsequent algorithm processing is facilitated.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Like reference numerals refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic flow chart of a method for detecting a three-dimensional target based on multi-modal feature fusion under a cross-view in an embodiment of the present invention;

fig. 2 is a schematic diagram of a cross-view feature converter in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In some embodiments of the present invention, a method for detecting a three-dimensional target by multi-modal feature fusion under a cross-view is provided, which extracts features of a camera image through a convolutional neural network; extracting the point cloud characteristics of the millimeter wave radar based on the understanding of the characteristics of the millimeter wave radar; by designing the cross visual angle feature conversion module, fusion of camera image data features and millimeter wave radar data features under a front view visual angle and a bird's-eye view visual angle (the bird's-eye view coordinate system can be directly selected as a coordinate system where a millimeter wave radar is located, and can also be selected as a self-vehicle coordinate system in consideration of subsequent application, and in the invention, the camera image data features and the millimeter wave radar data features are unified into a three-dimensional space coordinate system under the bird's-eye view), further feature extraction and regression of target category and three-dimensional position information are carried out by utilizing multi-modal feature fusion under the cross visual angle, and finally complete three-dimensional target detection information is output. The method can be applied to a perception algorithm for performing data-level fusion on multi-modal spatial heterogeneous information from multiple sensors, can fully adapt to the spatial characteristics of different modal data, improves the fusion performance, obtains more accurate and reliable target detection information, and improves the safety of an intelligent automobile system.

In accordance with other embodiments of the present invention, there are provided a system, apparatus and medium for three-dimensional object detection with multi-modal feature fusion at cross-views.

Example 1

As shown in fig. 1, the present embodiment provides a method for detecting a three-dimensional target by multi-modal feature fusion under cross-view angles, which includes the following steps:

1) Extracting features of the camera image data and the millimeter wave radar data under different visual angles, and performing cross visual angle conversion to obtain feature information under cross visual angles;

2) And constructing a fusion network based on cross view angle multi-modal data, performing deep fusion on the obtained feature information under the cross view angle, extracting features, and performing regression on the target category and the three-dimensional position information to obtain complete three-dimensional target detection information.

Preferably, in step 1), the method for extracting features of the camera image data and the millimeter wave radar data at different viewing angles and performing cross viewing angle conversion to obtain feature information at a cross viewing angle includes the following steps:

1.1 Constructing a feature extractor under a camera image view angle, and performing feature extraction on the camera image to obtain two-dimensional multi-scale convolution features and a corresponding 2D target detection position thereof under a front view angle;

1.2 Constructing a feature extractor under the viewing angle of the millimeter wave radar, performing multi-frame point cloud accumulation processing on the millimeter wave radar point cloud data, and obtaining a radar point cloud feature distribution map under the viewing angle of the aerial view;

1.3 Constructing a cross visual angle feature converter, and performing visual angle conversion on the two-dimensional multi-scale convolution features under the visual angle of the front view and the radar point cloud feature distribution map under the visual angle of the aerial view to obtain feature information under the cross visual angle.

Preferably, in the step 1.1), a method for constructing a feature extractor under a camera image viewing angle and performing feature extraction on the camera image to obtain two-dimensional multi-scale convolution features and corresponding 2D target detection positions thereof under a front view viewing angle includes the following steps:

1.1.1 Carrying out feature extraction on the camera image by utilizing a convolutional neural network to obtain two-dimensional multi-scale convolution features under the view angle of the front view.

Specifically, in this embodiment, a deep residual convolutional neural network (ResNet 101) is used as a network main body frame to extract deep multichannel convolutional features of a camera image, and the deep multichannel convolutional features are further extracted through a Feature Pyramid Network (FPN) to obtain two-dimensional multi-scale convolutional feature information.

1.1.2 Based on the obtained two-dimensional multi-scale convolution features, for each pixel with (h, w) coordinates of the camera image

Estimate its depth distribution D and class distribution->

Specifically, in this embodiment, a classification probability result of each discrete depth is predicted from a two-dimensional multi-scale convolution feature by using a softmax function, and a depth distribution D = { D = is obtained ₀ ，d ₀ +Δ，...，d ₀ + k Δ, D contains k +1 discrete points, D ₀ Is the most predictive of depth distributionSmall values in m; Δ is the separation distance value of the discrete depth profile, in m.

Preferably, in the step 1.2), a feature extractor under the view angle of the millimeter wave radar is constructed, multi-frame point cloud accumulation processing is performed on the millimeter wave radar data, and a radar point cloud feature distribution map under the view angle of the aerial view is obtained, including the following steps:

1.2.1 Multi-frame point cloud accumulation processing is carried out on the millimeter wave radar point cloud data, and a current frame millimeter wave radar point cloud observation result is obtained.

In this embodiment, the radar point cloud information from the past 5 frames (which is taken as an example and not limited to the next time in the present invention) to the current frame is accumulated by using the positioning information and the timestamp information, so as to further increase the point cloud density. The calculation formula is as follows:

Z _radar (t)＝T _{c_from_r} T _{c_from_g} (t)T _{g_from_c} (t-k)T _{r_from_c} Z _radar (t-k) (1)

wherein Z is _radar (t) a current frame millimeter wave radar point cloud observation result; t is _{c_from_r} A transfer matrix representing a transfer from the millimeter wave radar coordinate system to the host vehicle coordinate system; t is _{c_from_g} (t) a transfer matrix representing a current frame from a global coordinate system to a host vehicle coordinate system; t is _{g_from_c} (t-k) represents a transfer matrix of the k-th frame from the own vehicle coordinate system to the global coordinate system; t is a unit of _{r_from_c} A transfer matrix representing a coordinate system from the host vehicle to the millimeter wave radar; z _radar (t) and Z _radar And (t-k) represents the current frame millimeter wave radar point cloud observation result and the millimeter wave radar point cloud observation result before k frames.

As can be seen from the equation (1), in order to observe the point cloud Z of the millimeter wave radar before k frames _radar (t-k) projection onto the current frame Z _radar (t) adding Z _radar (T-k) by means of T _{r_from_c} Projected to the coordinate system of the self-vehicle, and then provided by the positioning system, the relation T between k frames of front self-vehicle and the global coordinate system _{g_from_c} (T-k) and the relationship T between the current time and the global coordinate system _{c_from_g} (T) taking into account the motion of the vehicle coordinate system in the k frame time, and finally passing through T _{c_from_r} Throw againShadow to current time Z _radar (t) under space.

Wherein Z is _radar (t) is expressed as:

wherein dis _lat Representing a target lateral position of the radar measurement; dis _long Represents the longitudinal position of the radar measurement; vel _lat Representing the relative lateral velocity of the radar measurement; vel _long Represents the longitudinal velocity of the radar measurement; RCS represents the reflection constant area of radar measurement signals and reflects the reflection intensity of radar echo.

Each transfer matrix is composed of a rotation matrix R and a translational vector t, and the following relations are satisfied:

1.2.2 Based on the current frame millimeter wave radar point cloud observation result, a radar point cloud characteristic distribution diagram under the perspective of the bird's-eye view is constructed by utilizing a Gaussian probability distribution model.

And constructing a Radar point cloud data characteristic distribution map under the aerial view by utilizing a Gaussian probability distribution model based on the relative position, the relative speed and the Radar scattering Cross Section (RCS, data reflecting Radar reflection intensity information) provided by the Radar point cloud information.

Specifically, in this embodiment, the radar point cloud data feature distribution map includes 5 channels, which respectively represent the horizontal positions (dis) of the targets _lat ) Longitudinal position (dis) _long ) Transverse velocity (vel) _lat ) Longitudinal speed (vel) _long ) And reflection intensity (RCS) information, and under the view angle of the aerial view, designing a wave radar point cloud characteristic distribution diagram by utilizing the primary binary Gaussian normal distribution:

wherein (x, y) represents that the space under the aerial view is rasterized, and the position coordinate of any one unit is determined according to the position information of the millimeter wave radar for transversely and radially observing the target under the aerial view; mu represents the position information [ mu ] of the target observed transversely and radially by the radar ₁ ＝dis _long ，μ ₂ ＝dis _lat ]Determining according to the measured values of the millimeter wave radar to each variable; and sigma represents the uncertainty of the target position value measured by the radar, and is determined according to the measurement precision of the millimeter wave radar on each variable.

Preferably, in step 1.3), as shown in fig. 2, the method for constructing an intersection angle of view feature converter, and performing angle of view conversion on the two-dimensional multi-scale convolution features under the view angle of the front view and the radar point cloud feature distribution map under the view angle of the aerial view to obtain feature information under the intersection angle of view includes the following steps:

1.3.1 Constructing a foresight view angle feature converter based on internal and external parameter information of the image, and converting the radar point cloud feature distribution map under the bird's-eye view angle to the foresight view angle to obtain a radar Gaussian feature fusion result under the foresight view.

Wherein, the internal reference information of the image comprises: a conversion matrix from a camera coordinate system to an image plane pixel coordinate system; the external reference information of the image comprises: a spatial transformation matrix of the camera coordinate system to the reference coordinate system. Specifically, the feature transformation under the view angle of the front view comprises the following steps:

firstly, a spatial transformation relation T from a bird's-eye view to a front view coordinate system is utilized _{f_from_b} Projecting the radar point cloud to a front view; secondly, according to the 2D target detection position of the camera image predicted in the step 1.1.2), radar point clouds falling in a two-dimensional space surrounding frame of the target in the front view are reserved (in practice, the radar point clouds influenced by errors are considered, and some radar point clouds possibly near the periphery of the target surrounding frame are reserved, so that the threshold value is flexibly changed by taking alpha (alpha is more than or equal to 1) values, and corresponding pixel positions are filled to obtain a radar Gaussian feature fusion result under the front view:

wherein the content of the first and second substances,

indicating radar target position (dis) _long ，dis _lat ) Projecting the internal and external parameters to the pixel position of the image plane; (w) ^j ，h ^j ) Representing the length and the width of a two-dimensional rectangular frame formed after the radar target is projected to an image plane; />

Representing the pixel value size of the radar after projection onto the image plane.

1.3.2 Based on internal and external parameter information (including a spatial conversion relation from radar to camera and a conversion relation from camera to image plane) of the image, a bird's-eye view perspective characteristic converter is constructed, two-dimensional multi-scale convolution characteristics under a front view perspective are converted to be under the bird's-eye view perspective, and an image convolution characteristic fusion result under the bird's-eye view perspective is obtained.

Specifically, the image feature conversion under the bird's eye view angle includes:

first, the spatial size (d) of each pixel representation in three-dimensional space is determined _x ，d _y ，d _z ) Constructing a corresponding viewing cone; secondly, obtaining the position distribution of the pixel with (u, v) of each image coordinate position under the bird's-eye view space by using the internal reference from the camera to the image coordinate system and the external reference from the camera space to the bird's-eye view space coordinate system

Denote the above transition matrix as T _{bb_fromb_f} (ii) a Thirdly, the corresponding image viewing cones are converted by the position conversion relation T _{b_from_f} Projecting to a bird's eye view; and finally, filling the previous image convolution characteristics into a corresponding view cone under the aerial view by utilizing an interpolation function to obtain an image convolution characteristic fusion result under the aerial view viewing angle.

Preferably, in the step 2), a fusion network based on the cross-view multi-modal data is constructed, further feature extraction is performed based on the feature information of the cross-view obtained in the step 1), and regression of the target category information and the three-dimensional position information is obtained, which includes the following steps:

2.1 Constructing a multi-modal data fusion network based on a cross view, connecting two-dimensional multi-scale convolution characteristics under a view angle of a front view and projected radar Gaussian characteristics together, and simultaneously connecting a radar point cloud characteristic distribution diagram under a view angle of a bird view and projected image convolution characteristics together;

2.2 Performing deep fusion feature extraction on the feature fusion information under the cross viewing angle obtained in the step 2.1) by using a Multilayer Perceptron (MLP) and a convolutional neural network, connecting all feature information together after unifying the scale, and realizing corresponding information weight distribution by using a convolutional layer;

the convolutional neural network of the present embodiment employs a deep residual neural network (ResNet 18) and a feature pyramid network, but is not limited thereto;

2.3 The three-dimensional target detection head is used for regression of target category and target pose information, corresponding loss functions are calculated, training and optimization of a network are carried out, and therefore complete three-dimensional target detection information is obtained.

In conclusion, when the cross-perspective radar image cross-direction feature expression method is used, the cross-perspective feature expression of the image information and the millimeter wave radar information is realized by designing the cross-perspective feature converter; and fusion and regression of multi-modal data are completed through a deep fusion neural network. The method can fully adapt to the spatial expression form of data under different visual angles, realizes the three-dimensional target detection of multi-source information fusion, and is further applied to single-vehicle multi-sensor fusion perception tasks and multi-vehicle combined perception tasks.

Example 2

The embodiment 1 provides a three-dimensional target detection method based on multi-modal feature fusion at an intersecting view angle, and correspondingly, the embodiment provides a three-dimensional target detection system based on multi-modal feature fusion at an intersecting view angle. The system provided by this embodiment can implement the method for detecting a three-dimensional target by multi-modal feature fusion in cross-view in embodiment 1, and the system can be implemented by software, hardware, or a combination of software and hardware. For example, the system may comprise integrated or separate functional modules or functional units to perform the corresponding steps in the methods of embodiment 1. Since the system of this embodiment is substantially similar to the method embodiment, the description process of this embodiment is relatively simple, and reference may be made to part of the description of embodiment 1 for relevant points.

The system for detecting a three-dimensional target by multi-modal feature fusion under a cross view angle provided by the embodiment comprises:

and the three-dimensional target detection module is used for constructing a fusion network based on cross view angle multi-modal data, further extracting the characteristics of the obtained characteristic information under the cross view angle, and regressing the target type and the three-dimensional position information to obtain complete three-dimensional target detection information.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A three-dimensional target detection method based on multi-modal feature fusion under a cross view angle is characterized by comprising the following steps:

2. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross view angle as claimed in claim 1, wherein the method for extracting the features of the camera image data and the millimeter wave radar data under different view angles and performing the cross view angle conversion to obtain the feature information under the cross view angle comprises the following steps:

3. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross view angle as claimed in claim 2, wherein the method for constructing the feature extractor under the camera image view angle, and performing the feature extraction on the camera image to obtain the two-dimensional multi-scale convolution feature and the corresponding 2D target detection position under the front view angle comprises the following steps:

Estimate its depth distribution D and class distribution->

4. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross view angle as claimed in claim 2, wherein the method for constructing the feature extractor under the view angle of the millimeter wave radar, performing multi-frame point cloud accumulation processing on the millimeter wave radar data, and obtaining the radar point cloud feature distribution map under the view angle of the bird's eye view comprises the following steps:

5. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross-view angle as claimed in claim 4, wherein the observation result of the current frame millimeter wave radar point cloud is as follows:

wherein Z is _radar (t) a current frame millimeter wave radar point cloud observation result; t is _{c_from_r} A transfer matrix representing a transfer from the millimeter wave radar coordinate system to the host vehicle coordinate system; t is _{c_from_g} (t) a transfer matrix representing a current frame from a global coordinate system to a host vehicle coordinate system; t is _{g_from_c} (t-k) represents a transfer matrix of the k-th frame from the own vehicle coordinate system to the global coordinate system; t is a unit of _{r_from_c} A transfer matrix representing a coordinate system from the host vehicle to the millimeter wave radar; z _radar (t) and Z _radar (t-k) millimeter wave radar point cloud of current frameObservation results and point cloud observation results of the millimeter wave radar before k frames.

6. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross view angle as claimed in claim 2, wherein the method for constructing the cross view angle feature converter to perform view angle conversion on the two-dimensional multi-scale convolution features under the view angle of the front view and the radar point cloud feature distribution map under the view angle of the bird's eye view to obtain the feature information under the cross view angle comprises the following steps:

and constructing a bird's-eye view image visual angle feature converter based on the internal and external parameter information of the image, and converting the two-dimensional multi-scale convolution features under the visual angle of the front view image into the bird's-eye view image visual angle to obtain an image convolution feature fusion result under the visual angle of the bird's-eye view image.

7. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross view angle as claimed in claim 6, wherein the method for constructing the feature converter of the view angle of the front view based on the internal and external parameter information of the image and converting the radar point cloud feature distribution map under the view angle of the aerial view into the view angle of the front view to obtain the radar gaussian feature fusion result under the front view comprises the following steps:

8. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross viewing angle as claimed in claim 6, wherein the method for constructing the bird's-eye view feature converter based on the calibration information between the millimeter wave radar and the image, and converting the two-dimensional multi-scale convolution feature under the front view viewing angle into the bird's-eye view viewing angle to obtain the image convolution feature fusion result under the bird's-eye view viewing angle comprises the following steps:

first, the spatial size (d) of each pixel representation in three-dimensional space is determined _x ，d _y ，d _z ) And constructing a corresponding viewing cone;

And the position conversion relation is recorded as T _{b_from_f} ；

Thirdly, the corresponding image viewing cones are converted by using the position conversion relation T _{b_from-f} Projecting to a bird's eye view;

and finally, filling the convolution features of the two-dimensional multi-scale image into a corresponding view cone under the aerial view by using an interpolation function to obtain an image convolution feature fusion result under the aerial view viewing angle.

9. The method for detecting the three-dimensional target through the multi-modal feature fusion under the cross view angle as claimed in claim 6, wherein the constructing of the fusion network based on the cross view angle multi-modal data, the deep fusion of the obtained feature information under the cross view angle and the feature extraction, and the regression of the target category and the three-dimensional position information to obtain the complete three-dimensional target detection information comprises:

constructing a multi-modal data fusion network based on cross visual angles, connecting two-dimensional multi-scale convolution features under a front view visual angle and projected radar Gaussian features together, and simultaneously connecting a radar point cloud feature distribution map under a bird view visual angle and projected image convolution features together;

10. A system for detecting a three-dimensional target by multi-modal feature fusion under cross-view angles is characterized by comprising: