CN117422971A

CN117422971A - Bimodal target detection method and system based on cross-modal attention mechanism fusion

Info

Publication number: CN117422971A
Application number: CN202311262346.4A
Authority: CN
Inventors: 任坤; 李盼; 任福荣; 张天阳
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-01-19

Abstract

The invention provides a bimodal target detection method and a bimodal target detection system based on cross-modal attention mechanism fusion, wherein the bimodal target detection method and the bimodal target detection system comprise the following steps: acquiring an image to be detected and millimeter wave radar data; preprocessing millimeter wave radar data; inputting the image to be detected and the preprocessed millimeter wave radar data into a trained bimodal target detection model based on cross-modal attention mechanism fusion to obtain a detection result; the method comprises the steps that a bimodal target detection model based on cross-modal attention mechanism fusion respectively utilizes a radar feature extraction network based on point cloud transformation and sparse coding convolution and a CSPDarkNet53 image feature extraction network to extract radar features and image features, radar features and image features are fused through a cross-modal attention composite feature fusion module at different scale levels of PANet input and output ends, and finally the fused features are input into a detection network of YOLOv5-X for detection and NMS processing is carried out to obtain detection results; the validity of the invention was verified under nuScenes dataset.

Description

Bimodal target detection method and system based on cross-modal attention mechanism fusion

Technical Field

The invention relates to a bimodal target detection method and a bimodal target detection system based on cross-modal attention mechanism fusion, and relates to the technical field of deep learning and target detection.

Background

In recent years, due to the rapid development of deep learning, the related research of the target detection technology is mature, and the performance of a target detection algorithm based on visual images is greatly improved. Under good environment, such as enough light and high visibility, the visual image detection algorithm can obtain higher detection precision. However, under the conditions of bad environments such as weather with low visibility, such as rain, fog, snow storm and the like, weak illumination intensity at night and the like, the detection accuracy of the detection algorithm is greatly reduced, a large number of false detection and omission detection situations can occur, and even the detection algorithm is completely disabled. In road traffic or outdoor environment operation, when people's vision is difficult to recognize some targets in bad weather or illumination environment, environment sensing equipment is needed to assist, and traffic or production safety is ensured. However, visual image-based object detection algorithms are not suitable for detection tasks in similar harsh environments.

In order to solve the above problems, other types of sensors are introduced, and through information fusion, it is a necessary trend to realize target detection under wider conditions. Compared with a camera sensor and a laser radar, the millimeter wave radar has the advantages that the detection performance is less influenced by extreme weather, the anti-interference capability is strong, and detailed information such as the shape of a target contained in a visual image cannot be provided. The millimeter wave radar and the image fusion method have high complementary detection capability, so that the detection performance of the algorithm in a severe environment is improved by adopting the millimeter wave radar and image fusion method, and the method becomes a more advantageous solution.

At present, a 2D target detection algorithm based on millimeter wave radar and visual image fusion is still in a research stage. The conventional millimeter wave radar and image fusion mode is generally divided into data level fusion, feature level fusion and decision level fusion. The data-level fusion firstly generates the ROI based on radar points, then extracts the corresponding region of the image according to the ROI, and then carries out detection tasks, but the detection performance of the method is limited by the number of effective radar points. Decision-level fusion fuses detection results of different sensors to generate final output, but the decision-level fusion method is very difficult in modeling a joint density function of the sensors. The feature level fusion is a fusion method which is focused because the feature level fusion can learn radar and visual features at the same time, fully excavate feature information, and obtain better detection performance.

However, feature fusion of millimeter wave radar data with visual images still presents challenges. Firstly, due to the isomerism of millimeter wave radar data and visual images, effective features of the radar data cannot be well captured by directly using a general convolution model, and semantic relations contained in the data cannot be fully mined. Secondly, the existing feature fusion method often adopts stitching, adding or spatial attention to fuse depth features of the radar and the image, however, the single fusion modes are difficult to consider the difference and information relevance between different modes at the same time. Therefore, how to design an effective feature fusion mode according to the characteristics of different modes is still to be studied.

Disclosure of Invention

The invention provides a bimodal target detection method and a bimodal target detection system based on cross-modal attention mechanism fusion, which are used for realizing high-precision two-dimensional target detection in severe weather and low-illumination environments. According to the invention, on the basis of a YOLOv5-X detection network, the multi-scale radar features are extracted through a radar feature extraction network based on point cloud transformation and sparse coding convolution, and the multi-scale fusion of the radar features and visual features is carried out through a compound feature fusion module based on cross-modal attention, so that the target detection of visibility robustness is finally realized. In a radar feature extraction network (GLRFENet) based on point cloud transformers and sparse coding convolution, long-range dependency of millimeter wave Lei Dadian cloud data is learned by utilizing displacement invariance of the transformers, and point cloud local features are accumulated by using 3D sparse volumes so as to obtain multi-scale radar features suitable for fusion with image features. And a compound feature fusion module based on cross-modal attention is designed, firstly, a radar feature guidance feature fusion network is utilized to learn image features to obtain pseudo radar features, and then, the information relevance between the image features and the pseudo radar features is learned through a cross-attention mechanism so as to improve cross-modal feature fusion, and further, the robustness and generalization capability of the model are improved.

Specifically, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a bimodal target detection method based on cross-modal attention mechanism fusion, comprising the steps of:

acquiring an image to be detected and millimeter wave radar data;

preprocessing millimeter wave radar data;

inputting the image to be detected and the preprocessed millimeter wave radar data into a trained bimodal target detection model based on cross-modal attention mechanism fusion, so as to obtain a detection result output by the bimodal target detection model based on cross-modal attention mechanism fusion;

the bimodal target detection model based on the cross-modal attention mechanism fusion respectively utilizes a radar feature extraction network based on point cloud transformation and sparse coding convolution and a CSPDarkNet53 image feature extraction network to extract millimeter wave radar features and image features, the radar features and the image features are fused through a cross-modal attention composite feature fusion module at different scale stages of PANet input and output ends, and finally the fused features are input into a detection network of YOLOv5-X for detection and NMS processing is carried out to obtain detection results;

the bimodal target detection method based on the cross-modal attention mechanism fusion further comprises the following steps of:

for the two-dimensional millimeter wave Lei Dadian cloud projected to the image plane after pretreatmentDepth information is introduced, and two-dimensional point cloud data are reconstructed into three-dimensional point cloud data +.>Wherein->Representing a real number set, H, W respectively representing the height and width of an initial point cloud image, and D representing the depth of radar points, and uniformly downsampling to N points, wherein the number of points is less than N and is complemented by 0;

the initialization feature of each point in the point cloud is set to a feature containing 6 dimensions: alpha _j ,β _j ,γ _j ,rcs _j ,v _j ¹ ,v _j ² Wherein alpha is _j ,β _j ,γ _j Spatial positions corresponding to W, H and D dimensions of jth radar point are respectively represented, rcs _j ,v _j ¹ ,v _j ² Respectively representing radar scattering sectional area, transverse speed and radial speed of jth radar point, thus obtaining initial point cloud characteristicsSimultaneously, the space position coordinates of each point are recorded, then the long-range dependency relationship of Lei Dadian cloud data is captured through a point cloud transducer sub-network, the global characteristics of the data are enhanced, and the output point cloud characteristics are obtained>

F is processed through sparse coding layer _out Feature encoding is performed on the features of each point in the four-dimensional radar feature sparse tensor according to the recorded space position coordinates, wherein the positions of the non-existing points are filled with 0, so that the four-dimensional radar feature sparse tensor is constructedWherein C represents the number of characteristic channels of the radar point;

further learning local features of the point cloud by using a 3D sparse convolution sub-network, and combining downsampling to obtain global-local multi-scale radar features;

in the bimodal target detection method based on the cross-modal attention mechanism fusion, the operation of uniformly downsampling the millimeter wave Lei Dadian cloud is specifically as follows:

judging whether the number of points in the point cloud is larger or smaller than N, if so, not performing sampling operation, and reserving all points; if the three-dimensional point cloud is larger than N, dividing the three-dimensional point cloud by using voxel grids;

dividing point clouds by using voxel grids, distributing points in the point clouds into corresponding voxel grids according to the positions of the points, wherein each voxel grid comprises a plurality of points;

sampling each voxel grid, taking the point in each voxel grid as a point set, firstly calculating the mass center of the voxel grid, and then selecting the point closest to the mass center of the voxel as a first round of sampling points by using a Kd-Tree neighbor search algorithm until the number of the sampling points is N;

if the number of the sampling points is less than N, randomly sampling the appointed number of points in the rest radar points to be used as supplement;

in the bimodal target detection method based on the cross-modal attention mechanism fusion, the 3D sparse convolution sub-network further comprises 5 feature extraction stages Stage 1-Stage 5, and the downsampling step length of each Stage is S respectively ₁ 、S ₂ 、S ₃ 、S ₄ 、S ₅ The step length of each stage takes a value according to the radar characteristic scale required to be output; the network structures of the 5 feature extraction stages Stage 1-Stage 5 are the same and adopt a fixed structure, and each of the 5 feature extraction stages is composed of 1 conventional sparse convolution layer and 2 identical sub-manifold sparse convolution layers, wherein the conventional sparse convolution layers comprise 3×3 conventional sparse convolutions, batchNorm1d normalization functions and ReLU activation functions, and the sub-manifold sparse convolution layers comprise 3×3 sub-manifold sparse convolutions, batchNorm1d normalization functions and ReLU activation functions, and the conventional sparse convolution layers of each Stage comprise conventional sparse convolutions of 3×3 sub-manifold sparse convolutions, batchNorm1d normalization functions and ReLU activation functionsThe step length of the sparse convolution is consistent with the downsampling step length of each stage, the step length of the sub-manifold sparse convolution is 1, and the convolution kernel size is 3 multiplied by 3 to be a fixed value; after the 3D sparse convolution sub-network is processed, the Stage 3-Stage 5 outputs radar features with different scales respectivelyAndand then combining the depth dimension into the channel dimension to obtain the final multi-scale radar featureAnd->For fusion with multiscale image features extracted by a CSPDarkNet53 image feature extraction network, wherein Li represents stage i, C _i 、D _i 、H _i And W is _i The number, depth, height and width of channels of the radar feature map of the ith stage are respectively represented, C _i D _i Representing the number of radar feature map channels after combining the depth dimension into the channel dimension in the ith stage, i=3, 4,5;

to be radar features of the same stageAnd image feature->Splicing, wherein->Andthe radar feature and the image feature representing the ith stage are identical in feature scale of the corresponding stage, then channel compression and information interaction among channels are carried out on the spliced features by using 1X 1 convolution, the compression ratio is e, and intermediate features are obtainedReuse of spatial concentration pairs +.>Space feature optimization is carried out to obtain pseudo radar featuresWherein, e takes a value of 2, and the method is suitable for subsequent fusion in order to make the channel number of the obtained pseudo radar feature consistent with that of the image feature;

computing image features using multi-headed cross-attention mechanismsAnd spurious radar features->Based on the correlation, fusing the pseudo radar features and the image features, wherein the number of heads is a fixed value of 8;

the bimodal target detection method based on cross-modal attention mechanism fusion further uses spatial attention to perform spatial feature optimization, and the method comprises the following steps:

using global maximization pooling and global averaging pooling to model intermediate featuresIs aggregated into a scalar to get +.>And->Then will->And->Splicing according to the channel dimension to obtain

Capturing by 3×3 convolution and 7×7 convolution with step size 1, respectivelySpace information in different ranges is obtained, the number of channels is reduced to 1, wherein the step size, the convolution kernel size and the number of channels are fixed values, and then feature stitching is carried out;

the channel dimension is subjected to global average pooling, the spatial attention weight is obtained through a Sigmoid function, and the intermediate feature is obtainedMultiplying the spatial attention weight to obtain an output characteristic +.>

In the bimodal target detection method based on cross-modal attention mechanism fusion, further, the calculating the correlation between the image features and the pseudo radar features by using the cross-modal attention mechanism, and fusing the features of different modalities based on the correlation comprises the following steps:

to be pseudo-radar characteristicAnd image feature->Flattening according to spaceRespectively obtaining the pseudo radar characteristic sequencesAnd image feature sequence->Wherein H is _i W _i Representing the length of a feature sequence after the pseudo radar features and the image features are unfolded according to the space in the ith stage;

then willObtained by linear transformation>As Query vector Query, < >>Respectively linear transformed ∈ ->And->As the Key vector Key and the Value vector Value, Q is used for each radar feature point and the region of the corresponding image ² And K ² Calculating the correlation and weight between the pseudo radar feature and the image feature, determining the importance of each pseudo radar feature point to the image feature, and then using the calculated weight pair V ² Weighting is carried out to obtain the attention characteristic sequence +.>

X is to be _Attn With the pseudo-radar signature sequenceSplicing, and folding to recover original characteristic shape to obtain fusion characteristic->

In a second aspect, the present invention also provides a bimodal target detection system based on cross-modal attention mechanism fusion, the system comprising:

the data acquisition module is used for acquiring the image to be detected and millimeter wave radar data;

the data preprocessing module is used for preprocessing millimeter wave radar data;

the target detection module is used for inputting the image to be detected and the preprocessed millimeter wave radar data into a trained bimodal target detection model based on cross-modal attention mechanism fusion, so as to obtain a detection result output by the bimodal target detection model based on cross-modal attention mechanism fusion; the bimodal target detection model based on the cross-modal attention mechanism fusion respectively utilizes a radar feature extraction network based on point cloud transformation and sparse coding convolution and a CSPDarkNet53 image feature extraction network to extract millimeter wave radar features and image features, the radar features and the image features are fused through a cross-modal attention composite feature fusion module at different stages of a PANet input end and a PANet output end, and finally the fused features are input into a detection network of YOLOv5-X for detection and NMS processing is carried out to obtain a detection result.

The invention mainly comprises the following steps:

(1) The invention provides a radar feature extraction network based on point cloud Transformer and sparse coding convolution, which can learn global features of millimeter wave Lei Dadian cloud by utilizing displacement invariance of the Transformer, and further aggregate local features of the point cloud by using 3D sparse convolution to obtain multi-scale radar features suitable for fusion with image features;

(2) The invention provides a cross-modal attention composite feature fusion module, which firstly utilizes a radar feature guidance feature fusion network to learn image features to obtain pseudo radar features, and then learns the information relevance between modes through a cross-attention mechanism on the image features and the pseudo radar features, so that the effectiveness of cross-modal feature fusion is improved, and the robustness and generalization capability of a model are further improved;

(3) According to the method, a point cloud transducer and sparse coding convolution-based radar feature extraction network and a cross-modal attention composite feature fusion module are combined, a cross-modal attention mechanism fusion-based bimodal target detection network is constructed, and target detection accuracy in a complex environment is improved;

in summary, the method is suitable for target detection in bad weather and low illumination environments such as rainy days, nights and the like, and has wide application prospect.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Like parts are designated with like reference numerals throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a diagram of a radar feature extraction network based on a point cloud transducer and sparse coding convolution in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of an offset attention module of an embodiment of the present invention;

FIG. 4 is a cross-modal attention composite feature fusion module according to an embodiment of the invention, wherein SA is a spatial attention module;

FIG. 5 is a block diagram of a spatial attention module according to an embodiment of the present invention;

FIG. 6 is a block diagram of a bimodal target detection network based on cross-modal attention mechanism fusion in accordance with an embodiment of the present invention;

FIG. 7 is a graph comparing detection results of a bimodal target detection method based on cross-modal attention mechanism fusion with a pure visual image target detection method YOLOv5-X according to an embodiment of the present invention;

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.

For ease of description, spatially relative terms, such as "inner," "outer," "lower," "upper," and the like, may be used herein to describe one element or feature's relationship to another element or feature as illustrated in the figures. Such spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures.

Example 1

In order to solve the problem of poor detection performance of a pure visual image target detection network in a bad weather and low light environment, the embodiment provides a bimodal target detection method based on cross-modal attention mechanism fusion, which comprises the following steps:

s1, acquiring an image to be detected and millimeter wave radar data;

s2, preprocessing millimeter wave radar data;

specifically, the pretreatment includes the following three steps:

firstly, radar signals with multiple periods are combined, and noise points are filtered through a three-dimensional boundary box, so that the quality of radar data is improved;

then, expanding the points into 3 meters of vertical lines to make up for the loss of radar data in height;

finally, the radar data are mapped to an image plane, so that the spatial alignment of the radar and the image is realized;

s3, inputting the image to be detected and the preprocessed millimeter wave radar data into a trained bimodal target detection model based on cross-modal attention mechanism fusion, and obtaining a detection result output by the bimodal target detection model based on cross-modal attention mechanism fusion;

the bimodal target detection model based on the cross-modal attention mechanism fusion respectively utilizes a radar feature extraction network based on point cloud transformation and sparse coding convolution and a CSPDarkNet53 image feature extraction network to extract millimeter wave radar features and image features, the radar features and the image features are fused through a cross-modal attention composite feature fusion module at different stages of a PANet input end and a PANet output end, and finally the fused features are input into a detection network of YOLOv5-X for detection and NMS processing is carried out to obtain a detection result;

s31, specifically, a radar feature extraction network based on point cloud convertors and sparse coding convolution is structured as shown in fig. 2, and the specific process of radar feature extraction comprises the following steps:

the network maps the preprocessed two-dimensional Lei Dadian cloud to the image planeAs an input, first, a projection of depth information of the radar point (i.e., a radial distance between the target and the sensor) in a direction perpendicular to the image is used as expanded third dimensional space coordinate information, and three-dimensional radar point cloud data>Wherein->Representing a real number set, H, W respectively representing the height and width of the initial point cloud image, h=384, w=640, D representing the depth of the radar points, d=64, and then uniformly downsampling to 2048 points, 2048 being an empirical value, the number of points being less than 2048 being complemented with 0;

the operation of uniformly downsampling the point cloud is specifically as follows: firstly, judging whether the number of points in the point cloud is larger or smaller than N, if so, not carrying out sampling operation, and reserving all points; if the number is greater than N, dividing the three-dimensional point cloud by using voxel grids, and distributing the points in the point cloud into corresponding voxel grids according to the positions of the points, wherein each voxel grid comprises a plurality of points; then sampling each voxel grid, taking the point in each voxel grid as a point set, firstly calculating the mass center of the voxel grid, then selecting the point closest to the mass center of the voxel as a first round of sampling points by using a Kd-Tree neighbor search algorithm until the number of the sampling points is N, and randomly sampling the appointed number of points in the rest radar points as supplementary sampling points if the number of the sampling points is less than N;

further, capturing Lei Dadian long-range dependency of cloud data through a point cloud converter sub-network, and enhancing global features of the data, wherein the point cloud converter sub-network structure comprises: an input embedding layer, an attention layer and a linear output layer; the features of the radar points are first initialized, and the initial features of each radar point can be expressed as:

wherein Z is _j Representing the initial characteristics of the jth radar point, alpha _j ,β _j ,γ _j Representing the spatial position coordinates of the jth radar point corresponding to W, H and D, rcs respectively _j ,v _j ¹ ,v _j ² Radar cross-sectional area respectively representing jth radar pointLateral and radial speeds, N representing the number of radar points; the input embedding layer comprises 2 linear layers for characterizing the initial radarN=2048, mapped to a new high-dimensional feature space, resulting in an embedded feature +.>

The embedded features are then input into an attention layer consisting of 4 stacked offset attention modules, 4 being a fixed value, to learn the long-range dependencies between each point; the offset attention module may be expressed as:

wherein F is _emb Representing embedded features, Q ¹ ，K ¹ And V ¹ Respectively represent a Query matrix, a Key matrix and a Value matrix, W _q ¹ 、W _k ¹ And W is _v ¹ Are all parameter matrixes which can be learned, A (-) represents the calculated attention characteristic, and l is calculated by the method _1norm Representation l ₁ Norm, T represents matrix transpose, LBR represents linear layer, F _Attention Representing an attention module output characteristic; offset attention module uses l ₁ The norms normalize the attention profile, as shown in FIG. 3, and then use the attention features with F _emb Instead of the original attention feature, the attention weight can be enhanced and the noise disturbance can be reduced; then, the multi-level attention module output features are spliced to enrich feature expression and reduce information loss. Finally, the linear layers are used for further fusion, and the point cloud characteristics are output

Feature encoding is carried out according to the spatial distribution of the three-dimensional point cloud P by utilizing global point cloud features through a sparse encoding layer so as to construct a four-dimensional radar feature tensorWherein c=160, d=64, h=384, w=640;

g is input into a multi-scale 3D sparse convolution sub-network to perform local efficient aggregation, local characteristics of Lei Dadian cloud are learned, and a large amount of unnecessary computing resources are saved; the 3D sparse convolution sub-network is similar to the image feature extraction network and comprises 5 feature extraction stages Stage 1-Stage 5, and main parameters are summarized in a table 1;

TABLE 1 3D sparse convolution layer structure and parameters

Based on the table, the downsampling step length of the radar features passing through each stage is S respectively ₁ 、S ₂ 、S ₃ 、S ₄ 、S ₅ The step length of each stage takes a value according to the radar characteristic scale required to be output, and the specific value is shown in a table 1; the network structures of the 5 feature extraction stages Stage 1-Stage 5 are the same and adopt a fixed structure, each of the 5 feature extraction stages is composed of 1 conventional sparse convolution layer and 2 identical sub-manifold sparse convolution layers, wherein the conventional sparse convolution layers comprise 3×3 conventional sparse convolutions, a Batchnorm1d normalization function and a ReLU activation function, the sub-manifold sparse convolution layers comprise 3×3 sub-manifold sparse convolutions, a Batchnorm1d normalization function and a ReLU activation function, the step length of the conventional sparse convolutions in each Stage is consistent with the downsampling step length in each Stage, the step length of the sub-manifold sparse convolutions is 1, and the convolution kernel size is 3×3 and is a fixed value; after the 3D sparse convolution sub-network is processed, the Stage 3-Stage 5 outputs radar features with different scales respectively And->And then combining the depth dimension into the channel dimension to obtain the final multi-scale radar featureAnd->For fusion with multiscale image features extracted by a CSPDarkNet53 image feature extraction network, wherein Li represents stage i, C _i 、D _i 、H _i And W is _i The number, depth, height and width of channels of the radar feature map of the ith stage are respectively represented, C _i D _i Representing the number of radar feature map channels after combining the depth dimension into the channel dimension at stage i, i=3, 4,5, c ₃ 、C ₄ 、C ₅ ＝40、80、160，D ₃ 、D ₄ 、D ₅ ＝8、8、8，H ₃ 、H ₄ 、H ₅ ＝48、24、12，W ₃ 、W ₄ 、W ₅ ＝80、40、20，C ₃ D ₃ 、C ₄ D ₄ 、C ₅ D ₅ ＝320、640、1280；

S32, specifically, a cross-modal attention composite feature fusion module is embedded in an input end and an output end of the PANet, radar features and image features at different stages are fused, the structure is shown in fig. 4, and a process for fusing the radar features and the image features through the cross-modal attention composite feature fusion module comprises:

to be radar features of the same stageAnd image feature->Splicing, wherein F _L 1 _i And F _L 2 _i The radar features and image features representing the ith stage and the feature scale of the corresponding stage are the same, then the 1 x 1 convolution is used to splice the pairsThe characteristics perform channel compression and information interaction among channels, the compression ratio is e=2, and intermediate characteristics are obtainedReuse of spatial concentration pairs +.>Space feature optimization is carried out to obtain pseudo radar featuresWherein, e takes a value of 2, and the method is suitable for subsequent fusion in order to make the channel number of the obtained pseudo radar feature consistent with that of the image feature; the method comprises the steps of carrying out a first treatment on the surface of the

The spatial attention is shown in FIG. 5, where the intermediate features are first pooled using global maximization and global averagingIs aggregated into a scalar to get +.>And->Then will->And->Splicing according to channel dimension to obtain +.>

Then captured by a 3×3 convolution and 7×7 convolution with a step size of 1, respectivelySpatial information of different ranges and dimension reduction of the channel number to 1, whereinStep length, the convolution kernel size and the channel number are fixed values, and characteristic splicing is performed again;

finally, the channel dimension is subjected to global average pooling, the spatial attention weight is obtained through a Sigmoid function, and the intermediate characteristics are obtainedMultiplying the spatial attention weight to obtain an output characteristic +.>Spatial attention can be expressed as:

wherein SA (g) represents the calculated spatial attention,representing intermediate features, representing convolution kernel size, F _* Representing the feature obtained by convolving the splice pooling feature, conv _*×* A convolution layer with a convolution kernel size of x is represented, GMP and GAP represent global max pooling and global average pooling, respectively, and Concat represents a concatenation operation;

further, image features are computed using a cross-attention mechanismAnd spurious radar features->Based on the correlation, fusing the pseudo radar features and the image features, wherein the number of heads is a fixed value of 8;

in particular, spurious radar features are to be usedAnd image feature->Flattening according to space to obtain corresponding characteristic sequences +.>And->Wherein H is _i W _i Represents the sequence length after spatial expansion, H ₃ W ₃ ＝3840，H ₄ W ₄ ＝960，H ₅ W ₅ ＝240；

Then willObtained by linear transformation>As Query vector Query, < >>Respectively linear transformed ∈ ->And->As the Key vector Key and the Value vector Value, Q is used for each radar feature point and the region of the corresponding image ² And K ² Calculating the correlation and weight between the pseudo radar feature and the image feature, determining the importance of each pseudo radar feature point to the image feature, and then using the calculated weight pair V ² Weighting to obtain the attention characteristic sequence +.>

Finally, X is _Attn With the pseudo-radar signature sequenceSplicing, and folding to recover original characteristic shape to obtain fusion characteristic->Thus, the proposed cross-modal attention composite feature fusion module can be expressed by the following formula:

wherein Flat represents performing a flattening operation on the feature, Q ² ，K ² And V ² Respectively represent a Query, key and Value matrix, W _q ² ，W _k ² And W is _v ² Are all parameter matrixes capable of learning, dim _i Representing dimensions of Query/Value at different stages, wherein the values are 320, 640 and 1280 according to the different stages, nh represents the number of heads in the multi-head attention, nh=8 is a fixed Value, PE represents a learnable relative position code, and Reshape represents folding and recovering an original characteristic shape operation;

s33, further inputting fusion characteristics output by each stage of PANet into a corresponding stage of a detection network of YOLOv5-X for processing to obtain detection frames and category parameters of each target in the image, and filtering out redundant detection frames through NMS non-maximum suppression post-processing to obtain a final detection result;

the experimental process and experimental result of the bimodal target detection method based on the cross-modal attention mechanism fusion are described in detail by specific embodiments;

1. experimental platform configuration

The embodiment is realized under a Pytorch framework, and the operation platform is based on Intel i7-12700CPU, NVIDIA GeForce RTX 4090GPU and Ubuntu20.04.3LTS operating systems for training and testing;

2. training bimodal target detection network model based on cross-modal attention mechanism fusion

2.1, data set and evaluation index

In an experiment, a nuScenes data set is adopted as a training data set, nuScenes comprises 1000 millimeter wave radars and image data of different scenes captured by a real scene, 850 scenes with labels in the data set are adopted as experimental data, the data set is preprocessed and divided into a training set, a verification set and a test set according to a ratio of 6:2:2, the ratio is an empirical value, 20480 pairs of radar and image data are taken as the training set in total, 6830 pairs of radar and image data are taken as the test set, and in order to adapt to network training, the resolution of input images and radar data is adjusted to 384 multiplied by 640;

in this embodiment, mAP (IoU =0.5) is used as an evaluation index of a bimodal target detection method based on cross-modal attention mechanism fusion, and in general, the higher the value of this index, the stronger the detection performance of the representative algorithm;

2.2 training parameter settings

The network is realized in a Pytorch framework, training is performed on a GeForce RTX 4090GPU by using an SGD optimizer, the Batch size is set to 4, the Epoch is set to 100, the learning rate is initially set to 0.01, the Batch size is limited by the video memory size of the equipment, the Epoch and the initial learning rate are experience values, and the network is trained based on a YOLOv5-X pre-training model according to self-adaptive adjustment of training;

2.3 analysis of results

Testing the bimodal target detection model based on the cross-modal attention mechanism fusion on a nuScenes data set, and comparing the performance of the bimodal target detection model with that of a pure visual image target detection model by using mAP evaluation indexes;

the results of the performance comparison are shown in Table 2, and under mAP evaluation index, the best results are shown in bold;

on the aspect of algorithm performance comparison based on mAP evaluation indexes, the bimodal target detection method based on cross-modal attention mechanism fusion is superior to a pure visual image target detection method;

table 2 comparison of experimental results

FIG. 7 shows the visualization of the detection results of the pure visual image target detection algorithm YOLOv5-X and the bimodal target detection algorithm of the present invention, respectively, with the detection accuracy and recall of the present invention being higher than those of the pure visual image algorithm from the detection results; as shown in fig. 7, YOLOv5-X cannot completely and accurately identify people and vehicles under dim light, and the present invention can accurately identify objects under dim light;

therefore, the experimental result shows that the bimodal target detection method of the invention realizes the strong robust target detection in complex environment.

Example two

The first embodiment provides a bimodal target detection method based on cross-modal attention mechanism fusion, and correspondingly, the present embodiment provides a bimodal target detection system based on cross-modal attention mechanism fusion; the bimodal target detection system based on the cross-modal attention mechanism fusion provided in the embodiment can implement the bimodal target detection method based on the cross-modal attention mechanism fusion in the first embodiment, and the system can be realized by software, hardware or a combination of software and hardware; for example, the system may include integrated or separate functional modules or functional units to perform the corresponding steps in the methods of embodiment one; because the bimodal target detection system based on the cross-modal attention mechanism fusion of the embodiment is basically similar to the method embodiment, the description process of the embodiment is simpler, and the relevant points can be seen from the part of the description of the first embodiment, and the bimodal target detection system based on the cross-modal attention mechanism fusion of the embodiment is only schematic;

the embodiment provides a bimodal target detection system based on cross-modal attention mechanism fusion, which comprises:

the target detection module is used for inputting the image to be detected and the preprocessed millimeter wave radar data into a trained bimodal target detection model based on the cross-modal attention mechanism fusion, so as to obtain a detection result output by the bimodal target detection model based on the cross-modal attention mechanism fusion; the method comprises the steps that a bimodal target detection model based on cross-modal attention mechanism fusion respectively utilizes a radar feature extraction network based on point cloud Transformer and sparse coding convolution and a CSPDarkNet53 image feature extraction network to extract millimeter wave radar features and image features, the radar features and the image features are fused through a cross-modal attention composite feature fusion module at different stages of a PANet input end and a PANet output end, and finally the fused features are input into a detection network of YOLOv5-X for detection and NMS processing is carried out to obtain detection results;

finally, it should be noted that the above embodiments are merely illustrative of the technical solution of the present invention, and not limiting thereof; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be replaced with other technical solutions, which may not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A bimodal target detection method based on cross-modal attention mechanism fusion is characterized by comprising the following steps:

acquiring an image to be detected and millimeter wave radar data;

preprocessing millimeter wave radar data;

the bimodal target detection model based on the cross-modal attention mechanism fusion respectively utilizes a radar feature extraction network based on point cloud transformation and sparse coding convolution and a CSPDarkNet53 image feature extraction network to extract millimeter wave radar features and image features, the radar features and the image features are fused through a cross-modal attention composite feature fusion module at different stages of a PANet input end and a PANet output end, and finally the fused features are input into a detection network of YOLOv5-X for detection and NMS processing is carried out to obtain a detection result.

2. The bimodal target detection method based on cross-modal attention mechanism fusion according to claim 1, wherein the method for extracting millimeter wave radar features based on a point cloud transducer and a sparse coding convolution radar feature extraction network comprises the following steps:

the initialization feature of each point in the point cloud is set to a feature containing 6 dimensions: alpha _j ,β _j ,γ _j ,rcs _j ,Wherein alpha is _j ,β _j ,γ _j Spatial positions corresponding to W, H and D dimensions of jth radar point are respectively represented, rcs _j ,v _j ¹ ,v _j ² Respectively representRadar cross-sectional area, lateral velocity and radial velocity of the jth radar point, thus obtaining initial point cloud characteristics +.>Simultaneously, the space position coordinates of each point are recorded, then the long-range dependency relationship of Lei Dadian cloud data is captured through a point cloud transducer sub-network, the global characteristics of the data are enhanced, and the output point cloud characteristics are obtained>

and further learning local features of the point cloud by using the 3D sparse convolution sub-network, and combining downsampling to obtain global-local multi-scale radar features.

3. The bimodal target detection method based on cross-modal attention mechanism fusion according to claim 2, wherein the operation of uniformly downsampling the millimeter wave Lei Dadian cloud is specifically:

if the number of sampling points is less than N, randomly sampling a specified number of points in the rest radar points to serve as supplementary sampling points.

4. The bimodal target detection method based on cross-modal attention mechanism fusion according to claim 2, wherein the 3D sparse convolution sub-network comprises 5 feature extraction stages Stage 1-Stage 5, and downsampling steps of each Stage are respectively S ₁ 、S ₂ 、S ₃ 、S ₄ 、S ₅ The step length of each stage takes a value according to the radar characteristic scale required to be output; the network structures of the 5 feature extraction stages Stage 1-Stage 5 are the same and adopt a fixed structure, each of the 5 feature extraction stages is composed of 1 conventional sparse convolution layer and 2 identical sub-manifold sparse convolution layers, wherein the conventional sparse convolution layers comprise 3×3 conventional sparse convolutions, a Batchnorm1d normalization function and a ReLU activation function, the sub-manifold sparse convolution layers comprise 3×3 sub-manifold sparse convolutions, a Batchnorm1d normalization function and a ReLU activation function, the step length of the conventional sparse convolutions in each Stage is consistent with the downsampling step length in each Stage, the step length of the sub-manifold sparse convolutions is 1, and the convolution kernel size is 3×3 and is a fixed value; after the 3D sparse convolution sub-network is processed, the Stage 3-Stage 5 outputs radar features with different scales respectivelyAnd->Then by merging the depth dimension into the channel dimension, the final multi-scale radar feature +.>And->For and by CSPDarkNet53 mapFusing the multi-scale image features extracted by the image feature extraction network, wherein Li represents the ith stage and C _i 、D _i 、H _i And W is _i The number, depth, height and width of channels of the radar feature map of the ith stage are respectively represented, C _i D _i The number of radar feature map channels after combining the depth dimension into the channel dimension at the i-th stage is represented, i=3, 4,5.

5. The bimodal target detection method based on cross-modal attention mechanism fusion according to claim 1, wherein the process of fusing radar features and image features through a cross-modal attention composite feature fusion module comprises:

to be radar features of the same stageAnd image feature->Splicing, wherein->And->The radar feature and the image feature representing the ith stage and the feature scale of the corresponding stage are the same, then the 1X 1 convolution is used for carrying out channel compression and information interaction between channels on the spliced features, the compression ratio is e, and an intermediate feature is obtained>Reuse of spatial concentration pairs +.>Performing space feature optimization to obtain pseudo radar features +.>Wherein, e takes a value of 2, and the method is suitable for subsequent fusion in order to make the channel number of the obtained pseudo radar feature consistent with that of the image feature;

computing image features using multi-headed cross-attention mechanismsAnd spurious radar features->Based on the correlation, the pseudo radar features and the image features are fused, wherein the number of the heads is a fixed value of 8.

6. The bimodal target detection method based on cross-modal attention mechanism fusion of claim 5, wherein spatial feature optimization using spatial attention comprises:

using global maximization pooling and global averaging pooling to model intermediate featuresIs aggregated into a scalar to get +.>And->Then will->And->Splicing according to channel dimension to obtain +.>

7. The bimodal target detection method based on cross-modal attention mechanism fusion according to claim 5, wherein calculating a correlation between image features and pseudo-radar features using a cross-modal attention mechanism, and fusing features of different modalities based on the correlation, comprises:

to be pseudo-radar characteristicAnd image feature->Flattening according to space to obtain pseudo radar characteristic sequencesAnd image feature sequence->Wherein H is _i W _i Representing spurious radars in the ith stageThe features and the image features are according to the length of the feature sequence after spatial expansion;

8. A bimodal target detection system based on cross-modal attention mechanism fusion, the system comprising: