CN116797894A

CN116797894A - Radar and video fusion target detection method for enhancing characteristic information

Info

Publication number: CN116797894A
Application number: CN202310696483.2A
Authority: CN
Inventors: 孙希延; 覃鸿媚; 李晶晶; 纪元法
Original assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Current assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-09-22

Abstract

The invention relates to the technical field of computer vision and pattern recognition, in particular to a radar and video fusion target detection method for enhancing characteristic information, which comprises the steps of obtaining radar data and image data, and preprocessing the radar data and the image data to obtain a Lei Dadian cloud mapping image; constructing a multi-mode fusion model, and inputting the radar point cloud mapping image and the original image into the multi-mode fusion model to obtain a feature map; inputting the feature map into an enhanced feature information network for conversion to obtain an enhanced feature map; embedding the feature extraction network into an input depth separable convolution module to sequentially perform depth convolution and point-by-point convolution to obtain an output feature map; and classifying and regressing the output feature map to obtain a target detection result, thereby solving the problem that the existing feature fusion method can cause result deviation due to inaccurate feature extraction.

Description

Radar and video fusion target detection method for enhancing characteristic information

Technical Field

The invention relates to the technical field of computer vision and pattern recognition, in particular to a radar and video fusion target detection method for enhancing characteristic information.

Background

Currently, in the intelligent traffic field, a plurality of sensor data fusion of laser radars, millimeter wave radars, cameras, beidou positioning and infrared cameras and the like are applied to automatic or auxiliary driving of vehicles, road detection and traffic monitoring safety and the like. Along with the development of a multi-sensor fusion technology, the millimeter wave radar and camera fusion technology is gradually applied to the fields of intelligent road sides and traffic monitoring, and vehicle-road coordination is realized through data acquisition, analysis and application of a plurality of sensors, so that the traffic safety degree is improved, and traffic accidents are reduced to the greatest extent.

The millimeter wave radar is used as a device for detecting target information, and is used for acquiring information of a scene and a target by transmitting modulated electromagnetic waves and receiving observation echoes, so that distance, position and speed information of the target can be acquired rapidly, the radar has working capacity of being not influenced by illumination weather, but the acquired target shape characteristics, texture, height information and the like are poor due to sparsity of the obtained radar point cloud. Aiming at the defects of limitation of a single sensor and the like, a multi-sensor fusion method is used as an effective way for solving the problems. Therefore, the camera can acquire dense characteristic information such as rich object shape outline and the like by combining the data acquired by the camera for complementary application, but the problem that the detection of a remote object is fuzzy and cannot obtain an accurate object detection effect due to the influence of heavy rain, heavy fog or low visible light environment exists, so that the rich object information can be acquired by the cooperative application of the millimeter wave radar and the camera, and meanwhile, the limitation and the defect of a single sensor are solved.

In general, millimeter wave radar and vision fusion can be classified into data level, feature level and decision level fusion, the data level can obtain more comprehensive, richer and more accurate data information, the problem of mismatching among different sensors can be reduced, and the reliability and accuracy of data are improved; however, the large amount of data is stored and transmitted with high cost and complexity, and the data processing time cost is high for a system with large data amount and various data types. The decision-level fusion can combine the decision results of a plurality of sensors to obtain a final result with higher accuracy and higher certainty; however, advanced decision algorithms and processing techniques are required, the cost and difficulty are high, and when the decision results generated by a plurality of sensors are inconsistent, the problem of decision consistency needs to be solved. On the one hand, the feature level fusion can fully fuse features by applying deep learning, on the other hand, the requirements of data storage and transmission can be reduced, the cost and complexity are reduced, the problem of isomerism can be well treated, and the data processing efficiency is improved; but may lead to a deviation in the results due to inaccuracy in feature extraction.

Disclosure of Invention

The invention aims to provide a radar and video fusion target detection method for enhancing feature information, and aims to solve the problem that the existing feature fusion method can cause result deviation due to inaccurate feature extraction.

In order to achieve the above object, the present invention provides a radar and video fusion target detection method for enhancing feature information, comprising the following steps:

acquiring radar data and image data, and preprocessing the radar data and the image data to obtain a Lei Dadian cloud mapping image;

constructing a multi-mode fusion model, and inputting the radar point cloud mapping image and the original image into the multi-mode fusion model to obtain a feature map;

inputting the feature map into an enhanced feature information network for conversion to obtain an enhanced feature map;

embedding the feature extraction network into a depth separable convolution module to sequentially perform depth convolution and point-by-point convolution to obtain an output feature map;

and carrying out classification and regression processing on the output feature map to obtain a target detection result.

The acquiring radar data and image data, and preprocessing the radar data and the image data to obtain Lei Dadian cloud mapping images includes:

acquiring radar data and image data;

performing spatial alignment and time alignment on the radar data and the image data to obtain alignment data;

mapping radar information to corresponding positions of a radar point cloud mapping image by using the distance of the radar data and the RCS as information gain to obtain a point cloud mapping image with radar characteristics;

and based on the point cloud mapping image, taking radar characteristic images of the radar data with two channels and images of the image data with three channels as inputs, correlating the radar characteristic images with characteristic information of the images, setting pixel values of places where targets do not exist to be 0 through a filter, and filling the pixel values at other positions according to radar channel information to obtain the Lei Dadian cloud mapping image.

The multi-mode fusion model is formed by combining a selected visual geometric group network and a characteristic pyramid backbone network.

Wherein the enhanced feature information network includes an autotransformer, a grounding transformer, and a rendering transformer.

The step of inputting the feature map into an enhanced feature information network for conversion to obtain an enhanced feature map comprises the following steps:

respectively inputting the characteristic diagrams into an autotransformer, a grounding transformer and a rendering transformer to obtain three hierarchical transformation characteristic diagrams;

and rearranging the three hierarchical transformation feature graphs into the size corresponding to the original feature graphs according to the size, splicing the hierarchical transformation feature graphs with the original pyramid features, and reducing the dimension through convolution to obtain the enhanced feature graphs.

The depth separable convolution module comprises a depth convolution module and a point-by-point convolution module.

According to the radar and video fusion target detection method for enhancing the characteristic information, radar data and image data are obtained, and preprocessing is carried out on the radar data and the image data to obtain a Lei Dadian cloud mapping image; constructing a multi-mode fusion model, and inputting the radar point cloud mapping image and the original image into the multi-mode fusion model to obtain a feature map; inputting the feature map into an enhanced feature information network for conversion to obtain an enhanced feature map; embedding the feature extraction network into a depth separable convolution module to sequentially perform depth convolution and point-by-point convolution to obtain an output feature map; classifying and regression processing is carried out on the output feature map to obtain a target detection result, and cross-scale and cross-space feature information cross exchange can be realized through the processing scheme so as to improve the recognition degree of large and small targets or similar targets; the number and the calculated amount of model training parameters can be reduced according to the depth separable convolution, and the operation cost is reduced. The method solves the problem that the existing feature fusion method can cause result deviation due to inaccurate feature extraction.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a radar and vision feature fusion overall framework.

Fig. 2 is a specific structure of radar and visual feature fusion.

Fig. 3 is an enhanced feature information network.

Fig. 4 is a flowchart of a radar and video fusion target detection method for enhancing feature information provided by the invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1 to 4, the present invention provides a radar and video fusion target detection method for enhancing feature information, which includes the following steps:

s1, radar data and image data are obtained, and preprocessing is carried out on the radar data and the image data to obtain Lei Dadian cloud mapping images;

the specific method is as follows:

s11, radar data and image data are acquired;

specifically, nuscenes data sets are downloaded on Nuscenes websites, the data sets are stored according to a certain directory format, and the data set preparation task is completed to obtain radar data and image data.

S12, carrying out space alignment and time alignment on the radar data and the image data to obtain alignment data;

specifically, through space-time alignment of radar data and image data, the coordinates of the radar point cloud are converted into coordinates under the image, and target matching association can be performed only when the two are unified in time space. By mapping the radar point cloud onto the image, the point cloud is expanded to a vertical line of fixed height in order to solve the point cloud sparseness problem. Clipping of both Lei Dadian cloud mapped images and images translates from 1600 x 900 image size to 360 x 640 as input.

S13, mapping radar information to corresponding positions of a radar point cloud mapping image by using the distance of the radar data and the RCS as information gain to obtain a point cloud mapping image with radar characteristics;

s14, based on the point cloud mapping image, taking a radar characteristic image of the radar data with two channels and images of the image data with three channels as inputs, correlating the radar characteristic image with characteristic information of the images, setting a pixel value of a place where a target does not exist to be 0 through a filter, and filling the pixel value according to radar channel information at other positions to obtain a Lei Dadian cloud mapping image.

In particular, interference of other non-target features is reduced.

S2, constructing a multi-mode fusion model, and inputting the radar point cloud mapping image and the original image into the multi-mode fusion model to obtain a feature map;

specifically, in order to fully integrate millimeter wave radar features and image features, the invention considers that a multi-level feature integration method is adopted, and as the feature pyramid model can effectively locate and identify target objects with different scales, FPN is selected as a backbone network of the integration model, namely, a model formed by combining VGG16 and FPN backbone networks is selected. Because conventional CNNs are typically identified by features of the last feature map, for small objects, if the input is a low resolution feature map small objects can be easily ignored, multi-layer feature extraction is achieved using feature pyramids, for larger objects, predicted in deep feature maps; for smaller targets, shallow feature map predictions are used. The two backbone networks are used simultaneously because the two backbone networks have good feature extraction capability and stronger multi-scale feature representation capability, and can effectively support the fusion of multi-mode images. In addition, the VGG16 and the FPN are relatively simple in structure, easy to implement and relatively high in calculation efficiency.

As can be seen from fig. 2, the specific implementation network structure can be seen from fig. 2, radar feature images are spliced and fused at one side, and maximized and pooled to obtain the next stage, the original images are subjected to feature extraction of 5 module layers of the VGG16 network structure after being spliced and fused to obtain feature images with different dimensions and different channel numbers, the output ends of the modules 3, 4 and 5 are connected with a Feature Pyramid (FPN), the feature pyramid images are obtained through processing, and then the feature pyramid output ends are subjected to enhanced feature information network processing, and detection results are completed according to regression and classification module processing.

S3, inputting the feature map into an enhanced feature information network for conversion to obtain an enhanced feature map;

specifically, according to the enhanced feature information network, context information is captured through methods such as query, key and value based on the heuristic of a transducer concept, a new feature map can be generated, and the feature map can encode cross-space and cross-scale non-local or local interactions so as to realize mutual complementary utilization of feature map information of different scales. The network converts any feature pyramid output feature map into a feature pyramid enhancement feature map with the same size but richer information according to three interaction methods, namely self-level, top-down and bottom-up. Three layers of feature graphs output by a Feature Pyramid (FPN) are input into an enhanced feature information network, and the converted scale is the same as the size of the feature pyramid, but is richer than the original background information.

Different scales of the object can be identified by using different levels of feature maps. Small objects should be identified at a relatively low level and large objects should be identified at a higher level. It is therefore necessary to combine local information with global information from a high level, and objects at similar scales, such as buses and vans, have similar target profiles, so that it is necessary to improve the accuracy of target classification by refining feature information. Furthermore, for example, people should appear on zebra crossings rather than lanes, the combination of local and global characteristic information is helpful for mutual recognition between different objects.

The enhanced feature information network includes an autotransformer, a grounding transformer, and a rendering transformer.

As shown in fig. 3, to specifically describe the enhanced feature network process, the following processes are performed according to different scales of Feature Pyramid (FPN) output:

s31, inputting the characteristic diagrams into an autotransformer SelfTransformer (ST), a grounding transformer Groundng Transformer (GT) and a rendering transformer Rendering Transformer (RT) respectively to obtain three hierarchical transformation characteristic diagrams;

SelfTransformer (ST), the output has the same scale as the input according to the non-local interaction of the same-layer feature map.

The specific expression process of ST is as follows:

inputting parameters: f (f) _q (P _i ),f _k (P _j ),f _v (P _j )，N

Similarity: s is(s) _i,j ＝F _sim (f _q (P _i,n ),f _k (P _j,n ))

Weight:

and (3) outputting: p (P) _i ′＝F _mul (w _i,j ,v _j )

f _q (P _i )、f _k (P _j )、f _v (P _i ) Query conversion function, key conversion function, and value conversion function, respectively; p is a feature map, P' is a post-conversion feature map, P _i 'is the i-th transformed feature position in P'.

F _sim Is a similarity function, which is a dot product calculation; normalization function based on soft maxima (MoS)F _mos ：σ _n For the nth aggregate weight +.>To normalize the learnable linear vector, k is k _j Arithmetic mean of all positions; f (F) _mul Is a weight aggregation function, which is a matrix multiplication calculation.

Groundng Transformer (GT), in top-down mode, outputs the same scale as the low-level feature map, maps the concept in the high-level feature to P ^c Mapping to low-level features to P ^f Is a "pixel" in (a) a pixel.

The specific expression process of GT is as follows:

inputting parameters:

similarity:

weight:

and (3) outputting: p (P) _i ′ ^f ＝F _mul (w _i,j ,v _j )

Wherein the calculated similarity isP _i ^f Is P _i I-th feature position of (a) +.>Is P _j The j-th feature position of (a); f (F) _mul As a weight aggregation function, only top-down interactions are improved.

RenderingTransformer (RT), the output scale is the same as the high-level characteristic map scale by adopting a bottom-up mode. Rendering a high-level "concept" using visual properties incorporated in a low-level "pixel" is not performed by the pixel, but by the entire feature map.

The specific expression process of RT is as follows:

inputting parameters: q, K, V

Weight: w=gap (K)

And (5) weight query: q (Q) _w ＝F _w (Q,w)

Downsampling: v (V) _{Lower part(s)} ＝F _3conv (V)

And (3) outputting:

wherein F is _w As an outer product function, F _1conv 3×3 convolution function with stride=1, F _conv As a 3X 3 convolution function, F _add A feature map summation function for a 3 x 3 convolution.

The high-level feature map is Q, the low-level feature maps are K and V, and in order to highlight the rendering object, the interaction between Q and K is performed in a channel-attention manner. K first calculates the weight w of Q by Global Average Pooling (GAP). Then, weight Q (Q _w ) Refinement was performed by 3 x 3 convolution. V reduces the gray squares in the feature size map by a 3 x 3 convolution. Finally, for the improved Q _w And downsampling V (V) _{Lower part(s)} ) Summation is performed and refinement is performed by another 3 x 3 convolution.

S32, rearranging the three hierarchical transformation feature graphs into the size corresponding to the original feature graphs according to the size, splicing the hierarchical transformation feature graphs with the original pyramid features, and reducing the dimension through convolution to obtain the enhanced feature graphs.

Specifically, the feature images after each level transformation are rearranged into the corresponding original feature images according to the size, are spliced with the original pyramid features, and the dimension is reduced through convolution, so that a group of feature images with the same dimension as the input feature images and richer in information is finally obtained, namely the feature images are spliced with the original feature images and then input into the convolution to be reduced to the original channel number.

According to the implementation of the scheme, features can be interacted across space and scale, in the feature pyramid, a high or low-level feature map comprises a large amount of global or local image information, and the feature map with enhanced information is obtained by splicing and overlapping feature degrees with the same scale after feature information interaction. For example, RT implements feature information enhancement in a bottom-up design approach by incorporating visual attributes in the low-level to present the high-level concept.

S4, embedding the feature extraction network into a depth separable convolution module to sequentially perform depth convolution and point-by-point convolution to obtain an output feature map;

specifically, the depth separable convolution module comprises a depth convolution module and a point-by-point convolution module.

According to the depth separable convolution module, which comprises a depth convolution and a point-by-point convolution, as a lightweight pluggable network, features are extracted by combining the two modules of the depth convolution module and the point-by-point convolution module. The specific implementation method is as follows:

1) A depth convolution module: one convolution kernel only processes one channel, i.e. each convolution kernel only processes its own corresponding channel. The number of channels of the input feature map is the number of convolution kernels. The feature maps after each convolution kernel process are stacked together. The number of channels of the input and output profiles is the same. Because only processing information in the length-width direction can result in losing cross-channel information, point-by-point convolution is required in order to supplement the cross-channel information back.

2) And a point-by-point convolution module: after the cross-channel dimension is processed by convolution, a plurality of feature graphs are generated by a plurality of convolution kernels, so that the depth convolution module only processes the space information in the length-width direction, and then only processes the information in the cross-channel direction by point convolution, the parameter number and the calculated amount of the network can be reduced to a greater extent, and the efficiency and the speed of the model are improved.

The general convolution is to process all channels by one convolution kernel, so when the feature map has a plurality of channels, the convolution kernel has a plurality of channels, so the convolution kernel is designed for all channels of the feature map, and each time one attribute of the feature map is added, the convolution kernel needs to be added by one, namely the total number of general convolution parameters = the total number of attributes x the size of the convolution kernel, and the number of required attribute multi-parameters is also increased. However, if more attributes are extracted, the depth separable convolution only needs to design more convolution cores to carry out convolution again by using a convolution check channel, and the output is the same as that of the general convolution. But depth separable convolution can save more parameters as more and more attributes are extracted. The number of depth separable convolution parameters and the computational cost are relatively low compared to general convolution.

S5, classifying and regressing the output feature map to obtain a target detection result.

Specifically, in order to obtain a target detection result, the performance of the model is evaluated through the indexes such as mAP, various average accuracy, precision, recall rate and the like by classification and regression processing. And finally, processing the output of the enhanced feature information network through a bounding box regression and classification block, thereby obtaining a target detection result. Before training the model, super parameters need to be set: the learning rate is 2e-5, the batch size is 1, the epoch is 25, and the fixed height is 3.

Advantageous effects

1. By introducing the enhanced feature information network, detection and identification of tiny objects, such as throwing objects of small objects like small bottles on a highway, can be solved, and objects with different scales are identified by using feature maps with different levels, namely small objects are identified in lower levels, and large objects are identified in higher levels. By the method, the quantity and the context information of the tiny objects are greatly enriched, and the robustness of the multi-mode fusion model can be improved.

2. Because different scales are fused in the FPN, a large amount of redundant information and conflict information are easy to generate, and the multi-scale expression capability is reduced. The feature information interaction from top to bottom and from bottom to top is performed on the feature graphs of different layers. The method can realize detection of richer characteristic information, thereby improving the characteristic utilization effect.

3. Since CNN is generally identified by the features of the last feature map, for small objects, if the input is a low resolution feature map, the small objects may be easily ignored, so that multi-layer feature extraction is implemented using feature pyramids, large objects are predicted in deep feature maps, and small objects are predicted in shallow feature maps.

4. After the enhancement characteristic information network is added, the quantity of parameters and calculation quantity are inevitably increased, and the calculation quantity and the digestion of the quantity of parameters in the process of training a model can be reduced to a certain extent by replacing the general convolution with the depth separable convolution, so that the cost is reduced.

5. The enhanced feature network is an end-to-end design scheme, avoids excessive redundancy caused by a complex network structure, and improves the stability of the converged network.

The foregoing disclosure is only a preferred embodiment of a method for detecting a radar and video fusion target with enhanced feature information, but it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will understand that all or part of the procedures for implementing the foregoing embodiment are equivalent and still fall within the scope of the invention.

Claims

1. The radar and video fusion target detection method for enhancing the characteristic information is characterized by comprising the following steps of:

2. The method for radar and video fusion target detection of enhanced feature information of claim 1,

the obtaining radar data and image data, and preprocessing the radar data and the image data to obtain Lei Dadian cloud mapping images, includes:

acquiring radar data and image data;

3. The method for radar and video fusion target detection of enhanced feature information of claim 1,

the multi-mode fusion model is formed by combining a selected visual geometry group network and a characteristic pyramid backbone network.

4. The method for radar and video fusion target detection of enhanced feature information of claim 1,

5. The method for radar and video fusion target detection of enhanced feature information of claim 4,

6. The method for radar and video fusion target detection of enhanced feature information of claim 1,