CN115393684A

CN115393684A - Anti-interference target detection method based on automatic driving scene multi-mode fusion

Info

Publication number: CN115393684A
Application number: CN202211321720.9A
Authority: CN
Inventors: 刘寒松; 王永; 王国强; 刘瑞; 焦安健; 李贤超
Original assignee: Sonli Holdings Group Co Ltd
Current assignee: Sonli Holdings Group Co Ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2022-11-25
Anticipated expiration: 2042-10-27
Also published as: CN115393684B

Abstract

The invention belongs to the technical field of target detection, and relates to an anti-interference target detection method based on multi-mode fusion of an automatic driving scene, which adopts a detection means of complementary multi-mode combination of visible light and near-infrared images, fully utilizes multi-mode data from a characteristic level, and designs different trunk networks for extracting basic characteristics of the visible light and the near-infrared images and extracting the basic characteristics aiming at different apparent characteristics of the near-infrared and visible light; aiming at the problems of different information abundance and different learning difficulty degrees of two groups of images, the learning rate constraint module is used in different branches and is used for controlling the updating rate of the branches in different modes; and then, the two groups of features are fused by using a mode fusion module, so that similar features and complementary features are reserved, namely coexistence and difference are sought, information of the two modes is fully utilized, the detection precision is high, and the detection omission rate in a complex interference scene is reduced.

Description

Anti-interference target detection method based on automatic driving scene multi-mode fusion

Technical Field

The invention belongs to the technical field of target detection, and relates to an anti-interference target detection method based on automatic driving scene multi-mode fusion.

Background

With the acceleration of urban processes, the number of vehicles in cities is increased sharply, and accompanying traffic problems become more obvious. With the development of artificial intelligence and big data technology, automatic driving can separate the driver from heavy and mechanized driving, reduce traffic accidents caused by human factors such as the driver and the like, and improve road traffic efficiency. The current difficulty of automatic driving is perception, although automatic driving has better performance in a clear scene, many limitations still exist in a complex scene, and especially under the conditions of low light at night and interference of complex weather such as rain, snow, fog and the like, visibility is sharply reduced, and target background contrast is degraded.

The automatic driving technology needs to be realized by various sensors to make up for information loss of single-mode data, and a laser radar, a visible light camera, a millimeter wave radar and the like are usually used for knowing the surrounding traffic conditions. When facing to complicated weather interference such as rain, snow, fog, haze, although the visible light image has abundant texture information, complicated weather interference also can reduce image quality, influences the testing result, therefore current solution mostly uses lidar, but long-distance target size is little, the detail is few, it is similar and difficult to distinguish with granule interference appearance such as rain, snow, have peculiar structural morphology and motion pattern, and inhomogeneous rain fog will let lidar produce the false in the globoid fog that produces, thereby judge into the barrier, great influence is caused to the system, consequently complicated weather can influence the accuracy and the integrality of laser point cloud.

The near-infrared sensor has good transmissivity, has good imaging effect under severe conditions such as rain, snow, fog, low-light scenes at night and the like, is almost invisible in near infrared for rainy and foggy weather with the characteristics of sparseness and highlight in a visible light image, can adopt a conventional low-cost lens as the visible light, and can be integrated in the same sensor with the visible light, so that the common optical axis with the visible light image can be realized, registration and alignment are not needed, unnecessary workload is reduced, the cost of equipment acquisition is reduced, the calculation amount of near-infrared image processing is far less than that of 3D point cloud data of a laser radar, and the deployment and realization of a vehicle-mounted resource-limited platform are facilitated. Meanwhile, the near infrared has no rich texture and color information, and the target identification lacks advantages, so that the information loss of single-mode data can be compensated by combining with visible light, and a great problem is how to organically combine images with different apparent characteristics of two modes for detection and identification of the target.

Therefore, the target detection in the automatic driving scene has outstanding challenges under severe weather interference conditions, and a more effective method for improving the detection anti-interference capability is urgently needed.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an anti-interference target detection method based on multi-mode fusion of an automatic driving scene, which is used for dealing with interference of low light at night and complicated weather such as rain, snow, fog and the like in target detection of the automatic driving scene, can be simultaneously used for target detection tasks under other complicated interference conditions, breaks through the detection bottleneck of a single mode, and improves the performance of target detection.

In order to achieve the above purpose, the specific process of detecting the interference-resisting target of the present invention is as follows:

(1) And (3) data set construction:

the method comprises the steps of collecting image data of an automatic driving scene under different scenes by using a visible light near-infrared co-optical axis camera, marking the data, constructing a data set, and enabling the data set to be 6:2:2, dividing the quantity ratio into a training set, a verification set and a test set;

(2) Differential feature extraction:

respectively inputting the visible light image and the near infrared image into different differentiation feature extraction networks to extract features, and obtaining two groups of basic features;

(3) And (3) learning rate constraint:

respectively inputting the two groups of basic characteristics obtained in the step (2) into a learning rate constraint module for controlling the updating rates of different modal branches;

(4) Modal feature fusion:

inputting the basic characteristics of the learning rate constraint module in the step (3) into a modal characteristic fusion module, and fusing the characteristics of the two modes to obtain modal fusion characteristics for subsequent target detection;

(5) Target detection:

inputting the modal fusion features obtained in the step (4) into a target detection module, connecting two convolution layers with convolution kernels of 3 x 3 after the features are extracted, setting three anchor frames with different length-width ratios at each position of an output feature map, and then learning target frame position information and classification information by using two full-connection layers with unshared parameters, wherein the target frame position information is the deviation of a real target position and the anchor frames, and the classification information is different categories of the target frames, such as background, pedestrians or vehicles.

(6) Training and testing the network, and outputting a detection result:

training by using the training data acquired in the step (1), sequentially inputting a near-infrared image and a visible light image with the size of 640 multiplied by 512 into a differentiation feature extraction network, a learning rate constraint module and a target detection module, finally outputting a predicted target position and a predicted type, performing error calculation by using a real target position and a predicted result, calculating a type error by using a Focal loss, calculating an error between the predicted target position and the real target position by using a smooth L1 loss, updating parameters through back propagation, storing the best model parameters after 200 complete iterations, and obtaining the trained anti-interference target detection network parameters by using the best model parameters as the parameters trained by a final model, and then testing the test data acquired in the step (1) to obtain the position and the type of the target.

As a further technical solution of the present invention, the different scenes in step (1) include night low light and rainy, snowy, and foggy weather scenes, 1500 images are collected in different modalities, and the content labeled to the data includes a position and a category of the target, where the position includes a center point and a length and a width of the target, and the category includes two categories of a vehicle and a pedestrian.

As a further technical scheme of the invention, the differentiated feature extraction network for extracting the visible light image features in the step (2) comprises five convolution modules, wherein each convolution module comprises two convolution layers, three activation layers and a maximum pooling layer; the differential feature extraction network for extracting the near-infrared image features comprises five convolution modules, wherein each convolution module comprises a convolution layer, two activation layers and an average pooling layer.

As a further technical solution of the present invention, the learning rate constraint module in step (3) is composed of a full connection layer, and when the inverse gradient is propagated, the gradient of different modes is multiplied by different coefficients to control the learning rate of different modes, and the full connection layer increases the nonlinear fitting capability of the network.

As a further technical scheme of the present invention, the modal characteristic fusion module in step (4) is composed of two branches, wherein one branch multiplies the characteristics of two modes correspondingly, and increases the significance of similar characteristics, i.e., "find the same"; the other branch is to add the characteristics of the two modes and reserve the differential characteristics, namely the existence of the difference.

Compared with the prior art, the invention adopts a detection means of complementary multi-mode combination of visible light and near-infrared images in order to deal with low-light night and complex weather interference such as rain, snow, fog and the like in an automatic driving scene and consider the problem of single-mode information loss. Fully utilizing multi-modal data from a characteristic level, and designing different trunk networks for extracting basic characteristics of visible light and near infrared images and extracting the basic characteristics according to different apparent characteristics of near infrared and visible light; aiming at the problems of different information abundance and different learning difficulty degrees of two groups of images, the learning rate constraint module is used in different branches and is used for controlling the updating rate of the branches in different modes; and then, the two groups of features are fused by using a modal fusion module, so that similar features and complementary features are reserved, namely coexistence and difference are sought, information of the two modes is fully utilized, the detection precision is high, and the detection omission rate in a complex interference scene is reduced.

Drawings

Fig. 1 is a schematic diagram of a network architecture framework for implementing anti-interference target detection according to the present invention.

Fig. 2 is a block diagram of a process for implementing anti-interference target detection according to the present invention.

Fig. 3 is a first detection example of embodiment 2 of the present invention, wherein (a) is a detection result diagram of a Yolov3 target detection method, and (b) is a detection result diagram of the method of the present invention.

Fig. 4 is a second detection example of embodiment 2 of the present invention, in which (a) is a detection result diagram of the Yolov3 target detection method, and (b) is a detection result diagram of the method of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

Example (b):

in this embodiment, the network structure shown in fig. 1 and the process shown in fig. 2 are used to implement detection of an anti-interference target, which specifically includes the following steps:

(1) And (3) data set construction:

collecting image data of an automatic driving scene under different scenes by using a visible light near-infrared co-optical axis camera, wherein the image data comprises low illumination at night and complex weather scenes such as rain, snow, fog and the like, and 1500 images are collected in different modes; the visible light near infrared is a coaxial-axis image, so that registration and alignment are not needed, data are directly marked to construct a data set, marked contents comprise positions and categories of targets, the positions comprise target center points and length and width, and the categories comprise two categories of vehicles and pedestrians; and dividing the data set into a training set, a verification set and a test set, wherein the quantity ratio is 6:2:2;

(2) Differential feature extraction:

aiming at different apparent characteristics of near infrared and visible light, different differentiation characteristic extraction networks are designed and respectively used for extracting basic characteristics of the visible light and the near infrared image, wherein the visible light image has complex and variable textures and high information complexity, the differentiation characteristic extraction network of the visible light image uses five convolution modules, each convolution module comprises two convolution layers, three activation layers and a maximum pooling layer, and the maximum pooling layer is used because the maximum pooling layer is more sensitive to the textures; the texture of the near-infrared image is relatively smooth, background interference is relatively less, five convolution modules are also used in the differential feature extraction network of the near-infrared image, each convolution module comprises a convolution layer, two activation layers and an average pooling layer, and the average pooling layer is favorable for extracting features of the smooth near-infrared image with less texture, so that two groups of basic features of visible light and near infrared are obtained;

(3) Learning rate constraints:

aiming at the problems of different richness and learning difficulty of two groups of image information, a learning rate constraint module is used in different branches and is used for controlling the updating rate of the branches in different modes, wherein the learning rate constraint module uses basic characteristics of different modes as input, the learning rate constraint module consists of a layer of full connection layer, when the reverse gradient is transmitted, the gradient of the different modes is multiplied by different coefficients so as to control the learning rate of the different modes, and meanwhile, the full connection layer can also increase the nonlinear fitting capability of the network;

(4) Modal feature fusion:

inputting the basic features of the learning rate constraint module in the step (3) into a modal feature fusion module for feature fusion of two modalities, wherein the modal feature fusion module consists of two branches, the first branch is used for correspondingly multiplying the features of the two modalities to increase the significance of similar features, namely 'identity finding', the other branch is used for adding the features of the two modalities, keeping the difference features, namely 'difference existence', and combining the output features of the two branches to obtain modal fusion features for subsequent target detection;

(5) Target detection:

inputting the modal fusion features into a target detection module, connecting two convolution layers with convolution kernels of 3 multiplied by 3 after the extracted features, setting three anchor frames with different length-width ratios at each position of an output feature map, and then learning target frame position information and classification information by using a full connection layer with two parameters not shared, wherein the target frame position information is the deviation between a real target position and the anchor frame, and the classification information is different types of the target frame, such as background, pedestrians or vehicles;

(6) Training and testing, and outputting a detection result:

training by using the training data acquired in the step (1), sequentially inputting a near-infrared image and a visible light image with the size of 640 multiplied by 512 into a differentiation feature extraction network, a learning rate constraint module and a target detection module, and finally outputting a predicted target position and a predicted target type; calculating errors by using the real target position and the type and the predicted result, calculating type errors by using Focal loss, and calculating errors between the predicted target position and the real target position by using smooth L1 loss; updating parameters through back propagation, saving the best model parameters after 200 epoch iterations to serve as the final model trained parameters to obtain the trained anti-interference target detection network parameters, and then testing the test data acquired in the step (1) to obtain the position and the type of the target.

Example 2:

in this embodiment, the technical scheme of embodiment 1 is adopted to perform a test in an acquired data set, where the test data set includes 150 pairs of vehicle infrared-visible light data and 150 pairs of pedestrian infrared-visible light data, only visible light data is used for the test in the Yolov3 target detection method, infrared-visible light data is used for the test in this embodiment, and the infrared-visible light data is used for the fusion in this embodiment, and an experimental result shows that, compared with the existing Yolov3 target detection method, the detection accuracy in this embodiment is improved from 86.2% to 92.4%, two sets of real object detection effect graphs are shown in fig. 3 and fig. 4, (a) is a detection result of Yolov3 (a detection result of only a visible light image), (b) is a detection result of infrared-visible light multimodal data used in this embodiment, and is finally shown in an infrared image, and as can be seen from the experimental result graph, the detection omission rate in a complex interference scene is greatly reduced in this embodiment.

Network structures and algorithms not described in detail herein are all common in the art.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited by the disclosure of the embodiments, but should be defined by the scope of the appended claims.

Claims

1. An anti-interference target detection method based on automatic driving scene multi-mode fusion is characterized by comprising the following specific processes:

(1) And (3) data set construction:

(2) Differential feature extraction:

respectively inputting the visible light image and the near-infrared image into different differentiation feature extraction networks to extract features, and obtaining two groups of basic features;

(3) And (3) learning rate constraint:

(4) Modal feature fusion:

(5) Target detection:

inputting the modal fusion features obtained in the step (4) into a target detection module, connecting two convolution layers with convolution kernels of 3 x 3 after the features are extracted, setting three anchor frames with different length-width ratios at each position of an output feature map, and then learning target frame position information and classification information by using two full-connection layers with unshared parameters, wherein the target frame position information is the deviation between a real target position and the anchor frame, and the classification information is different classes of the target frame;

(6) Training and testing the network, and outputting a detection result:

training by using the training data acquired in the step (1), sequentially inputting a near-infrared image and a visible light image with the size of 640 x 512 into a differentiation feature extraction network, a learning rate constraint module and a target detection module, finally outputting a predicted target position and category, performing error calculation by using a real target position and category and a predicted result, calculating a category error by using a Focal loss, calculating an error between the predicted target position and the real target position by using a smooth L1 loss, updating parameters through back propagation, storing the best model parameters after 200 complete iterations, and obtaining the trained anti-interference target detection network parameters by using the best model parameters as the final model trained parameters, and then testing the test data acquired in the step (1) to obtain the position and category of the target.

2. The method for detecting the anti-interference target based on the multi-modal fusion of the automatic driving scenes according to the claim 1, wherein the different scenes in the step (1) comprise low-light scenes at night and rain, snow and fog scenes, 1500 images are collected in different modalities, the content labeled on the data comprises the position and the category of the target, wherein the position comprises the central point and the length and the width of the target, and the category comprises two categories of vehicles and pedestrians.

3. The anti-interference target detection method based on multimodal fusion of automatic driving scenes according to claim 2, characterized in that the differentiated feature extraction network for extracting visible light image features in the step (2) comprises five convolution modules, wherein each convolution module comprises two convolution layers, three activation layers and one maximum pooling layer; the differential feature extraction network for extracting the near-infrared image features comprises five convolution modules, wherein each convolution module comprises a convolution layer, two activation layers and an average pooling layer.

4. The anti-interference target detection method based on multi-modal fusion of the automatic driving scene as claimed in claim 3, wherein the learning rate constraint module in step (3) is composed of a fully connected layer, and when the inverse gradient is propagated, the gradient of different modes is multiplied by different coefficients to control the learning rate of different modes, and meanwhile, the fully connected layer increases the nonlinear fitting capability of the network.

5. The anti-interference target detection method based on multi-modal fusion of the automatic driving scenario as claimed in claim 4, wherein the modal feature fusion module in step (4) is composed of two branches, wherein one branch multiplies the features of two modalities correspondingly to increase the significance of similar features, i.e. "find the same"; the other branch is to add the characteristics of the two modes and retain the characteristics of the difference, namely the existence of the difference.