CN113326735B

CN113326735B - YOLOv 5-based multi-mode small target detection method

Info

Publication number: CN113326735B
Application number: CN202110475048.8A
Authority: CN
Inventors: 霍静; 孙宏伟; 李文斌; 高阳
Original assignee: Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd; Nanjing University
Current assignee: Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd; Nanjing University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2023-11-28
Anticipated expiration: 2041-04-29
Also published as: CN113326735A

Abstract

The invention discloses a method for detecting a multi-mode small target based on YOLOv5, which mainly solves the problem of jointly using an infrared image and a visible light image for target detection and mainly comprises the following steps: constructing a light illumination perception network, and calculating the visible light mode image perception coefficient by using the light illumination perception network; based on a designed illumination sensing network, the infrared mode and visible light mode data are subjected to multi-mode fusion under a YOLOv5 architecture. According to the method, the illumination perception coefficient under the visible light model image is estimated by using the illumination perception network, the trained bimodal target detection network is subjected to perception weighted fusion in the NMS algorithm, the method obtains good detection effect under the multimodal data set, and the model has good robustness against complex environments such as night.

Description

YOLOv 5-based multi-mode small target detection method

Technical Field

The invention discloses a method for detecting a multi-mode small target based on YOLOv5, and belongs to the field of computer vision.

Background

More and more researchers are focusing on improving the recognition accuracy of the target detection model by using multiple sensors. In the face of complex environments, researchers usually use the characteristic of multi-mode data complementation to improve model effect, and mainly because different sensors record information in different modes, the information between modes has complementation due to the difference of the sensors. The common sensors include an infrared camera, a laser radar, a depth camera and the like, and are not easily affected by external environments.

In 2015 Hwang et al published a dataset on CVPR with respect to multi-modality, with pedestrian detection as the background, providing images that can align both light and infrared modalities, named Kaist. The Kaist dataset was proposed as a Benchmark to open the gate in the area of multi-modal object detection. Based on Kaist data set, li and other scholars propose a multi-mode complementary technology with illumination perception gate Fusion, authors make experimental verification on fast R-CNN, and meanwhile, specific analysis is carried out on Fusion structures such as Input Fusion, early Fusion, halfway Fusion, late Fusion and the like. The Input Fusion is performed on a data Input layer, a visible light mode image consists of three channels of red, green and blue, an infrared mode is generally a gray level image, namely a single channel, and two mode images are fused together to form four channels, so that the Fusion is simpler to realize; early Fusion is Fusion at the bottom layer of a backbone network, and generally Fusion of semantic features at the bottom layer is realized, and Fusion of semantic features at a high layer is absent in the method; halfway fusion is fusion in a backbone network interlayer, and the interlayer is easier to realize fusion with better characteristics, but is difficult to train; late Fusion is based on the network output layer, and the method is more focused on the Fusion of results, and is easy to realize in model training and deployment.

After Hwang, lu and other scholars further analyze the multi-modal Fusion in detail on the basis of Li, and authors consider that the problem of object coordinate drift in different modes needs to be considered when the multi-modal Fusion is performed, and for a trained model, the authors can simulate the optical mode in an reasoning stage to verify the influence of coordinate drift on model accuracy. Finally, the author firstly carries out manual correction on object coordinates under two modes of the Kaist data set, and simultaneously proposes that an RFA module carries out further correction on the algorithm so as to promote the effective fusion of multiple modes, but the model reasoning speed is reduced due to the introduction of the RFA module. The Yang et al uses SSD as a research framework and proposes a GFU-based multi-modal fusion unit, and applies a multi-modal fusion technology to a one-stage target detection framework. Heng et al propose a loop refinement fusion module and introduce semantic supervision loss as an auxiliary strategy to make feature fusion more balanced. The Zhou et al further analyze based on Lu, consider that the multimode fusion is respectively influenced by two unbalanced factors of illumination and characteristics, and the author proposes two modules of characteristic fusion and illumination perception fusion based on a circuit differential idea based on an SSD detection model.

Based on the above research contents, it can be known that most students use a Halfway Fusion mode to perform multi-mode target detection Fusion, the implementation of the mode is complex, the inconsistency of the distribution of the characteristic domains among the multiple modes makes model training more difficult, and great difficulty is brought to the deployment and application of the target detection model.

Disclosure of Invention

The invention provides an innovative algorithm specifically aiming at multi-mode target detection in a complex environment, wherein the algorithm is based on a lightweight illumination sensing network, and multi-mode fusion is carried out on the detected results in a visible light mode and an infrared mode, namely, illumination sensing coefficients in the visible light mode are introduced in a late fusion stage to carry out weighting treatment.

The method for detecting the multi-mode small target based on the YOLOv5 comprises the following steps:

step (1), data acquisition is carried out on a scene to be applied, and division is carried out to obtain a training set and a verification set;

step (2), scaling the illumination sensing network data set, and performing data augmentation processing on the multi-mode data set;

step (3), designing an illumination sensing network, and independently training the illumination sensing network by adopting binary cross entropy loss;

step (4), under a multi-mode data set, respectively and independently training a visible light mode and an infrared mode based on a YOLOv5 detection framework;

step (5), integrating the independently trained illumination perception model, visible light model and infrared model into a defined multi-mode network;

and (6) calculating a visible light model image sensing coefficient through the illumination sensing network, weighting the tail output of the light model by using the sensing coefficient, and finally fusing the dual-mode output result and inputting the dual-mode output result into a non-maximum algorithm.

The beneficial effects are that: according to the method, the illumination perception coefficient under the visible light model image is estimated by using the illumination perception network, the trained bimodal target detection network is subjected to perception weighted fusion in the NMS algorithm, the method obtains good detection effect under the multimodal data set, and the model has good robustness against complex environments such as night.

Drawings

FIG. 1 illustrates multi-modal target detection based on illumination-aware network fusion.

FIG. 2 is a multimodal fusion pseudocode based on a lighting aware network.

Detailed Description

The invention will be described in further detail with reference to the drawings and specific embodiments thereof, for the purpose of showing in detail the objects, features, and advantages of the present invention.

1. Illumination perception network based on Focus structure

Since the image in the visible light mode is greatly affected by the environment such as illumination, especially the night environment. From the perspective of an algorithm model, the detected target in the visible light mode is not completely reliable, and the problem of missed detection or false detection exists, so that a weighted evaluation coefficient is needed to be carried out on the image in the visible light mode.

The method uses the Focus convolution structure in the YOLOv5 model for reference, and applies the Focus convolution structure to the definition of the illumination sensing network. Specifically, the Focus structure consists of a Conv convolution network, where the convolution kernel is 1×1, and the images are sampled at intervals from both the lateral and longitudinal directions inside the Focus for input 128×128, forming four 64×64 downsampled graphs, and finally stacking together to form an input data with 12 channels. Then downsampling is carried out through a pooling layer with the size of 2 multiplied by 2, a Dropout method is adopted to discard neuron nodes with the probability of 0.2, finally the obtained feature vector is input into a Linear layer for prediction, and meanwhile, the tail of the network is processed by adopting a softmax function.

The calculation formula of the visible light model illumination perception coefficient is as follows:

wherein w represents an illumination-aware network output vector, which is represented by w ₁ 、w ₂ Two elements. Mu represents a smoothing factor, k is the number of label categories, w' is a smoothed vector, and epsilon is a calculated perception coefficient, namely, a first element is taken for assignment.

2. Multi-mode fusion based on illumination perception coefficient

The invention realizes the fusion of the multi-mode information based on the latest YOLOv5 target detection architecture. As shown in FIG. 1, the multi-mode target detection fusion architecture based on illumination perception consists of an illumination perception network and a dual-mode fusion network.

Firstly, the general loss function of the multi-mode detection algorithm based on illumination perception fusion is defined as the following formula:

wherein visible is training loss in the optical mode, lwir is training loss in the infrared mode, L _aware Is a training loss under the illumination-aware network. The bimodal losses are all caused by L _obj 、L _cls 、L _box Three parts are composed of gamma ₀ 、γ ₁ 、γ ₂ Super parameters for balancing the three losses, respectively. The loss of the illumination-aware network is defined as follows:

L _aware ＝-x′ _d *log(x _d )-x′ _n *log(x _n )#

wherein x in the formula _d 、x _n Real labels, x 'representing daytime and evening respectively' _d And x' _n Respectively representing the output values of the illumination-aware network.

The cross entropy loss architecture definition is uniformly used for the front background loss and the back background loss and the category classification loss, and is similar to the illumination perception loss, and the specific definition is as follows:

where n represents the number of samples, w _i A loss weight coefficient, x, representing the ith sample _i Network output representing the ith sample point, y _i The true label value representing the ith sample point, σ (·) is the Sigmoid activation function.

The loss function was defined as follows using CIoU loss for position regression loss calculation:

wherein ρ is ² (. Cndot.) is the Euclidean distance calculation, b ^gt Respectively representing the coordinates of the central points of the object BBox, and c represents BBox and BBox ^gt The diagonal distance of the smallest bounding rectangle. Alpha is used to make the trade-off parameter and v is used to measure the aspect ratio uniformity parameter.

As shown in FIG. 2, the multi-mode fusion pseudo code based on the illumination sensing network firstly obtains the current sensing coefficient E of the visible light image through the illumination sensing network and the sensing coefficient calculation formula for the result sets A and B and the corresponding confidence coefficient sets R and S which are output in a dual mode, finally, before fusion, the confidence coefficient and the E coefficient which are output in the visible light mode are multiplied, and then, the result is input into a non-maximum suppression algorithm for fusion.

The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the specific details of the above embodiments, and various equivalent changes can be made to the technical solution of the present invention within the scope of the technical concept of the present invention, and all the equivalent changes belong to the protection scope of the present invention.

Claims

1. A method for detecting a multi-mode small target based on YOLOv5 specifically comprises the following steps:

integrating an independently trained illumination sensing model, a visible light model and an infrared model into a defined multi-modal network, training a strategy of the multi-modal illumination sensing fusion model, taking the latest YOLOv5 target detection algorithm as a multi-modal fusion architecture, introducing the illumination sensing network, and defining a total loss function of the multi-modal detection algorithm based on illumination sensing fusion as shown in a formula:

wherein M is a mode set comprising two elements, a visible mode and an infrared mode, wherein L _aware For training loss under the illumination perception network, the bimodal loss is all represented by L _obj 、L _cls 、L _box Three parts are composed of gamma ₀ 、γ ₁ 、γ ₂ The super parameters for balancing the three losses are defined as follows:

L _aware ＝-x′ _d *log(x _d )-x′ _n *log(x _n )

wherein x in the formula _d 、x _n Real labels, x 'representing daytime and evening respectively' _d And x' _n Respectively representing output values of the illumination sensing network;

where n represents the number of samples, w _i A loss weight coefficient, x, representing the ith sample _i Network output representing the ith sample point, y _i The true label value representing the ith sample point, σ (·) is the Sigmoid activation function;

wherein ρ is ² (. Cndot.) is the Euclidean distance calculation, b ^gt Respectively representing the coordinates of the central points of the object BBox, and c represents BBox and BBox ^gt The diagonal distance of the minimum circumscribed rectangle, alpha is used as a track-off parameter, and v is used for measuring an aspect ratio consistency parameter;

2. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: in the data set dividing process in the step (1), two types of data sets are involved; the first is an illumination-aware network dataset and the second is a multi-modal detection dataset.

3. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (3) designing a light illumination sensing network, and introducing a Focus structure into the head of the illumination sensing network when the illumination sensing network has Conv and Linear structures, wherein the Focus structure samples an input image at intervals up and down, increases an input channel and reduces the image size at the same time, so that the network calculation amount is effectively reduced.

4. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (3) training the multi-mode model in the step (4), compared with single-mode target detection, introducing an infrared mode as a complementary mode so as to promote target detection in a complex environment.

5. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (6) based on multi-mode fusion of the illumination sensing network, finally, the result set output under the visible light mode and the infrared mode is subjected to weighted fusion according to the illumination sensing coefficient under the visible light image.