CN113326735B - YOLOv 5-based multi-mode small target detection method - Google Patents
YOLOv 5-based multi-mode small target detection method Download PDFInfo
- Publication number
- CN113326735B CN113326735B CN202110475048.8A CN202110475048A CN113326735B CN 113326735 B CN113326735 B CN 113326735B CN 202110475048 A CN202110475048 A CN 202110475048A CN 113326735 B CN113326735 B CN 113326735B
- Authority
- CN
- China
- Prior art keywords
- mode
- network
- illumination
- loss
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 32
- 230000004927 fusion Effects 0.000 claims abstract description 49
- 238000005286 illumination Methods 0.000 claims abstract description 44
- 230000008447 perception Effects 0.000 claims abstract description 22
- 238000000034 method Methods 0.000 claims abstract description 13
- 230000002902 bimodal effect Effects 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 238000012795 verification Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 230000000295 complement effect Effects 0.000 claims description 2
- 238000013434 data augmentation Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 3
- 239000010410 layer Substances 0.000 description 7
- 238000012937 correction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000011229 interlayer Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for detecting a multi-mode small target based on YOLOv5, which mainly solves the problem of jointly using an infrared image and a visible light image for target detection and mainly comprises the following steps: constructing a light illumination perception network, and calculating the visible light mode image perception coefficient by using the light illumination perception network; based on a designed illumination sensing network, the infrared mode and visible light mode data are subjected to multi-mode fusion under a YOLOv5 architecture. According to the method, the illumination perception coefficient under the visible light model image is estimated by using the illumination perception network, the trained bimodal target detection network is subjected to perception weighted fusion in the NMS algorithm, the method obtains good detection effect under the multimodal data set, and the model has good robustness against complex environments such as night.
Description
Technical Field
The invention discloses a method for detecting a multi-mode small target based on YOLOv5, and belongs to the field of computer vision.
Background
More and more researchers are focusing on improving the recognition accuracy of the target detection model by using multiple sensors. In the face of complex environments, researchers usually use the characteristic of multi-mode data complementation to improve model effect, and mainly because different sensors record information in different modes, the information between modes has complementation due to the difference of the sensors. The common sensors include an infrared camera, a laser radar, a depth camera and the like, and are not easily affected by external environments.
In 2015 Hwang et al published a dataset on CVPR with respect to multi-modality, with pedestrian detection as the background, providing images that can align both light and infrared modalities, named Kaist. The Kaist dataset was proposed as a Benchmark to open the gate in the area of multi-modal object detection. Based on Kaist data set, li and other scholars propose a multi-mode complementary technology with illumination perception gate Fusion, authors make experimental verification on fast R-CNN, and meanwhile, specific analysis is carried out on Fusion structures such as Input Fusion, early Fusion, halfway Fusion, late Fusion and the like. The Input Fusion is performed on a data Input layer, a visible light mode image consists of three channels of red, green and blue, an infrared mode is generally a gray level image, namely a single channel, and two mode images are fused together to form four channels, so that the Fusion is simpler to realize; early Fusion is Fusion at the bottom layer of a backbone network, and generally Fusion of semantic features at the bottom layer is realized, and Fusion of semantic features at a high layer is absent in the method; halfway fusion is fusion in a backbone network interlayer, and the interlayer is easier to realize fusion with better characteristics, but is difficult to train; late Fusion is based on the network output layer, and the method is more focused on the Fusion of results, and is easy to realize in model training and deployment.
After Hwang, lu and other scholars further analyze the multi-modal Fusion in detail on the basis of Li, and authors consider that the problem of object coordinate drift in different modes needs to be considered when the multi-modal Fusion is performed, and for a trained model, the authors can simulate the optical mode in an reasoning stage to verify the influence of coordinate drift on model accuracy. Finally, the author firstly carries out manual correction on object coordinates under two modes of the Kaist data set, and simultaneously proposes that an RFA module carries out further correction on the algorithm so as to promote the effective fusion of multiple modes, but the model reasoning speed is reduced due to the introduction of the RFA module. The Yang et al uses SSD as a research framework and proposes a GFU-based multi-modal fusion unit, and applies a multi-modal fusion technology to a one-stage target detection framework. Heng et al propose a loop refinement fusion module and introduce semantic supervision loss as an auxiliary strategy to make feature fusion more balanced. The Zhou et al further analyze based on Lu, consider that the multimode fusion is respectively influenced by two unbalanced factors of illumination and characteristics, and the author proposes two modules of characteristic fusion and illumination perception fusion based on a circuit differential idea based on an SSD detection model.
Based on the above research contents, it can be known that most students use a Halfway Fusion mode to perform multi-mode target detection Fusion, the implementation of the mode is complex, the inconsistency of the distribution of the characteristic domains among the multiple modes makes model training more difficult, and great difficulty is brought to the deployment and application of the target detection model.
Disclosure of Invention
The invention provides an innovative algorithm specifically aiming at multi-mode target detection in a complex environment, wherein the algorithm is based on a lightweight illumination sensing network, and multi-mode fusion is carried out on the detected results in a visible light mode and an infrared mode, namely, illumination sensing coefficients in the visible light mode are introduced in a late fusion stage to carry out weighting treatment.
The method for detecting the multi-mode small target based on the YOLOv5 comprises the following steps:
step (1), data acquisition is carried out on a scene to be applied, and division is carried out to obtain a training set and a verification set;
step (2), scaling the illumination sensing network data set, and performing data augmentation processing on the multi-mode data set;
step (3), designing an illumination sensing network, and independently training the illumination sensing network by adopting binary cross entropy loss;
step (4), under a multi-mode data set, respectively and independently training a visible light mode and an infrared mode based on a YOLOv5 detection framework;
step (5), integrating the independently trained illumination perception model, visible light model and infrared model into a defined multi-mode network;
and (6) calculating a visible light model image sensing coefficient through the illumination sensing network, weighting the tail output of the light model by using the sensing coefficient, and finally fusing the dual-mode output result and inputting the dual-mode output result into a non-maximum algorithm.
The beneficial effects are that: according to the method, the illumination perception coefficient under the visible light model image is estimated by using the illumination perception network, the trained bimodal target detection network is subjected to perception weighted fusion in the NMS algorithm, the method obtains good detection effect under the multimodal data set, and the model has good robustness against complex environments such as night.
Drawings
FIG. 1 illustrates multi-modal target detection based on illumination-aware network fusion.
FIG. 2 is a multimodal fusion pseudocode based on a lighting aware network.
Detailed Description
The invention will be described in further detail with reference to the drawings and specific embodiments thereof, for the purpose of showing in detail the objects, features, and advantages of the present invention.
1. Illumination perception network based on Focus structure
Since the image in the visible light mode is greatly affected by the environment such as illumination, especially the night environment. From the perspective of an algorithm model, the detected target in the visible light mode is not completely reliable, and the problem of missed detection or false detection exists, so that a weighted evaluation coefficient is needed to be carried out on the image in the visible light mode.
The method uses the Focus convolution structure in the YOLOv5 model for reference, and applies the Focus convolution structure to the definition of the illumination sensing network. Specifically, the Focus structure consists of a Conv convolution network, where the convolution kernel is 1×1, and the images are sampled at intervals from both the lateral and longitudinal directions inside the Focus for input 128×128, forming four 64×64 downsampled graphs, and finally stacking together to form an input data with 12 channels. Then downsampling is carried out through a pooling layer with the size of 2 multiplied by 2, a Dropout method is adopted to discard neuron nodes with the probability of 0.2, finally the obtained feature vector is input into a Linear layer for prediction, and meanwhile, the tail of the network is processed by adopting a softmax function.
The calculation formula of the visible light model illumination perception coefficient is as follows:
wherein w represents an illumination-aware network output vector, which is represented by w 1 、w 2 Two elements. Mu represents a smoothing factor, k is the number of label categories, w' is a smoothed vector, and epsilon is a calculated perception coefficient, namely, a first element is taken for assignment.
2. Multi-mode fusion based on illumination perception coefficient
The invention realizes the fusion of the multi-mode information based on the latest YOLOv5 target detection architecture. As shown in FIG. 1, the multi-mode target detection fusion architecture based on illumination perception consists of an illumination perception network and a dual-mode fusion network.
Firstly, the general loss function of the multi-mode detection algorithm based on illumination perception fusion is defined as the following formula:
wherein visible is training loss in the optical mode, lwir is training loss in the infrared mode, L aware Is a training loss under the illumination-aware network. The bimodal losses are all caused by L obj 、L cls 、L box Three parts are composed of gamma 0 、γ 1 、γ 2 Super parameters for balancing the three losses, respectively. The loss of the illumination-aware network is defined as follows:
L aware =-x′ d *log(x d )-x′ n *log(x n )#
wherein x in the formula d 、x n Real labels, x 'representing daytime and evening respectively' d And x' n Respectively representing the output values of the illumination-aware network.
The cross entropy loss architecture definition is uniformly used for the front background loss and the back background loss and the category classification loss, and is similar to the illumination perception loss, and the specific definition is as follows:
where n represents the number of samples, w i A loss weight coefficient, x, representing the ith sample i Network output representing the ith sample point, y i The true label value representing the ith sample point, σ (·) is the Sigmoid activation function.
The loss function was defined as follows using CIoU loss for position regression loss calculation:
wherein ρ is 2 (. Cndot.) is the Euclidean distance calculation, b gt Respectively representing the coordinates of the central points of the object BBox, and c represents BBox and BBox gt The diagonal distance of the smallest bounding rectangle. Alpha is used to make the trade-off parameter and v is used to measure the aspect ratio uniformity parameter.
As shown in FIG. 2, the multi-mode fusion pseudo code based on the illumination sensing network firstly obtains the current sensing coefficient E of the visible light image through the illumination sensing network and the sensing coefficient calculation formula for the result sets A and B and the corresponding confidence coefficient sets R and S which are output in a dual mode, finally, before fusion, the confidence coefficient and the E coefficient which are output in the visible light mode are multiplied, and then, the result is input into a non-maximum suppression algorithm for fusion.
The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the specific details of the above embodiments, and various equivalent changes can be made to the technical solution of the present invention within the scope of the technical concept of the present invention, and all the equivalent changes belong to the protection scope of the present invention.
Claims (5)
1. A method for detecting a multi-mode small target based on YOLOv5 specifically comprises the following steps:
step (1), data acquisition is carried out on a scene to be applied, and division is carried out to obtain a training set and a verification set;
step (2), scaling the illumination sensing network data set, and performing data augmentation processing on the multi-mode data set;
step (3), designing an illumination sensing network, and independently training the illumination sensing network by adopting binary cross entropy loss;
step (4), under a multi-mode data set, respectively and independently training a visible light mode and an infrared mode based on a YOLOv5 detection framework;
integrating an independently trained illumination sensing model, a visible light model and an infrared model into a defined multi-modal network, training a strategy of the multi-modal illumination sensing fusion model, taking the latest YOLOv5 target detection algorithm as a multi-modal fusion architecture, introducing the illumination sensing network, and defining a total loss function of the multi-modal detection algorithm based on illumination sensing fusion as shown in a formula:
wherein M is a mode set comprising two elements, a visible mode and an infrared mode, wherein L aware For training loss under the illumination perception network, the bimodal loss is all represented by L obj 、L cls 、L box Three parts are composed of gamma 0 、γ 1 、γ 2 The super parameters for balancing the three losses are defined as follows:
L aware =-x′ d *log(x d )-x′ n *log(x n )
wherein x in the formula d 、x n Real labels, x 'representing daytime and evening respectively' d And x' n Respectively representing output values of the illumination sensing network;
the cross entropy loss architecture definition is uniformly used for the front background loss and the back background loss and the category classification loss, and is similar to the illumination perception loss, and the specific definition is as follows:
where n represents the number of samples, w i A loss weight coefficient, x, representing the ith sample i Network output representing the ith sample point, y i The true label value representing the ith sample point, σ (·) is the Sigmoid activation function;
the loss function was defined as follows using CIoU loss for position regression loss calculation:
wherein ρ is 2 (. Cndot.) is the Euclidean distance calculation, b gt Respectively representing the coordinates of the central points of the object BBox, and c represents BBox and BBox gt The diagonal distance of the minimum circumscribed rectangle, alpha is used as a track-off parameter, and v is used for measuring an aspect ratio consistency parameter;
and (6) calculating a visible light model image sensing coefficient through the illumination sensing network, weighting the tail output of the light model by using the sensing coefficient, and finally fusing the dual-mode output result and inputting the dual-mode output result into a non-maximum algorithm.
2. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: in the data set dividing process in the step (1), two types of data sets are involved; the first is an illumination-aware network dataset and the second is a multi-modal detection dataset.
3. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (3) designing a light illumination sensing network, and introducing a Focus structure into the head of the illumination sensing network when the illumination sensing network has Conv and Linear structures, wherein the Focus structure samples an input image at intervals up and down, increases an input channel and reduces the image size at the same time, so that the network calculation amount is effectively reduced.
4. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (3) training the multi-mode model in the step (4), compared with single-mode target detection, introducing an infrared mode as a complementary mode so as to promote target detection in a complex environment.
5. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (6) based on multi-mode fusion of the illumination sensing network, finally, the result set output under the visible light mode and the infrared mode is subjected to weighted fusion according to the illumination sensing coefficient under the visible light image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110475048.8A CN113326735B (en) | 2021-04-29 | 2021-04-29 | YOLOv 5-based multi-mode small target detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110475048.8A CN113326735B (en) | 2021-04-29 | 2021-04-29 | YOLOv 5-based multi-mode small target detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113326735A CN113326735A (en) | 2021-08-31 |
CN113326735B true CN113326735B (en) | 2023-11-28 |
Family
ID=77413991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110475048.8A Active CN113326735B (en) | 2021-04-29 | 2021-04-29 | YOLOv 5-based multi-mode small target detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113326735B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114332778B (en) * | 2022-03-08 | 2022-06-21 | 深圳市万物云科技有限公司 | Intelligent alarm work order generation method and device based on people stream density and related medium |
CN115205651A (en) * | 2022-09-16 | 2022-10-18 | 南京工业大学 | Low visibility road target detection method based on bimodal fusion |
CN115631510B (en) * | 2022-10-24 | 2023-07-04 | 智慧眼科技股份有限公司 | Pedestrian re-identification method and device, computer equipment and storage medium |
CN116012825A (en) * | 2023-01-13 | 2023-04-25 | 上海赫立智能机器有限公司 | Electronic component intelligent identification method based on multiple modes |
CN117079245B (en) * | 2023-07-05 | 2024-09-17 | 浙江工业大学 | Traffic road target identification method based on wireless signals |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108564097A (en) * | 2017-12-05 | 2018-09-21 | 华南理工大学 | A kind of multiscale target detection method based on depth convolutional neural networks |
CN110322423A (en) * | 2019-04-29 | 2019-10-11 | 天津大学 | A kind of multi-modality images object detection method based on image co-registration |
CN111209810A (en) * | 2018-12-26 | 2020-05-29 | 浙江大学 | Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images |
CN111260594A (en) * | 2019-12-22 | 2020-06-09 | 天津大学 | Unsupervised multi-modal image fusion method |
CN112203122A (en) * | 2020-10-10 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based similar video processing method and device and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569697A (en) * | 2018-08-31 | 2019-12-13 | 阿里巴巴集团控股有限公司 | Method, device and equipment for detecting components of vehicle |
-
2021
- 2021-04-29 CN CN202110475048.8A patent/CN113326735B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108564097A (en) * | 2017-12-05 | 2018-09-21 | 华南理工大学 | A kind of multiscale target detection method based on depth convolutional neural networks |
CN111209810A (en) * | 2018-12-26 | 2020-05-29 | 浙江大学 | Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images |
CN110322423A (en) * | 2019-04-29 | 2019-10-11 | 天津大学 | A kind of multi-modality images object detection method based on image co-registration |
CN111260594A (en) * | 2019-12-22 | 2020-06-09 | 天津大学 | Unsupervised multi-modal image fusion method |
CN112203122A (en) * | 2020-10-10 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based similar video processing method and device and electronic equipment |
Non-Patent Citations (1)
Title |
---|
结合FPN的改进YOLOv3车辆实时检测算法;李刚等;《黑龙江工业学院学报》;第20卷(第3期);第106-112页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113326735A (en) | 2021-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113326735B (en) | YOLOv 5-based multi-mode small target detection method | |
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN109584248B (en) | Infrared target instance segmentation method based on feature fusion and dense connection network | |
CN113420607A (en) | Multi-scale target detection and identification method for unmanned aerial vehicle | |
CN111582092B (en) | Pedestrian abnormal behavior detection method based on human skeleton | |
Wan et al. | AFSar: An anchor-free SAR target detection algorithm based on multiscale enhancement representation learning | |
CN109543632A (en) | A kind of deep layer network pedestrian detection method based on the guidance of shallow-layer Fusion Features | |
CN114612937B (en) | Pedestrian detection method based on single-mode enhancement by combining infrared light and visible light | |
CN116452937A (en) | Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism | |
CN110222718A (en) | The method and device of image procossing | |
CN113361466B (en) | Multispectral target detection method based on multi-mode cross guidance learning | |
CN116612468A (en) | Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism | |
CN109919246A (en) | Pedestrian's recognition methods again based on self-adaptive features cluster and multiple risks fusion | |
CN110909656B (en) | Pedestrian detection method and system integrating radar and camera | |
CN115631397A (en) | Target detection method and device based on bimodal image | |
CN115527098A (en) | Infrared small target detection method based on global mean contrast space attention | |
CN111898427A (en) | Multispectral pedestrian detection method based on feature fusion deep neural network | |
CN115527159A (en) | Counting system and method based on cross-modal scale attention aggregation features | |
CN112069997B (en) | Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net | |
CN113361475A (en) | Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing | |
CN117689995A (en) | Unknown spacecraft level detection method based on monocular image | |
CN116935356A (en) | Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method | |
CN117173595A (en) | Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7 | |
CN116524314A (en) | Unmanned aerial vehicle small target detection method based on anchor-free frame algorithm | |
Li et al. | MASNet: Road semantic segmentation based on multi-scale modality fusion perception |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |