CN113326735B - YOLOv 5-based multi-mode small target detection method - Google Patents

YOLOv 5-based multi-mode small target detection method Download PDF

Info

Publication number
CN113326735B
CN113326735B CN202110475048.8A CN202110475048A CN113326735B CN 113326735 B CN113326735 B CN 113326735B CN 202110475048 A CN202110475048 A CN 202110475048A CN 113326735 B CN113326735 B CN 113326735B
Authority
CN
China
Prior art keywords
mode
network
illumination
loss
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110475048.8A
Other languages
Chinese (zh)
Other versions
CN113326735A (en
Inventor
霍静
孙宏伟
李文斌
高阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd
Nanjing University
Original Assignee
Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd, Nanjing University filed Critical Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd
Priority to CN202110475048.8A priority Critical patent/CN113326735B/en
Publication of CN113326735A publication Critical patent/CN113326735A/en
Application granted granted Critical
Publication of CN113326735B publication Critical patent/CN113326735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting a multi-mode small target based on YOLOv5, which mainly solves the problem of jointly using an infrared image and a visible light image for target detection and mainly comprises the following steps: constructing a light illumination perception network, and calculating the visible light mode image perception coefficient by using the light illumination perception network; based on a designed illumination sensing network, the infrared mode and visible light mode data are subjected to multi-mode fusion under a YOLOv5 architecture. According to the method, the illumination perception coefficient under the visible light model image is estimated by using the illumination perception network, the trained bimodal target detection network is subjected to perception weighted fusion in the NMS algorithm, the method obtains good detection effect under the multimodal data set, and the model has good robustness against complex environments such as night.

Description

YOLOv 5-based multi-mode small target detection method
Technical Field
The invention discloses a method for detecting a multi-mode small target based on YOLOv5, and belongs to the field of computer vision.
Background
More and more researchers are focusing on improving the recognition accuracy of the target detection model by using multiple sensors. In the face of complex environments, researchers usually use the characteristic of multi-mode data complementation to improve model effect, and mainly because different sensors record information in different modes, the information between modes has complementation due to the difference of the sensors. The common sensors include an infrared camera, a laser radar, a depth camera and the like, and are not easily affected by external environments.
In 2015 Hwang et al published a dataset on CVPR with respect to multi-modality, with pedestrian detection as the background, providing images that can align both light and infrared modalities, named Kaist. The Kaist dataset was proposed as a Benchmark to open the gate in the area of multi-modal object detection. Based on Kaist data set, li and other scholars propose a multi-mode complementary technology with illumination perception gate Fusion, authors make experimental verification on fast R-CNN, and meanwhile, specific analysis is carried out on Fusion structures such as Input Fusion, early Fusion, halfway Fusion, late Fusion and the like. The Input Fusion is performed on a data Input layer, a visible light mode image consists of three channels of red, green and blue, an infrared mode is generally a gray level image, namely a single channel, and two mode images are fused together to form four channels, so that the Fusion is simpler to realize; early Fusion is Fusion at the bottom layer of a backbone network, and generally Fusion of semantic features at the bottom layer is realized, and Fusion of semantic features at a high layer is absent in the method; halfway fusion is fusion in a backbone network interlayer, and the interlayer is easier to realize fusion with better characteristics, but is difficult to train; late Fusion is based on the network output layer, and the method is more focused on the Fusion of results, and is easy to realize in model training and deployment.
After Hwang, lu and other scholars further analyze the multi-modal Fusion in detail on the basis of Li, and authors consider that the problem of object coordinate drift in different modes needs to be considered when the multi-modal Fusion is performed, and for a trained model, the authors can simulate the optical mode in an reasoning stage to verify the influence of coordinate drift on model accuracy. Finally, the author firstly carries out manual correction on object coordinates under two modes of the Kaist data set, and simultaneously proposes that an RFA module carries out further correction on the algorithm so as to promote the effective fusion of multiple modes, but the model reasoning speed is reduced due to the introduction of the RFA module. The Yang et al uses SSD as a research framework and proposes a GFU-based multi-modal fusion unit, and applies a multi-modal fusion technology to a one-stage target detection framework. Heng et al propose a loop refinement fusion module and introduce semantic supervision loss as an auxiliary strategy to make feature fusion more balanced. The Zhou et al further analyze based on Lu, consider that the multimode fusion is respectively influenced by two unbalanced factors of illumination and characteristics, and the author proposes two modules of characteristic fusion and illumination perception fusion based on a circuit differential idea based on an SSD detection model.
Based on the above research contents, it can be known that most students use a Halfway Fusion mode to perform multi-mode target detection Fusion, the implementation of the mode is complex, the inconsistency of the distribution of the characteristic domains among the multiple modes makes model training more difficult, and great difficulty is brought to the deployment and application of the target detection model.
Disclosure of Invention
The invention provides an innovative algorithm specifically aiming at multi-mode target detection in a complex environment, wherein the algorithm is based on a lightweight illumination sensing network, and multi-mode fusion is carried out on the detected results in a visible light mode and an infrared mode, namely, illumination sensing coefficients in the visible light mode are introduced in a late fusion stage to carry out weighting treatment.
The method for detecting the multi-mode small target based on the YOLOv5 comprises the following steps:
step (1), data acquisition is carried out on a scene to be applied, and division is carried out to obtain a training set and a verification set;
step (2), scaling the illumination sensing network data set, and performing data augmentation processing on the multi-mode data set;
step (3), designing an illumination sensing network, and independently training the illumination sensing network by adopting binary cross entropy loss;
step (4), under a multi-mode data set, respectively and independently training a visible light mode and an infrared mode based on a YOLOv5 detection framework;
step (5), integrating the independently trained illumination perception model, visible light model and infrared model into a defined multi-mode network;
and (6) calculating a visible light model image sensing coefficient through the illumination sensing network, weighting the tail output of the light model by using the sensing coefficient, and finally fusing the dual-mode output result and inputting the dual-mode output result into a non-maximum algorithm.
The beneficial effects are that: according to the method, the illumination perception coefficient under the visible light model image is estimated by using the illumination perception network, the trained bimodal target detection network is subjected to perception weighted fusion in the NMS algorithm, the method obtains good detection effect under the multimodal data set, and the model has good robustness against complex environments such as night.
Drawings
FIG. 1 illustrates multi-modal target detection based on illumination-aware network fusion.
FIG. 2 is a multimodal fusion pseudocode based on a lighting aware network.
Detailed Description
The invention will be described in further detail with reference to the drawings and specific embodiments thereof, for the purpose of showing in detail the objects, features, and advantages of the present invention.
1. Illumination perception network based on Focus structure
Since the image in the visible light mode is greatly affected by the environment such as illumination, especially the night environment. From the perspective of an algorithm model, the detected target in the visible light mode is not completely reliable, and the problem of missed detection or false detection exists, so that a weighted evaluation coefficient is needed to be carried out on the image in the visible light mode.
The method uses the Focus convolution structure in the YOLOv5 model for reference, and applies the Focus convolution structure to the definition of the illumination sensing network. Specifically, the Focus structure consists of a Conv convolution network, where the convolution kernel is 1×1, and the images are sampled at intervals from both the lateral and longitudinal directions inside the Focus for input 128×128, forming four 64×64 downsampled graphs, and finally stacking together to form an input data with 12 channels. Then downsampling is carried out through a pooling layer with the size of 2 multiplied by 2, a Dropout method is adopted to discard neuron nodes with the probability of 0.2, finally the obtained feature vector is input into a Linear layer for prediction, and meanwhile, the tail of the network is processed by adopting a softmax function.
The calculation formula of the visible light model illumination perception coefficient is as follows:
wherein w represents an illumination-aware network output vector, which is represented by w 1 、w 2 Two elements. Mu represents a smoothing factor, k is the number of label categories, w' is a smoothed vector, and epsilon is a calculated perception coefficient, namely, a first element is taken for assignment.
2. Multi-mode fusion based on illumination perception coefficient
The invention realizes the fusion of the multi-mode information based on the latest YOLOv5 target detection architecture. As shown in FIG. 1, the multi-mode target detection fusion architecture based on illumination perception consists of an illumination perception network and a dual-mode fusion network.
Firstly, the general loss function of the multi-mode detection algorithm based on illumination perception fusion is defined as the following formula:
wherein visible is training loss in the optical mode, lwir is training loss in the infrared mode, L aware Is a training loss under the illumination-aware network. The bimodal losses are all caused by L obj 、L cls 、L box Three parts are composed of gamma 0 、γ 1 、γ 2 Super parameters for balancing the three losses, respectively. The loss of the illumination-aware network is defined as follows:
L aware =-x′ d *log(x d )-x′ n *log(x n )#
wherein x in the formula d 、x n Real labels, x 'representing daytime and evening respectively' d And x' n Respectively representing the output values of the illumination-aware network.
The cross entropy loss architecture definition is uniformly used for the front background loss and the back background loss and the category classification loss, and is similar to the illumination perception loss, and the specific definition is as follows:
where n represents the number of samples, w i A loss weight coefficient, x, representing the ith sample i Network output representing the ith sample point, y i The true label value representing the ith sample point, σ (·) is the Sigmoid activation function.
The loss function was defined as follows using CIoU loss for position regression loss calculation:
wherein ρ is 2 (. Cndot.) is the Euclidean distance calculation, b gt Respectively representing the coordinates of the central points of the object BBox, and c represents BBox and BBox gt The diagonal distance of the smallest bounding rectangle. Alpha is used to make the trade-off parameter and v is used to measure the aspect ratio uniformity parameter.
As shown in FIG. 2, the multi-mode fusion pseudo code based on the illumination sensing network firstly obtains the current sensing coefficient E of the visible light image through the illumination sensing network and the sensing coefficient calculation formula for the result sets A and B and the corresponding confidence coefficient sets R and S which are output in a dual mode, finally, before fusion, the confidence coefficient and the E coefficient which are output in the visible light mode are multiplied, and then, the result is input into a non-maximum suppression algorithm for fusion.
The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the specific details of the above embodiments, and various equivalent changes can be made to the technical solution of the present invention within the scope of the technical concept of the present invention, and all the equivalent changes belong to the protection scope of the present invention.

Claims (5)

1. A method for detecting a multi-mode small target based on YOLOv5 specifically comprises the following steps:
step (1), data acquisition is carried out on a scene to be applied, and division is carried out to obtain a training set and a verification set;
step (2), scaling the illumination sensing network data set, and performing data augmentation processing on the multi-mode data set;
step (3), designing an illumination sensing network, and independently training the illumination sensing network by adopting binary cross entropy loss;
step (4), under a multi-mode data set, respectively and independently training a visible light mode and an infrared mode based on a YOLOv5 detection framework;
integrating an independently trained illumination sensing model, a visible light model and an infrared model into a defined multi-modal network, training a strategy of the multi-modal illumination sensing fusion model, taking the latest YOLOv5 target detection algorithm as a multi-modal fusion architecture, introducing the illumination sensing network, and defining a total loss function of the multi-modal detection algorithm based on illumination sensing fusion as shown in a formula:
wherein M is a mode set comprising two elements, a visible mode and an infrared mode, wherein L aware For training loss under the illumination perception network, the bimodal loss is all represented by L obj 、L cls 、L box Three parts are composed of gamma 0 、γ 1 、γ 2 The super parameters for balancing the three losses are defined as follows:
L aware =-x′ d *log(x d )-x′ n *log(x n )
wherein x in the formula d 、x n Real labels, x 'representing daytime and evening respectively' d And x' n Respectively representing output values of the illumination sensing network;
the cross entropy loss architecture definition is uniformly used for the front background loss and the back background loss and the category classification loss, and is similar to the illumination perception loss, and the specific definition is as follows:
where n represents the number of samples, w i A loss weight coefficient, x, representing the ith sample i Network output representing the ith sample point, y i The true label value representing the ith sample point, σ (·) is the Sigmoid activation function;
the loss function was defined as follows using CIoU loss for position regression loss calculation:
wherein ρ is 2 (. Cndot.) is the Euclidean distance calculation, b gt Respectively representing the coordinates of the central points of the object BBox, and c represents BBox and BBox gt The diagonal distance of the minimum circumscribed rectangle, alpha is used as a track-off parameter, and v is used for measuring an aspect ratio consistency parameter;
and (6) calculating a visible light model image sensing coefficient through the illumination sensing network, weighting the tail output of the light model by using the sensing coefficient, and finally fusing the dual-mode output result and inputting the dual-mode output result into a non-maximum algorithm.
2. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: in the data set dividing process in the step (1), two types of data sets are involved; the first is an illumination-aware network dataset and the second is a multi-modal detection dataset.
3. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (3) designing a light illumination sensing network, and introducing a Focus structure into the head of the illumination sensing network when the illumination sensing network has Conv and Linear structures, wherein the Focus structure samples an input image at intervals up and down, increases an input channel and reduces the image size at the same time, so that the network calculation amount is effectively reduced.
4. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (3) training the multi-mode model in the step (4), compared with single-mode target detection, introducing an infrared mode as a complementary mode so as to promote target detection in a complex environment.
5. The YOLOv 5-based multi-modal small target detection method of claim 1, wherein: and (6) based on multi-mode fusion of the illumination sensing network, finally, the result set output under the visible light mode and the infrared mode is subjected to weighted fusion according to the illumination sensing coefficient under the visible light image.
CN202110475048.8A 2021-04-29 2021-04-29 YOLOv 5-based multi-mode small target detection method Active CN113326735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110475048.8A CN113326735B (en) 2021-04-29 2021-04-29 YOLOv 5-based multi-mode small target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110475048.8A CN113326735B (en) 2021-04-29 2021-04-29 YOLOv 5-based multi-mode small target detection method

Publications (2)

Publication Number Publication Date
CN113326735A CN113326735A (en) 2021-08-31
CN113326735B true CN113326735B (en) 2023-11-28

Family

ID=77413991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110475048.8A Active CN113326735B (en) 2021-04-29 2021-04-29 YOLOv 5-based multi-mode small target detection method

Country Status (1)

Country Link
CN (1) CN113326735B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332778B (en) * 2022-03-08 2022-06-21 深圳市万物云科技有限公司 Intelligent alarm work order generation method and device based on people stream density and related medium
CN115205651A (en) * 2022-09-16 2022-10-18 南京工业大学 Low visibility road target detection method based on bimodal fusion
CN115631510B (en) * 2022-10-24 2023-07-04 智慧眼科技股份有限公司 Pedestrian re-identification method and device, computer equipment and storage medium
CN116012825A (en) * 2023-01-13 2023-04-25 上海赫立智能机器有限公司 Electronic component intelligent identification method based on multiple modes
CN117079245B (en) * 2023-07-05 2024-09-17 浙江工业大学 Traffic road target identification method based on wireless signals

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564097A (en) * 2017-12-05 2018-09-21 华南理工大学 A kind of multiscale target detection method based on depth convolutional neural networks
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN111209810A (en) * 2018-12-26 2020-05-29 浙江大学 Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images
CN111260594A (en) * 2019-12-22 2020-06-09 天津大学 Unsupervised multi-modal image fusion method
CN112203122A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Artificial intelligence-based similar video processing method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569697A (en) * 2018-08-31 2019-12-13 阿里巴巴集团控股有限公司 Method, device and equipment for detecting components of vehicle

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564097A (en) * 2017-12-05 2018-09-21 华南理工大学 A kind of multiscale target detection method based on depth convolutional neural networks
CN111209810A (en) * 2018-12-26 2020-05-29 浙江大学 Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN111260594A (en) * 2019-12-22 2020-06-09 天津大学 Unsupervised multi-modal image fusion method
CN112203122A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Artificial intelligence-based similar video processing method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合FPN的改进YOLOv3车辆实时检测算法;李刚等;《黑龙江工业学院学报》;第20卷(第3期);第106-112页 *

Also Published As

Publication number Publication date
CN113326735A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN113326735B (en) YOLOv 5-based multi-mode small target detection method
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN113420607A (en) Multi-scale target detection and identification method for unmanned aerial vehicle
CN111582092B (en) Pedestrian abnormal behavior detection method based on human skeleton
Wan et al. AFSar: An anchor-free SAR target detection algorithm based on multiscale enhancement representation learning
CN109543632A (en) A kind of deep layer network pedestrian detection method based on the guidance of shallow-layer Fusion Features
CN114612937B (en) Pedestrian detection method based on single-mode enhancement by combining infrared light and visible light
CN116452937A (en) Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism
CN110222718A (en) The method and device of image procossing
CN113361466B (en) Multispectral target detection method based on multi-mode cross guidance learning
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN109919246A (en) Pedestrian's recognition methods again based on self-adaptive features cluster and multiple risks fusion
CN110909656B (en) Pedestrian detection method and system integrating radar and camera
CN115631397A (en) Target detection method and device based on bimodal image
CN115527098A (en) Infrared small target detection method based on global mean contrast space attention
CN111898427A (en) Multispectral pedestrian detection method based on feature fusion deep neural network
CN115527159A (en) Counting system and method based on cross-modal scale attention aggregation features
CN112069997B (en) Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net
CN113361475A (en) Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing
CN117689995A (en) Unknown spacecraft level detection method based on monocular image
CN116935356A (en) Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN116524314A (en) Unmanned aerial vehicle small target detection method based on anchor-free frame algorithm
Li et al. MASNet: Road semantic segmentation based on multi-scale modality fusion perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant