CN116229217A

CN116229217A - Infrared target detection method applied to complex environment

Info

Publication number: CN116229217A
Application number: CN202310369678.6A
Authority: CN
Inventors: 王慧; 虞继敏; 周尚波; 李舜; 吴涛; 张鑫
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-06-06

Abstract

The invention belongs to the field of infrared target detection, and particularly relates to an infrared target detection method applied to a complex environment, which comprises the following steps: acquiring an infrared image to be detected, and preprocessing the infrared image; extracting different scale features of the infrared image by adopting a trunk feature extraction network; adopting a neck reinforcing feature extraction network to carry out reinforcing fusion treatment on the features with different scales to obtain a fusion feature map; inputting the fused effective feature map into a prediction output network to obtain a target detection result; the method effectively improves the detection precision of the infrared targets, has better detection effect on the infrared targets which are easy to be blocked in complex scenes, simultaneously remarkably reduces the quantity of parameters and meets the real-time detection requirement.

Description

Infrared target detection method applied to complex environment

Technical Field

The invention belongs to the field of infrared target detection, and particularly relates to an infrared target detection method applied to a complex environment.

Background

The infrared image is obtained by heat radiation, and has the outstanding characteristics of long target detection distance, strong concealment, availability in daytime and at night, and the like. Along with the expansion of the distance imaging range, the demand for an intelligent target detection method in an infrared image is also increasing. Conventional infrared image target detection methods include thresholding-based methods, edge detection-based methods, and the like, but such methods are only suitable for detection in a single scene. Due to the complexity of the real environment and the weak characteristics of the infrared target, accurate detection of the target becomes difficult, so that important characteristics of the detection model, especially for some targets shielded by obstacles, are difficult to extract, and the practicability is poor. The detection method based on the convolutional neural network can automatically learn features from input data, has robustness to changes of complex environments, and is high in adaptability.

At present, the existing infrared target detection method comprises a method for detecting an infrared target in a complex scene, which comprises the steps of performing Mosaic data enhancement on an input infrared image, wherein the patent application number of the method is CN202210207336. X; optimizing and improving the structure of a feature extraction network CSPDarknet53, and adding an attention mechanism ECA module into the feature extraction network; performing slicing operation on an input image by using a Focus structure, performing convolution processing for a plurality of times, extracting feature information by using an optimized CSPDarknet53 feature extraction network to obtain feature images with different scales, adding an SPP module after the feature extraction network, and solving the problem of accuracy reduction caused by target scale change; the minimum feature map is integrated with the high-level strong semantic feature information and the low-level strong positioning features through a feature pyramid network structure and a path aggregation network structure, and the two network structures are combined to finally obtain detection layers with different scales and simultaneously having the strong semantic features and the strong positioning features; and using the varical Loss as a Loss function of confidence and class probability of the detected object to realize multi-scale detection and obtain different prediction frames. According to the method, the input infrared image is subjected to feature extraction through an improved trunk feature extraction network, the fusion of feature information of different scales is realized by combining a feature pyramid network structure and a path aggregation network structure, meanwhile, the loss function of the network is optimized, finally, the feature images of different scales are predicted, and the detection of dense shielding objects is improved by using non-maximum value inhibition based on Distance-IoU (DIoU), so that the method can be widely applied to the fields of automatic driving, night security and the like.

However, the above method has the following problems: 1. the feature extraction module CSPDarknet53 has excessive parameters and redundant feature graphs in the feature extraction process. 2. The multi-scale feature fusion is to be enhanced, and the background interference resistance is weak.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an infrared target detection method applied to a complex environment, which comprises the following steps: acquiring an infrared image to be detected, and preprocessing the infrared image; inputting the preprocessed infrared image into a trained infrared target detection model to obtain a detection result; the infrared target detection model comprises a trunk feature extraction network, a neck reinforcing feature extraction network and a prediction output feature layer;

the process of training the infrared target detection model comprises the following steps:

s1: acquiring a training data set, wherein the training data set comprises an infrared image and a category label corresponding to the infrared image;

s2: preprocessing an infrared image in the training data set, and inputting the preprocessed infrared image into an infrared target detection model for training;

s3: extracting different scale features of the infrared image by adopting a trunk feature extraction network;

s4: adopting a neck reinforcing feature extraction network to carry out reinforcing fusion treatment on the features with different scales to obtain a fusion feature map;

s5: inputting the fusion feature map into a prediction output feature layer to obtain a target detection result;

s6: and calculating a loss function of the model according to the target detection result, continuously adjusting model parameters, determining the accuracy of the target detection result by adopting a performance evaluation index, and completing training of the model when the accuracy of the target detection result meets the requirement.

Preferably, the trunk feature extraction network comprises a GSeConv module, a C3Ghost module and an SPPF module, wherein the GSeConv module is used for extracting shallow features of the infrared image, the C3Ghost module is used for reducing redundant information of the shallow features, and the SPPF module is used for increasing the receptive field of the network and obtaining context information after removing the redundant information.

Further, the GSeConv module extracting shallow features of the infrared image includes: compressing channels of the infrared image by adopting convolution with the size of 1 multiplied by 1, wherein the number of convolution kernels of the convolution layer is one half of the number of channels of the input image; performing feature reconstruction on a feature map output by convolution with the size of 1 multiplied by 1 by adopting layer-by-layer convolution with the size of 3 multiplied by 3 to obtain a mixed feature map; carrying out channel splitting on the mixed feature images to obtain two groups of feature images; performing feature superposition on the first group of feature graphs and concentrated features generated by point-by-point convolution according to the channel direction; and splicing the superimposed feature images with the second group of feature images to obtain an output result.

Preferably, the reinforcement fusion processing of the features with different scales by using the neck reinforcement feature extraction network EPANet comprises: the 32 times of downsampling characteristic diagram extracted by the backbone network is passed through an SPPF module to obtain an output characteristic diagram with the size of 20 multiplied by 20; the size of the 32 times downsampled feature map is transformed through a 1X 1 convolution module and an upsampling module, and the transformed feature map is spliced with the feature map which is downsampled by 16 times by the backbone network to obtain a feature map A; inputting the feature map A into a C3GS module and an up-sampling module to further extract features and superimpose the feature map obtained by 8 times down-sampling of a main network to obtain a feature map B, and realizing a bottom-up fusion process; and inputting the feature map B into a C3GS module and a downsampling module for processing to obtain a 40×40 fusion feature map and an 80×80 fusion feature map.

Preferably, the feature extraction module C3GS of the neck network can reduce the number of parameters without losing accuracy. The module structure is characterized in that the module structure is divided into two branches after the channels are divided, one branch does not do any operation, the other branch firstly extracts the characteristics through the GSeConv module, and then the corresponding characteristic information is weighted by utilizing the SimAM attention mechanism, so that effective characteristic details in the image are highlighted, the superimposed characteristic images are shuffled, the information interaction between the channels is promoted, and the learning capacity of a network is enhanced. Neck multiscale feature fusion networks bottom-up and top-down EPANet enhanced information fusion networks are implemented by upsampling and downsampling features. In order to obtain a better fusion effect, nodes with smaller contribution to the network are removed, the depth of the network is reduced, and the neck is lighter; and then, a fusion edge is added between the original shallow network and the neck bottom layer output node so as to fuse higher-level characteristics and shorten the transmission path of the context information, thereby extracting the characteristics of richer semantic information.

Preferably, the model loss function is expressed as:

wherein IOU represents the intersection ratio of the predicted frame and the real target frame, ρ ² (b,b ^gt ) Representing the distance between the predicted frame and the center point of the real frame, ρ represents the Euclidean distance between the center points of the two frames, b represents the coordinate of the predicted center point, b ^gt Representing the coordinates of the center point of a real target frame, c representing the diagonal distance from the smallest rectangular area containing two bounding boxes, α being a parameter for the balance ratio, v for the uniformity of the Heng Lianggao width ratio, w ^gt And h ^gt Representing the width and height of the real target frame, w and h representing the width and height of the predicted frame.

Preferably, the calculation formula of the performance evaluation index is:

where precision represents the precision, tp represents the number of detection frames with the intersection ratio of the predicted value and the true value being greater than 0.5, fp represents the number of detection frames with the intersection ratio of the predicted value and the true value being less than or equal to 0.5, recovery represents the recall, and fn represents the number in which no true value is detected.

The beneficial effects are that:

the method effectively improves the detection precision of the infrared targets, has better detection effect on the infrared targets which are easy to be blocked in complex scenes, simultaneously remarkably reduces the quantity of parameters, meets the real-time detection requirement, and is friendly to the deployment of edge equipment.

Drawings

FIG. 1 is a flowchart of an infrared target detection algorithm in an embodiment of the invention;

FIG. 2 is a block diagram of a GSeConv module in an embodiment of the invention;

FIG. 3 is a block diagram of a C3Ghost module in an embodiment of the invention;

FIG. 4 is a block diagram of a C3GS module according to an embodiment of the invention;

fig. 5 is a block diagram of a SimAM module according to an embodiment of the present invention;

FIG. 6 is a diagram of an EPANet network in an embodiment of the present invention;

FIG. 7 is a diagram of an infrared detection model overall framework in an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

An embodiment of an infrared target detection method applied to a complex environment, the method comprises the following steps: acquiring an infrared image to be detected, and preprocessing the infrared image; inputting the preprocessed infrared image into a trained infrared target detection model to obtain a detection result; the infrared target detection model comprises a trunk feature extraction network, a neck reinforcing feature extraction network and a prediction output feature layer. Specifically, as shown in fig. 1, inputting a given infrared image, firstly extracting fine-grained feature information such as basic outline, texture and the like from a shallow network of a trunk; the main network is mainly characterized by extracting by a GSeConv module and a C3Ghost module; carrying out multi-scale information fusion on different scale features extracted from the backbone network through a neck EPAnet network, and further enhancing the extraction of target features; the main feature extraction module of the neck network is a C3GS module; the prediction output feature layer utilizes the three effective feature layers after the fusion network strengthening treatment to identify and predict the target category and detect the regression of the frame position, thereby obtaining an output result.

In this embodiment, a method for training an infrared target detection model is provided, the method including:

As shown in fig. 7, the trunk feature extraction network includes a GSeConv module, a C3Ghost module, and an SPPF module, where the GSeConv module is used to extract shallow features of an infrared image, the C3Ghost module is used to reduce redundant information of the shallow features, and the SPPF module is used to increase a receptive field of the network, and obtain context information after removing the redundant information.

Specifically, the shallow layer contains fine-grained feature information such as basic contours and textures, and insufficient extraction of the feature information in the shallow layer network can cause partial information loss of a target to be detected and blur the expression of global features, which directly affects the feature extraction quality of the deep layer network, so that the detection precision of a model is reduced. As shown in fig. 2, the GSeConv module is a core feature extraction module of a shallow layer part of the trunk, which can effectively inhibit interference of irrelevant information, reduce loss of shallow layer features, and enhance expression of target feature information. The module specifically operates: the channels of the input feature map are first compressed using a 1 x 1 convolution, where the number of convolution kernels is set to half the number of output channels, and then features are reconstructed using fewer parameter quantities by a layer-by-layer convolution of 3 x 3 size. Based on the method, the obtained mixed feature images are split into two groups of feature images through channels, one group of the feature images is overlapped with concentrated features generated by point-by-point convolution according to the channel direction, and the other group of the feature images is directly mapped to the next layer of feature images which are overlapped and then subjected to convolution operation, so that an output result is obtained. As shown in fig. 3, the C3Ghost uses a CSP architecture, which is used to solve the problem of gradient information repetition of an injection network, and flows gradients into different paths, so that the propagation of the gradient information has correlation differences, and the purpose of the CSP architecture is to improve the accuracy while reducing the reasoning cost.

The neck reinforcing feature extraction network is used for carrying out reinforcing fusion on the features with different scales. And the effective feature layer obtained by the backbone network is fused with the features of the neck network after up-sampling and down-sampling, so that the bottom-up and top-down fusion process is realized. In order to obtain a better fusion effect, nodes with smaller contribution to the network are removed, the depth of the network is reduced, and the neck is lighter; and then, a fusion edge is added between the original shallow network and the neck bottom layer output node so as to fuse higher-level characteristics and shorten the transmission path of the context information, thereby extracting the characteristics of richer semantic information. The feature extraction module C3GS module of the neck network can reduce the parameter number without losing the precision. The module structure is characterized in that the feature images are subjected to channel segmentation, and a group of segmented feature images are input into a GSeConv module to extract deep features; weighting deep characteristic information by adopting a SimAM attention mechanism; and shuffling the overlapped and output characteristic graphs according to channels, so as to promote information interaction and enhance the learning ability of the network.

The GSeConv module extracting shallow layer characteristics of the infrared image comprises the following steps: compressing channels of the infrared image by adopting convolution with the size of 1 multiplied by 1, wherein the number of convolution kernels of the convolution layer is one half of the number of channels of the input image; performing feature reconstruction on a feature map output by convolution with the size of 1 multiplied by 1 by adopting layer-by-layer convolution with the size of 3 multiplied by 3 to obtain a mixed feature map; carrying out channel splitting on the mixed feature images to obtain two groups of feature images; performing feature superposition on the first group of feature graphs and concentrated features generated by point-by-point convolution according to the channel direction; and splicing the superimposed feature images with the second group of feature images to obtain an output result.

The C3Ghost module comprises a residual network branch and a convolution branch, wherein the residual network branch is composed of a plurality of residual structures; inputting input information into a residual network branch and a convolution branch respectively, wherein the residual network branch is used for extracting deep features of the input information, increasing gradient values of back propagation between layers, and the convolution branch is used for directly extracting features of the input information; and superposing the feature images extracted by the two branches according to the channel direction to obtain an output feature image.

In this embodiment, the feature extraction module C3GS module of the neck network can reduce the number of parameters without losing accuracy. As shown in fig. 4, the module structure is characterized in that after the channels are segmented, one branch does not perform any operation, the other branch firstly extracts features through the GSeConv module, and then weights corresponding feature information by using the SimAM attention mechanism, so that effective feature details in the image are highlighted, and the superimposed feature images are shuffled, so that information interaction between the channels is promoted, and learning capability of a network is enhanced, wherein the structure of the SimAM attention mechanism is shown in fig. 5. As shown in fig. 6, the neck multi-scale feature fusion network implements bottom-up and top-down EPANet enhanced information fusion networks by upsampling and downsampling features. In order to obtain a better fusion effect, nodes with smaller contribution to the network are removed, the depth of the network is reduced, and the neck is lighter; and then, a fusion edge is added between the original shallow network and the neck bottom layer output node so as to fuse higher-level characteristics and shorten the transmission path of the context information, thereby extracting the characteristics of richer semantic information.

The method for carrying out reinforced fusion treatment on the features with different scales by adopting the neck reinforced feature extraction network EPANet comprises the following steps: the 32 times of downsampling characteristic diagram extracted by the backbone network is passed through an SPPF module to obtain an output characteristic diagram with the size of 20 multiplied by 20; the size of the 32 times downsampled feature map is transformed through a 1X 1 convolution module and an upsampling module, and the transformed feature map is spliced with the feature map which is downsampled by 16 times by the backbone network to obtain a feature map A; inputting the feature map A into a C3GS module and an up-sampling module to further extract features and superimpose the feature map obtained by 8 times down-sampling of a main network to obtain a feature map B, and realizing a bottom-up fusion process; and inputting the feature map B into a C3GS module and a downsampling module for processing to obtain a 40×40 fusion feature map and an 80×80 fusion feature map.

The C3GS module comprises a convolution branch and a special processing branch, the convolution branch directly extracts input channel information, the special processing branch carries out channel segmentation processing on the feature images, and a group of segmented feature images are input into the GSeConv module to extract deep features; weighting deep characteristic information by adopting a SimAM attention mechanism; and shuffling the superimposed and output characteristic images according to channels to promote information interaction. And superposing and outputting the feature graphs obtained by the two branches.

The post-processing stage of the prediction output of the model includes two parts, namely, regression for confirming and locating the target class. Wherein the regression positioning loss function of the detection frame position is CIOU, which takes into account not only the overlapping area and center point distance between the frame prediction frame and the real frame, but also introduces an aspect ratio influencing factor between the two target bounding frames.

Specifically, the process of processing the fusion feature map by adopting the prediction output feature layer comprises the following steps: respectively inputting the three fusion feature graphs output by the neck reinforcement feature extraction network EPAnet into three corresponding prediction output feature layers; distributing preset priori frames of three scales in each prediction feature layer; taking a detection head for predicting a small target as an example, dividing an 80×80 fusion feature map into 80×80 grids, wherein each grid corresponds to three preset priori frames; sliding the divided fusion feature map by adopting a predictor, and predicting confidence coefficient parameters and regression parameters of each priori frame when the predictor slides to a specific grid; regression fine adjustment is carried out on the predicted frame by adopting a CIOU loss function, and all kinds of information and coordinate information of the position of the predicted frame are decoded, so that the positioning of the target frame is realized; and removing redundant prediction frames through non-maximum suppression, and reserving an optimal prediction frame to obtain a final detection result.

The model loss function is expressed as:

wherein IOU represents the intersection ratio of the predicted frame and the real target frame, ρ ² (b,b ^gt ) Representing the distance between the predicted frame and the center point of the real frame, ρ represents the Euclidean distance between the center points of the two frames, b represents the coordinate of the predicted center point, b ^gt Representing the center point of a real target frameCoordinates, c, represents the diagonal distance from the smallest rectangular region containing two bounding boxes, α is a parameter for balancing the scale, v is the uniformity of the Heng Lianggao width ratio, w ^gt And h ^gt Representing the width and height of the real target frame, w and h representing the width and height of the predicted frame.

The commonly used evaluation performance indexes of the target detection task are as follows:

accuracy (P): positive and negative sample samples that are correctly identified are the ratio of positive samples.

Recall (R): of all positive samples in the test set, the ratio of positive samples is correctly identified.

Average accuracy mean (mAP): average of average accuracy of each category.

Wherein tp is the number of detection frames with the intersection ratio of the predicted value and the true value being more than 0.5; fp is the number of detection frames with the intersection ratio of the predicted value and the true value being less than or equal to 0.5; fn is the number of true values that are not detected.

Simulation implementation conditions: the environment configuration is based on Pytorch 1.12.0,CUDA 11.7,python3.7. The model was trained on NVIDIA GeForce RTX 3060 GPU. During the training phase, the optimizer weight decay is set to 0.005, the SGD momentum is 0.9, and the model is trained with 100 iterations.

Compared with the most advanced infrared target detection model, the detection precision is improved by 2 points, and the detection effect on the small infrared target shielded in the complex scene is better.

While the foregoing is directed to embodiments, aspects and advantages of the present invention, other and further details of the invention may be had by the foregoing description, it will be understood that the foregoing embodiments are merely exemplary of the invention, and that any changes, substitutions, alterations, etc. which may be made herein without departing from the spirit and principles of the invention.

Claims

1. The infrared target detection method applied to the complex environment is characterized by comprising the following steps of: acquiring an infrared image to be detected, and preprocessing the infrared image; inputting the preprocessed infrared image into a trained infrared target detection model to obtain a detection result; the infrared target detection model comprises a trunk feature extraction network, a neck reinforcing feature extraction network and a prediction output feature layer;

s4: adopting a neck reinforcing feature extraction network to carry out reinforcing treatment on the features with different scales to obtain a fusion feature map;

2. The method for detecting the infrared target in the complex environment according to claim 1, wherein the trunk feature extraction network comprises a GSeConv module, a C3Ghost module and an SPPF module, wherein the GSeConv module is used for extracting shallow features of an infrared image, the C3Ghost module is used for reducing redundant information of the shallow features, and the SPPF module is used for increasing a receptive field of the network to obtain the context information after the redundant information is removed.

3. The method for detecting an infrared target applied to a complex environment according to claim 2, wherein the GSeConv module extracting shallow features of an infrared image comprises: compressing channels of the infrared image by adopting convolution with the size of 1 multiplied by 1, wherein the number of convolution kernels of the convolution layer is one half of the number of channels of the input image; performing feature reconstruction on a feature map output by convolution with the size of 1 multiplied by 1 by adopting layer-by-layer convolution with the size of 3 multiplied by 3 to obtain a mixed feature map; carrying out channel splitting on the mixed feature images to obtain two groups of feature images; performing feature superposition on the first group of feature graphs and concentrated features generated by point-by-point convolution according to the channel direction; and splicing the superimposed feature images with the second group of feature images to obtain an output result.

4. The method for detecting an infrared target in a complex environment according to claim 2, wherein the C3Ghost module comprises a residual network branch and a convolution branch, wherein the residual network branch is composed of a plurality of residual structures; inputting input information into a residual network branch and a convolution branch respectively, wherein the residual network branch is used for extracting deep features of the input information, increasing gradient values of back propagation between layers, and the convolution branch is used for directly extracting features of the input information; and superposing the feature images extracted by the two branches according to the channel direction to obtain an output feature image.

5. The method for detecting an infrared target in a complex environment according to claim 1, wherein the step of performing enhanced fusion processing on features of different scales by using a neck enhanced feature extraction network EPANet comprises: the 32 times of downsampling characteristic diagram extracted by the backbone network is passed through an SPPF module to obtain an output characteristic diagram with the size of 20 multiplied by 20; the size of the 32 times downsampled feature map is transformed through a 1X 1 convolution module and an upsampling module, and the transformed feature map is spliced with the feature map which is downsampled by 16 times by the backbone network to obtain a feature map A; inputting the feature map A into a C3GS module and an up-sampling module to further extract features and superimpose the feature map obtained by 8 times down-sampling of a backbone network to obtain a feature map B; and inputting the feature map B into a C3GS module and a downsampling module for processing to obtain a 40×40 fusion feature map and an 80×80 fusion feature map.

6. The method for detecting the infrared target in the complex environment according to claim 5, wherein the C3GS module comprises a convolution branch and a special processing branch, the convolution branch directly extracts input channel information, the special processing branch performs channel segmentation processing on the feature map, and a segmented set of feature maps are input into the GSeConv module to extract deep features; weighting deep characteristic information by adopting a SimAM attention mechanism; and shuffling the superimposed and output characteristic images according to channels to promote information interaction. And superposing and outputting the feature graphs obtained by the two branches.

7. The method for detecting an infrared target in a complex environment according to claim 1, wherein the process of processing the fusion feature map by using the predicted output feature layer comprises: respectively inputting the three fusion feature graphs output by the neck reinforcement feature extraction network EPAnet into three corresponding prediction output feature layers; distributing preset priori frames of three scales in each prediction feature layer; in the process of small target detection, dividing an 80×80 fusion feature map into 80×80 grids, wherein each grid corresponds to three preset priori frames; sliding the divided fusion feature map by adopting a predictor, and predicting confidence coefficient parameters and regression parameters of the fusion feature map through a priori frame when the predictor slides to a corresponding grid; regression fine tuning is carried out on the prediction frame by adopting a CIOU loss function, and all kinds of information and coordinate information of the position of the prediction frame are decoded; removing redundant prediction frames through non-maximum suppression, and reserving optimal prediction frames to obtain a final detection result; the other object detection process is the same as the small object detection process.

8. The method for detecting an infrared target in a complex environment according to claim 1, wherein the expression of the model loss function is:

9. The method for detecting an infrared target applied to a complex environment according to claim 1, wherein the calculation formula of the performance evaluation index is: