CN111860637B

CN111860637B - Single-shot multi-frame infrared target detection method

Info

Publication number: CN111860637B
Application number: CN202010689129.3A
Authority: CN
Inventors: 刘刚; 刘森; 刘中华; 肖春宝; 曹紫绚; 张文波; 张培根; 许来祥
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2023-11-21
Anticipated expiration: 2040-07-17
Also published as: CN111860637A

Abstract

The single-shot multi-frame infrared target detection method is characterized by using a characteristic pyramid network, and based on the inequality of the contribution of each characteristic layer to fusion output, the bidirectional multi-scale characteristic weighted fusion between the characteristic layer with low resolution and strong semantics and the characteristic layer with high resolution and weak semantics is realized, so that an auxiliary network is constructed; and from the cross-over ratio, simultaneously considering the influence of the overlapping area and the non-overlapping area on the objective function, constructing the detector positioning loss which keeps invariance to the target scale change, constructing the target detector, and improving the sensitivity of the detection model to small target positioning errors. And taking the VGG16 convolutional neural network as a characteristic extraction network and integrating the VGG16 convolutional neural network with an auxiliary network and a target detector to form a single-shot multi-frame infrared target detection model integrating multi-scale characteristic weighted fusion and scale invariance positioning loss. The invention has the advantages of autonomous learning capability and high detection rate, and is an effective way for solving the problem of detection of the infrared imaging guidance target in the complex environment.

Description

Single-shot multi-frame infrared target detection method

Technical Field

The invention belongs to the technical field of infrared target detection, and particularly relates to a single-shot multi-frame infrared target detection method.

Background

At present, the target detection is the basis of the infrared imaging guidance automatic target recognition system to complete subsequent tasks such as recognition, tracking and the like, the existing system does not have the autonomous learning capability of target characteristics, and the task environment cannot be considered once exceeding the condition of pre-planning. The single-stage target detection based on deep learning has autonomous learning capability and high calculation efficiency, and is an effective way for solving the problem of infrared imaging guidance target detection in a complex environment. The single-shot multi-frame detector (Single Shot MultiBox Detector, SSD) is a classical single-stage detection model. The SSD destination detection model can be broken down into two modules, a feature extractor and a destination detector. The feature extractor is responsible for extracting features from the input image, and the target detector predicts the target location and class using the extracted features. The feature extractor comprises two parts: a feature extraction network and an auxiliary network. The feature extraction network is generally formed by modifying an image classification network, so that a transfer learning effect can be realized by utilizing weights pre-trained on an image classification data set. The auxiliary network is used for carrying out operations such as transformation, fusion and the like on the characteristics output by the characteristic extraction network. The target detector is made up of several fully connected or convolved layers, each of which can be considered as a collection of several detectors. Each detector can only output one detection result, and the number of detectors determines the upper limit of the number of targets. Each detector consists of 1 locator and 1 classifier. The locator is responsible for mapping the input features to target location information and the classifier is responsible for mapping the input features to target category information.

At present, two main challenges exist in infrared imaging guidance target detection based on SSD model: (1) Effectively representing and processing multi-scale features is one of the major difficulties in SSD model feature extractor design. Although the existing method improves the detection capability of a single-stage model on targets, particularly small targets, by improving the representation and fusion method of multi-scale features, the detection capability of the single-stage model has larger improvement space due to the singleness of connection forms of different scale feature layers and the equality of fusion modes. Higher studyThe effective feature selection and fusion mechanism realizes the fusion of a low-resolution and strong-semantic feature layer and a high-resolution and weak-semantic feature layer, and is a key thought for improving the detection capability of infrared small targets; (2) The loss of target detector positioning of SSD model generally employs L ₁ 、L ₂ The loss function is calculated without taking into account the effect of the target scale change on the loss function, but for the same dimensional errors, small targets are clearly more sensitive than large targets, which reduces the sensitivity of the model to small target positioning errors.

Disclosure of Invention

In view of the above, in order to solve the above-mentioned shortcomings of the prior art, the present invention aims to provide a single-shot multi-frame infrared target detection method, which combines multi-scale feature weighted fusion with scale invariance positioning loss, has autonomous learning ability and high detection rate, and is an effective way for solving the problem of infrared imaging guidance target detection in complex environments.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a single-shot multi-frame infrared target detection method comprises the following steps:

s1: starting from the feature pyramid network, describing inequality of contribution of each feature layer to fusion output based on a learnable weight, realizing bidirectional multi-scale feature weighted fusion between the feature layer with low resolution and strong semantics and the feature layer with high resolution and weak semantics, and constructing an auxiliary network;

s2: from the cross-over ratio, simultaneously considering the influence of the overlapping area and the non-overlapping area on the objective function, constructing the detector positioning loss which keeps invariance to the objective scale change, and constructing the objective detector;

s3: taking a VGG16 convolutional neural network as a feature extraction network and integrating the VGG16 convolutional neural network with an auxiliary network and a target detector to form a single-shot multi-frame infrared target detection model integrating multi-scale feature weighted fusion and scale invariance positioning loss;

s31: adding a weight for representing the importance of each input feature in the feature fusion process, and learning the importance of the input feature through network training;

s32: integrating multi-scale bidirectional jump connection and rapid normalization weighting feature fusion, serving as a functional network layer, and repeating for a plurality of times to construct a bidirectional feature weighting fusion pyramid network serving as an auxiliary network;

s33: focusing on the influence of the overlapping area and the non-overlapping area on the objective function, integrating the factors which keep invariance to the objective scale change, and constructing the positioning loss of the detector;

s34: the feature extraction network, the auxiliary network and the target detector are combined to form a single-shot multi-frame infrared target detection model for realizing fusion of multi-scale feature weighted fusion and scale invariance positioning loss.

Further, the step S1 specifically includes: starting from the thought of merging multi-scale features from top to bottom of a feature pyramid network (Feature Pyramid Network, FPN), adding a bottom-to-top path on the FPN to further merge the features to form a bidirectional path; on the basis, if the node has only one input edge and no feature fusion exists, the contribution of the node to the feature fusion network is small, and the node is removed; adding a jump connection for non-adjacent nodes at the same level, and fusing more features without increasing the cost; in addition, to achieve higher level feature fusion, the bi-directional path is taken as a functional network layer and repeated multiple times.

Further, the two-way multi-scale feature weighted fusion in the step S1 specifically includes the following steps:

in the feature fusion process, a weight representing the importance of each input feature is added, the importance of the input feature is learned through network training, and a weighted fusion equation is as follows:

here ω _i Is a weight which can be learned, F _i Is the ith input feature of the current layer, at each ω _i Relu is then applied to ensure ω _i >0，Epsilon is a constant for the sum of weights of the various input features of the current layer.

Further, in the step S2, a loss of positioning of the detector, which remains unchanged for the target scale change, is constructed, and the method specifically includes the following steps:

from the cross-over ratio (Intersection over Union, ioU), construct locator loss that remains constant for target scale variation:

L _loc ＝1-GIoC；

wherein A and B are respectively a prediction anchor frame and a real anchor frame, and C is the closure of the prediction anchor frame and the real anchor frame; the most important of the overlapping area is the intersection area of the prediction anchor frame and the real anchor frame, the most important of the non-overlapping area is the area of the closure C, which is obtained by removing the union of the prediction anchor frame and the real anchor frame, and the non-overlapping area is secondarily influenced by removing the union part of the intersection area; GIoC focuses on the influence of overlapping and non-overlapping areas on the objective function at the same time and is less computationally intensive.

Further, the VGG16 convolutional neural network in step S3 includes: 13 convolutional layers, 5 pooling layers and 2 full connection layers, and the VGG16 convolutional neural network constructs a characteristic extraction network of the single-shot multi-frame detector.

The beneficial effects of the invention are as follows:

the single-shot multi-frame infrared target detection method disclosed by the invention is integrated with multi-scale feature weighted fusion and scale invariance positioning loss, has autonomous learning capability and high detection rate, and is an effective way for solving the problem of infrared imaging guidance target detection in a complex environment;

the invention starts from a feature pyramid network, and describes the inequality of the contribution of each feature layer to fusion output based on the learnable weight, so as to realize the bidirectional multi-scale feature weighted fusion between the feature layer with low resolution and strong semantics and the feature layer with high resolution and weak semantics; and from the cross-over ratio, simultaneously considering the influence of the overlapping area and the non-overlapping area on the objective function, constructing the detector positioning loss which keeps invariance to the target scale change, and improving the sensitivity of the detection model to small target positioning errors. The model provided by the invention has autonomous learning capability and high detection rate, and is an effective way for solving the problem of detection of the infrared imaging guidance target in a complex environment.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a multi-scale feature layer connection strategy of the present invention;

FIG. 2 is a diagram of training set and validation set loss variation according to the present invention;

FIG. 3 is a graph showing the AP change of the infrared target P-R curve according to the present invention.

Detailed Description

Specific examples are given below to further clarify, complete and detailed description of the technical scheme of the invention. The present embodiment is a preferred embodiment based on the technical solution of the present invention, but the scope of the present invention is not limited to the following embodiments.

s2: from the cross-over ratio, simultaneously considering the influence of the overlapping area and the non-overlapping area on the objective function, constructing the detector positioning loss which keeps invariance to the objective scale change, and constructing the objective detector; the sensitivity of the detection model to small target positioning errors is improved;

to improve the small target detection capability of the single-stage model, the factor that keeps invariance to the target scale change should be integrated into the detector positioning loss function, and the intersection ratio IoU has invariance to the target scale change:

a and B are respectively a prediction anchor frame and a real anchor frame;

however, ioU only focuses on the overlapping area of the predicted anchor frame and the real anchor frame, and other non-overlapping areas are also required to be focused on in order to better reflect the overlapping ratio of the predicted frame and the real frame. Based on this, the invention improves IoU:

c is the closure of the predicted anchor frame and the real anchor frame. The most important of the overlapping area is the intersection area of the predicted frame and the real frame, the most important of the non-overlapping area is the area of the closure C where the predicted frame and the real frame are combined, and the combined area of the removed intersection area is the secondary influence non-overlapping area, and the GIoC reflects the overlapping area and the non-overlapping area at the same time;

the GIoC-based detector positioning penalty for maintaining invariance to target dimensional changes can be designed as:

L _loc ＝1-GIoC；

s34: the feature extraction network, the auxiliary network and the target detector are combined to form a single-shot multi-frame infrared target detection model for realizing fusion of multi-scale feature weighted fusion and scale invariance positioning loss. The batch size was 32, iterating 20 ten thousand times. The initial learning rate was set to 0.001, divided by 10 at iterations to 5 ten thousand, 10 ten thousand and 15 ten thousand, respectively, with a random gradient descent with a magnitude of 0.9 and a weight decay parameter of 0.0005 for network optimization.

Furthermore, in a classical SSD single-stage target detection network, the features close to the bottom layer tend to have less semantic information, but the position information is rich; the feature semantic information near the top layer is rich, but the position information is rough. In order to improve the detection rate of infrared small targets, the invention starts from FPN and fuses the feature layers with different scales. Further, as an optimization scheme, step S1 is to construct a bidirectional jump connection FPN. The step S1 specifically includes: starting from the thought of merging multi-scale features from top to bottom of a feature pyramid network (Feature Pyramid Network, FPN), adding a bottom-to-top path on the FPN to further merge the features to form a bidirectional path; on the basis, if the node has only one input edge and no feature fusion exists, the contribution of the node to the feature fusion network is small, and the node is removed; adding a jump connection for non-adjacent nodes at the same level, and fusing more features without increasing the cost; in addition, to achieve higher level feature fusion, the bi-directional path is taken as a functional network layer and repeated multiple times. Further, adding a bottom-up path on the FPN to further fuse the features; if the node has only one input edge and no feature fusion exists, the contribution of the node to the feature fusion network is small and the node can be removed; a jump connection is added for non-adjacent nodes at the same level, so that more features can be fused without increasing cost, and further, the two-way jump connection FPN is realized. The structure of the two-way jump connection FPN is shown in fig. 1 (d).

considering that different input features have different resolutions, their contributions to the output features tend to be unequal; in the feature fusion process, a weight representing the importance of each input feature is added, the importance of the input feature is learned through network training, and a weighted fusion equation is as follows:

here ω _i Is a weight which can be learned, F _i Is the ith input feature of the current layer, at each ω _i Relu is then applied to ensure ω _i >0，Epsilon is a constant, which is a small value (e.g., 0.0001) to avoid numerical instability, for the sum of the weights of the various input features of the current layer. In connection with fig. 1 (d), the layer 6 fusion process can be described as:

here, theIs an initial layer 6 feature,/->Is a top-down path 6-level intermediate feature, +.>Is a 6-stage output feature of the bottom-up path. Resize is an upsampling or downsampling operation that adapts to resolution matching, conv is a convolution operation of feature processing. All other layer features are fused in a similar manner.

L _loc ＝1-GIoC；

Further, after the end of the above step S34, in this embodiment, the following experiments prove the effects of the present invention:

pascal Voc dataset

The training data portion in VOC2007+voc2012 was selected as the training set in the pasal VOC dataset and tested on the VOC2007 test dataset. Taking VGG16 pre-trained in an ImageNet dataset as a backbone network of the invention, naming the model of the invention as WFSSD, comparing with SSD, DSSD, RSSD and FSSD at input image sizes of 300×300 and 521×512 respectively, and the results are shown in Table 1;

TABLE 1 PASCAL VOC2007 test set detection results

As can be seen from table 1, the mAP of WFSSD300 reaches 84.7%, an increase of 7.2 percentage points relative to SSD 300; the mAP of WFSSD512 reached 86.6%, an increase of 7.1 percent relative to SSD 512. Although the feature extraction network of the DSSD adopts ResNet-101 with better performance, the mAP result of the model of the invention is still higher than that of the DSSD. Because the two improved SSD models, namely the RSSD and the FSSD, do not consider the difference of contribution degrees of the feature layers to fusion output when the high feature layer and the low feature layer are fused, the two improved SSD models are simple non-weighted superposition; according to the invention, a learning-based weighted fusion mode is adopted for different feature layers, and the loss with invariance to the target scale transformation is introduced into the detector, so that the surpassing of the detection performance is realized. In terms of speed, the WFSSD model is not much different from SSD; although YOLOv3 has a faster detection speed, the difference from the model of the invention in terms of accuracy is obvious.

2. Ablation experiments

The two-way feature weighted fusion pyramid network and the target scale invariance GIoC positioning loss are two innovation points provided by the invention, and the results of ablation experiments are shown in table 2 for analyzing the contribution degree of the two-way feature weighted fusion pyramid network and the target scale invariance GIoC positioning loss to the improvement of the WFSSD model performance;

TABLE 2 influence of different Components on the detection Performance of the inventive model

As can be seen from table 2, the bidirectional feature weighted fusion pyramid network and the target scale invariance GIoC positioning loss exert an influence on the performance of the model of the present invention from the two aspects of the auxiliary network and the positioner loss of the feature extractor, respectively, and the influence degree of the bidirectional feature weighted fusion pyramid network is greater.

3. Self-built infrared dataset

The experimental data source is infrared aircraft video, and a total of 5582 frames (352×240) are stored in frames. The targets are classified into three categories by attitude: lateral, backward and back, and distinguishing a machine body and a tail flame during detection, wherein 6 kinds of manually marked target categories are provided: back Fuselage (BAF), back Tail flame (BAT), lateral Fuselage (LAF), lateral Tail flame (Lateral Tail flame, LAT), aft Fuselage (Backward Fuselage, BWF), aft Tail flame (Backward Tail flame, BWT). The total number of targets for the six categories manually marked in the dataset was 19936, distributed as: BAF 3385, BAT 2730, LAF 6438, LAT4904, BWF 352, BWT 2127, unbalanced distribution, and a data augmentation method based on geometric transformation is used to increase the sample size during training to compensate for this.

Training and verification loss change curves of the WFSSD300 model are shown in fig. 2, it can be seen that the loss change of the model for the training set is continuously reduced, and the loss value change of the corresponding verification set is gradually reduced, and finally, the convergence state tends to be stable. The training parameters are the same as the Pascal Voc dataset, the AP and mAP results of the test set are shown in Table 3, and the WFSSD exceeds SSD, DSSD, RSSD and FSSD in terms of detection accuracy.

TABLE 3 Infrared dataset detection results

The accuracy (precision) and recall (recall) curves for class 6 data testing are shown in fig. 3 as a function of AP values.

The 19936 targets marked manually are tested on a trained model, and the test results are divided into three cases: true is detected and correctly classified, false is detected but misclassified, miss indicates no detection, and the results (percentages) are shown in Table 4. In the original image, the targets with the length and width of the mark frame smaller than 50 pixels are 13785 targets, and account for 69.6% of the total targets, and the detection results of the targets are shown in table 5. In the original image, the total of 7666 targets with the length and width of the mark frame smaller than 25 pixels accounts for 38.45% of the total targets, and the detection results of the targets are shown in table 6. The length and width of the marking frame in the original image are less than 1402 targets of 12 pixels, accounting for 7.03% of the total targets, and the detection results are shown in table 7;

TABLE 4 Small target detection Capacity comparison (ALL)

Table 5 comparison of small target detection capability (< 50)

Table 6 small target detection capability comparison (< 25)

Table 7 small target detection capability comparison (< 12)

As can be seen from tables 4 to 7, as the target size becomes smaller, the proportion of the conventional SSD and WFSSD correctly detected (True) to the target is reduced, and the proportion of missed detection (Miss) is increased. At the same size, both indexes of the WFSSD are superior to those of the traditional SSD and other improved SSDs, and the advantage is obviously increased along with the reduction of the size. WFSSD can detect more small objects, benefiting from feature weighted fusion.

In summary, the invention starts from the feature pyramid network, and describes the inequality of the contribution of each feature layer to the fusion output based on the learnable weight, so as to realize the two-way multi-scale feature weighted fusion between the feature layer with low resolution and strong semantics and the feature layer with high resolution and weak semantics; and from the cross-over ratio, simultaneously considering the influence of the overlapping area and the non-overlapping area on the objective function, constructing the detector positioning loss which keeps invariance to the target scale change, and improving the sensitivity of the detection model to small target positioning errors. The model provided by the invention has autonomous learning capability and high detection rate, and is an effective way for solving the problem of detection of the infrared imaging guidance target in a complex environment.

The foregoing has outlined and described the features, principles, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are merely illustrative of the principles of the present invention, and that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A single-shot multi-frame infrared target detection method is characterized in that: the method comprises the following steps:

2. The method for detecting the single-shot multi-frame infrared target according to claim 1, wherein the method comprises the following steps of: the step S1 specifically includes: starting from the thought of merging multi-scale features from top to bottom of a feature pyramid network (Feature Pyramid Network, FPN), adding a bottom-to-top path on the FPN to further merge the features to form a bidirectional path; on the basis, if the node has only one input edge and no feature fusion exists, the contribution of the node to the feature fusion network is small, and the node is removed; adding a jump connection for non-adjacent nodes at the same level, and fusing more features without increasing the cost; in addition, to achieve higher level feature fusion, the bi-directional path is taken as a functional network layer and repeated multiple times.

3. The method for detecting the single-shot multi-frame infrared target according to claim 1, wherein the method comprises the following steps of: the two-way multi-scale feature weighted fusion in the step S1 specifically comprises the following steps:

4. The method for detecting the single-shot multi-frame infrared target according to claim 1, wherein the method comprises the following steps of: in the step S2, a loss of detector positioning that remains unchanged for the target scale change is constructed, specifically including the steps of:

from the cross-over ratio (Intersection overUnion, ioU), construct locator loss that remains constant for target scale variation:

L _loc ＝1-GIoC；

5. The method for detecting the single-shot multi-frame infrared target according to claim 1, wherein the method comprises the following steps of: the VGG16 convolutional neural network in step S3 includes: 13 convolutional layers, 5 pooling layers and 2 full connection layers, and the VGG16 convolutional neural network constructs a characteristic extraction network of the single-shot multi-frame detector.