CN117934820B

CN117934820B - Infrared target identification method based on difficult sample enhancement loss

Info

Publication number: CN117934820B
Application number: CN202410332193.4A
Authority: CN
Inventors: 徐从安; 吴俊峰; 高龙; 孙显; 史骏; 孙炜玮; 周伟; 宿南; 艾加秋
Original assignee: Naval Aeronautical University
Current assignee: Naval Aeronautical University
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-06-14
Anticipated expiration: 2044-03-22
Also published as: CN117934820A

Abstract

The invention discloses an infrared target identification method based on difficult sample enhancement loss, and belongs to the field of data identification. The infrared target recognition detection model adopted by the method comprises a backbone network, a neck and a head, wherein the backbone network is based on ResNet structure and comprises a convolution layer and four stage parts which are connected in sequence; the infrared image to be identified is used as an input image of a backbone network to be input into a convolution layer for extracting image characteristics; the backbone network also comprises three attention enhancement modules which are correspondingly and parallelly arranged with the last three stage parts respectively; the neck includes a feature pyramid network and the head includes a region suggestion network and a result prediction network. The invention introduces a attention enhancement module and a sample enhancement loss function based on infrared image physical characteristics, and obviously improves the recognition and detection performance of infrared targets.

Description

Infrared target identification method based on difficult sample enhancement loss

Technical Field

The invention belongs to the field of data identification, and relates to an infrared target identification method.

Background

Infrared images have a prominent advantage in a severe illumination environment compared to visible light images (RGB images). Infrared target detection plays a very important role in many applications such as early warning systems and marine monitoring systems.

Infrared images tend to be lower in resolution than visible light images and lack color information of objects. In addition, there are often strong background clutter, noise, low contrast, etc. problems in infrared images. This can result in some low difficulty samples in the visible image being difficult to detect in the infrared image, such as a red football on green grass. The above problems make infrared target detection a difficult task.

In recent years, a deep learning-based method is also used for infrared target detection recognition. In the deep learning-based method, the overall model can be largely divided into a feature extraction part and a result prediction part. For example, RISTDnet improves the feature extraction capability of the infrared image target by integrating the multi-scale convolution result; THERMALDET is used for infrared target detection, and competition performance is obtained, and the structure can adaptively re-weight each channel of the features after fusing the features of different layers; the Deep-IR Target establishes an information path between each feature channel through the self-care network, and further extracts spatial information of the image. However, these methods mainly focus on information loss of the infrared image during feature extraction, and do not take into account the context information between the target region and the surrounding background region in the image, resulting in poor recognition and detection performance of the low-contrast target.

On the other hand, due to the wide application of optical images, the object detection algorithm has been developed faster in the optical image field than in the infrared image field. The widely applied YOLO series algorithm designs a large number of residual errors and branch structures in a main network, and improves the characteristic extraction capability of the network. Although the YOLO series algorithm can be applied to infrared images, the method also does not improve the attention degree of useful characteristic information, and cannot improve the recognition and detection performance of low-contrast targets.

Disclosure of Invention

The invention provides an infrared target identification method based on difficult sample enhancement loss, which aims at: the problem of low-contrast infrared target identification detection performance is poor is solved.

The technical scheme of the invention is as follows:

an infrared target recognition method based on difficult sample enhancement loss adopts an infrared target recognition detection model comprising a backbone network, a neck and a head;

The backbone network is based on ResNet structure and comprises a convolution layer and four stage parts which are connected in sequence; the infrared image to be identified is used as an input image of a backbone network to be input into a convolution layer for extracting image characteristics; the backbone network also comprises three attention enhancement modules which are correspondingly and parallelly arranged with the last three stage parts respectively;

The neck comprises a feature pyramid network, and the head comprises a region suggestion network and a result prediction network; the feature pyramid network fuses the image feature information extracted by the backbone network and then inputs the image feature information into the regional suggestion network to obtain a region of interest; the result prediction network predicts the location and class of the target in combination with the region of interest and the image characteristic information.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: each stage part comprises a plurality of convolution blocks; the structure of the convolution block uses the structure in Res2Net, and carries out convolution operation on input characteristics, then divides the obtained characteristic tensor into four groups along a channel, and then sequentially carries out addition and convolution operation.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: the method comprises the steps that among four stage parts, a first stage part obtains a group of characteristic tensors based on image characteristics output by a convolution layer, and then the group of characteristic tensors are input into a second stage part;

For each group of corresponding attention enhancement modules and phase sections: the input of the attention enhancement module is the same as the input characteristic tensor of the corresponding stage part, and the output of the attention enhancement module is multiplied by the output of the stage part to be used as the output of the group of the corresponding attention enhancement module and the stage part;

The outputs of the first group of attention enhancement modules and the phase section are the input feature tensors of the second group of attention enhancement modules and the phase section, and the outputs of the second group of attention enhancement modules and the phase section are the input feature tensors of the third group of attention enhancement modules and the phase section; the outputs of the first, second and third sets of attention enhancement modules and the phase section are output as image characteristic information of the backbone network.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: the attention enhancing module includes a first portion, a second portion, and a third portion connected in sequence.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: in the first part, the attention enhancement module respectively carries out maximum pooling and average pooling on the input characteristic tensor along the channel direction, then carries out splicing convolution on the maximum pooling result and the average pooling result, and takes the splicing convolution result as the input of the second part;

The calculation process of the first part is as follows:

；

Wherein, Input feature tensor for attention enhancing module,/>For convolution kernel size/>Splicing convolution operation with step length of 1 and filling value of 0,/>Representing a maximum pooling operation along the channel direction,/>Representing an average pooling operation along the channel direction,/>The convolution results are stitched for the first portion.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: in the second part, obtaining a weight matrix through the overall attention structure;

Specifically, the overall attention structure includes The mutual attention structure is calculated as follows:

；

Wherein, For convolution kernel size/>Performing splicing convolution operation with step length of 1 and filling value of 0; /(I)For/>Collections composed of features obtained by multiscale pooling,/>，/>Represents the/>Calculating a mutual attention structure; /(I)And (3) a weight matrix obtained for the second part.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: the mutual attention structureThe calculation mode of (a) is as follows:

；

Wherein, The representation means that the kernel size after splicing two variables is/>Convolution operation with step size 1 and fill value 1,/>Representation means that the kernel size after splicing the two variables together is/>Convolution operation with step size of 1 and padding value of 7, the width and height of input and output of convolution are the same; /(I)、/>And/>Respectively refer to/>First/>, abstracted by convolution operationA query, a key, and a response value.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: in the third part, an input feature tensor is usedAnd the element product of the final weight matrix as output:

；

Wherein, For convolution kernel size/>Splicing convolution operation with step length of 1 and filling value of 0,/>For the output of the first part,/>For the output of the second part,/>For the final weight matrix,/>Is the output of the attention enhancing module.

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss: training the infrared target identification detection model by adopting a sample enhancement loss function; the sample enhancing loss functionIncluding regression loss/>And classification loss/>：

。

As a further improvement of the method for infrared target identification based on difficult sample enhancement loss:

Regression loss The calculation mode of (a) is as follows:

First, the contrast of the object is calculated:

；

Wherein, Representing a target area manually marked in an input image before training,/>Which represents the contrast of the object,Representing the background areas above, below, to the left and to the right of the target area respectively,Set of background regions representing four directions,/>Representing pixel values in the target area as/>Probability of occurrence of pixel,/>Representing background area/>The middle pixel value is/>Probability of occurrence of pixels of (2);

then calculate additional enhancement coefficients ：

；

Wherein,、/>、/>Is a super parameter;

Then for the first Sample regression loss of individual training samples/>The calculation mode of (a) is as follows:

；

Wherein, The intersection ratio between the target area in the training sample predicted for the infrared target recognition detection model and the artificially marked target area,/>And/>The predicted target area width and the actual target area width in the training sample,/>, respectivelyAnd/>The predicted target area height and the actual target area height in the training sample are respectively;

then calculate regression loss Wherein/>Training the total number of samples;

Classification loss The calculation mode of (a) is as follows:

；

Wherein, For the total number of target categories,/>Is a sign function if/>The training samples belong to the/>The number of categories is 1, otherwise 0,/>Is the predicted/>, of the infrared target recognition detection modelTargets in the individual training samples belong to the/>Probability of individual categories.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention uses the attention enhancement module in the last three stages of the attention backbone network, enhances the feature points of the interested region and the target while integrating the image context information, and improves the attention degree of the network to the useful information, thereby improving the feature extraction capability of the low-contrast target in the infrared image and the recognition and detection performance of the low-contrast infrared target.

2. The invention provides a sample enhancement loss function based on infrared image physical characteristics, the difficulty of a sample is evaluated by using the proposed contrast measurement, and loss values of a low-contrast target with high difficulty and other targets are weighted respectively so as to balance the counter-propagation gradient of the low-contrast sample and other samples in a network, thereby further improving the overall performance.

Drawings

FIG. 1 is a schematic diagram of an infrared target recognition detection model in the present invention;

FIG. 2 is a schematic diagram of a phase portion of a hot residual backbone network;

FIG. 3 is a schematic diagram of the structure of the attention enhancing module;

FIG. 4 is a schematic diagram of the overall attention structure in the second portion of the attention enhancement module, with 3*3 branches in the diagram being linear mappings based on different scale convolution kernels for the input features;

FIG. 5 is a schematic diagram of a mutual attention structure in the overall attention structure, the mutual attention structure performing mutual attention operations on a plurality of features input to the module, the schematic diagram Representation of/>Performing a mutual attention operation on the inputs, specifically, performing an attention operation on the query vector (Q) and the value vector (V) of the branch and the query vector (K) of the other branch;

Fig. 6 is a schematic diagram of 4 divisions of a target area and a background area, with a center square portion being the target area and a black portion sideways being the selected background area.

Detailed Description

The technical scheme of the invention is described in detail below with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention.

An infrared target recognition method based on difficult sample enhancement loss adopts an infrared target recognition detection model comprising a backbone network, a neck and a head.

As in fig. 1, the backbone network is based on ResNet architecture, comprising a convolutional layer and four phase sections connected in sequence.

The infrared image to be identified is input to the convolution layer as an input image of the backbone network for extracting image features.

Each of the phase portions includes a plurality of convolution blocks, respectively. Moreover, the number of convolutions of each phase section is not necessarily the same. M in fig. 2 represents the number of convolutions blocks stacked in a certain stage section, i.e. the depth of the network. In this embodiment, the structure of the convolution block uses the structure in Res2Net, which performs convolution operation on the input features, then divides the obtained feature tensors into four groups along the channel, and then sequentially performs addition and convolution operation.

The first stage part obtains a set of feature tensors based on the image features output by the convolution layer and then inputs the feature tensors to the second stage part.

In order to supplement the characteristic information of the target in the infrared image by integrating the image context characteristics, the backbone network further comprises three attention enhancement modules which are respectively and correspondingly arranged in parallel with the last three stage parts. For each set of attention enhancement modules and phase sections: the input of the attention enhancement module is identical to the input characteristic tensor of the corresponding stage part, and the output of the attention enhancement module is multiplied by the output of the stage part to be used as the output of the group of the corresponding attention enhancement module and the stage part. The outputs of the first set of attention enhancement modules and phase sections are the input feature tensors of the second set of attention enhancement modules and phase sections, and the outputs of the second set of attention enhancement modules and phase sections are the input feature tensors of the third set of attention enhancement modules and phase sections. The outputs of the first, second and third sets of attention enhancement modules and the phase section are output as image characteristic information of the backbone network.

Further, as shown in fig. 3, the attention enhancing module includes a first portion, a second portion, and a third portion.

In order for the attention enhancement module to focus on spatial information of image features, the present invention employs a joint expression of two different pooling operations on the tensor of input features as input to the proposed overall attention structure. Finally, the output is mapped between 0 and 1 as a spatial weight matrix of the input feature tensor.

In the first part, the attention enhancement module respectively carries out maximum pooling and average pooling on the input characteristic tensor along the channel direction, then carries out splicing convolution on the maximum pooling result and the average pooling result, and then takes the splicing convolution result as the input of the second part.

The calculation process of the first part is as follows:

；

In the second part, a weight matrix is obtained by the global attention structure.

In general, the relevance of a local area in an image is inversely proportional to distance, so that the characteristic information of an object is only relevant to the context of a limited range of local areas. The invention designs a new general attention structure in the backbone network. As shown in fig. 4, the overall attention structure calculates the cross-attention of the feature response at different scales for each local region in the feature space response map, while establishing an information path between the different scale local regions for each location of the image.

Specifically, the overall attention structure includesThe mutual attention structure is calculated as follows:

；

Wherein, For convolution kernel size/>And performing splicing convolution operation with the step length of 1 and the filling value of 0. /(I)For/>Collections composed of features obtained by multiscale pooling,/>. The above calculation of attention to each input feature is a fusion of the calculation results of a plurality of mutual attention structures having different parameters in order to pay attention to information in different subspaces.

Further, as shown in FIG. 5, a mutual attention structureThe calculation mode of (a) is as follows:

；

Wherein, The representation means that the kernel size after splicing two variables is/>Convolution operation with step size 1 and fill value 1,/>Representation means that the kernel size after splicing the two variables together is/>The convolution operation with step size 1 and fill value 7, the width and height of the convolved input and output are the same. /(I)、/>And/>Respectively refer to/>First/>, abstracted by convolution operationA query, a key, and a response value.

And (3) a weight matrix obtained for the second part.

In the third part, as in fig. 3, an input feature tensor is usedAnd the element product of the final weight matrix as output:

；

The neck includes a Feature Pyramid Network (FPN). The header includes a regional recommendation network (RPN) and a result prediction network.

The Feature Pyramid Network (FPN) fuses the image feature information extracted by the backbone network, and then inputs the image feature information into the region suggestion network (RPN) to obtain a region of interest (ROI). Finally, the result prediction network combines the region of interest (ROI) and the image feature information to predict the location and class of the object.

Further, for the infrared target recognition detection model, aiming at unbalance between a low-contrast target and other targets, training is performed based on the proposed sample enhancement loss function, and the detection performance of the low-contrast target is improved by optimizing the weight of the low-contrast target and other target loss function values.

The sample enhancing loss functionIncluding regression loss/>And classification loss/>：

。

Wherein regression lossBuild on the basis of CIoU loss functions. Compared with the traditional IoU loss function, CIoU reflects the regression effect of the detection frame more accurately by considering the overlapping area, the relative distance and the aspect ratio of the prediction box and the real box. The sample enhancement loss function optimizes the target loss function weight according to the contrast on the basis of CIoU so as to balance the learning effect of the low-contrast sample and other samples.

To evaluate the difficulty of a sample, the method designs a metric to represent the difference between the target region and the background region. First, the contrast is calculated using the target region and a background region of a certain scale around it. Next, as shown in fig. 6, four different background areas may be selected in four directions around the target area. And finally, calculating the gray value distribution difference between the target area and the background area in each direction, and taking the information entropy as a mathematical model for calculating the local gray value and the distribution dispersion. At the same time, the gray value of the region is multiplied by its corresponding information entropy, so that our measurement can take into account the center of the distribution.

If the extent of flooding of the target by the background areas in each direction is different, the difference between the gray value distribution of the target area and the background areas in each direction is small. Based on this theory, the maximum difference between the distribution of the target region and the background region in each direction is used as the contrast value of the target for evaluating the difficulty thereof.

The contrast of the target is calculated by the following steps:

；

Wherein, Representing a target area manually marked in an input image before training,/>Which represents the contrast of the object,Representing the background areas above, below, to the left and to the right of the target area respectively,Set of background regions representing four directions,/>Representing pixel values in the target area as/>Probability of occurrence of pixel,/>Representing background area/>The middle pixel value is/>Is used for the probability of occurrence of a pixel of (c).

The method applies a gaussian function as a basic mathematical model of a mapping function of sample enhancement coefficients. In order to improve the learning effect of the difficult sample and accelerate the convergence speed of the simple sample, the sum of the mapping of the contrast coefficient and the super-parameters is adopted as the final additional enhancement coefficient.

The target-based contrast ratio mapping calculation process is as follows:

；

Wherein, Is the final additional enhancement coefficient involved in the calculation in the loss function,/>、/>、/>The default values of the super parameters are 0.75,100 and 0.1.

In the above formula, the square of the contrast of the target is used as the natural baseAnd multiplying it with a linear coefficient, and then adding a hyper-parameter/>, to the resultAdditional enhancement coefficients as a function of loss.

For the firstSample regression loss of individual training samples/>The calculation mode of (a) is as follows:

；

Wherein, The intersection ratio between the target area in the training sample predicted for the infrared target recognition detection model and the artificially marked target area,/>And/>The predicted target area width and the actual target area width in the training sample,/>, respectivelyAnd/>The predicted target area and the actual target area in the training sample are high, respectively.

Then calculate regression lossWherein/>For the total number of training samples.

Classification lossThe calculation mode of (a) is as follows:

；

It should be noted that it will be apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, but may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The scope of the invention is indicated by the appended claims rather than by the foregoing description.

Claims

1. The infrared target recognition method based on difficult sample enhancement loss adopts an infrared target recognition detection model comprising a backbone network, a neck and a head, and is characterized in that:

The neck comprises a feature pyramid network, and the head comprises a region suggestion network and a result prediction network; the feature pyramid network fuses the image feature information extracted by the backbone network and then inputs the image feature information into the regional suggestion network to obtain a region of interest; the result prediction network predicts the position and the category of the target by combining the region of interest and the image characteristic information;

the method comprises the steps that among four stage parts, a first stage part obtains a group of characteristic tensors based on image characteristics output by a convolution layer, and then the group of characteristic tensors are input into a second stage part;

The outputs of the first group of attention enhancement modules and the phase section are the input feature tensors of the second group of attention enhancement modules and the phase section, and the outputs of the second group of attention enhancement modules and the phase section are the input feature tensors of the third group of attention enhancement modules and the phase section; the outputs of the first group, the second group and the third group of attention enhancement modules and the stage part are used as image characteristic information output by a backbone network;

The attention enhancing module comprises a first part, a second part and a third part which are sequentially connected;

in the first part, the attention enhancement module respectively carries out maximum pooling and average pooling on the input characteristic tensor along the channel direction, then carries out splicing convolution on the maximum pooling result and the average pooling result, and takes the splicing convolution result as the input of the second part;

The calculation process of the first part is as follows:

；

Wherein, Input feature tensor for attention enhancing module,/>For convolution kernel size/>Splicing convolution operation with step length of 1 and filling value of 0,/>Representing a maximum pooling operation along the channel direction,/>Representing an average pooling operation along the channel direction,/>Splicing convolution results for the first part;

in the second part, obtaining a weight matrix through the overall attention structure;

；

Wherein, For convolution kernel size/>Performing splicing convolution operation with step length of 1 and filling value of 0; /(I)For/>Collections composed of features obtained by multiscale pooling,/>，/>Represents the/>Calculating a mutual attention structure;

a weight matrix obtained for the second part;

The mutual attention structure The calculation mode of (a) is as follows:

；

Wherein, The representation means that the kernel size after splicing two variables is/>Convolution operation with step size 1 and fill value 1,/>The representation means that the core size after splicing the two variables isConvolution operation with step size of 1 and padding value of 7, the width and height of input and output of convolution are the same; /(I)、And/>Respectively refer to/>First/>, abstracted by convolution operationIndividual query, key, and response values;

in the third part, an input feature tensor is used And the element product of the final weight matrix as output:

；

Wherein, For convolution kernel size/>Splicing convolution operation with step length of 1 and filling value of 0,/>For the output of the first part,/>For the output of the second part,/>For the final weight matrix,/>An output that is an attention enhancing module;

Training the infrared target identification detection model by adopting a sample enhancement loss function; the sample enhancing loss function Including regression loss/>And classification loss/>：

；

Regression lossThe calculation mode of (a) is as follows:

First, the contrast of the object is calculated:

；

Wherein, Representing a target area manually marked in an input image before training,/>Which represents the contrast of the object,Representing background regions above, below, to the left and to the right, respectively, of the target region,/>Set of background regions representing four directions,/>Representing pixel values in the target area as/>Probability of occurrence of pixel,/>Representing background area/>The middle pixel value is/>Probability of occurrence of pixels of (2);

then calculate additional enhancement coefficients ：

；

Wherein,、/>、/>Is a super parameter;

；

then calculate regression loss Wherein/>Training the total number of samples;

Classification loss The calculation mode of (a) is as follows:

；

2. The method for infrared target identification based on difficult sample enhancement loss of claim 1, wherein: each stage part comprises a plurality of convolution blocks; the structure of the convolution block uses the structure in Res2Net, and carries out convolution operation on input characteristics, then divides the obtained characteristic tensor into four groups along a channel, and then sequentially carries out addition and convolution operation.