CN116402851A

CN116402851A - Infrared dim target tracking method under complex background

Info

Publication number: CN116402851A
Application number: CN202310268997.8A
Authority: CN
Inventors: 李大威; 蔺素珍; 崔晨辉; 禄晓飞
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-07-07

Abstract

Aiming at the difficult problems that effective features are difficult to extract from the infrared weak and small targets under the complex background, the infrared weak and small targets are easily influenced by surrounding interferents and the like, the invention provides the infrared weak and small target tracking method under the complex background. Firstly, an input reference area and an area to be tracked are transmitted into a dual-feature extraction module by a network model to respectively obtain a fusion feature map; then, carrying out similarity calculation on the fusion feature map by using a similarity calculation module, wherein the output similarity map contains classification and regression information of the targets; and finally, outputting the predicted position and the boundary frame of the current frame image target through a refinement module and a head network so as to realize the stable tracking of the infrared weak and small target under the complex background. The method and the device can effectively carry out steady tracking on the real target in the complex scene, reduce the influence of the interference objects around the target, improve the tracking performance and provide accurate position information for extracting the target characteristics and judging the key event.

Description

Infrared dim target tracking method under complex background

Technical Field

The invention belongs to the field of infrared image target tracking, and particularly relates to an infrared weak and small target tracking method under different complex backgrounds by utilizing an end-to-end depth network model.

Background

The infrared weak and small target tracking technology is mainly applied to the aspects of hostile target early warning, remote guided weapon and the like, and is an important difficult problem for accurately tracking the infrared weak and small target, and the main challenge is that: 1) The distance between the infrared weak target and the infrared sensor is very far, so that the infrared target occupies fewer image pixels, usually between 2×2 and 9×9, and key features are difficult to extract without edge contour and texture information. 2) In the process of tracking the infrared weak and small target, jitter of a sensor may occur, so that the track of the infrared weak and small target is broken, and the tracking device suddenly jumps to another position from a certain position of an image, so that the target is lost. 3) In the process of gesture adjustment and ignition flameout of the infrared weak and small target, the gray value of the target can be changed, and if the condition that the background is brighter is met, the target can be submerged in the background, so that tracking failure is caused. 4) Some interferents may appear around the target, the gray values of which also show gaussian distribution, similar to the real target, and the tracker may deviate during the discrimination process, losing the real target. Therefore, it is an urgent and challenging task to propose an infrared dim target tracking method in a complex context.

At present, the existing infrared weak and small target tracking methods can be divided into two types: the method comprises a mathematical modeling method based on model driving (called a mathematical modeling method for short) and a deep learning method based on data driving (called a deep learning method for short). The mathematical modeling method is to learn the relevant filter on line, establish a target appearance model, and track the target candidate region in a mode of obtaining a similarity graph, wherein the maximum responsivity position in the similarity graph is the true target center position of the current frame. The mathematical modeling method can update the target model in real time when tracking the target, can effectively reduce the influence of posture adjustment, brightness change and the like of the infrared weak and small target on the tracker, and collects the front background information of the target so as to improve the tracking performance of the target. But also has the following drawbacks: 1) The method is usually used for scale estimation by adopting a multi-scale pyramid, different scale parameters are required to be preset to adjust the size of a candidate area to be tracked, the process is tedious, the calculation is complex, the tracking speed is reduced, the method is only suitable for scale change with equal scale, and the method is difficult to be applied to actual tracking scenes. 2) The boundary effect for reducing the performance of the tracker is introduced in a cyclic operation mode, and in the process of training the filter, a plurality of images after cyclic shift are used, so that real target information is also existed in a negative sample, the performance of the tracker is seriously reduced, and the discrimination capability for similar interferents around the real target is reduced. The deep learning method is commonly used at present for a twin network architecture, depth features of a reference area and an area to be tracked are extracted through a backbone network sharing parameters, then cross-correlation operation is carried out to obtain a similarity graph, and finally, the central position and a boundary box of a predicted target are output through classification and regression head network. The device has a simple structure, does not need to set a large number of super parameters, and can achieve effective balance of precision and speed. Therefore, the invention improves on the basis of a twin network architecture, and realizes the accurate tracking of the infrared weak and small targets in a complex scene.

Disclosure of Invention

Aiming at the problems that effective features are difficult to extract and are easily influenced by surrounding interferents and the like of an infrared weak and small target under a complex background condition, the invention provides the infrared weak and small target tracking method under the complex background, which is suitable for tracking the infrared weak and small target under the complex background such as forests, plains, ridges and the like, can achieve higher accuracy and precision, and can meet the real-time requirement.

The invention adopts the following technical scheme: the method for tracking the infrared dim and small targets in the complex background is used for carrying out steady tracking on the infrared dim and small targets in different background environments, and comprises the following steps:

step 1: inputting an infrared image sequence Z containing an infrared weak and small target;

step 2: selecting a target area in a first frame image of the infrared image sequence Z as a reference area T;

step 3: inputting the reference area T into a dual-feature extraction module to obtain a fusion feature map cat (T);

step 4: the center position of a frame of target above a frame of subsequent frame image of the infrared image sequence Z is used as an origin to obtain a region X to be tracked _i I epsilon 2-n, n representing the total number of frames of the T sequence;

step 5: to-be-tracked area X _i Inputting into a dual-feature extraction module to obtain a fusion feature map cat (X _i )；

Step 6: inputting the two fusion feature images into a similarity calculation module together to obtain a similarity image R;

step 7: the similarity graph R is subjected to a refinement module to obtain a similarity graph up (R) consistent with the size of the area to be tracked;

step 8: outputting the target center point position and the boundary frame size through a head network by the similarity map up (R) to obtain a tracking result of the current frame i;

step 9: and replacing the target area tracked by the current frame i to a reference area T, and continuing the operation of the steps 3-9 until the sequence is finished.

The method needs to construct a Dual-feature extraction module (Dual-feature Extraction Module, DEM), a similarity calculation module (Similarity Calculation Module, SCM) and a refinement module (Refinement Module, RM) to form an infrared weak small target tracking network, and in a test stage, an area update module (Region Update Module, RUM) is added to adapt to the change of targets and surrounding backgrounds. The DEM performs feature extraction on the infrared weak and small targets and part of background environments, so that key features are effectively extracted; the SCM measures the similarity of the feature images of the reference area and the area to be tracked, and a similarity image is obtained, wherein the similarity image comprises classification and regression information of the target; the RM is used for solving the problem that the tracking precision is reduced due to the fact that a large amount of background is introduced when the RM is mapped back to the area to be tracked, and the size of the RM is amplified through a neural network and is consistent with the area to be tracked, so that the pixel corresponding relation is kept in one-to-one correspondence, and the tracking precision is improved; RUM is applied to the test stage, and the image area of the last frame is always used as a reference area, so that the feature map information is updated. The infrared weak and small target tracking method under the complex background is finally formed through the combination of the modules, the network robustness is improved by adopting a data set disclosed on a network and adding operations such as cutting, rotation, blurring, mirroring and the like, the network optimization is performed in a multi-loss combined training mode, the input of the method is an infrared image sequence, and the output is the left upper corner and right lower corner coordinates of the predicted target position of each frame of image.

In the method, the dual-feature extraction module comprises: depth feature extractor and directional gradient histogram feature extractor. The depth feature extractor may obtain shallow details and depth semantic features in the image after the input of the infrared image. The directional gradient histogram feature extractor calculates the gray gradient direction and the gray gradient size of pixels in each image block by uniformly dividing the image into a plurality of image blocks, and finally integrates each image block to form the gray gradient histogram feature of the whole image.

The similarity calculation module modifies the transformer network, enhances the key information in the targets and the backgrounds in the reference area and the to-be-tracked area feature map through the self-attention mechanism, searches the most similar area in the enhanced to-be-tracked area feature map by using the cross-attention mechanism, benefits from the fact that the transformer network can extract global context information, and self-adaptively notices dependence of similar parts in the two feature maps from the global scope.

The refinement module is used for segmenting the U-Net network, deleting the left downsampling network in the U-Net network, reserving the right upsampling network, inputting the similarity graph obtained by the similarity calculation module, and outputting the refined similarity graph consistent with the size of the area to be tracked. The purpose of the module is: when the dual-feature extraction module is used, convolution and pooling operations in the depth network can be used for downsampling an image area, the size of a feature map is continuously reduced, a plurality of backgrounds can be introduced after the receptive fields of each pixel correspond to the area to be tracked, and tracking performance is reduced. After the refinement module is added, classification and regression information in the similarity graph is more abundant, and positioning of a real target is easy.

The area updating module is used for continuously updating the reference area in the test process of the method. In the process of tracking an infrared target, a part of target background area is often introduced to improve tracking accuracy, but in a complex background environment, the target part background area is continuously changed, and the performance of a tracker is reduced in the subsequent tracking process by only taking the first frame of an infrared sequence image as the target area. Therefore, the tracking accuracy is improved by adding the area updating module in the testing process of the method.

The depth feature extractor in the dual feature extraction module adopts a ResNet-18 network model, and the last average pooling layer and the full connection layer are deleted in the network. After a reference region T of 127×127×3 and a region X to be tracked of 255×255×3 are input together through a 5-layer residual network, a depth feature map res (T) of 15×15×512 and a depth feature map res (X) of 31×31×512 are obtained. In the oriented gradient Histogram (HOG) feature extractor, the same input as the depth feature extractor, an HOG feature map HOG (T) of 15×15×8 and an HOG feature map HOG (X) of 31×31×8 are output. Finally, performing concat operation on res (T) and hog (T) to obtain cat (T), and taking consistent operation on the area to be tracked to obtain cat (X).

The similarity calculation module is used for improving a transducer network structure in natural language processing and is suitable for the field of target tracking, and comprises an encoder and a decoder. In the encoder stage, firstly, spatial position encoding is carried out on the feature map cat (T) by using an nn.encoding function in a pyrach deep learning library to obtain an encoded map P (T), then P (T) is added with cat (T), and then a view function is used for leveling each channel of the added feature map to obtain a multi-dimensional feature vector f (T) of 520 multiplied by 255, which is used as an input of a transducer encoder. After f (T) is transmitted through the first encoder layer, the target characteristic information in f (T) is enhanced by utilizing a multi-head attention mechanism, and the specific calculation process is as follows:

MultiHead(Q,K,V)＝ConCat(head ₁ ,...,head _n )W ^O ，head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) In the formula, Q, K, V variables are identical and equal to f (T), W _i ^Q 、W _i ^K 、W _i ^V And (3) representing a weight matrix focusing on different information, i epsilon 1-8 representing information focusing on 8 positions by an attention mechanism, and mapping the coded content to 8 spaces, so that the characterization capability of the model is stronger. f (T) is processed by a multi-head attention mechanism to obtain a coding feature enc (T), then f (T) and enc (T) are directly added, and then the coding feature enc (T) and enc (T) pass through a normalization layer and a feedforward neural network (FFN), wherein the feedforward neural network comprises two linear layers and one normalization layer, and a first coding feature enc' (T) is finally obtained after the operations, and the formula can be expressed as follows: enc' (T) =ffn (Norm (f (T) +enc (T)). Thereafter, the reference region features are again enhanced by continuing through the second encoder layer, the operation being consistent with the first encoder layer. The final cat (T) is passed through the encoder of the similarity calculation module to obtain the encoding feature enc "(T). In the decoder stage, the fused feature map cat (X) of the region to be tracked is input into a decoder in a Similarity Calculation Module (SCM). The decoder is also provided with two decoder layers, wherein the spatial position coding is added in the feature map, the first decoder layer is used for carrying out leveling operation, and the decoder layer comprises an attention module consisting of a multi-head attention mechanism and a normalization layer and is used for enhancing target information and key background information in the feature map of the region to be tracked and outputting a first decoding feature dec' (X). In the second decoder layer, comprising an attention module and a feed-forward neural network, the dec '(X) and the encoding feature enc "(T) are taken as inputs, the Q, K, V variables are no longer exactly the same when the multi-headed attention mechanism in the attention module is entered, but q=dec' (X), k=v=enc" (T), and finally the second decoding feature dec "(X) is output through the normalization layer and the feed-forward neural network. Then, the dec "(X) feature map is scaled by using the view function in the Pytorch deep learning library to become a similarity map R with the size of 31×31×520.

The refinement module adopts an up-sampling network of U-Net, and comprises a 5-layer network, wherein the first-layer network and the second-layer network both comprise a transposed convolution with a convolution kernel of 3 multiplied by 3 and a step length of 2 multiplied by 2 and a double convolution block, and the double convolution block consists of two convolution blocks, wherein each convolution block comprises a convolution layer with the convolution kernel of 3 multiplied by 3, a filling layer of 1 and a step length of 1, a normalization layer and an activation function layer. The main difference between other three-layer networks and the previous two layers is that the transposed convolution kernels are different in size and step length, and are mainly represented by: the transposed convolution kernel of the third layer network is 2×2, the step size is 2, the transposed convolution kernel of the fourth layer network is 2×2, the step size is 1, and the transposed convolution kernel of the fifth layer network is 1×1, the step size is 1. Finally, a 2-dimensional convolution is performed with a 1 x 1 convolution kernel. The similarity graph R is subjected to a refinement module to obtain a refined similarity graph up (R) of 1×255×255.

The above-mentioned area updating module is applied in the test process of the invention, when carrying on the similarity calculation of the fusion characteristic map of the area to be tracked and reference area, the reference area adopts the infrared image of the goal of the last frame and its partial background area all the time, get the new reference area T ', finally, through the encoder part in the double feature extraction module and similarity calculation module, get the new coding feature enc "(T').

According to the infrared dim target tracking method under the complex background, the image sequence of the training set is obtained by the following steps: selecting a LaTOT data set as a basic data set for model training, firstly taking a target center point of each image in an image sequence as an original point, cutting the image with a cutting width which is 10 times of the diagonal length of a boundary frame, and cutting the image from the original point to the periphery to obtain a 511X 511 new image; subsequently, the new image is subjected to operations of translation, scaling, blurring and mirroring to enable the target to deviate from the original position, the trained reference image is randomly decimated from the first frame to the last frame of the sequence, and the image to be tracked is randomly decimated within 30 frames before and after the reference image. The training set used in the method has 104726 images.

The image sequence of the test set of the infrared dim target tracking method under the complex background is obtained by the following steps: in some infrared sequences, the target is only 1 multiplied by 1, and a useful depth characteristic image and a HOG characteristic image cannot be extracted, so that simulation infrared weak targets in the range of 5 multiplied by 5 to 7 multiplied by 7 are added on a real target under the condition that the target is too small, gray values of the simulation infrared weak targets are distributed in Gaussian, and a label uses Dark label software to label the targets.

The invention provides an infrared weak and small target tracking method under a complex background for solving the problem of robustness tracking of infrared weak and small targets in different background environments, and the tracking problem is decomposed into target classification and regression tasks by designing an end-to-end deep learning network, so that the real targets in the complex scene can be effectively tracked in a steady way, the influence of interference objects around the targets is reduced, the tracking performance is improved, and accurate position information is provided for extracting target characteristics and judging key events next.

Drawings

Fig. 1 is a diagram of a network model structure according to the present invention.

Fig. 2 is a block diagram of a dual feature extraction module according to the present invention.

FIG. 3 is a block diagram of a similarity calculation module according to the present invention.

Fig. 4 is a diagram showing a detailed module structure of the present invention.

Fig. 5 is a schematic diagram of an embodiment of the present invention, where (a) a reference area of a previous frame input by the tracking network is displayed, (b) a to-be-tracked area of a current frame input by the tracking network is displayed, and (c) a labeling result of an output bounding box of the tracking network in a current source image is displayed.

Detailed Description

The invention will be further described with reference to the accompanying drawings and detailed description below:

referring to fig. 1, the method for tracking the infrared dim target under the complex background in the embodiment includes the following steps:

step 1: inputting an infrared image sequence Z to be tracked, wherein the sequence Z comprises n frames of images;

step 2: manually calibrating a target area to be tracked on a first frame image of an infrared image sequence Z, filling a surrounding background area with a diagonal length of 2 times of the target area to obtain a reference area T epsilon R ^127×127×3 ；

Step 2.1: feeding the reference area T into a dual feature extraction module (DEM) for feature extraction, and referring to a model framework in FIG. 2, the reference area T comprises two parts of depth feature extraction and directional gradient histogram feature extraction;

step 2.2: depth feature extraction in DEM dual feature extraction: the ResNet-18 architecture is adopted, and consists of 5 network blocks, and the reference region T sequentially passes through the 5 network blocks to obtain a reference region depth feature map res (T). Wherein, the 1 st network block is composed of a convolution layer with a convolution kernel of 7×7, a step length of 2 and a filling of 1; the 2 nd network block consists of a maximum pooling layer with a convolution kernel of 3 multiplied by 3 and a step length of 1 and filled with 1 and two residual blocks, wherein the residual block structure is a convolution layer with the convolution kernel of 3 multiplied by 3 and the step length of 1 and filled with 1, but the output phase Concat of the output phase and the input phase of the residual block are connected in a residual way at the output position; the 3 rd to 5 th network blocks are respectively composed of two residual blocks. The structure of the residual block is similar to that of the residual block in the 2 nd network block, but the step size of the first convolution layer in the first residual block in each network block is set to 2, and the image is downsampled. And finally deleting the final average pooling layer and the linear layer to obtain a reference region depth feature map res (T) with the size of 15 multiplied by 512.

Step 2.3: directional gradient histogram feature extraction in DEM dual feature extraction: firstly, equally dividing a reference area T into a plurality of cell units, wherein the size of each cell unit is 8 multiplied by 8 pixels; then calculating the gradient magnitude and direction of each pixel in each cell unit, and adopting gradient information of pixels in 8 azimuth statistics units in the cell units; finally, synthesizing a larger block by using 2 multiplied by 2 cell units, and carrying out serial connection and normalization on gradient information of each cell unit in the block to form a gradient histogram characteristic in the block; finally, all the blocks are combined to obtain a directional gradient histogram feature map hog (T) with the size of 15 multiplied by 8.

Step 2.4: fusion of dual features in DEM dual feature extraction: and performing Concat operation of channel dimension on res (T) output in the depth feature extractor and hog (T) output in the directional gradient histogram feature extractor to obtain a fusion feature map cat (T).

Step 3: the fusion feature map cat (T) of the reference region is input to an encoder in a Similarity Calculation Module (SCM), referring to the encoder structure in fig. 3. Firstly, carrying out space position coding with 256 dimensions on pixel points of each row in a feature map cat (T) by using an nn.coding function in a pyrach deep learning library, carrying out the operation on pixels of each column, and then carrying out Concat operation on a dimension layer by using the space position coding of each row and the space position coding of each column to obtain a coding map P (T) with the size of 15 multiplied by 512. Because the number of wide and high channels of P (T) and cat (T) are consistent, corresponding pixels are directly added, each pixel information of cat (T) contains spatial information of the position, and then each channel of the added feature map is leveled by using a view function, so that a multi-dimensional feature vector f (T) of 520 multiplied by 255 is obtained and is used as an input of an encoder. In the encoder, the encoder has two encoder layers, and after f (T) is input into the first encoder layer, target characteristic information in f (T) is enhanced by utilizing a multi-head attention mechanism, and the specific calculation process is as follows:

MultiHead(Q,K,V)＝ConCat(head ₁ ,...,head _n )W ^O ，head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i v), wherein Q, K, V is identical and equal to f (T), W _i ^Q 、W _i ^K 、W _i ^V And the i epsilon 1-8 represents information of 8 positions focused by a multi-head attention mechanism, the coded content is mapped to 8 spaces, and the characterization capability of the model is stronger. f (T) is passed through a multi-head attention mechanism to obtain a feature enc (T), then f (T) and enc (T) are directly added, and then a normalization layer and a feedforward neural network (FFN) are passed through, wherein the feedforward neural network comprises two linear layers and a normalization layer, and after these operations, a first coding feature enc' (T) is finally obtained, and the formula can be expressed as follows: enc' (T) =ffn (Norm (f (T) +enc (T)). Thereafter, the reference region features are again enhanced by continuing through the second encoder layer, the operation being consistent with the first encoder layer. Final cat (T) passing phaseThe encoder of the similarity calculation module then obtains the encoding feature enc "(T).

Step 4: inputting 2-n frames of images in an infrared image sequence, taking the central position of a previous frame of the image as an original point, filling a background area with the diagonal length of a boundary frame of the previous frame of the image, which is 4.5 times that of the previous frame of the image, so as to form an area X to be tracked _i ,i∈2-n。

Step 5: to-be-tracked area X _i Sending the fusion feature map cat (X) into a dual feature extraction module (DEM) to extract features to obtain a fusion feature map cat (X) of the region to be tracked _i )。

Step 6: fusion characteristic map cat (X) _i ) Input into the decoder in the Similarity Calculation Module (SCM), refer to the decoder structure in fig. 3. The decoder also has two decoder layers, the first decoder layer comprising an attention module and the second decoder layer comprising an attention module and a feed forward neural network, wherein the attention module is composed of a multi-headed attention mechanism and a normalization layer. Feature map cat (X) _i ) Through space position coding and leveling operation, the target information and key background information in the fusion characteristic diagram of the region to be tracked are enhanced through a first decoder layer, and a first decoding characteristic dec' (X) is output _i ). In the second decoder layer, dec' (X _i ) And the coding feature enc "(T) as inputs, the Q, K, V variable is no longer exactly the same when the multi-head attention mechanism is entered, but q=dec' (X) _i ) K=v=enc "(T), and finally the second decoder layer outputs a second decoding characteristic dec" (X) _i ). Afterwards, dec "(X) is mapped by using view function in Pytorch deep learning library _i ) The feature map is scaled to a similarity map R of 31×31×520.

Step 7: the similarity map R is transferred into a Refinement Module (RM) to obtain a refined similarity map up (R), and the network structure of the similarity map up (R) is referred to in fig. 4. The refinement module comprises a 5-layer network, wherein the first-layer network and the second-layer network both comprise a transposed convolution with a convolution kernel of 3 multiplied by 3 and a step length of 2 multiplied by 2 and a double convolution block, and the double convolution block consists of two convolution blocks, wherein each convolution block comprises a convolution layer with the convolution kernel of 3 multiplied by 3, the padding of 1 and the step length of 1, a normalization layer and an activation function layer. The main difference between other three-layer networks and the previous two layers is that the transposed convolution kernels are different in size and step length, and are mainly represented by: the transposed convolution kernel of the third layer network is 2×2, the step size is 2, the transposed convolution kernel of the fourth layer network is 2×2, the step size is 1, and the transposed convolution kernel of the fifth layer network is 1×1, the step size is 1. Finally, a 2-dimensional convolution of a 1×1 convolution kernel is used to obtain a 1×255×255 refined similarity map up (R).

Step 8: when tracking the target in the 3 rd frame and later to-be-tracked area, the reference area is updated by using an area update module (RUM). Firstly, inputting a previous frame of infrared image and a previous frame of prediction boundary frame; then, the infrared image takes the central point of the prediction boundary frame as an origin, and expands the periphery by 2 times of the distance of the diagonal line of the boundary frame to obtain a new reference image T'; finally, a new encoded feature enc "(T') is obtained by the encoder sections in the dual feature extraction module and the similarity calculation module.

Step 9: establishment of training data set: the invention uses the modified LaTOT data set (104726 pieces) as a basic data set for training, but modifies the LaTOT data set to improve the robustness and accuracy of the network of the invention: firstly, expanding the LaTOT data set to the periphery by taking the center of a target as an origin to obtain a basic image; then, add random panning, scaling, blurring, mirroring operations to create a new image. All the images of LaTOT are subjected to the two steps to obtain a modified LaTOT data set. When the training images are input, each image is input and cut to obtain a reference image, the image to be tracked is a frame of image randomly extracted from 30 frames before and after the position of the image sequence where the reference image is located, and the image to be tracked is obtained through cutting.

Step 10: establishment of a test data set: the present invention uses the modified DIRST data set (13655 sheets) as the test set. Some image sequences in the original DIRST dataset have a target size of 1×1 pixels, and cannot extract valid depth features or directional gradient histogram features. For this purpose, a new target of 5×5-7×7 pixels is covered with a target of 1×1 pixels in size, the new target gray level conforming to a two-dimensional gaussian distribution.

Step 11: model training process: the invention provides an infrared weak and small target tracking method network under a complex background, which belongs to an end-to-end network model, wherein two infrared images which are cut are taken as a reference area and are sent into the network model together with an area to be tracked for iterative optimization for a plurality of times, and a classification result, a center deviation result and a boundary box result which are output by a head network are subjected to loss calculation and gradient derivation optimization network parameters with corresponding labels. And evaluating the test result by using the IOU cross ratio and the accuracy (the Euclidean distance of the center point) according to the evaluation index.

Step 12: model training parameter setting: model training is carried out by adopting a Windows server with a display card model of NVIDIA RTX 3090 and a display memory of 24GB, and the used test software is PyCharm2021.2.2. The total number of iterations was trained to 50, the learning rate was initialized to 0.01 and exponentially decreased during the iteration until it was reduced to 0.0005, the optimizer was a random gradient (SGD) optimizer, and the network frame was pyrerch 1.8.0.

Claims

1. A tracking method of infrared weak and small targets under a complex background is characterized in that: the method comprises the following steps:

step 4: the center position of a frame of target above a frame of subsequent frame image of the infrared image sequence Z is used as an origin to obtain a region X to be tracked _i I epsilon 2-n, n representing the total number of frames of the Z sequence;

2. The method for tracking the infrared dim target under the complex background according to claim 1, wherein the method comprises the following steps:

the dual-feature extraction module comprises a deep feature extraction network and a gray gradient feature extraction method, wherein the deep feature extraction network enables the network to adaptively learn shallow detail features and deep semantic features in an image in a deep learning and gradient descending mode, and the gray gradient feature extraction method extracts gray gradient histogram features of a local area in the image in a mathematical calculation mode;

the similarity calculation module comprises a transducer network structure, wherein a self-attention mechanism is utilized to enhance target information of the feature graphs, background information is restrained, and a cross-attention mechanism calculates similarity between the feature graphs from a global level to obtain a similarity graph;

the refinement module reduces the receptive field of each pixel point of the similarity graph by modifying the U-Net network in the image segmentation task, can keep a one-to-one correspondence with the pixel points of the region to be tracked, increases the information quantity in the similarity graph, and improves the accuracy of head network output;

the area updating module is embodied in the testing process, because the background area of the infrared weak and small target is changed frequently in the complex scene, the accuracy of the target framed by the first frame and part of the background area can be greatly reduced in the subsequent similarity calculation, and in order to improve the target tracking accuracy, the predicted target area for tracking the previous frame target is utilized as a new reference area.

3. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the dual-feature extraction module consists of a ResNet-18 depth network and a directional gradient histogram feature extractor, wherein a reference area and a region to be tracked are respectively transmitted into the ResNet-18 network and the directional gradient histogram feature extractor to obtain feature images, and then the feature images are connected in series through dimension layers to obtain a fusion feature image.

4. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the similarity calculation module adds position codes to the feature map of the reference area, changes the shape of the feature map into multidimensional vectors through a view function in a pytorch, enables each pixel in the feature map to record space information, enhances target information in the feature map through a self-attention mechanism in a transducer, inhibits a background area, obtains a one-dimensional vector added with the position codes through the same operation on the feature map of the area to be tracked, enhances target information in the feature map of the area to be tracked through a multi-head attention mechanism, and finally transmits the enhanced feature map of the reference area and the feature map of the area to be tracked together into a cross-attention mechanism of the transducer, searches the most similar position in the feature map of the area to be tracked, and generates a similarity map.

5. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the refinement module is split from a U-Net network in an image segmentation task, only an up-sampling network is reserved, the input of the refinement module is an output similarity graph of the similarity calculation module, and finally a refinement similarity graph up (R) consistent with the size of the region to be tracked is obtained.

6. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the area updating module enables the tracking method to continuously utilize the last frame of target area and partial background area of the tracked image as the reference area in the test process, the real-time updating of the reference area can better cope with complex and changeable tracking environments, the similarity calculation between the area to be tracked and the reference area is more accurate, and the tracking precision is improved.

7. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the image sequence of the training set of the method is obtained by the following process: selecting a LaTOT data set as a basic data set for model training, firstly taking a target center point of each image in an image sequence as an original point, cutting the image with a cutting width which is 10 times of the diagonal length of a boundary frame, and cutting the image from the original point to the periphery to obtain a 511X 511 new image; subsequently, the new image is subjected to operations of translation, scaling, blurring and mirroring to enable the target to deviate from the original position, the trained reference image is randomly decimated from the first frame to the last frame of the sequence, and the image to be tracked is randomly decimated within 30 frames before and after the reference image.

8. The method for tracking the infrared dim target in the complex background according to claim 7, wherein the method comprises the following steps: the image sequence of the test set is obtained by the following process: the DIRST data set is selected as a test set, and in some infrared sequences, the target is only 1 multiplied by 1, and a useful depth feature map and a HOG feature map cannot be extracted, so that simulation infrared weak targets in the range of 5 multiplied by 5 to 7 multiplied by 7 are added to a real target under the condition that the target is too small, gray values of the simulation infrared weak targets are distributed in Gaussian, and the targets are marked.