CN116402851A - Infrared dim target tracking method under complex background - Google Patents

Infrared dim target tracking method under complex background Download PDF

Info

Publication number
CN116402851A
CN116402851A CN202310268997.8A CN202310268997A CN116402851A CN 116402851 A CN116402851 A CN 116402851A CN 202310268997 A CN202310268997 A CN 202310268997A CN 116402851 A CN116402851 A CN 116402851A
Authority
CN
China
Prior art keywords
target
area
image
tracking
tracked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310268997.8A
Other languages
Chinese (zh)
Inventor
李大威
蔺素珍
崔晨辉
禄晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North University of China
Original Assignee
North University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North University of China filed Critical North University of China
Priority to CN202310268997.8A priority Critical patent/CN116402851A/en
Publication of CN116402851A publication Critical patent/CN116402851A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

Aiming at the difficult problems that effective features are difficult to extract from the infrared weak and small targets under the complex background, the infrared weak and small targets are easily influenced by surrounding interferents and the like, the invention provides the infrared weak and small target tracking method under the complex background. Firstly, an input reference area and an area to be tracked are transmitted into a dual-feature extraction module by a network model to respectively obtain a fusion feature map; then, carrying out similarity calculation on the fusion feature map by using a similarity calculation module, wherein the output similarity map contains classification and regression information of the targets; and finally, outputting the predicted position and the boundary frame of the current frame image target through a refinement module and a head network so as to realize the stable tracking of the infrared weak and small target under the complex background. The method and the device can effectively carry out steady tracking on the real target in the complex scene, reduce the influence of the interference objects around the target, improve the tracking performance and provide accurate position information for extracting the target characteristics and judging the key event.

Description

Infrared dim target tracking method under complex background
Technical Field
The invention belongs to the field of infrared image target tracking, and particularly relates to an infrared weak and small target tracking method under different complex backgrounds by utilizing an end-to-end depth network model.
Background
The infrared weak and small target tracking technology is mainly applied to the aspects of hostile target early warning, remote guided weapon and the like, and is an important difficult problem for accurately tracking the infrared weak and small target, and the main challenge is that: 1) The distance between the infrared weak target and the infrared sensor is very far, so that the infrared target occupies fewer image pixels, usually between 2×2 and 9×9, and key features are difficult to extract without edge contour and texture information. 2) In the process of tracking the infrared weak and small target, jitter of a sensor may occur, so that the track of the infrared weak and small target is broken, and the tracking device suddenly jumps to another position from a certain position of an image, so that the target is lost. 3) In the process of gesture adjustment and ignition flameout of the infrared weak and small target, the gray value of the target can be changed, and if the condition that the background is brighter is met, the target can be submerged in the background, so that tracking failure is caused. 4) Some interferents may appear around the target, the gray values of which also show gaussian distribution, similar to the real target, and the tracker may deviate during the discrimination process, losing the real target. Therefore, it is an urgent and challenging task to propose an infrared dim target tracking method in a complex context.
At present, the existing infrared weak and small target tracking methods can be divided into two types: the method comprises a mathematical modeling method based on model driving (called a mathematical modeling method for short) and a deep learning method based on data driving (called a deep learning method for short). The mathematical modeling method is to learn the relevant filter on line, establish a target appearance model, and track the target candidate region in a mode of obtaining a similarity graph, wherein the maximum responsivity position in the similarity graph is the true target center position of the current frame. The mathematical modeling method can update the target model in real time when tracking the target, can effectively reduce the influence of posture adjustment, brightness change and the like of the infrared weak and small target on the tracker, and collects the front background information of the target so as to improve the tracking performance of the target. But also has the following drawbacks: 1) The method is usually used for scale estimation by adopting a multi-scale pyramid, different scale parameters are required to be preset to adjust the size of a candidate area to be tracked, the process is tedious, the calculation is complex, the tracking speed is reduced, the method is only suitable for scale change with equal scale, and the method is difficult to be applied to actual tracking scenes. 2) The boundary effect for reducing the performance of the tracker is introduced in a cyclic operation mode, and in the process of training the filter, a plurality of images after cyclic shift are used, so that real target information is also existed in a negative sample, the performance of the tracker is seriously reduced, and the discrimination capability for similar interferents around the real target is reduced. The deep learning method is commonly used at present for a twin network architecture, depth features of a reference area and an area to be tracked are extracted through a backbone network sharing parameters, then cross-correlation operation is carried out to obtain a similarity graph, and finally, the central position and a boundary box of a predicted target are output through classification and regression head network. The device has a simple structure, does not need to set a large number of super parameters, and can achieve effective balance of precision and speed. Therefore, the invention improves on the basis of a twin network architecture, and realizes the accurate tracking of the infrared weak and small targets in a complex scene.
Disclosure of Invention
Aiming at the problems that effective features are difficult to extract and are easily influenced by surrounding interferents and the like of an infrared weak and small target under a complex background condition, the invention provides the infrared weak and small target tracking method under the complex background, which is suitable for tracking the infrared weak and small target under the complex background such as forests, plains, ridges and the like, can achieve higher accuracy and precision, and can meet the real-time requirement.
The invention adopts the following technical scheme: the method for tracking the infrared dim and small targets in the complex background is used for carrying out steady tracking on the infrared dim and small targets in different background environments, and comprises the following steps:
step 1: inputting an infrared image sequence Z containing an infrared weak and small target;
step 2: selecting a target area in a first frame image of the infrared image sequence Z as a reference area T;
step 3: inputting the reference area T into a dual-feature extraction module to obtain a fusion feature map cat (T);
step 4: the center position of a frame of target above a frame of subsequent frame image of the infrared image sequence Z is used as an origin to obtain a region X to be tracked i I epsilon 2-n, n representing the total number of frames of the T sequence;
step 5: to-be-tracked area X i Inputting into a dual-feature extraction module to obtain a fusion feature map cat (X i );
Step 6: inputting the two fusion feature images into a similarity calculation module together to obtain a similarity image R;
step 7: the similarity graph R is subjected to a refinement module to obtain a similarity graph up (R) consistent with the size of the area to be tracked;
step 8: outputting the target center point position and the boundary frame size through a head network by the similarity map up (R) to obtain a tracking result of the current frame i;
step 9: and replacing the target area tracked by the current frame i to a reference area T, and continuing the operation of the steps 3-9 until the sequence is finished.
The method needs to construct a Dual-feature extraction module (Dual-feature Extraction Module, DEM), a similarity calculation module (Similarity Calculation Module, SCM) and a refinement module (Refinement Module, RM) to form an infrared weak small target tracking network, and in a test stage, an area update module (Region Update Module, RUM) is added to adapt to the change of targets and surrounding backgrounds. The DEM performs feature extraction on the infrared weak and small targets and part of background environments, so that key features are effectively extracted; the SCM measures the similarity of the feature images of the reference area and the area to be tracked, and a similarity image is obtained, wherein the similarity image comprises classification and regression information of the target; the RM is used for solving the problem that the tracking precision is reduced due to the fact that a large amount of background is introduced when the RM is mapped back to the area to be tracked, and the size of the RM is amplified through a neural network and is consistent with the area to be tracked, so that the pixel corresponding relation is kept in one-to-one correspondence, and the tracking precision is improved; RUM is applied to the test stage, and the image area of the last frame is always used as a reference area, so that the feature map information is updated. The infrared weak and small target tracking method under the complex background is finally formed through the combination of the modules, the network robustness is improved by adopting a data set disclosed on a network and adding operations such as cutting, rotation, blurring, mirroring and the like, the network optimization is performed in a multi-loss combined training mode, the input of the method is an infrared image sequence, and the output is the left upper corner and right lower corner coordinates of the predicted target position of each frame of image.
In the method, the dual-feature extraction module comprises: depth feature extractor and directional gradient histogram feature extractor. The depth feature extractor may obtain shallow details and depth semantic features in the image after the input of the infrared image. The directional gradient histogram feature extractor calculates the gray gradient direction and the gray gradient size of pixels in each image block by uniformly dividing the image into a plurality of image blocks, and finally integrates each image block to form the gray gradient histogram feature of the whole image.
The similarity calculation module modifies the transformer network, enhances the key information in the targets and the backgrounds in the reference area and the to-be-tracked area feature map through the self-attention mechanism, searches the most similar area in the enhanced to-be-tracked area feature map by using the cross-attention mechanism, benefits from the fact that the transformer network can extract global context information, and self-adaptively notices dependence of similar parts in the two feature maps from the global scope.
The refinement module is used for segmenting the U-Net network, deleting the left downsampling network in the U-Net network, reserving the right upsampling network, inputting the similarity graph obtained by the similarity calculation module, and outputting the refined similarity graph consistent with the size of the area to be tracked. The purpose of the module is: when the dual-feature extraction module is used, convolution and pooling operations in the depth network can be used for downsampling an image area, the size of a feature map is continuously reduced, a plurality of backgrounds can be introduced after the receptive fields of each pixel correspond to the area to be tracked, and tracking performance is reduced. After the refinement module is added, classification and regression information in the similarity graph is more abundant, and positioning of a real target is easy.
The area updating module is used for continuously updating the reference area in the test process of the method. In the process of tracking an infrared target, a part of target background area is often introduced to improve tracking accuracy, but in a complex background environment, the target part background area is continuously changed, and the performance of a tracker is reduced in the subsequent tracking process by only taking the first frame of an infrared sequence image as the target area. Therefore, the tracking accuracy is improved by adding the area updating module in the testing process of the method.
The depth feature extractor in the dual feature extraction module adopts a ResNet-18 network model, and the last average pooling layer and the full connection layer are deleted in the network. After a reference region T of 127×127×3 and a region X to be tracked of 255×255×3 are input together through a 5-layer residual network, a depth feature map res (T) of 15×15×512 and a depth feature map res (X) of 31×31×512 are obtained. In the oriented gradient Histogram (HOG) feature extractor, the same input as the depth feature extractor, an HOG feature map HOG (T) of 15×15×8 and an HOG feature map HOG (X) of 31×31×8 are output. Finally, performing concat operation on res (T) and hog (T) to obtain cat (T), and taking consistent operation on the area to be tracked to obtain cat (X).
The similarity calculation module is used for improving a transducer network structure in natural language processing and is suitable for the field of target tracking, and comprises an encoder and a decoder. In the encoder stage, firstly, spatial position encoding is carried out on the feature map cat (T) by using an nn.encoding function in a pyrach deep learning library to obtain an encoded map P (T), then P (T) is added with cat (T), and then a view function is used for leveling each channel of the added feature map to obtain a multi-dimensional feature vector f (T) of 520 multiplied by 255, which is used as an input of a transducer encoder. After f (T) is transmitted through the first encoder layer, the target characteristic information in f (T) is enhanced by utilizing a multi-head attention mechanism, and the specific calculation process is as follows:
Figure BDA0004134135550000041
MultiHead(Q,K,V)=ConCat(head 1 ,...,head n )W O ,head i =Attention(QW i Q ,KW i K ,VW i V ) In the formula, Q, K, V variables are identical and equal to f (T), W i Q 、W i K 、W i V And (3) representing a weight matrix focusing on different information, i epsilon 1-8 representing information focusing on 8 positions by an attention mechanism, and mapping the coded content to 8 spaces, so that the characterization capability of the model is stronger. f (T) is processed by a multi-head attention mechanism to obtain a coding feature enc (T), then f (T) and enc (T) are directly added, and then the coding feature enc (T) and enc (T) pass through a normalization layer and a feedforward neural network (FFN), wherein the feedforward neural network comprises two linear layers and one normalization layer, and a first coding feature enc' (T) is finally obtained after the operations, and the formula can be expressed as follows: enc' (T) =ffn (Norm (f (T) +enc (T)). Thereafter, the reference region features are again enhanced by continuing through the second encoder layer, the operation being consistent with the first encoder layer. The final cat (T) is passed through the encoder of the similarity calculation module to obtain the encoding feature enc "(T). In the decoder stage, the fused feature map cat (X) of the region to be tracked is input into a decoder in a Similarity Calculation Module (SCM). The decoder is also provided with two decoder layers, wherein the spatial position coding is added in the feature map, the first decoder layer is used for carrying out leveling operation, and the decoder layer comprises an attention module consisting of a multi-head attention mechanism and a normalization layer and is used for enhancing target information and key background information in the feature map of the region to be tracked and outputting a first decoding feature dec' (X). In the second decoder layer, comprising an attention module and a feed-forward neural network, the dec '(X) and the encoding feature enc "(T) are taken as inputs, the Q, K, V variables are no longer exactly the same when the multi-headed attention mechanism in the attention module is entered, but q=dec' (X), k=v=enc" (T), and finally the second decoding feature dec "(X) is output through the normalization layer and the feed-forward neural network. Then, the dec "(X) feature map is scaled by using the view function in the Pytorch deep learning library to become a similarity map R with the size of 31×31×520.
The refinement module adopts an up-sampling network of U-Net, and comprises a 5-layer network, wherein the first-layer network and the second-layer network both comprise a transposed convolution with a convolution kernel of 3 multiplied by 3 and a step length of 2 multiplied by 2 and a double convolution block, and the double convolution block consists of two convolution blocks, wherein each convolution block comprises a convolution layer with the convolution kernel of 3 multiplied by 3, a filling layer of 1 and a step length of 1, a normalization layer and an activation function layer. The main difference between other three-layer networks and the previous two layers is that the transposed convolution kernels are different in size and step length, and are mainly represented by: the transposed convolution kernel of the third layer network is 2×2, the step size is 2, the transposed convolution kernel of the fourth layer network is 2×2, the step size is 1, and the transposed convolution kernel of the fifth layer network is 1×1, the step size is 1. Finally, a 2-dimensional convolution is performed with a 1 x 1 convolution kernel. The similarity graph R is subjected to a refinement module to obtain a refined similarity graph up (R) of 1×255×255.
The above-mentioned area updating module is applied in the test process of the invention, when carrying on the similarity calculation of the fusion characteristic map of the area to be tracked and reference area, the reference area adopts the infrared image of the goal of the last frame and its partial background area all the time, get the new reference area T ', finally, through the encoder part in the double feature extraction module and similarity calculation module, get the new coding feature enc "(T').
According to the infrared dim target tracking method under the complex background, the image sequence of the training set is obtained by the following steps: selecting a LaTOT data set as a basic data set for model training, firstly taking a target center point of each image in an image sequence as an original point, cutting the image with a cutting width which is 10 times of the diagonal length of a boundary frame, and cutting the image from the original point to the periphery to obtain a 511X 511 new image; subsequently, the new image is subjected to operations of translation, scaling, blurring and mirroring to enable the target to deviate from the original position, the trained reference image is randomly decimated from the first frame to the last frame of the sequence, and the image to be tracked is randomly decimated within 30 frames before and after the reference image. The training set used in the method has 104726 images.
The image sequence of the test set of the infrared dim target tracking method under the complex background is obtained by the following steps: in some infrared sequences, the target is only 1 multiplied by 1, and a useful depth characteristic image and a HOG characteristic image cannot be extracted, so that simulation infrared weak targets in the range of 5 multiplied by 5 to 7 multiplied by 7 are added on a real target under the condition that the target is too small, gray values of the simulation infrared weak targets are distributed in Gaussian, and a label uses Dark label software to label the targets.
The invention provides an infrared weak and small target tracking method under a complex background for solving the problem of robustness tracking of infrared weak and small targets in different background environments, and the tracking problem is decomposed into target classification and regression tasks by designing an end-to-end deep learning network, so that the real targets in the complex scene can be effectively tracked in a steady way, the influence of interference objects around the targets is reduced, the tracking performance is improved, and accurate position information is provided for extracting target characteristics and judging key events next.
Drawings
Fig. 1 is a diagram of a network model structure according to the present invention.
Fig. 2 is a block diagram of a dual feature extraction module according to the present invention.
FIG. 3 is a block diagram of a similarity calculation module according to the present invention.
Fig. 4 is a diagram showing a detailed module structure of the present invention.
Fig. 5 is a schematic diagram of an embodiment of the present invention, where (a) a reference area of a previous frame input by the tracking network is displayed, (b) a to-be-tracked area of a current frame input by the tracking network is displayed, and (c) a labeling result of an output bounding box of the tracking network in a current source image is displayed.
Detailed Description
The invention will be further described with reference to the accompanying drawings and detailed description below:
referring to fig. 1, the method for tracking the infrared dim target under the complex background in the embodiment includes the following steps:
step 1: inputting an infrared image sequence Z to be tracked, wherein the sequence Z comprises n frames of images;
step 2: manually calibrating a target area to be tracked on a first frame image of an infrared image sequence Z, filling a surrounding background area with a diagonal length of 2 times of the target area to obtain a reference area T epsilon R 127×127×3
Step 2.1: feeding the reference area T into a dual feature extraction module (DEM) for feature extraction, and referring to a model framework in FIG. 2, the reference area T comprises two parts of depth feature extraction and directional gradient histogram feature extraction;
step 2.2: depth feature extraction in DEM dual feature extraction: the ResNet-18 architecture is adopted, and consists of 5 network blocks, and the reference region T sequentially passes through the 5 network blocks to obtain a reference region depth feature map res (T). Wherein, the 1 st network block is composed of a convolution layer with a convolution kernel of 7×7, a step length of 2 and a filling of 1; the 2 nd network block consists of a maximum pooling layer with a convolution kernel of 3 multiplied by 3 and a step length of 1 and filled with 1 and two residual blocks, wherein the residual block structure is a convolution layer with the convolution kernel of 3 multiplied by 3 and the step length of 1 and filled with 1, but the output phase Concat of the output phase and the input phase of the residual block are connected in a residual way at the output position; the 3 rd to 5 th network blocks are respectively composed of two residual blocks. The structure of the residual block is similar to that of the residual block in the 2 nd network block, but the step size of the first convolution layer in the first residual block in each network block is set to 2, and the image is downsampled. And finally deleting the final average pooling layer and the linear layer to obtain a reference region depth feature map res (T) with the size of 15 multiplied by 512.
Step 2.3: directional gradient histogram feature extraction in DEM dual feature extraction: firstly, equally dividing a reference area T into a plurality of cell units, wherein the size of each cell unit is 8 multiplied by 8 pixels; then calculating the gradient magnitude and direction of each pixel in each cell unit, and adopting gradient information of pixels in 8 azimuth statistics units in the cell units; finally, synthesizing a larger block by using 2 multiplied by 2 cell units, and carrying out serial connection and normalization on gradient information of each cell unit in the block to form a gradient histogram characteristic in the block; finally, all the blocks are combined to obtain a directional gradient histogram feature map hog (T) with the size of 15 multiplied by 8.
Step 2.4: fusion of dual features in DEM dual feature extraction: and performing Concat operation of channel dimension on res (T) output in the depth feature extractor and hog (T) output in the directional gradient histogram feature extractor to obtain a fusion feature map cat (T).
Step 3: the fusion feature map cat (T) of the reference region is input to an encoder in a Similarity Calculation Module (SCM), referring to the encoder structure in fig. 3. Firstly, carrying out space position coding with 256 dimensions on pixel points of each row in a feature map cat (T) by using an nn.coding function in a pyrach deep learning library, carrying out the operation on pixels of each column, and then carrying out Concat operation on a dimension layer by using the space position coding of each row and the space position coding of each column to obtain a coding map P (T) with the size of 15 multiplied by 512. Because the number of wide and high channels of P (T) and cat (T) are consistent, corresponding pixels are directly added, each pixel information of cat (T) contains spatial information of the position, and then each channel of the added feature map is leveled by using a view function, so that a multi-dimensional feature vector f (T) of 520 multiplied by 255 is obtained and is used as an input of an encoder. In the encoder, the encoder has two encoder layers, and after f (T) is input into the first encoder layer, target characteristic information in f (T) is enhanced by utilizing a multi-head attention mechanism, and the specific calculation process is as follows:
Figure BDA0004134135550000071
MultiHead(Q,K,V)=ConCat(head 1 ,...,head n )W O ,head i =Attention(QW i Q ,KW i K ,VW i v), wherein Q, K, V is identical and equal to f (T), W i Q 、W i K 、W i V And the i epsilon 1-8 represents information of 8 positions focused by a multi-head attention mechanism, the coded content is mapped to 8 spaces, and the characterization capability of the model is stronger. f (T) is passed through a multi-head attention mechanism to obtain a feature enc (T), then f (T) and enc (T) are directly added, and then a normalization layer and a feedforward neural network (FFN) are passed through, wherein the feedforward neural network comprises two linear layers and a normalization layer, and after these operations, a first coding feature enc' (T) is finally obtained, and the formula can be expressed as follows: enc' (T) =ffn (Norm (f (T) +enc (T)). Thereafter, the reference region features are again enhanced by continuing through the second encoder layer, the operation being consistent with the first encoder layer. Final cat (T) passing phaseThe encoder of the similarity calculation module then obtains the encoding feature enc "(T).
Step 4: inputting 2-n frames of images in an infrared image sequence, taking the central position of a previous frame of the image as an original point, filling a background area with the diagonal length of a boundary frame of the previous frame of the image, which is 4.5 times that of the previous frame of the image, so as to form an area X to be tracked i ,i∈2-n。
Step 5: to-be-tracked area X i Sending the fusion feature map cat (X) into a dual feature extraction module (DEM) to extract features to obtain a fusion feature map cat (X) of the region to be tracked i )。
Step 6: fusion characteristic map cat (X) i ) Input into the decoder in the Similarity Calculation Module (SCM), refer to the decoder structure in fig. 3. The decoder also has two decoder layers, the first decoder layer comprising an attention module and the second decoder layer comprising an attention module and a feed forward neural network, wherein the attention module is composed of a multi-headed attention mechanism and a normalization layer. Feature map cat (X) i ) Through space position coding and leveling operation, the target information and key background information in the fusion characteristic diagram of the region to be tracked are enhanced through a first decoder layer, and a first decoding characteristic dec' (X) is output i ). In the second decoder layer, dec' (X i ) And the coding feature enc "(T) as inputs, the Q, K, V variable is no longer exactly the same when the multi-head attention mechanism is entered, but q=dec' (X) i ) K=v=enc "(T), and finally the second decoder layer outputs a second decoding characteristic dec" (X) i ). Afterwards, dec "(X) is mapped by using view function in Pytorch deep learning library i ) The feature map is scaled to a similarity map R of 31×31×520.
Step 7: the similarity map R is transferred into a Refinement Module (RM) to obtain a refined similarity map up (R), and the network structure of the similarity map up (R) is referred to in fig. 4. The refinement module comprises a 5-layer network, wherein the first-layer network and the second-layer network both comprise a transposed convolution with a convolution kernel of 3 multiplied by 3 and a step length of 2 multiplied by 2 and a double convolution block, and the double convolution block consists of two convolution blocks, wherein each convolution block comprises a convolution layer with the convolution kernel of 3 multiplied by 3, the padding of 1 and the step length of 1, a normalization layer and an activation function layer. The main difference between other three-layer networks and the previous two layers is that the transposed convolution kernels are different in size and step length, and are mainly represented by: the transposed convolution kernel of the third layer network is 2×2, the step size is 2, the transposed convolution kernel of the fourth layer network is 2×2, the step size is 1, and the transposed convolution kernel of the fifth layer network is 1×1, the step size is 1. Finally, a 2-dimensional convolution of a 1×1 convolution kernel is used to obtain a 1×255×255 refined similarity map up (R).
Step 8: when tracking the target in the 3 rd frame and later to-be-tracked area, the reference area is updated by using an area update module (RUM). Firstly, inputting a previous frame of infrared image and a previous frame of prediction boundary frame; then, the infrared image takes the central point of the prediction boundary frame as an origin, and expands the periphery by 2 times of the distance of the diagonal line of the boundary frame to obtain a new reference image T'; finally, a new encoded feature enc "(T') is obtained by the encoder sections in the dual feature extraction module and the similarity calculation module.
Step 9: establishment of training data set: the invention uses the modified LaTOT data set (104726 pieces) as a basic data set for training, but modifies the LaTOT data set to improve the robustness and accuracy of the network of the invention: firstly, expanding the LaTOT data set to the periphery by taking the center of a target as an origin to obtain a basic image; then, add random panning, scaling, blurring, mirroring operations to create a new image. All the images of LaTOT are subjected to the two steps to obtain a modified LaTOT data set. When the training images are input, each image is input and cut to obtain a reference image, the image to be tracked is a frame of image randomly extracted from 30 frames before and after the position of the image sequence where the reference image is located, and the image to be tracked is obtained through cutting.
Step 10: establishment of a test data set: the present invention uses the modified DIRST data set (13655 sheets) as the test set. Some image sequences in the original DIRST dataset have a target size of 1×1 pixels, and cannot extract valid depth features or directional gradient histogram features. For this purpose, a new target of 5×5-7×7 pixels is covered with a target of 1×1 pixels in size, the new target gray level conforming to a two-dimensional gaussian distribution.
Step 11: model training process: the invention provides an infrared weak and small target tracking method network under a complex background, which belongs to an end-to-end network model, wherein two infrared images which are cut are taken as a reference area and are sent into the network model together with an area to be tracked for iterative optimization for a plurality of times, and a classification result, a center deviation result and a boundary box result which are output by a head network are subjected to loss calculation and gradient derivation optimization network parameters with corresponding labels. And evaluating the test result by using the IOU cross ratio and the accuracy (the Euclidean distance of the center point) according to the evaluation index.
Step 12: model training parameter setting: model training is carried out by adopting a Windows server with a display card model of NVIDIA RTX 3090 and a display memory of 24GB, and the used test software is PyCharm2021.2.2. The total number of iterations was trained to 50, the learning rate was initialized to 0.01 and exponentially decreased during the iteration until it was reduced to 0.0005, the optimizer was a random gradient (SGD) optimizer, and the network frame was pyrerch 1.8.0.

Claims (8)

1. A tracking method of infrared weak and small targets under a complex background is characterized in that: the method comprises the following steps:
step 1: inputting an infrared image sequence Z containing an infrared weak and small target;
step 2: selecting a target area in a first frame image of the infrared image sequence Z as a reference area T;
step 3: inputting the reference area T into a dual-feature extraction module to obtain a fusion feature map cat (T);
step 4: the center position of a frame of target above a frame of subsequent frame image of the infrared image sequence Z is used as an origin to obtain a region X to be tracked i I epsilon 2-n, n representing the total number of frames of the Z sequence;
step 5: to-be-tracked area X i Inputting into a dual-feature extraction module to obtain a fusion feature map cat (X i );
Step 6: inputting the two fusion feature images into a similarity calculation module together to obtain a similarity image R;
step 7: the similarity graph R is subjected to a refinement module to obtain a similarity graph up (R) consistent with the size of the area to be tracked;
step 8: outputting the target center point position and the boundary frame size through a head network by the similarity map up (R) to obtain a tracking result of the current frame i;
step 9: and replacing the target area tracked by the current frame i to a reference area T, and continuing the operation of the steps 3-9 until the sequence is finished.
2. The method for tracking the infrared dim target under the complex background according to claim 1, wherein the method comprises the following steps:
the dual-feature extraction module comprises a deep feature extraction network and a gray gradient feature extraction method, wherein the deep feature extraction network enables the network to adaptively learn shallow detail features and deep semantic features in an image in a deep learning and gradient descending mode, and the gray gradient feature extraction method extracts gray gradient histogram features of a local area in the image in a mathematical calculation mode;
the similarity calculation module comprises a transducer network structure, wherein a self-attention mechanism is utilized to enhance target information of the feature graphs, background information is restrained, and a cross-attention mechanism calculates similarity between the feature graphs from a global level to obtain a similarity graph;
the refinement module reduces the receptive field of each pixel point of the similarity graph by modifying the U-Net network in the image segmentation task, can keep a one-to-one correspondence with the pixel points of the region to be tracked, increases the information quantity in the similarity graph, and improves the accuracy of head network output;
the area updating module is embodied in the testing process, because the background area of the infrared weak and small target is changed frequently in the complex scene, the accuracy of the target framed by the first frame and part of the background area can be greatly reduced in the subsequent similarity calculation, and in order to improve the target tracking accuracy, the predicted target area for tracking the previous frame target is utilized as a new reference area.
3. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the dual-feature extraction module consists of a ResNet-18 depth network and a directional gradient histogram feature extractor, wherein a reference area and a region to be tracked are respectively transmitted into the ResNet-18 network and the directional gradient histogram feature extractor to obtain feature images, and then the feature images are connected in series through dimension layers to obtain a fusion feature image.
4. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the similarity calculation module adds position codes to the feature map of the reference area, changes the shape of the feature map into multidimensional vectors through a view function in a pytorch, enables each pixel in the feature map to record space information, enhances target information in the feature map through a self-attention mechanism in a transducer, inhibits a background area, obtains a one-dimensional vector added with the position codes through the same operation on the feature map of the area to be tracked, enhances target information in the feature map of the area to be tracked through a multi-head attention mechanism, and finally transmits the enhanced feature map of the reference area and the feature map of the area to be tracked together into a cross-attention mechanism of the transducer, searches the most similar position in the feature map of the area to be tracked, and generates a similarity map.
5. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the refinement module is split from a U-Net network in an image segmentation task, only an up-sampling network is reserved, the input of the refinement module is an output similarity graph of the similarity calculation module, and finally a refinement similarity graph up (R) consistent with the size of the region to be tracked is obtained.
6. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the area updating module enables the tracking method to continuously utilize the last frame of target area and partial background area of the tracked image as the reference area in the test process, the real-time updating of the reference area can better cope with complex and changeable tracking environments, the similarity calculation between the area to be tracked and the reference area is more accurate, and the tracking precision is improved.
7. The method for tracking the infrared dim target under the complex background according to claim 2, wherein the method comprises the following steps: the image sequence of the training set of the method is obtained by the following process: selecting a LaTOT data set as a basic data set for model training, firstly taking a target center point of each image in an image sequence as an original point, cutting the image with a cutting width which is 10 times of the diagonal length of a boundary frame, and cutting the image from the original point to the periphery to obtain a 511X 511 new image; subsequently, the new image is subjected to operations of translation, scaling, blurring and mirroring to enable the target to deviate from the original position, the trained reference image is randomly decimated from the first frame to the last frame of the sequence, and the image to be tracked is randomly decimated within 30 frames before and after the reference image.
8. The method for tracking the infrared dim target in the complex background according to claim 7, wherein the method comprises the following steps: the image sequence of the test set is obtained by the following process: the DIRST data set is selected as a test set, and in some infrared sequences, the target is only 1 multiplied by 1, and a useful depth feature map and a HOG feature map cannot be extracted, so that simulation infrared weak targets in the range of 5 multiplied by 5 to 7 multiplied by 7 are added to a real target under the condition that the target is too small, gray values of the simulation infrared weak targets are distributed in Gaussian, and the targets are marked.
CN202310268997.8A 2023-03-17 2023-03-17 Infrared dim target tracking method under complex background Pending CN116402851A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310268997.8A CN116402851A (en) 2023-03-17 2023-03-17 Infrared dim target tracking method under complex background

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310268997.8A CN116402851A (en) 2023-03-17 2023-03-17 Infrared dim target tracking method under complex background

Publications (1)

Publication Number Publication Date
CN116402851A true CN116402851A (en) 2023-07-07

Family

ID=87006629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310268997.8A Pending CN116402851A (en) 2023-03-17 2023-03-17 Infrared dim target tracking method under complex background

Country Status (1)

Country Link
CN (1) CN116402851A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117011335A (en) * 2023-07-26 2023-11-07 山东大学 Multi-target tracking method and system based on self-adaptive double decoders
CN117274823A (en) * 2023-11-21 2023-12-22 成都理工大学 Visual transducer landslide identification method based on DEM feature enhancement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117011335A (en) * 2023-07-26 2023-11-07 山东大学 Multi-target tracking method and system based on self-adaptive double decoders
CN117011335B (en) * 2023-07-26 2024-04-09 山东大学 Multi-target tracking method and system based on self-adaptive double decoders
CN117274823A (en) * 2023-11-21 2023-12-22 成都理工大学 Visual transducer landslide identification method based on DEM feature enhancement
CN117274823B (en) * 2023-11-21 2024-01-26 成都理工大学 Visual transducer landslide identification method based on DEM feature enhancement

Similar Documents

Publication Publication Date Title
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN109886066B (en) Rapid target detection method based on multi-scale and multi-layer feature fusion
CN114782691B (en) Robot target identification and motion detection method based on deep learning, storage medium and equipment
CN113807187B (en) Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
CN111539887B (en) Channel attention mechanism and layered learning neural network image defogging method based on mixed convolution
CN108038435B (en) Feature extraction and target tracking method based on convolutional neural network
CN116402851A (en) Infrared dim target tracking method under complex background
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN113780149A (en) Method for efficiently extracting building target of remote sensing image based on attention mechanism
CN113344932B (en) Semi-supervised single-target video segmentation method
CN111242026B (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN113486894B (en) Semantic segmentation method for satellite image feature parts
CN113033432A (en) Remote sensing image residential area extraction method based on progressive supervision
CN110751271B (en) Image traceability feature characterization method based on deep neural network
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN110633706B (en) Semantic segmentation method based on pyramid network
Tian et al. Semantic segmentation of remote sensing image based on GAN and FCN network model
Fan et al. Hcpvf: Hierarchical cascaded point-voxel fusion for 3d object detection
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN112115786A (en) Monocular vision odometer method based on attention U-net
CN114494934A (en) Unsupervised moving object detection method based on information reduction rate

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination