CN116309705A

CN116309705A - Satellite video single-target tracking method and system based on feature interaction

Info

Publication number: CN116309705A
Application number: CN202310149943.XA
Authority: CN
Inventors: 苏芝娟; 贾玉童; 彭思卿; 万刚; 刘佳; 汪国平; 刘伟; 尹云霞; 武易天; 李功; 谢珠利; 王振宇; 李矗
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-06-23
Anticipated expiration: 2043-02-22
Also published as: CN116309705B

Abstract

The invention provides a target tracking method in satellite video aiming at the characteristics of a satellite video target based on a transducer backbone network. The target tracking algorithm comprises two branches, wherein one branch is used for target prediction based on feature extraction and interaction, and the other branch is used for target prediction through track fitting. And finally, fusing the two branch prediction results to acquire a final position, thereby improving the operation efficiency and accuracy.

Description

Satellite video single-target tracking method and system based on feature interaction

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a satellite video single-target tracking method and system based on feature interaction.

Background

Along with the diversification of remote sensing application scenes, a single-frame static remote sensing image can not meet the requirement of dynamic detection of ground features. The video satellite can acquire time sequence dynamic images of the observation area, and provides rich information for traffic condition monitoring, natural disaster quick response, military safety and other applications. Target tracking is one of the key technologies in video analysis and understanding applications.

The target tracking method is mainly divided into a generative model or a discriminant model. Target tracking based on generative models can be seen as a search problem, modeling target regions in the current frame, and selecting the most similar regions as predicted locations in the next frame. However, the generative model does not take full advantage of background information and appearance variations. In contrast, discriminant models treat object tracking as a binary classification problem, training the discriminant model using a classifier marks the attributes of objects and background as positive and negative samples in the current frame. Then in subsequent frames, the foreground is identified using a classifier and the results updated. In the deep learning approach, the tracker uses depth features with strong notations instead of manual features. In recent years, a deep learning convolutional neural network is introduced into target tracking, and has a good effect, and is attracting attention of many students.

For the characteristics of satellite video targets, many researchers have conducted many studies including deep learning-based and correlation filtering-based approaches in order to improve tracking effects.

Chen et al propose a new rotation adaptive tracker (RAMC) with motion constraints based on kernel correlation filtering, discussing how to improve SV target tracking using a mixture of angle and motion information from both rotation and translation aspects. Xuan et al propose an adaptive rotation correlation filter (RACF) algorithm to solve the tracking drift problem caused by target rotation. Song et al propose a joint Siamese attention-aware network (JSANet) containing self-attention and cross-attention modules for efficient remote sensing tracking against the negative effects of satellite video target weak features and background noise. Shao et al propose a predictive attention-inspiring SIAMESE network (paseam) for satellite-borne video tracking that constructs a full convolution SIAMESE network with shallow features to obtain fine-grained appearance features. In addition, predictive attention is proposed to deal with occlusion and blurring. Zhang et al propose a prediction network based on a Full Convolution Network (FCN) to predict the probability of locating a next frame object in each pixel from previously obtained results. On the basis, a segmentation method is introduced, a feasible region of each frame of target is generated, and a high probability is given to the region.

However, due to the characteristics of the remote sensing image, satellite video tracking faces a number of problems compared to conventional target tracking tasks or Unmanned Aerial Vehicle (UAV) based aerial image tracking. The challenges faced by applying target tracking techniques in satellite video datasets are as follows:

the target duty ratio is small: the width and height of high resolution satellite video is typically above 2000 pixels, while the object of interest is only around 0.01% or even less of the pixels of the entire video frame. The large background increases the search range of classical tracking algorithms while reducing tracking performance. In addition, the small-size tracking target features are fewer and similar to the environment, so that the tracking robustness is poor and the tracking error is large.

The video frame rate is low: due to limitations of on-board hardware, the frame rate of satellite video is typically low, resulting in significant movement of the target between frames, further affecting tracking predictions and model updates. For example, if the target suddenly stops, shadows, or moves, existing tracking systems may easily miss it.

Abrupt change in illumination: since satellite video acquisition is at high altitudes in space, the light and atmospheric refractive index changes with the motion of the orbiting satellites, which can lead to abrupt changes in frame illumination. The difference in light has an important effect on the performance and accuracy of target tracking.

Although the single-target tracking method based on the common video is disclosed in the prior art to be applied to satellite video target tracking, the following problems still exist:

1. the effect of the algorithm is affected by the limited satellite video public data set.

2. For the efficiency problem of satellite video target tracking, multi-module combination processing results in limited efficiency improvement.

3. For the precision problem of satellite video target tracking, the conventional algorithm is not fully improved aiming at the characteristics of the satellite video target because the satellite video target is different from the common video target.

Disclosure of Invention

Aiming at the problems, the invention provides a satellite video single-target tracking method based on feature interaction.

The technical solution for realizing the purpose of the invention is as follows:

the satellite video single-target tracking method based on the feature interaction is characterized by comprising the following steps of:

step 1: inputting satellite video data to be tracked;

step 2: performing feature extraction and interaction on input satellite video data based on an improved transducer network to obtain a target prediction result transR;

step 3: performing target prediction on input satellite video data based on track fitting to obtain a target prediction result ployR;

step 4: fusing the transR and ployR to obtain a final prediction result finalR;

step 5: and outputting a target positioning result in the satellite video according to the obtained final prediction result finalR.

Further, the improved transducer network described in step 2 is composed of a transducer backbone for feature extraction and feature interaction, and a predictor for object localization.

Further, the specific operation steps of the step 2 include:

step 21: cutting the search image and the template image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video data respectively;

step 22: cutting the template image

And search for images +.>

Remodelling into two flattened sequences of 2D image blocks +.>

And->

And N is _z ＝H _z W _z /P ² And N _x ＝H _x W _x /P ² The number of image blocks for the template image and the search image;

step 23: mapping the 2D image blocks obtained in step 22 to a 1D token with C dimension by linear projection, and adding an input sequence of position embedding to obtain a backbone, wherein the input sequence comprises a template sequence

And search sequence->

Step 24: cutting the central area of the template image to obtain a central area sequence e ^0* ；

Step 25: concatenating search sequences s along a first dimension ⁰ Template sequence e ⁰ And a center region sequence e ^0* And sending the splicing result to a transformer backbone network;

step 26: extracting and interacting the characteristics of the template image and the search image through a transformer backbone network;

step 27: step 26 is performed L times to output a target-related search feature s ^L And sends it to the predictor;

step 28: predictor pair s ^L Directly to a sorting head phi _cls Regression header phi _reg Predicting to obtain the position and shape of the target, namely:

y _reg ＝Φ _reg (S ^L )，y _cls ＝Φ _cls (S ^L )

wherein y is _res 、y _cls And representing the regression and classification results of the target for estimating the position and shape of the target.

Further, the specific steps of step 24 include:

step 241: clipping one in the center of the template imageSmaller area

Step 242: successive formation of Z into tiles

Step 243: calculation of

Is embedded in the position of (2);

step 244: with the original plaque Z _p Identical linear projection mapping

And the mapped features and the positions are embedded and added to obtain a sequence e ^0* 。

Further, the specific steps of the step 3 include:

step 31: for center coordinates { P } of bounding box obtained in previous N frames in the past _t -N，P _t -N+1，…，P _t-1 Collect and fit it to two polynomial functions:

x _t ＝F _x (x _t-N ，x _t-N+1 ，...，x _t-1 ) (3)

y _t ＝F _y (y _t-N ，y _t-N+1 ，...，y _t-1 ) (4)

wherein x is _t And y _t Is P _t X and y coordinates of (c);

step 32: setting a threshold epsilon, and when the displacement of the previous N frames in the x axis or the y axis is smaller than the threshold epsilon, assuming that the target is stationary in the corresponding direction, obtaining a position prediction result as follows:

where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, ε is less than or equal to 0.3.

Further, the specific operation steps of the step 4 include:

step 41: calculating the average displacement distance s of the previous N frames in the past;

step 42: finalr=trans if ployR is less than 0.8s from trans;

step 43: if the distance between ployR and trans r is greater than 0.8s, then the finalR is the coordinates of ployR and trans r centers.

A satellite video single-target tracking system based on feature interaction, comprising:

the prediction module based on the feature extraction and interaction is used for carrying out feature extraction and interaction on the input satellite video based on the improved transducer network to obtain a corresponding target prediction result;

the prediction module based on track fitting is used for carrying out target prediction on the input satellite video based on track fitting to obtain a corresponding target prediction result;

the fusion module is used for fusing the target prediction result obtained by the prediction module based on feature extraction and interaction with the target prediction result obtained by the prediction module based on track fitting to obtain a final prediction result;

and the target positioning module is used for outputting a target positioning result according to the final prediction result.

A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the satellite video single-target tracking method based on feature interaction according to any one of claims 1-7.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the satellite video single-target tracking method based on feature interaction as claimed in any one of claims 1 to 7 when the program is executed by the processor.

The beneficial effects are that:

first, all parameters in the backbone network are initialized using the parameters pre-trained by the pre-training model GLIP. To achieve better zero sample and small sample migration performance. The problem of insufficient satellite video data sets is solved.

Secondly, the multi-head attention transducer backbone with interaction is selected, so that full interaction between the template features and the searched sample features is realized, the combination of the conventional CNN backbone feature extraction link and the interaction link is realized, the number of modules is reduced, the flow is simplified, and the efficiency is effectively improved;

third, unavoidable information loss may result from the downsampling operation. In order to reduce the negative influence of downsampling, the invention adds a complete template target block when inputting an image block, and can enable a transducer backbone to capture more details in an important template image area, thereby improving the recognition accuracy;

fourth, to further mitigate model drift, the present invention uses a polynomial function to fit the trajectory of the object in the past N frames. The polynomial function estimates the historical motion pattern of the object for predicting the position in the next frame.

Drawings

FIG. 1 is a conventional transducer network architecture;

FIG. 2 is a flow chart of a target tracking method according to the present invention;

fig. 3 is a flow chart of a transducer backbone feature extraction and interaction link.

Detailed Description

In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The invention provides a target tracking method in satellite video aiming at the characteristics of a satellite video target based on a transducer backbone network. The target tracking algorithm comprises two branches, wherein one branch is used for target prediction based on feature extraction and interaction, and the other branch is used for target prediction through track fitting. The two branch prediction results are fused to obtain the final position, and the flow chart of the invention is shown in fig. 2.

1. Branch one: target prediction based on feature extraction and interaction

It consists of only the transducer backbone and the predictor. The conventional transformer network architecture, as shown in fig. 1, needs to use a CNN backbone network to perform feature extraction, feature interaction and target positioning on a template image and a search image. In order to make the algorithm module less and the flow simpler, the invention can realize the functions of the two modules of feature extraction and feature interaction by adopting a transducer backbone, and the specific steps are as follows:

step 1: and cutting the template image and the search image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video respectively.

Step 2: image of template

And search for images +.>

Remodelling into two flattened sequences of 2D image blocks +.>

And->

Where (P, P) is the tile (square) size, N _z ＝H _z W _z /P ² And N _x ＝H _x W _x /P ² The number of image blocks for the template image and the search image; the 2D image block is mapped to a 1D token with a C dimension by linear projection. After adding the 1D token with embedded position, the input sequence of the trunk is obtained, which comprises a template sequence +.>

And search sequence->

To provide more detailed target information to the transformer backbone, a smaller region is cropped in the center of the template image

And Z is continued to form image block +.>

Wherein->

The division line on the central area Z is located at the center of the division line on the sample image Z to ensure +.>

And original plaque Z _p Containing different target information. Obtaining a central region image block->

After that, their position embeddings are calculated. Then, with Z _p Identical linear projection map->

Step 3: directly concatenating search sequences s along a first dimension ⁰ Template sequence e ⁰ And a center region sequence e ^0* And sends the splice results together to the transformer backbone.

Step 4: and after the splicing result is sent to the transducer backbone network, extracting and interacting the characteristics of the template image and the search image through the transducer backbone network. The specific flow is shown in fig. 3. All parameters in the backbone network are initialized with the parameters of the visual branch pre-training of GLIP.

Whole transformer backbone linkRows L times. We use e ¹ Sum s ¹ Input templates and search sequences representing layer (1+1), 1=0, …, L-1.Att is the self-attention model. In our transducer backbone, the feature learning due to template image and search image is learned by a (e ¹ ，s ¹ ) And a(s) ¹ ，e ¹ ) And affect each other. Att (e) ¹ ) Comprises s ¹ And vice versa. Information interaction between templates and search properties exists in each layer of the transformer backbone, so that no additional interaction module needs to be added after the backbone. Directly output search feature s ^L And sending the target positioning result to a predictor for target positioning.

Att(e ^l )＝softmax([a(e ^l ，e ^l )，a(e ^l ，s ^l )])[e ^l W _V ，s ^l W _V ] ^T

Att(s ^l )＝softmax([a(s ^l ，e ^l )，a(s ^l ，s ^l )])[e ^l W _V ，s ^l W _V ] ^T (1)

Step 5: predictor(s)

After transforming the former backbone we get a target-related search feature S ^L And is directly added to a sort head phi _cls Regression header phi _reg Prediction is performed

y _reg ＝Φ _reg (S ^L )，y _cls ＝Φ _cls (S ^L ) (2)

Wherein y is _reg 、y _cls And representing the regression and classification results of the target for estimating the position and shape of the target.

2. Branch two: target prediction based on track fitting

Since objects in satellite video typically rotate slowly, a polynomial function is used to fit the trajectory of the object in the past N frames. The polynomial function is used to estimate the historical motion pattern of the object and can predict the position of the object in the next frame.

In predicting the center coordinate Pt of the object in the t-th frame, use is made ofCenter coordinates { P } of bounding box obtained in the past N frames _t -N，P _t -N+1，…，P _t-1 Collecting and fitting to two polynomial functions F _x (. Cndot.) and F _y (. Cndot.) represents the x-and y-coordinates, respectively. And the two polynomial fitting functions can be expressed as:

x _t ＝F _x (x _t-N ，x _t-N+1 ，...，x _t-1 ) (3)

y _t ＝F _y (y _t-N ，y _t-N+1 ，...，y _t-1 ) (4)

wherein x is _t And y _t Is P _t X-coordinate and y-coordinate of (c).

To avoid an overfitting to a stationary object, a threshold e is used to determine if the object is moving. When the displacement of the past N frames in the x-axis or y-axis is smaller than the threshold e, then the target is assumed to be stationary in the corresponding direction, so its position prediction result is:

where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, and e is set to 0.3 or less.

3. Obtaining final predicted outcomes

The average displacement distance s of the past N frames is calculated, the result of target prediction based on feature extraction and interaction is defined as transR, the result of target prediction based on trajectory fitting is defined as ployR, and the final prediction result is defined as finalR. Then if ployR is less than 0.8s from transR, finalr=transr; if the distance between ployR and transR is greater than 0.8s, the finalR takes the coordinates of ployR and transR centers as the final prediction result.

What is not described in detail in this specification is prior art known to those skilled in the art. Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. The satellite video single-target tracking method based on the feature interaction is characterized by comprising the following steps of:

step 1: inputting satellite video data to be tracked;

step 4: fusing the transR and ployR to obtain a final prediction result finalR;

2. The method for tracking single satellite video targets based on feature interaction according to claim 1, wherein the improved transformer network in step 2 is composed of a transformer backbone and a predictor, the transformer backbone is used for feature extraction and feature interaction, and the predictor is used for target positioning.

3. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 2, wherein the specific operation steps of step 2 include:

step (a)22: cutting the template image

And search for images +.>

Remodelling into two flattened sequences of 2D image blocks +.>

And->

And search sequence

y _reg ＝Φ _reg (S ^L ),y _cls ＝Φ _cls (S ^L )

4. A satellite video single-target tracking method based on feature interaction according to claim 3, wherein the specific steps of step 24 include:

step 241: cropping a smaller region in the center of the template image

Step 242: successive formation of Z into tiles

Step 243: calculation of

Is embedded in the position of (2);

step 244: by using the original plaque z _p Identical linear projection mapping

5. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 4, wherein the specific steps of the step 3 include:

step 31: for center coordinates { P } of bounding box obtained in previous N frames in the past _t -N,P _t -N+1，…，P _t-1 Collect and fit it to two polynomial functions:

x _t ＝F _x (x _t-N ，x _t-N+1 ，...，x _t-1 ) (3)

y _t ＝F _y (y _t-N ，y _t-N+1 ，...，y _t-1 ) (4)

wherein x is _t And y _t Is P _t X and y coordinates of (c);

6. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 5, wherein the specific operation steps of step 4 include:

step 42: finalr=trans if ployR is less than 0.8s from trans;

7. A satellite video single-target tracking system based on feature interaction, comprising:

8. A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the satellite video single-target tracking method based on feature interaction according to any one of claims 1-7.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the satellite video single-target tracking method based on feature interaction as claimed in any one of claims 1 to 7 when the program is executed by the processor.