CN116309705A - Satellite video single-target tracking method and system based on feature interaction - Google Patents

Satellite video single-target tracking method and system based on feature interaction Download PDF

Info

Publication number
CN116309705A
CN116309705A CN202310149943.XA CN202310149943A CN116309705A CN 116309705 A CN116309705 A CN 116309705A CN 202310149943 A CN202310149943 A CN 202310149943A CN 116309705 A CN116309705 A CN 116309705A
Authority
CN
China
Prior art keywords
target
satellite video
interaction
prediction result
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310149943.XA
Other languages
Chinese (zh)
Other versions
CN116309705B (en
Inventor
苏芝娟
贾玉童
彭思卿
万刚
刘佳
汪国平
刘伟
尹云霞
武易天
李功
谢珠利
王振宇
李矗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Original Assignee
Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peoples Liberation Army Strategic Support Force Aerospace Engineering University filed Critical Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority to CN202310149943.XA priority Critical patent/CN116309705B/en
Publication of CN116309705A publication Critical patent/CN116309705A/en
Application granted granted Critical
Publication of CN116309705B publication Critical patent/CN116309705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/277Analysis of motion involving stochastic approaches, e.g. using Kalman filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target tracking method in satellite video aiming at the characteristics of a satellite video target based on a transducer backbone network. The target tracking algorithm comprises two branches, wherein one branch is used for target prediction based on feature extraction and interaction, and the other branch is used for target prediction through track fitting. And finally, fusing the two branch prediction results to acquire a final position, thereby improving the operation efficiency and accuracy.

Description

Satellite video single-target tracking method and system based on feature interaction
Technical Field
The invention relates to the technical field of remote sensing image processing, in particular to a satellite video single-target tracking method and system based on feature interaction.
Background
Along with the diversification of remote sensing application scenes, a single-frame static remote sensing image can not meet the requirement of dynamic detection of ground features. The video satellite can acquire time sequence dynamic images of the observation area, and provides rich information for traffic condition monitoring, natural disaster quick response, military safety and other applications. Target tracking is one of the key technologies in video analysis and understanding applications.
The target tracking method is mainly divided into a generative model or a discriminant model. Target tracking based on generative models can be seen as a search problem, modeling target regions in the current frame, and selecting the most similar regions as predicted locations in the next frame. However, the generative model does not take full advantage of background information and appearance variations. In contrast, discriminant models treat object tracking as a binary classification problem, training the discriminant model using a classifier marks the attributes of objects and background as positive and negative samples in the current frame. Then in subsequent frames, the foreground is identified using a classifier and the results updated. In the deep learning approach, the tracker uses depth features with strong notations instead of manual features. In recent years, a deep learning convolutional neural network is introduced into target tracking, and has a good effect, and is attracting attention of many students.
For the characteristics of satellite video targets, many researchers have conducted many studies including deep learning-based and correlation filtering-based approaches in order to improve tracking effects.
Chen et al propose a new rotation adaptive tracker (RAMC) with motion constraints based on kernel correlation filtering, discussing how to improve SV target tracking using a mixture of angle and motion information from both rotation and translation aspects. Xuan et al propose an adaptive rotation correlation filter (RACF) algorithm to solve the tracking drift problem caused by target rotation. Song et al propose a joint Siamese attention-aware network (JSANet) containing self-attention and cross-attention modules for efficient remote sensing tracking against the negative effects of satellite video target weak features and background noise. Shao et al propose a predictive attention-inspiring SIAMESE network (paseam) for satellite-borne video tracking that constructs a full convolution SIAMESE network with shallow features to obtain fine-grained appearance features. In addition, predictive attention is proposed to deal with occlusion and blurring. Zhang et al propose a prediction network based on a Full Convolution Network (FCN) to predict the probability of locating a next frame object in each pixel from previously obtained results. On the basis, a segmentation method is introduced, a feasible region of each frame of target is generated, and a high probability is given to the region.
However, due to the characteristics of the remote sensing image, satellite video tracking faces a number of problems compared to conventional target tracking tasks or Unmanned Aerial Vehicle (UAV) based aerial image tracking. The challenges faced by applying target tracking techniques in satellite video datasets are as follows:
the target duty ratio is small: the width and height of high resolution satellite video is typically above 2000 pixels, while the object of interest is only around 0.01% or even less of the pixels of the entire video frame. The large background increases the search range of classical tracking algorithms while reducing tracking performance. In addition, the small-size tracking target features are fewer and similar to the environment, so that the tracking robustness is poor and the tracking error is large.
The video frame rate is low: due to limitations of on-board hardware, the frame rate of satellite video is typically low, resulting in significant movement of the target between frames, further affecting tracking predictions and model updates. For example, if the target suddenly stops, shadows, or moves, existing tracking systems may easily miss it.
Abrupt change in illumination: since satellite video acquisition is at high altitudes in space, the light and atmospheric refractive index changes with the motion of the orbiting satellites, which can lead to abrupt changes in frame illumination. The difference in light has an important effect on the performance and accuracy of target tracking.
Although the single-target tracking method based on the common video is disclosed in the prior art to be applied to satellite video target tracking, the following problems still exist:
1. the effect of the algorithm is affected by the limited satellite video public data set.
2. For the efficiency problem of satellite video target tracking, multi-module combination processing results in limited efficiency improvement.
3. For the precision problem of satellite video target tracking, the conventional algorithm is not fully improved aiming at the characteristics of the satellite video target because the satellite video target is different from the common video target.
Disclosure of Invention
Aiming at the problems, the invention provides a satellite video single-target tracking method based on feature interaction.
The technical solution for realizing the purpose of the invention is as follows:
the satellite video single-target tracking method based on the feature interaction is characterized by comprising the following steps of:
step 1: inputting satellite video data to be tracked;
step 2: performing feature extraction and interaction on input satellite video data based on an improved transducer network to obtain a target prediction result transR;
step 3: performing target prediction on input satellite video data based on track fitting to obtain a target prediction result ployR;
step 4: fusing the transR and ployR to obtain a final prediction result finalR;
step 5: and outputting a target positioning result in the satellite video according to the obtained final prediction result finalR.
Further, the improved transducer network described in step 2 is composed of a transducer backbone for feature extraction and feature interaction, and a predictor for object localization.
Further, the specific operation steps of the step 2 include:
step 21: cutting the search image and the template image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video data respectively;
step 22: cutting the template image
Figure BDA0004090419700000041
And search for images +.>
Figure BDA0004090419700000042
Remodelling into two flattened sequences of 2D image blocks +.>
Figure BDA0004090419700000043
And->
Figure BDA0004090419700000044
And N is z =H z W z /P 2 And N x =H x W x /P 2 The number of image blocks for the template image and the search image;
step 23: mapping the 2D image blocks obtained in step 22 to a 1D token with C dimension by linear projection, and adding an input sequence of position embedding to obtain a backbone, wherein the input sequence comprises a template sequence
Figure BDA0004090419700000045
And search sequence->
Figure BDA0004090419700000046
Step 24: cutting the central area of the template image to obtain a central area sequence e 0*
Step 25: concatenating search sequences s along a first dimension 0 Template sequence e 0 And a center region sequence e 0* And sending the splicing result to a transformer backbone network;
step 26: extracting and interacting the characteristics of the template image and the search image through a transformer backbone network;
step 27: step 26 is performed L times to output a target-related search feature s L And sends it to the predictor;
step 28: predictor pair s L Directly to a sorting head phi cls Regression header phi reg Predicting to obtain the position and shape of the target, namely:
y reg =Φ reg (S L ),y cls =Φ cls (S L )
wherein y is res 、y cls And representing the regression and classification results of the target for estimating the position and shape of the target.
Further, the specific steps of step 24 include:
step 241: clipping one in the center of the template imageSmaller area
Figure BDA0004090419700000051
Step 242: successive formation of Z into tiles
Figure BDA0004090419700000052
Step 243: calculation of
Figure BDA0004090419700000053
Is embedded in the position of (2);
step 244: with the original plaque Z p Identical linear projection mapping
Figure BDA0004090419700000054
And the mapped features and the positions are embedded and added to obtain a sequence e 0*
Further, the specific steps of the step 3 include:
step 31: for center coordinates { P } of bounding box obtained in previous N frames in the past t -N,P t -N+1,…,P t-1 Collect and fit it to two polynomial functions:
x t =F x (x t-N ,x t-N+1 ,...,x t-1 ) (3)
y t =F y (y t-N ,y t-N+1 ,...,y t-1 ) (4)
wherein x is t And y t Is P t X and y coordinates of (c);
step 32: setting a threshold epsilon, and when the displacement of the previous N frames in the x axis or the y axis is smaller than the threshold epsilon, assuming that the target is stationary in the corresponding direction, obtaining a position prediction result as follows:
Figure BDA0004090419700000055
Figure BDA0004090419700000056
where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, ε is less than or equal to 0.3.
Further, the specific operation steps of the step 4 include:
step 41: calculating the average displacement distance s of the previous N frames in the past;
step 42: finalr=trans if ployR is less than 0.8s from trans;
step 43: if the distance between ployR and trans r is greater than 0.8s, then the finalR is the coordinates of ployR and trans r centers.
A satellite video single-target tracking system based on feature interaction, comprising:
the prediction module based on the feature extraction and interaction is used for carrying out feature extraction and interaction on the input satellite video based on the improved transducer network to obtain a corresponding target prediction result;
the prediction module based on track fitting is used for carrying out target prediction on the input satellite video based on track fitting to obtain a corresponding target prediction result;
the fusion module is used for fusing the target prediction result obtained by the prediction module based on feature extraction and interaction with the target prediction result obtained by the prediction module based on track fitting to obtain a final prediction result;
and the target positioning module is used for outputting a target positioning result according to the final prediction result.
A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the satellite video single-target tracking method based on feature interaction according to any one of claims 1-7.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the satellite video single-target tracking method based on feature interaction as claimed in any one of claims 1 to 7 when the program is executed by the processor.
The beneficial effects are that:
first, all parameters in the backbone network are initialized using the parameters pre-trained by the pre-training model GLIP. To achieve better zero sample and small sample migration performance. The problem of insufficient satellite video data sets is solved.
Secondly, the multi-head attention transducer backbone with interaction is selected, so that full interaction between the template features and the searched sample features is realized, the combination of the conventional CNN backbone feature extraction link and the interaction link is realized, the number of modules is reduced, the flow is simplified, and the efficiency is effectively improved;
third, unavoidable information loss may result from the downsampling operation. In order to reduce the negative influence of downsampling, the invention adds a complete template target block when inputting an image block, and can enable a transducer backbone to capture more details in an important template image area, thereby improving the recognition accuracy;
fourth, to further mitigate model drift, the present invention uses a polynomial function to fit the trajectory of the object in the past N frames. The polynomial function estimates the historical motion pattern of the object for predicting the position in the next frame.
Drawings
FIG. 1 is a conventional transducer network architecture;
FIG. 2 is a flow chart of a target tracking method according to the present invention;
fig. 3 is a flow chart of a transducer backbone feature extraction and interaction link.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
The invention provides a target tracking method in satellite video aiming at the characteristics of a satellite video target based on a transducer backbone network. The target tracking algorithm comprises two branches, wherein one branch is used for target prediction based on feature extraction and interaction, and the other branch is used for target prediction through track fitting. The two branch prediction results are fused to obtain the final position, and the flow chart of the invention is shown in fig. 2.
1. Branch one: target prediction based on feature extraction and interaction
It consists of only the transducer backbone and the predictor. The conventional transformer network architecture, as shown in fig. 1, needs to use a CNN backbone network to perform feature extraction, feature interaction and target positioning on a template image and a search image. In order to make the algorithm module less and the flow simpler, the invention can realize the functions of the two modules of feature extraction and feature interaction by adopting a transducer backbone, and the specific steps are as follows:
step 1: and cutting the template image and the search image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video respectively.
Step 2: image of template
Figure BDA0004090419700000081
And search for images +.>
Figure BDA0004090419700000082
Remodelling into two flattened sequences of 2D image blocks +.>
Figure BDA0004090419700000083
And->
Figure BDA0004090419700000084
Where (P, P) is the tile (square) size, N z =H z W z /P 2 And N x =H x W x /P 2 The number of image blocks for the template image and the search image; the 2D image block is mapped to a 1D token with a C dimension by linear projection. After adding the 1D token with embedded position, the input sequence of the trunk is obtained, which comprises a template sequence +.>
Figure BDA0004090419700000085
And search sequence->
Figure BDA0004090419700000086
To provide more detailed target information to the transformer backbone, a smaller region is cropped in the center of the template image
Figure BDA0004090419700000087
And Z is continued to form image block +.>
Figure BDA0004090419700000088
Wherein->
Figure BDA0004090419700000089
The division line on the central area Z is located at the center of the division line on the sample image Z to ensure +.>
Figure BDA00040904197000000810
And original plaque Z p Containing different target information. Obtaining a central region image block->
Figure BDA00040904197000000811
After that, their position embeddings are calculated. Then, with Z p Identical linear projection map->
Figure BDA00040904197000000812
And the mapped features and the positions are embedded and added to obtain a sequence e 0*
Step 3: directly concatenating search sequences s along a first dimension 0 Template sequence e 0 And a center region sequence e 0* And sends the splice results together to the transformer backbone.
Step 4: and after the splicing result is sent to the transducer backbone network, extracting and interacting the characteristics of the template image and the search image through the transducer backbone network. The specific flow is shown in fig. 3. All parameters in the backbone network are initialized with the parameters of the visual branch pre-training of GLIP.
Whole transformer backbone linkRows L times. We use e 1 Sum s 1 Input templates and search sequences representing layer (1+1), 1=0, …, L-1.Att is the self-attention model. In our transducer backbone, the feature learning due to template image and search image is learned by a (e 1 ,s 1 ) And a(s) 1 ,e 1 ) And affect each other. Att (e) 1 ) Comprises s 1 And vice versa. Information interaction between templates and search properties exists in each layer of the transformer backbone, so that no additional interaction module needs to be added after the backbone. Directly output search feature s L And sending the target positioning result to a predictor for target positioning.
Att(e l )=softmax([a(e l ,e l ),a(e l ,s l )])[e l W V ,s l W V ] T
Att(s l )=softmax([a(s l ,e l ),a(s l ,s l )])[e l W V ,s l W V ] T (1)
Step 5: predictor(s)
After transforming the former backbone we get a target-related search feature S L And is directly added to a sort head phi cls Regression header phi reg Prediction is performed
y reg =Φ reg (S L ),y cls =Φ cls (S L ) (2)
Wherein y is reg 、y cls And representing the regression and classification results of the target for estimating the position and shape of the target.
2. Branch two: target prediction based on track fitting
Since objects in satellite video typically rotate slowly, a polynomial function is used to fit the trajectory of the object in the past N frames. The polynomial function is used to estimate the historical motion pattern of the object and can predict the position of the object in the next frame.
In predicting the center coordinate Pt of the object in the t-th frame, use is made ofCenter coordinates { P } of bounding box obtained in the past N frames t -N,P t -N+1,…,P t-1 Collecting and fitting to two polynomial functions F x (. Cndot.) and F y (. Cndot.) represents the x-and y-coordinates, respectively. And the two polynomial fitting functions can be expressed as:
x t =F x (x t-N ,x t-N+1 ,...,x t-1 ) (3)
y t =F y (y t-N ,y t-N+1 ,...,y t-1 ) (4)
wherein x is t And y t Is P t X-coordinate and y-coordinate of (c).
To avoid an overfitting to a stationary object, a threshold e is used to determine if the object is moving. When the displacement of the past N frames in the x-axis or y-axis is smaller than the threshold e, then the target is assumed to be stationary in the corresponding direction, so its position prediction result is:
Figure BDA0004090419700000101
Figure BDA0004090419700000102
where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, and e is set to 0.3 or less.
3. Obtaining final predicted outcomes
The average displacement distance s of the past N frames is calculated, the result of target prediction based on feature extraction and interaction is defined as transR, the result of target prediction based on trajectory fitting is defined as ployR, and the final prediction result is defined as finalR. Then if ployR is less than 0.8s from transR, finalr=transr; if the distance between ployR and transR is greater than 0.8s, the finalR takes the coordinates of ployR and transR centers as the final prediction result.
What is not described in detail in this specification is prior art known to those skilled in the art. Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims (9)

1. The satellite video single-target tracking method based on the feature interaction is characterized by comprising the following steps of:
step 1: inputting satellite video data to be tracked;
step 2: performing feature extraction and interaction on input satellite video data based on an improved transducer network to obtain a target prediction result transR;
step 3: performing target prediction on input satellite video data based on track fitting to obtain a target prediction result ployR;
step 4: fusing the transR and ployR to obtain a final prediction result finalR;
step 5: and outputting a target positioning result in the satellite video according to the obtained final prediction result finalR.
2. The method for tracking single satellite video targets based on feature interaction according to claim 1, wherein the improved transformer network in step 2 is composed of a transformer backbone and a predictor, the transformer backbone is used for feature extraction and feature interaction, and the predictor is used for target positioning.
3. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 2, wherein the specific operation steps of step 2 include:
step 21: cutting the search image and the template image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video data respectively;
step (a)22: cutting the template image
Figure FDA0004090419680000011
And search for images +.>
Figure FDA0004090419680000012
Remodelling into two flattened sequences of 2D image blocks +.>
Figure FDA0004090419680000013
And->
Figure FDA0004090419680000014
And N is z =H z W z /P 2 And N x =H x W x /P 2 The number of image blocks for the template image and the search image;
step 23: mapping the 2D image blocks obtained in step 22 to a 1D token with C dimension by linear projection, and adding an input sequence of position embedding to obtain a backbone, wherein the input sequence comprises a template sequence
Figure FDA0004090419680000021
And search sequence
Figure FDA0004090419680000022
Step 24: cutting the central area of the template image to obtain a central area sequence e 0*
Step 25: concatenating search sequences s along a first dimension 0 Template sequence e 0 And a center region sequence e 0* And sending the splicing result to a transformer backbone network;
step 26: extracting and interacting the characteristics of the template image and the search image through a transformer backbone network;
step 27: step 26 is performed L times to output a target-related search feature s L And sends it to the predictor;
step 28: predictor pair s L Directly to a sorting head phi cls Regression header phi reg Predicting to obtain the position and shape of the target, namely:
y reg =Φ reg (S L ),y cls =Φ cls (S L )
wherein y is reg 、y cls And representing the regression and classification results of the target for estimating the position and shape of the target.
4. A satellite video single-target tracking method based on feature interaction according to claim 3, wherein the specific steps of step 24 include:
step 241: cropping a smaller region in the center of the template image
Figure FDA0004090419680000023
Step 242: successive formation of Z into tiles
Figure FDA0004090419680000024
Step 243: calculation of
Figure FDA0004090419680000025
Is embedded in the position of (2);
step 244: by using the original plaque z p Identical linear projection mapping
Figure FDA0004090419680000026
And the mapped features and the positions are embedded and added to obtain a sequence e 0*
5. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 4, wherein the specific steps of the step 3 include:
step 31: for center coordinates { P } of bounding box obtained in previous N frames in the past t -N,P t -N+1,…,P t-1 Collect and fit it to two polynomial functions:
x t =F x (x t-N ,x t-N+1 ,...,x t-1 ) (3)
y t =F y (y t-N ,y t-N+1 ,...,y t-1 ) (4)
wherein x is t And y t Is P t X and y coordinates of (c);
step 32: setting a threshold epsilon, and when the displacement of the previous N frames in the x axis or the y axis is smaller than the threshold epsilon, assuming that the target is stationary in the corresponding direction, obtaining a position prediction result as follows:
Figure FDA0004090419680000031
Figure FDA0004090419680000032
where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, ε is less than or equal to 0.3.
6. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 5, wherein the specific operation steps of step 4 include:
step 41: calculating the average displacement distance s of the previous N frames in the past;
step 42: finalr=trans if ployR is less than 0.8s from trans;
step 43: if the distance between ployR and trans r is greater than 0.8s, then the finalR is the coordinates of ployR and trans r centers.
7. A satellite video single-target tracking system based on feature interaction, comprising:
the prediction module based on the feature extraction and interaction is used for carrying out feature extraction and interaction on the input satellite video based on the improved transducer network to obtain a corresponding target prediction result;
the prediction module based on track fitting is used for carrying out target prediction on the input satellite video based on track fitting to obtain a corresponding target prediction result;
the fusion module is used for fusing the target prediction result obtained by the prediction module based on feature extraction and interaction with the target prediction result obtained by the prediction module based on track fitting to obtain a final prediction result;
and the target positioning module is used for outputting a target positioning result according to the final prediction result.
8. A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the satellite video single-target tracking method based on feature interaction according to any one of claims 1-7.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the satellite video single-target tracking method based on feature interaction as claimed in any one of claims 1 to 7 when the program is executed by the processor.
CN202310149943.XA 2023-02-22 2023-02-22 Satellite video single-target tracking method and system based on feature interaction Active CN116309705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310149943.XA CN116309705B (en) 2023-02-22 2023-02-22 Satellite video single-target tracking method and system based on feature interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310149943.XA CN116309705B (en) 2023-02-22 2023-02-22 Satellite video single-target tracking method and system based on feature interaction

Publications (2)

Publication Number Publication Date
CN116309705A true CN116309705A (en) 2023-06-23
CN116309705B CN116309705B (en) 2024-07-30

Family

ID=86831596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310149943.XA Active CN116309705B (en) 2023-02-22 2023-02-22 Satellite video single-target tracking method and system based on feature interaction

Country Status (1)

Country Link
CN (1) CN116309705B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058366A (en) * 2023-07-04 2023-11-14 南京航空航天大学 Large aircraft large part point cloud semantic segmentation method based on pre-training large model
CN117197192A (en) * 2023-11-06 2023-12-08 北京观微科技有限公司 Satellite video single-target tracking method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470356A (en) * 2018-03-15 2018-08-31 浙江工业大学 A kind of target object fast ranging method based on binocular vision
CN109631829A (en) * 2018-12-17 2019-04-16 南京理工大学 A kind of binocular distance measuring method of adaptive Rapid matching
CN110580713A (en) * 2019-08-30 2019-12-17 武汉大学 Satellite video target tracking method based on full convolution twin network and track prediction
CN113076809A (en) * 2021-03-10 2021-07-06 青岛海纳云科技控股有限公司 High-altitude falling object detection method based on visual Transformer
CN113963032A (en) * 2021-12-01 2022-01-21 浙江工业大学 Twin network structure target tracking method fusing target re-identification
CN114372173A (en) * 2022-01-11 2022-04-19 中国人民公安大学 Natural language target tracking method based on Transformer architecture
CN114842047A (en) * 2022-03-29 2022-08-02 武汉大学 Twin network satellite video target tracking method based on motion prior
WO2023273136A1 (en) * 2021-06-29 2023-01-05 常州工学院 Target object representation point estimation-based visual tracking method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470356A (en) * 2018-03-15 2018-08-31 浙江工业大学 A kind of target object fast ranging method based on binocular vision
CN109631829A (en) * 2018-12-17 2019-04-16 南京理工大学 A kind of binocular distance measuring method of adaptive Rapid matching
CN110580713A (en) * 2019-08-30 2019-12-17 武汉大学 Satellite video target tracking method based on full convolution twin network and track prediction
CN113076809A (en) * 2021-03-10 2021-07-06 青岛海纳云科技控股有限公司 High-altitude falling object detection method based on visual Transformer
WO2023273136A1 (en) * 2021-06-29 2023-01-05 常州工学院 Target object representation point estimation-based visual tracking method
CN113963032A (en) * 2021-12-01 2022-01-21 浙江工业大学 Twin network structure target tracking method fusing target re-identification
CN114372173A (en) * 2022-01-11 2022-04-19 中国人民公安大学 Natural language target tracking method based on Transformer architecture
CN114842047A (en) * 2022-03-29 2022-08-02 武汉大学 Twin network satellite video target tracking method based on motion prior

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NING WANG ET AL.: "Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking", 《ARXIV:2103.11681V2 》, 24 March 2021 (2021-03-24), pages 1 - 13 *
吴家俊: "基于高速视觉的运动目标检测与跟踪", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 August 2019 (2019-08-15), pages 4 *
汪强,卢先领: "时空模板更新的Transformer目标跟踪算法", 《计算机科学与探索》, 30 September 2022 (2022-09-30), pages 1 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058366A (en) * 2023-07-04 2023-11-14 南京航空航天大学 Large aircraft large part point cloud semantic segmentation method based on pre-training large model
CN117058366B (en) * 2023-07-04 2024-03-01 南京航空航天大学 Large aircraft large part point cloud semantic segmentation method based on pre-training large model
CN117197192A (en) * 2023-11-06 2023-12-08 北京观微科技有限公司 Satellite video single-target tracking method and device
CN117197192B (en) * 2023-11-06 2024-02-23 北京观微科技有限公司 Satellite video single-target tracking method and device

Also Published As

Publication number Publication date
CN116309705B (en) 2024-07-30

Similar Documents

Publication Publication Date Title
Ming et al. Deep learning for monocular depth estimation: A review
CN113807187B (en) Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
CN116309705B (en) Satellite video single-target tracking method and system based on feature interaction
CN110738673A (en) Visual SLAM method based on example segmentation
CN111652081B (en) Video semantic segmentation method based on optical flow feature fusion
CN115035171B (en) Self-supervision monocular depth estimation method based on self-attention guide feature fusion
CN111860651A (en) Monocular vision-based semi-dense map construction method for mobile robot
CN116403006B (en) Real-time visual target tracking method, device and storage medium
Duan [Retracted] Deep Learning‐Based Multitarget Motion Shadow Rejection and Accurate Tracking for Sports Video
Li et al. Monocular 3-D Object Detection Based on Depth-Guided Local Convolution for Smart Payment in D2D Systems
Wang et al. Metaverse Meets Intelligent Transportation System: An Efficient and Instructional Visual Perception Framework
Razzok et al. Pedestrian detection under weather conditions using conditional generative adversarial network
CN117218378A (en) High-precision regression infrared small target tracking method
Zheng et al. 6d camera relocalization in visually ambiguous extreme environments
Li et al. NeRF-MS: Neural Radiance Fields with Multi-Sequence
CN115880332A (en) Target tracking method for low-altitude aircraft visual angle
CN115457080A (en) Multi-target vehicle track extraction method based on pixel-level image fusion
CN115100565A (en) Multi-target tracking method based on spatial correlation and optical flow registration
CN114757819A (en) Structure-guided style deviation correction type style migration method and system
Qiu et al. ARODNet: adaptive rain image enhancement object detection network for autonomous driving in adverse weather conditions
Li et al. A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments
CN112634331A (en) Optical flow prediction method and device
Tian et al. Lightweight dual-task networks for crowd counting in aerial images
Tosi et al. A survey on deep stereo matching in the twenties
Liu et al. Deep learning for 3D human pose estimation and mesh recovery: A survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant