CN116309705A - Satellite video single-target tracking method and system based on feature interaction - Google Patents
Satellite video single-target tracking method and system based on feature interaction Download PDFInfo
- Publication number
- CN116309705A CN116309705A CN202310149943.XA CN202310149943A CN116309705A CN 116309705 A CN116309705 A CN 116309705A CN 202310149943 A CN202310149943 A CN 202310149943A CN 116309705 A CN116309705 A CN 116309705A
- Authority
- CN
- China
- Prior art keywords
- target
- satellite video
- interaction
- prediction result
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 238000006073 displacement reaction Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000004927 fusion Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 7
- 230000033001 locomotion Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 101100121955 Tanacetum cinerariifolium GLIP gene Proteins 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/251—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/277—Analysis of motion involving stochastic approaches, e.g. using Kalman filters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
- G06V10/765—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/13—Satellite images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10032—Satellite or aerial image; Remote sensing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Astronomy & Astrophysics (AREA)
- Remote Sensing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a target tracking method in satellite video aiming at the characteristics of a satellite video target based on a transducer backbone network. The target tracking algorithm comprises two branches, wherein one branch is used for target prediction based on feature extraction and interaction, and the other branch is used for target prediction through track fitting. And finally, fusing the two branch prediction results to acquire a final position, thereby improving the operation efficiency and accuracy.
Description
Technical Field
The invention relates to the technical field of remote sensing image processing, in particular to a satellite video single-target tracking method and system based on feature interaction.
Background
Along with the diversification of remote sensing application scenes, a single-frame static remote sensing image can not meet the requirement of dynamic detection of ground features. The video satellite can acquire time sequence dynamic images of the observation area, and provides rich information for traffic condition monitoring, natural disaster quick response, military safety and other applications. Target tracking is one of the key technologies in video analysis and understanding applications.
The target tracking method is mainly divided into a generative model or a discriminant model. Target tracking based on generative models can be seen as a search problem, modeling target regions in the current frame, and selecting the most similar regions as predicted locations in the next frame. However, the generative model does not take full advantage of background information and appearance variations. In contrast, discriminant models treat object tracking as a binary classification problem, training the discriminant model using a classifier marks the attributes of objects and background as positive and negative samples in the current frame. Then in subsequent frames, the foreground is identified using a classifier and the results updated. In the deep learning approach, the tracker uses depth features with strong notations instead of manual features. In recent years, a deep learning convolutional neural network is introduced into target tracking, and has a good effect, and is attracting attention of many students.
For the characteristics of satellite video targets, many researchers have conducted many studies including deep learning-based and correlation filtering-based approaches in order to improve tracking effects.
Chen et al propose a new rotation adaptive tracker (RAMC) with motion constraints based on kernel correlation filtering, discussing how to improve SV target tracking using a mixture of angle and motion information from both rotation and translation aspects. Xuan et al propose an adaptive rotation correlation filter (RACF) algorithm to solve the tracking drift problem caused by target rotation. Song et al propose a joint Siamese attention-aware network (JSANet) containing self-attention and cross-attention modules for efficient remote sensing tracking against the negative effects of satellite video target weak features and background noise. Shao et al propose a predictive attention-inspiring SIAMESE network (paseam) for satellite-borne video tracking that constructs a full convolution SIAMESE network with shallow features to obtain fine-grained appearance features. In addition, predictive attention is proposed to deal with occlusion and blurring. Zhang et al propose a prediction network based on a Full Convolution Network (FCN) to predict the probability of locating a next frame object in each pixel from previously obtained results. On the basis, a segmentation method is introduced, a feasible region of each frame of target is generated, and a high probability is given to the region.
However, due to the characteristics of the remote sensing image, satellite video tracking faces a number of problems compared to conventional target tracking tasks or Unmanned Aerial Vehicle (UAV) based aerial image tracking. The challenges faced by applying target tracking techniques in satellite video datasets are as follows:
the target duty ratio is small: the width and height of high resolution satellite video is typically above 2000 pixels, while the object of interest is only around 0.01% or even less of the pixels of the entire video frame. The large background increases the search range of classical tracking algorithms while reducing tracking performance. In addition, the small-size tracking target features are fewer and similar to the environment, so that the tracking robustness is poor and the tracking error is large.
The video frame rate is low: due to limitations of on-board hardware, the frame rate of satellite video is typically low, resulting in significant movement of the target between frames, further affecting tracking predictions and model updates. For example, if the target suddenly stops, shadows, or moves, existing tracking systems may easily miss it.
Abrupt change in illumination: since satellite video acquisition is at high altitudes in space, the light and atmospheric refractive index changes with the motion of the orbiting satellites, which can lead to abrupt changes in frame illumination. The difference in light has an important effect on the performance and accuracy of target tracking.
Although the single-target tracking method based on the common video is disclosed in the prior art to be applied to satellite video target tracking, the following problems still exist:
1. the effect of the algorithm is affected by the limited satellite video public data set.
2. For the efficiency problem of satellite video target tracking, multi-module combination processing results in limited efficiency improvement.
3. For the precision problem of satellite video target tracking, the conventional algorithm is not fully improved aiming at the characteristics of the satellite video target because the satellite video target is different from the common video target.
Disclosure of Invention
Aiming at the problems, the invention provides a satellite video single-target tracking method based on feature interaction.
The technical solution for realizing the purpose of the invention is as follows:
the satellite video single-target tracking method based on the feature interaction is characterized by comprising the following steps of:
step 1: inputting satellite video data to be tracked;
step 2: performing feature extraction and interaction on input satellite video data based on an improved transducer network to obtain a target prediction result transR;
step 3: performing target prediction on input satellite video data based on track fitting to obtain a target prediction result ployR;
step 4: fusing the transR and ployR to obtain a final prediction result finalR;
step 5: and outputting a target positioning result in the satellite video according to the obtained final prediction result finalR.
Further, the improved transducer network described in step 2 is composed of a transducer backbone for feature extraction and feature interaction, and a predictor for object localization.
Further, the specific operation steps of the step 2 include:
step 21: cutting the search image and the template image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video data respectively;
step 22: cutting the template imageAnd search for images +.>Remodelling into two flattened sequences of 2D image blocks +.>And->And N is z =H z W z /P 2 And N x =H x W x /P 2 The number of image blocks for the template image and the search image;
step 23: mapping the 2D image blocks obtained in step 22 to a 1D token with C dimension by linear projection, and adding an input sequence of position embedding to obtain a backbone, wherein the input sequence comprises a template sequenceAnd search sequence->
Step 24: cutting the central area of the template image to obtain a central area sequence e 0* ;
Step 25: concatenating search sequences s along a first dimension 0 Template sequence e 0 And a center region sequence e 0* And sending the splicing result to a transformer backbone network;
step 26: extracting and interacting the characteristics of the template image and the search image through a transformer backbone network;
step 27: step 26 is performed L times to output a target-related search feature s L And sends it to the predictor;
step 28: predictor pair s L Directly to a sorting head phi cls Regression header phi reg Predicting to obtain the position and shape of the target, namely:
y reg =Φ reg (S L ),y cls =Φ cls (S L )
wherein y is res 、y cls And representing the regression and classification results of the target for estimating the position and shape of the target.
Further, the specific steps of step 24 include:
step 244: with the original plaque Z p Identical linear projection mappingAnd the mapped features and the positions are embedded and added to obtain a sequence e 0* 。
Further, the specific steps of the step 3 include:
step 31: for center coordinates { P } of bounding box obtained in previous N frames in the past t -N,P t -N+1,…,P t-1 Collect and fit it to two polynomial functions:
x t =F x (x t-N ,x t-N+1 ,...,x t-1 ) (3)
y t =F y (y t-N ,y t-N+1 ,...,y t-1 ) (4)
wherein x is t And y t Is P t X and y coordinates of (c);
step 32: setting a threshold epsilon, and when the displacement of the previous N frames in the x axis or the y axis is smaller than the threshold epsilon, assuming that the target is stationary in the corresponding direction, obtaining a position prediction result as follows:
where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, ε is less than or equal to 0.3.
Further, the specific operation steps of the step 4 include:
step 41: calculating the average displacement distance s of the previous N frames in the past;
step 42: finalr=trans if ployR is less than 0.8s from trans;
step 43: if the distance between ployR and trans r is greater than 0.8s, then the finalR is the coordinates of ployR and trans r centers.
A satellite video single-target tracking system based on feature interaction, comprising:
the prediction module based on the feature extraction and interaction is used for carrying out feature extraction and interaction on the input satellite video based on the improved transducer network to obtain a corresponding target prediction result;
the prediction module based on track fitting is used for carrying out target prediction on the input satellite video based on track fitting to obtain a corresponding target prediction result;
the fusion module is used for fusing the target prediction result obtained by the prediction module based on feature extraction and interaction with the target prediction result obtained by the prediction module based on track fitting to obtain a final prediction result;
and the target positioning module is used for outputting a target positioning result according to the final prediction result.
A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the satellite video single-target tracking method based on feature interaction according to any one of claims 1-7.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the satellite video single-target tracking method based on feature interaction as claimed in any one of claims 1 to 7 when the program is executed by the processor.
The beneficial effects are that:
first, all parameters in the backbone network are initialized using the parameters pre-trained by the pre-training model GLIP. To achieve better zero sample and small sample migration performance. The problem of insufficient satellite video data sets is solved.
Secondly, the multi-head attention transducer backbone with interaction is selected, so that full interaction between the template features and the searched sample features is realized, the combination of the conventional CNN backbone feature extraction link and the interaction link is realized, the number of modules is reduced, the flow is simplified, and the efficiency is effectively improved;
third, unavoidable information loss may result from the downsampling operation. In order to reduce the negative influence of downsampling, the invention adds a complete template target block when inputting an image block, and can enable a transducer backbone to capture more details in an important template image area, thereby improving the recognition accuracy;
fourth, to further mitigate model drift, the present invention uses a polynomial function to fit the trajectory of the object in the past N frames. The polynomial function estimates the historical motion pattern of the object for predicting the position in the next frame.
Drawings
FIG. 1 is a conventional transducer network architecture;
FIG. 2 is a flow chart of a target tracking method according to the present invention;
fig. 3 is a flow chart of a transducer backbone feature extraction and interaction link.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
The invention provides a target tracking method in satellite video aiming at the characteristics of a satellite video target based on a transducer backbone network. The target tracking algorithm comprises two branches, wherein one branch is used for target prediction based on feature extraction and interaction, and the other branch is used for target prediction through track fitting. The two branch prediction results are fused to obtain the final position, and the flow chart of the invention is shown in fig. 2.
1. Branch one: target prediction based on feature extraction and interaction
It consists of only the transducer backbone and the predictor. The conventional transformer network architecture, as shown in fig. 1, needs to use a CNN backbone network to perform feature extraction, feature interaction and target positioning on a template image and a search image. In order to make the algorithm module less and the flow simpler, the invention can realize the functions of the two modules of feature extraction and feature interaction by adopting a transducer backbone, and the specific steps are as follows:
step 1: and cutting the template image and the search image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video respectively.
Step 2: image of templateAnd search for images +.>Remodelling into two flattened sequences of 2D image blocks +.>And->Where (P, P) is the tile (square) size, N z =H z W z /P 2 And N x =H x W x /P 2 The number of image blocks for the template image and the search image; the 2D image block is mapped to a 1D token with a C dimension by linear projection. After adding the 1D token with embedded position, the input sequence of the trunk is obtained, which comprises a template sequence +.>And search sequence->
To provide more detailed target information to the transformer backbone, a smaller region is cropped in the center of the template imageAnd Z is continued to form image block +.>Wherein->The division line on the central area Z is located at the center of the division line on the sample image Z to ensure +.>And original plaque Z p Containing different target information. Obtaining a central region image block->After that, their position embeddings are calculated. Then, with Z p Identical linear projection map->And the mapped features and the positions are embedded and added to obtain a sequence e 0* 。
Step 3: directly concatenating search sequences s along a first dimension 0 Template sequence e 0 And a center region sequence e 0* And sends the splice results together to the transformer backbone.
Step 4: and after the splicing result is sent to the transducer backbone network, extracting and interacting the characteristics of the template image and the search image through the transducer backbone network. The specific flow is shown in fig. 3. All parameters in the backbone network are initialized with the parameters of the visual branch pre-training of GLIP.
Whole transformer backbone linkRows L times. We use e 1 Sum s 1 Input templates and search sequences representing layer (1+1), 1=0, …, L-1.Att is the self-attention model. In our transducer backbone, the feature learning due to template image and search image is learned by a (e 1 ,s 1 ) And a(s) 1 ,e 1 ) And affect each other. Att (e) 1 ) Comprises s 1 And vice versa. Information interaction between templates and search properties exists in each layer of the transformer backbone, so that no additional interaction module needs to be added after the backbone. Directly output search feature s L And sending the target positioning result to a predictor for target positioning.
Att(e l )=softmax([a(e l ,e l ),a(e l ,s l )])[e l W V ,s l W V ] T
Att(s l )=softmax([a(s l ,e l ),a(s l ,s l )])[e l W V ,s l W V ] T (1)
Step 5: predictor(s)
After transforming the former backbone we get a target-related search feature S L And is directly added to a sort head phi cls Regression header phi reg Prediction is performed
y reg =Φ reg (S L ),y cls =Φ cls (S L ) (2)
Wherein y is reg 、y cls And representing the regression and classification results of the target for estimating the position and shape of the target.
2. Branch two: target prediction based on track fitting
Since objects in satellite video typically rotate slowly, a polynomial function is used to fit the trajectory of the object in the past N frames. The polynomial function is used to estimate the historical motion pattern of the object and can predict the position of the object in the next frame.
In predicting the center coordinate Pt of the object in the t-th frame, use is made ofCenter coordinates { P } of bounding box obtained in the past N frames t -N,P t -N+1,…,P t-1 Collecting and fitting to two polynomial functions F x (. Cndot.) and F y (. Cndot.) represents the x-and y-coordinates, respectively. And the two polynomial fitting functions can be expressed as:
x t =F x (x t-N ,x t-N+1 ,...,x t-1 ) (3)
y t =F y (y t-N ,y t-N+1 ,...,y t-1 ) (4)
wherein x is t And y t Is P t X-coordinate and y-coordinate of (c).
To avoid an overfitting to a stationary object, a threshold e is used to determine if the object is moving. When the displacement of the past N frames in the x-axis or y-axis is smaller than the threshold e, then the target is assumed to be stationary in the corresponding direction, so its position prediction result is:
where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, and e is set to 0.3 or less.
3. Obtaining final predicted outcomes
The average displacement distance s of the past N frames is calculated, the result of target prediction based on feature extraction and interaction is defined as transR, the result of target prediction based on trajectory fitting is defined as ployR, and the final prediction result is defined as finalR. Then if ployR is less than 0.8s from transR, finalr=transr; if the distance between ployR and transR is greater than 0.8s, the finalR takes the coordinates of ployR and transR centers as the final prediction result.
What is not described in detail in this specification is prior art known to those skilled in the art. Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.
Claims (9)
1. The satellite video single-target tracking method based on the feature interaction is characterized by comprising the following steps of:
step 1: inputting satellite video data to be tracked;
step 2: performing feature extraction and interaction on input satellite video data based on an improved transducer network to obtain a target prediction result transR;
step 3: performing target prediction on input satellite video data based on track fitting to obtain a target prediction result ployR;
step 4: fusing the transR and ployR to obtain a final prediction result finalR;
step 5: and outputting a target positioning result in the satellite video according to the obtained final prediction result finalR.
2. The method for tracking single satellite video targets based on feature interaction according to claim 1, wherein the improved transformer network in step 2 is composed of a transformer backbone and a predictor, the transformer backbone is used for feature extraction and feature interaction, and the predictor is used for target positioning.
3. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 2, wherein the specific operation steps of step 2 include:
step 21: cutting the search image and the template image into 22 times and 42 times of a target bounding box to be tracked in the input satellite video data respectively;
step (a)22: cutting the template imageAnd search for images +.>Remodelling into two flattened sequences of 2D image blocks +.>And->And N is z =H z W z /P 2 And N x =H x W x /P 2 The number of image blocks for the template image and the search image;
step 23: mapping the 2D image blocks obtained in step 22 to a 1D token with C dimension by linear projection, and adding an input sequence of position embedding to obtain a backbone, wherein the input sequence comprises a template sequenceAnd search sequence
Step 24: cutting the central area of the template image to obtain a central area sequence e 0* ;
Step 25: concatenating search sequences s along a first dimension 0 Template sequence e 0 And a center region sequence e 0* And sending the splicing result to a transformer backbone network;
step 26: extracting and interacting the characteristics of the template image and the search image through a transformer backbone network;
step 27: step 26 is performed L times to output a target-related search feature s L And sends it to the predictor;
step 28: predictor pair s L Directly to a sorting head phi cls Regression header phi reg Predicting to obtain the position and shape of the target, namely:
y reg =Φ reg (S L ),y cls =Φ cls (S L )
wherein y is reg 、y cls And representing the regression and classification results of the target for estimating the position and shape of the target.
4. A satellite video single-target tracking method based on feature interaction according to claim 3, wherein the specific steps of step 24 include:
5. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 4, wherein the specific steps of the step 3 include:
step 31: for center coordinates { P } of bounding box obtained in previous N frames in the past t -N,P t -N+1,…,P t-1 Collect and fit it to two polynomial functions:
x t =F x (x t-N ,x t-N+1 ,...,x t-1 ) (3)
y t =F y (y t-N ,y t-N+1 ,...,y t-1 ) (4)
wherein x is t And y t Is P t X and y coordinates of (c);
step 32: setting a threshold epsilon, and when the displacement of the previous N frames in the x axis or the y axis is smaller than the threshold epsilon, assuming that the target is stationary in the corresponding direction, obtaining a position prediction result as follows:
where Δx and Δy are displacements in the x-axis and y-axis directions in the past n frames, ε is less than or equal to 0.3.
6. The method for tracking a single target of a satellite video based on characteristic interaction according to claim 5, wherein the specific operation steps of step 4 include:
step 41: calculating the average displacement distance s of the previous N frames in the past;
step 42: finalr=trans if ployR is less than 0.8s from trans;
step 43: if the distance between ployR and trans r is greater than 0.8s, then the finalR is the coordinates of ployR and trans r centers.
7. A satellite video single-target tracking system based on feature interaction, comprising:
the prediction module based on the feature extraction and interaction is used for carrying out feature extraction and interaction on the input satellite video based on the improved transducer network to obtain a corresponding target prediction result;
the prediction module based on track fitting is used for carrying out target prediction on the input satellite video based on track fitting to obtain a corresponding target prediction result;
the fusion module is used for fusing the target prediction result obtained by the prediction module based on feature extraction and interaction with the target prediction result obtained by the prediction module based on track fitting to obtain a final prediction result;
and the target positioning module is used for outputting a target positioning result according to the final prediction result.
8. A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the satellite video single-target tracking method based on feature interaction according to any one of claims 1-7.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the satellite video single-target tracking method based on feature interaction as claimed in any one of claims 1 to 7 when the program is executed by the processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310149943.XA CN116309705B (en) | 2023-02-22 | 2023-02-22 | Satellite video single-target tracking method and system based on feature interaction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310149943.XA CN116309705B (en) | 2023-02-22 | 2023-02-22 | Satellite video single-target tracking method and system based on feature interaction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116309705A true CN116309705A (en) | 2023-06-23 |
CN116309705B CN116309705B (en) | 2024-07-30 |
Family
ID=86831596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310149943.XA Active CN116309705B (en) | 2023-02-22 | 2023-02-22 | Satellite video single-target tracking method and system based on feature interaction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116309705B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117058366A (en) * | 2023-07-04 | 2023-11-14 | 南京航空航天大学 | Large aircraft large part point cloud semantic segmentation method based on pre-training large model |
CN117197192A (en) * | 2023-11-06 | 2023-12-08 | 北京观微科技有限公司 | Satellite video single-target tracking method and device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108470356A (en) * | 2018-03-15 | 2018-08-31 | 浙江工业大学 | A kind of target object fast ranging method based on binocular vision |
CN109631829A (en) * | 2018-12-17 | 2019-04-16 | 南京理工大学 | A kind of binocular distance measuring method of adaptive Rapid matching |
CN110580713A (en) * | 2019-08-30 | 2019-12-17 | 武汉大学 | Satellite video target tracking method based on full convolution twin network and track prediction |
CN113076809A (en) * | 2021-03-10 | 2021-07-06 | 青岛海纳云科技控股有限公司 | High-altitude falling object detection method based on visual Transformer |
CN113963032A (en) * | 2021-12-01 | 2022-01-21 | 浙江工业大学 | Twin network structure target tracking method fusing target re-identification |
CN114372173A (en) * | 2022-01-11 | 2022-04-19 | 中国人民公安大学 | Natural language target tracking method based on Transformer architecture |
CN114842047A (en) * | 2022-03-29 | 2022-08-02 | 武汉大学 | Twin network satellite video target tracking method based on motion prior |
WO2023273136A1 (en) * | 2021-06-29 | 2023-01-05 | 常州工学院 | Target object representation point estimation-based visual tracking method |
-
2023
- 2023-02-22 CN CN202310149943.XA patent/CN116309705B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108470356A (en) * | 2018-03-15 | 2018-08-31 | 浙江工业大学 | A kind of target object fast ranging method based on binocular vision |
CN109631829A (en) * | 2018-12-17 | 2019-04-16 | 南京理工大学 | A kind of binocular distance measuring method of adaptive Rapid matching |
CN110580713A (en) * | 2019-08-30 | 2019-12-17 | 武汉大学 | Satellite video target tracking method based on full convolution twin network and track prediction |
CN113076809A (en) * | 2021-03-10 | 2021-07-06 | 青岛海纳云科技控股有限公司 | High-altitude falling object detection method based on visual Transformer |
WO2023273136A1 (en) * | 2021-06-29 | 2023-01-05 | 常州工学院 | Target object representation point estimation-based visual tracking method |
CN113963032A (en) * | 2021-12-01 | 2022-01-21 | 浙江工业大学 | Twin network structure target tracking method fusing target re-identification |
CN114372173A (en) * | 2022-01-11 | 2022-04-19 | 中国人民公安大学 | Natural language target tracking method based on Transformer architecture |
CN114842047A (en) * | 2022-03-29 | 2022-08-02 | 武汉大学 | Twin network satellite video target tracking method based on motion prior |
Non-Patent Citations (3)
Title |
---|
NING WANG ET AL.: "Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking", 《ARXIV:2103.11681V2 》, 24 March 2021 (2021-03-24), pages 1 - 13 * |
吴家俊: "基于高速视觉的运动目标检测与跟踪", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 August 2019 (2019-08-15), pages 4 * |
汪强,卢先领: "时空模板更新的Transformer目标跟踪算法", 《计算机科学与探索》, 30 September 2022 (2022-09-30), pages 1 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117058366A (en) * | 2023-07-04 | 2023-11-14 | 南京航空航天大学 | Large aircraft large part point cloud semantic segmentation method based on pre-training large model |
CN117058366B (en) * | 2023-07-04 | 2024-03-01 | 南京航空航天大学 | Large aircraft large part point cloud semantic segmentation method based on pre-training large model |
CN117197192A (en) * | 2023-11-06 | 2023-12-08 | 北京观微科技有限公司 | Satellite video single-target tracking method and device |
CN117197192B (en) * | 2023-11-06 | 2024-02-23 | 北京观微科技有限公司 | Satellite video single-target tracking method and device |
Also Published As
Publication number | Publication date |
---|---|
CN116309705B (en) | 2024-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ming et al. | Deep learning for monocular depth estimation: A review | |
CN113807187B (en) | Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion | |
CN116309705B (en) | Satellite video single-target tracking method and system based on feature interaction | |
CN110738673A (en) | Visual SLAM method based on example segmentation | |
CN111652081B (en) | Video semantic segmentation method based on optical flow feature fusion | |
CN115035171B (en) | Self-supervision monocular depth estimation method based on self-attention guide feature fusion | |
CN111860651A (en) | Monocular vision-based semi-dense map construction method for mobile robot | |
CN116403006B (en) | Real-time visual target tracking method, device and storage medium | |
Duan | [Retracted] Deep Learning‐Based Multitarget Motion Shadow Rejection and Accurate Tracking for Sports Video | |
Li et al. | Monocular 3-D Object Detection Based on Depth-Guided Local Convolution for Smart Payment in D2D Systems | |
Wang et al. | Metaverse Meets Intelligent Transportation System: An Efficient and Instructional Visual Perception Framework | |
Razzok et al. | Pedestrian detection under weather conditions using conditional generative adversarial network | |
CN117218378A (en) | High-precision regression infrared small target tracking method | |
Zheng et al. | 6d camera relocalization in visually ambiguous extreme environments | |
Li et al. | NeRF-MS: Neural Radiance Fields with Multi-Sequence | |
CN115880332A (en) | Target tracking method for low-altitude aircraft visual angle | |
CN115457080A (en) | Multi-target vehicle track extraction method based on pixel-level image fusion | |
CN115100565A (en) | Multi-target tracking method based on spatial correlation and optical flow registration | |
CN114757819A (en) | Structure-guided style deviation correction type style migration method and system | |
Qiu et al. | ARODNet: adaptive rain image enhancement object detection network for autonomous driving in adverse weather conditions | |
Li et al. | A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments | |
CN112634331A (en) | Optical flow prediction method and device | |
Tian et al. | Lightweight dual-task networks for crowd counting in aerial images | |
Tosi et al. | A survey on deep stereo matching in the twenties | |
Liu et al. | Deep learning for 3D human pose estimation and mesh recovery: A survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |