CN113592900A - Target tracking method and system based on attention mechanism and global reasoning - Google Patents

Target tracking method and system based on attention mechanism and global reasoning Download PDF

Info

Publication number
CN113592900A
CN113592900A CN202110656309.6A CN202110656309A CN113592900A CN 113592900 A CN113592900 A CN 113592900A CN 202110656309 A CN202110656309 A CN 202110656309A CN 113592900 A CN113592900 A CN 113592900A
Authority
CN
China
Prior art keywords
feature map
attention mechanism
map
target tracking
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110656309.6A
Other languages
Chinese (zh)
Inventor
鲍华
束平
许克应
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202110656309.6A priority Critical patent/CN113592900A/en
Publication of CN113592900A publication Critical patent/CN113592900A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method and a system based on an attention mechanism and global reasoning, which belong to the technical field of computer vision and comprise the following steps: the target tracking is carried out by utilizing a target tracking model based on a twin network, the target tracking model comprises a template branch and a search branch, the template branch and the search branch respectively comprise a main network, a parallel attention mechanism and a global reasoning module, and the target tracking method comprises the following steps: acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map; carrying out weighted summation on the first score map and the second score map to obtain a regression map; and determining the position of the target according to the regression graph. Compared with the existing tracking algorithm, the method has better tracking effect.

Description

Target tracking method and system based on attention mechanism and global reasoning
Technical Field
The invention relates to the technical field of computer vision, in particular to a target tracking method and a target tracking system based on an attention mechanism and global reasoning.
Background
Target tracking is one of the challenges in the field of computer vision, which is the basis for more advanced visual understanding and scene analysis. Target tracking technology is widely used for video surveillance, human-computer interaction, robotics, video editing and unmanned driving. The visual object tracking task is to realize continuous and stable tracking of the moving target in the subsequent frame according to the target position and size information of the initial frame. Due to the interference of objects such as scale change, rotation, deformation, rapid movement, and background illumination change of the target, it is still a difficult task to achieve long-term stable target tracking.
In recent years, research on the task of visual tracking has focused on two aspects, namely, on the one hand, the speed of the algorithm and on the other hand, the accuracy of the tracking. In terms of speed, the correlation filtering algorithm is one of the most successful tracking frameworks, mainly using fast fourier transform and simpler manual functions, running at speeds close to 700 frames per second. However, this method is often difficult to handle in complex situations and performance will be greatly reduced. In terms of accuracy, the target tracking method based on deep learning shows very strong effect. Compared with a related filtering algorithm, the target tracking performance of the target tracking method based on deep learning is greatly improved, the most difficult scene can be better processed, and the speed is low.
In order to solve the problem of low tracking speed of the target tracking algorithm based on deep learning, a target tracking algorithm based on a twin network is provided. Related researchers put forward the target tracking of the twin network for the first time, namely, the target tracking problem is converted into a patch block matching problem and is realized by a neural network; researchers have also proposed an end-to-end twin network tracing algorithm SiamFC, which is fast, so that in the next few years many twin network based target tracing have appeared.
The target tracking method based on the twin network has the advantages of high speed and accuracy, so that the target tracking method is paid strong attention, but some existing twin network tracking algorithms still have some defects. In the following, for typical twin networks SiamFC and SiamRPN, two disadvantages are pointed out, one of which is that they use a shallow network structure, extract insufficient features, and do not pay good attention to the tracking target itself, so that a tracking failure may occur in the face of some tracking challenges. Secondly, neither of them considers the context information, and the tracking failure is easily caused when facing the object with large occlusion or large deformation.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and obtain better tracking effect.
In order to achieve the above object, in one aspect, the present invention adopts a target tracking method based on attention mechanism and global inference, which performs target tracking using a target tracking model based on a twin network, where the target tracking model includes a template branch and a search branch, and both the template branch and the search branch include a backbone network, a parallel attention mechanism and a global inference module, and include:
acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map;
carrying out weighted summation on the first score map and the second score map to obtain a regression map;
and determining the position of the target according to the regression graph.
Further, the backbone network adopts a renex network structure, and is configured to perform feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism.
Further, the attention mechanism includes a channel attention mechanism and a spatial attention mechanism, wherein:
the space attention mechanism is used for processing the input feature map to obtain a first feature map;
the channel attention mechanism is used for processing the input feature map to obtain a second feature map;
and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.
Further, the global reasoning module is used for projecting the characteristics of the attention characteristic graph to the nodes of the interaction space, carrying out reasoning, and then mapping the characteristics of the nodes of the interaction space to the original space to obtain a new characteristic graph; and adding the new feature map and the attention feature map to obtain a new feature map.
Further, the obtaining an initial frame picture and a current frame picture as input of a template branch and a search branch respectively to obtain a first score map and a second score map includes:
and performing cross-correlation operation on the new feature graph output by the global reasoning module in the template branch and the new feature graph output by the global reasoning module in the search branch to respectively obtain the first score graph and the second score graph.
On the other hand, the target tracking system based on the attention mechanism and the global reasoning comprises a picture acquisition module and a target tracking module, wherein:
the image acquisition module is used for acquiring an initial frame image and a current frame image;
the target tracking module is used for processing the initial frame image and the current frame image by using a target tracking model to determine the position of a target, the target tracking model comprises a template branch and a search branch, the template branch and the search branch both comprise a trunk network, a parallel attention mechanism and a global reasoning module, the template branch and the search branch respectively process the initial frame image and the current frame image to obtain a first score map and a second score map, and the first score map and the second score map are subjected to weighted summation to obtain a regression map and determine the position of the target.
Further, the backbone network adopts a renex network structure, and is configured to perform feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism.
Further, the attention mechanism includes a channel attention mechanism and a spatial attention mechanism, wherein:
the space attention mechanism is used for processing the input feature map to obtain a first feature map;
the channel attention mechanism is used for processing the input feature map to obtain a second feature map;
and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.
Further, the global reasoning module is used for projecting the characteristics of the attention characteristic graph to the nodes of the interaction space, carrying out reasoning, and then mapping the characteristics of the nodes of the interaction space to the original space to obtain a new characteristic graph; and adding the new feature map and the attention feature map to obtain a new feature map.
Compared with the prior art, the invention has the following technical effects: the invention uses a deeper network structure and adds a parallel attention mechanism, so that the extracted features are more sufficient, and simultaneously, the global reasoning module is added, and the addition of the global reasoning module better considers the global context information, thereby obtaining better tracking effect.
Drawings
The following detailed description of embodiments of the invention refers to the accompanying drawings in which:
FIG. 1 is a flow diagram of a method for target tracking based on attentional mechanisms and global reasoning;
FIG. 2 is an overall tracking block diagram of the target tracking method, which includes three parts, namely a backbone network, a parallel attention mechanism and a global reasoning module;
FIG. 3 is a block diagram of a spatial attention mechanism;
FIG. 4 is a block diagram of a channel attention mechanism;
FIG. 5 is a block diagram of a global inference module;
FIG. 6 is a comparative evaluation of the tracking algorithm of the present invention and other 5 high performance mainstream algorithms on an OTB100 reference data set, (a) is a success rate graph, and (b) is a precision graph;
FIG. 7 is a graph of the accuracy of the tracking algorithm of the present invention and other 5 high performance mainstream algorithms in various challenges on an OTB100 dataset;
FIG. 8 is a graph of the success rate of the tracking algorithm of the present invention versus other 5 high performance mainstream algorithms for various challenges on an OTB100 data set;
fig. 9 is a qualitative analysis comparison of one tracking algorithm of the present invention with three other tracking algorithms over four video frames in OTB 100.
Detailed Description
To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.
As shown in fig. 1 to 2, the present embodiment discloses a target tracking method based on attention mechanism and global inference, which performs target tracking by using a target tracking model based on a twin network, where the target tracking model includes a template branch and a search branch, and the template branch and the search branch both include a backbone network, a parallel attention mechanism and a global inference module, and includes the following steps S1 to S3:
s1, acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map;
s2, carrying out weighted summation on the first score map and the second score map to obtain a regression map;
and S3, determining the position of the target according to the regression graph.
It should be noted that the backbone network for extracting features in the target tracking model adopts the latest renex network structure, and is used for performing feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism, and the specific details are as follows:
the twin network has two branches, a template branch that takes as input a given initial frame picture, and a search branch that takes as input a picture of the current frame. The two branches are subjected to feature extraction through a complete convolution network, then cross-correlation operation is performed, and finally a score graph is obtained, wherein the specific situation can be represented by the following formula:
Figure BDA0003112943050000051
wherein z denotes a template picture, x denotes a search picture,
Figure BDA0003112943050000052
the characteristic diagram generated by the convolutional neural network is shown, b is an offset value, I is an identity matrix, S (z, x) is a final score diagram, the characteristic diagrams obtained by the two branches are subjected to cross-correlation operation to obtain S (z, x), and the position with the highest score is the position of the target.
As a further preferred technical solution, the attention mechanism includes a channel attention mechanism and a space attention mechanism, wherein:
the space attention mechanism is used for processing the input feature map to obtain a first feature map;
the channel attention mechanism is used for processing the input feature map to obtain a second feature map;
and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.
As shown in fig. 3, the spatial attention mechanism in this embodiment adopts a compact, efficient, and small-computation-amount spatial attention module to input a feature map
Figure BDA0003112943050000061
The method is divided according to the spatial position:
FSA=[f1,1,f1,2,…,fi,j,…,fH,W]
wherein the content of the first and second substances,
Figure BDA0003112943050000062
the feature tensor at spatial position (i, j) is represented, where i ∈ {1, 2, …, H }, and j ∈ {1, 2, …, W }. Feature(s)FIG. FSAFrom two branches input, one branch generates the weighting coefficients, while the other remains unchanged. Finally, multiplying the weight coefficient and each corresponding position tensor in the segmented characteristic diagram, and outputting the processed characteristic diagram
Figure BDA0003112943050000063
As shown in the following formula:
Figure BDA0003112943050000064
wherein, mui,jFrom the feature tensor fi,jBy a convolution operation of 1 × 1, σ () represents the sigmoid activation function.
As shown in FIG. 4, the channel attention mechanism in this embodiment is to input a feature map FCAAnd dividing the feature graph into two branches, wherein the original feature graph is kept unchanged, the two branches are respectively subjected to global average pooling, a 1 × 1 convolution compression channel and a 1 × 1 convolution expansion channel, and then a sigmoid activation function is carried out to finally generate a weight coefficient, and the original feature graph and the generated weight coefficient are weighted to obtain a new feature graph.
Feature map to be input
Figure BDA0003112943050000065
The division is carried out according to the number of channels, and the specific conditions are as follows:
FCA=[f1,f2,…,fk,…,fC]
wherein the content of the first and second substances,
Figure BDA0003112943050000066
k∈{1,2,…,C}。
generating an feature tensor after the feature map is subjected to global average pooling
Figure BDA0003112943050000071
The value of the k channel is shown below:
Figure BDA0003112943050000072
the generated feature tensor is subjected to two convolution operations of 1 × 1 to obtain a new feature tensor z', which is expressed by the following formula:
z′=W1(δ(W2z))
wherein the content of the first and second substances,
Figure BDA0003112943050000073
is the weight value of the first convolutional layer,
Figure BDA0003112943050000074
is the weight value of the second convolution layer, δ (·) is the ReLU activation function. The final profile was obtained as shown below:
Figure BDA0003112943050000075
finally, a feature map generated by the spatial attention mechanism based on the parallel attention mechanism
Figure BDA0003112943050000076
And feature maps generated by channel attention mechanism
Figure BDA0003112943050000077
Parallel addition yields a new feature map F, as shown in the following equation:
Figure BDA0003112943050000078
as a further preferred technical solution, the global inference module is configured to project the features of the attention feature map onto nodes of an interaction space to form a completely connected map, perform inference, and then map the features of the interaction space nodes to an original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.
As shown in fig. 5The global inference module consists of five convolutions, two for size reduction and expansion on the input feature map X and the output feature map Y (leftmost and rightmost), one for generating a double projection B (top) between the coordinates and the potential interaction space, and two for global inference based on the map Ag in the interaction space (middle). Here, V encodes the region feature as a graph node, WgRepresenting the parameters of the graph convolution. Will input the feature map
Figure BDA0003112943050000081
Where c is the number of channels, and L ═ hxw is mapped to the interaction space by linear transformation, as shown in the following formula:
Figure BDA0003112943050000082
wherein the content of the first and second substances,
Figure BDA0003112943050000083
the volume of the graph in the interaction space can be represented by:
Z=GVWg=((I-Ag)V)Wg
wherein G and AgRepresenting an N × N node adjacency matrix for diffusing information between nodes; wgThe status is indicated to be updated by the user,
Figure BDA0003112943050000084
a node matrix is represented.
And then mapping the interactive space to the original space to obtain a new characteristic diagram, which is specifically shown by the following linear mapping:
Figure BDA0003112943050000085
and finally, adding the new feature diagram and the original feature diagram to obtain a final feature diagram with context information.
In this embodiment, global reasoning modules are respectively added to two branches of a twin network, and perform cross-correlation operation with another branch to obtain a score map, and a weighted sum is performed on the obtained score map, where a specific formula is as follows:
S(z,x)=φS1(z,x)+(1-φ)S2(z,x)
wherein S is1(z, x) is a score graph obtained by adding a global reasoning module to the template branch, S2(z, x) is a score map obtained by adding the global reasoning module to the search branch, phi is a weight coefficient, in the embodiment, phi is 0.5, and S (z, x) is a final output score map.
The embodiment also discloses a target tracking system based on the attention mechanism and the global reasoning, which comprises a picture acquisition module and a target tracking module, wherein:
the image acquisition module is used for acquiring an initial frame image and a current frame image;
the target tracking module is used for processing the initial frame image and the current frame image by using a target tracking model to determine the position of a target, the target tracking model comprises a template branch and a search branch, the template branch and the search branch both comprise a trunk network, a parallel attention mechanism and a global reasoning module, the template branch and the search branch respectively process the initial frame image and the current frame image to obtain a first score map and a second score map, and the first score map and the second score map are subjected to weighted summation to obtain a regression map and determine the position of the target.
As a further preferred technical solution, the backbone network adopts a renex network structure, and is configured to perform feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism.
As a further preferred technical solution, the attention mechanism includes a channel attention mechanism and a space attention mechanism, wherein:
the space attention mechanism is used for processing the input feature map to obtain a first feature map;
the channel attention mechanism is used for processing the input feature map to obtain a second feature map;
and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.
As a further preferred technical solution, the global inference module is configured to project the features of the attention feature map onto nodes of an interaction space, perform inference, and then map the features of the interaction space nodes to an original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.
The target tracking system based on attention mechanism and global inference disclosed in this embodiment has the same technical features and technical effects as the target tracking method based on attention mechanism and global inference disclosed in the above embodiment, and details are not repeated here.
Compared with the existing tracking algorithm, the method has the advantages that the better tracking effect is achieved, and the experimental verification is as follows:
the data set for the experiments of this example was OTB100[ Yi Wu, Jongwood Lim, and Ming-Hsua Yang. object tracking bearing mark. IEEE Transactions on Pattern Analysis and Machine Analysis, 37(9): 1834-1848,2015.2 ]. It consists of 100 video frames and was proposed in 2015. Different data sets are also labeled with different attributes. There are 11 different attributes. These attributes may represent common difficulties in the field of target tracking. Such as Illumination Variation (IV), Scale Variation (SV), Occlusion (OCC), Deformation (DEF), Motion Blur (MB), Fast Motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background similarity (BC), Low Resolution (LR).
The quality of the tracking algorithm depends on the accuracy map and success map of the OTB100 data set, the accuracy map being the percentage of the center point of the target position estimated by the tracking algorithm for the video frames to the manually marked target center point less than a given threshold number of frames. Since the accuracy map cannot reflect changes in the size and scale of the target object, a success rate map is proposed. The success diagram is the percentage of the total frame number calculated by the frame number which is greater than the coincidence rate threshold value under the given coincidence rate threshold value, and the coincidence rate is calculated by the following formula:
Figure BDA0003112943050000101
wherein, O is coincidence rate, B is boundary frame area obtained by tracking algorithm, G is boundary frame area of real value, n is intersection operation, and U is union operation.
As shown in FIG. 6, the tracking algorithm of the present invention is tested on the OTB100 dataset and the results obtained are compared with the SimRPN [ Li B, Yan J, Wu W, et al]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8971-8980.]、CFNet[Valmadre J,Bertinetto L,Henriques J,et al.End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2805-2813.]、SiamFC3s[Wang L,Ouyang W,Wang X,et al.Visual tracking with fully convolutional networks[C]//Proceedings of the IEEE international conference on computer vision.2015:3119-3127.]、Staple[Bertinetto L,Valmadre J,Golodetz S,et al.Staple:Complementary learners for real-time tracking[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:1401-1409.]、fDSST[Danelljan M,
Figure BDA0003112943050000111
G,Khan F S,et al.Discriminative scale space tracking[J].IEEE transactions on pattern analysis and machine intelligence,2016,39(8):1561-1575.]Compared with results obtained by five mainstream algorithms in recent years, the algorithm of the invention has better effect. Compared with the SimFC algorithm, the tracking algorithm disclosed by the invention has the advantages that the average success rate and the accuracy are improved, wherein the success rate is improved by 6.9 percentage points, and the accuracy is improved by 9.6 percentage points. Compared with the SimRPN algorithm, the method is also improved, wherein the success rate and the precision are respectively improved by 1.8 percent and 2.4 percent.
As shown in fig. 7 and 8, the five classic tracking algorithms and the algorithm of the present invention have the good success rate and precision in the different properties of the OTB 100. Wherein the attributes of the tracking challenges represented in fig. 7 and 8 (a), (b), (c), (d), (e), (f), (g), and (h), respectively, are background similarity, distortion, low resolution, motion blur, occlusion, out-of-plane rotation, out-of-view and scale variation. Fig. 7 and 8 show that the tracking algorithm of the present invention outperforms the SiamRPN, CFNet, siamcfc 3s, stack and fdst algorithms in terms of both success rate and accuracy in the face of the above challenges.
As shown in fig. 9, four challenging video sequences in the OTB2015 dataset are selected, and the results obtained by the tracking algorithm of the present invention are compared with the real values, the results obtained by SiamFC, and the results obtained by stack, so that it can be found that the tracking algorithm of the present invention has significant advantages in handling the difficulties of occlusion, deformation, motion blur, and scale change.
The tracking challenge in the "Bolt 2" video sequence has distortion and similar background, and the above algorithm performs better for the distortion challenge, but the similar background challenge shows that the SiamFC performs worse as shown in the 235 th frame and the 252 th frame, while the algorithm of the present invention still performs better.
The tracking challenges in the Box video sequence include illumination change, scale change, occlusion, motion blur, in-plane rotation, out-of-view, similar background and low resolution, and for the challenges of the three tracking algorithms of illumination change, out-of-view and low resolution, the states of the three algorithms in the figure are similar, but for the six challenges of scale change, occlusion, motion blur, in-plane rotation, out-of-plane rotation and similar background, the algorithm of the invention performs better. In the figure, the SiamFC algorithm of the 43 th frame picture, the 357 th frame picture and the 945 th frame picture can follow the lost target, and the stack algorithm of the 641 th frame can follow the lost target, but the tracking state of the algorithm provided by the invention is stable all the time, and the lost following condition does not occur.
The tracking challenges that are present in "Dragon Baby" video sequences are scale change, occlusion, motion blur, fast motion, in-plane rotation, out-of-plane rotation, and out-of-view. As shown, the three algorithms basically can follow when the video is in 19 frames, but only the tracking algorithm of the present invention can follow the target when the 44 th frame has motion blur and fast motion, the 48 th frame has occlusion, and the 80 th frame has out-of-plane rotation.
Tracking challenges present in "Girl 2" video sequences are scale variations, occlusion, distortion, motion blur, and out-of-plane rotation. As shown in the figure, when the occlusion is about to occur in the 107 th frame, the tracking states of the three algorithms are similar, but only the tracking algorithm of the invention keeps track of the target in the 239 th frame after the occlusion occurs. Similarly, the algorithm of the present invention can still keep up with the target when the 842 th frame and 927 th frame are deformed, but the other two algorithms perform poorly.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A target tracking method based on an attention mechanism and global reasoning is characterized in that a target tracking model based on a twin network is used for target tracking, the target tracking model comprises a template branch and a search branch, the template branch and the search branch respectively comprise a main network, a parallel attention mechanism and a global reasoning module, and the target tracking method comprises the following steps:
acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map;
carrying out weighted summation on the first score map and the second score map to obtain a regression map;
and determining the position of the target according to the regression graph.
2. The target tracking method based on attention mechanism and global inference as claimed in claim 1, wherein said backbone network adopts renex t network structure, which is used to perform feature extraction on the inputted initial frame picture or the current frame picture, obtain a feature map and use it as the input of the attention mechanism.
3. The method of target tracking based on attention mechanism and global inference as claimed in claim 1, wherein said attention mechanism comprises a channel attention mechanism and a spatial attention mechanism, wherein:
the space attention mechanism is used for processing the input feature map to obtain a first feature map;
the channel attention mechanism is used for processing the input feature map to obtain a second feature map;
and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.
4. The target tracking method based on attention mechanism and global inference as claimed in claim 3, wherein said global inference module is used for projecting the features of the attention feature map onto the nodes of the interaction space, performing inference, and then mapping the features of the interaction space nodes to the original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.
5. The method for tracking targets based on attention mechanism and global inference as claimed in claim 4, wherein said obtaining an initial frame picture and a current frame picture as input of a template branch and a search branch respectively to obtain a first score map and a second score map comprises:
and performing cross-correlation operation on the new feature graph output by the global reasoning module in the template branch and the new feature graph output by the global reasoning module in the search branch to respectively obtain the first score graph and the second score graph.
6. A target tracking system based on attention mechanism and global reasoning is characterized by comprising a picture acquisition module and a target tracking module, wherein:
the image acquisition module is used for acquiring an initial frame image and a current frame image;
the target tracking module is used for processing the initial frame image and the current frame image by using a target tracking model to determine the position of a target, the target tracking model comprises a template branch and a search branch, the template branch and the search branch both comprise a trunk network, a parallel attention mechanism and a global reasoning module, the template branch and the search branch respectively process the initial frame image and the current frame image to obtain a first score map and a second score map, and the first score map and the second score map are subjected to weighted summation to obtain a regression map and determine the position of the target.
7. The target tracking system based on attention mechanism and global inference as claimed in claim 6, wherein said backbone network adopts a renex t network structure, which is used to perform feature extraction on the inputted initial frame picture or the current frame picture, obtain a feature map and use it as the input of the attention mechanism.
8. The target tracking system based on attention mechanism and global inference of claim 6, characterized in that the attention mechanism comprises a channel attention mechanism and a spatial attention mechanism, wherein:
the space attention mechanism is used for processing the input feature map to obtain a first feature map;
the channel attention mechanism is used for processing the input feature map to obtain a second feature map;
and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.
9. The target tracking system based on attention mechanism and global inference as claimed in claim 8, wherein said global inference module is used to project the features of the attention feature map onto the nodes of the interaction space, to make inference, and then to map the features of the interaction space nodes to the original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.
CN202110656309.6A 2021-06-11 2021-06-11 Target tracking method and system based on attention mechanism and global reasoning Pending CN113592900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110656309.6A CN113592900A (en) 2021-06-11 2021-06-11 Target tracking method and system based on attention mechanism and global reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110656309.6A CN113592900A (en) 2021-06-11 2021-06-11 Target tracking method and system based on attention mechanism and global reasoning

Publications (1)

Publication Number Publication Date
CN113592900A true CN113592900A (en) 2021-11-02

Family

ID=78243779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110656309.6A Pending CN113592900A (en) 2021-06-11 2021-06-11 Target tracking method and system based on attention mechanism and global reasoning

Country Status (1)

Country Link
CN (1) CN113592900A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018878A (en) * 2022-04-21 2022-09-06 哈尔滨工业大学 Attention mechanism-based target tracking method in complex scene, storage medium and equipment
CN115661207A (en) * 2022-11-14 2023-01-31 南昌工程学院 Target tracking method and system based on space consistency matching and weight learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110472495A (en) * 2019-07-08 2019-11-19 南京邮电大学盐城大数据研究院有限公司 A kind of deep learning face identification method based on graphical inference global characteristics
CN111192292A (en) * 2019-12-27 2020-05-22 深圳大学 Target tracking method based on attention mechanism and twin network and related equipment
CN111192270A (en) * 2020-01-03 2020-05-22 中山大学 Point cloud semantic segmentation method based on point global context reasoning
CN111354017A (en) * 2020-03-04 2020-06-30 江南大学 Target tracking method based on twin neural network and parallel attention module
CN111428699A (en) * 2020-06-10 2020-07-17 南京理工大学 Driving fatigue detection method and system combining pseudo-3D convolutional neural network and attention mechanism
CN112560695A (en) * 2020-12-17 2021-03-26 中国海洋大学 Underwater target tracking method, system, storage medium, equipment, terminal and application

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110472495A (en) * 2019-07-08 2019-11-19 南京邮电大学盐城大数据研究院有限公司 A kind of deep learning face identification method based on graphical inference global characteristics
CN111192292A (en) * 2019-12-27 2020-05-22 深圳大学 Target tracking method based on attention mechanism and twin network and related equipment
CN111192270A (en) * 2020-01-03 2020-05-22 中山大学 Point cloud semantic segmentation method based on point global context reasoning
CN111354017A (en) * 2020-03-04 2020-06-30 江南大学 Target tracking method based on twin neural network and parallel attention module
CN111428699A (en) * 2020-06-10 2020-07-17 南京理工大学 Driving fatigue detection method and system combining pseudo-3D convolutional neural network and attention mechanism
CN112560695A (en) * 2020-12-17 2021-03-26 中国海洋大学 Underwater target tracking method, system, storage medium, equipment, terminal and application

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUNPENG CHEN: "Graph-Based Global Reasoning Networks", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 9 January 2020 (2020-01-09), pages 433 - 442 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018878A (en) * 2022-04-21 2022-09-06 哈尔滨工业大学 Attention mechanism-based target tracking method in complex scene, storage medium and equipment
CN115661207A (en) * 2022-11-14 2023-01-31 南昌工程学院 Target tracking method and system based on space consistency matching and weight learning

Similar Documents

Publication Publication Date Title
CN108090919B (en) Improved kernel correlation filtering tracking method based on super-pixel optical flow and adaptive learning factor
Tang et al. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition
CN108830170B (en) End-to-end target tracking method based on layered feature representation
CN110135365B (en) Robust target tracking method based on illusion countermeasure network
Kim et al. Fast pedestrian detection in surveillance video based on soft target training of shallow random forest
CN110942476A (en) Improved three-dimensional point cloud registration method and system based on two-dimensional image guidance and readable storage medium
CN113592900A (en) Target tracking method and system based on attention mechanism and global reasoning
CN112288627A (en) Recognition-oriented low-resolution face image super-resolution method
CN111415318B (en) Unsupervised related filtering target tracking method and system based on jigsaw task
CN112163498A (en) Foreground guiding and texture focusing pedestrian re-identification model establishing method and application thereof
CN111488932A (en) Self-supervision video time-space characterization learning method based on frame rate perception
CN111882581B (en) Multi-target tracking method for depth feature association
CN112183675A (en) Twin network-based tracking method for low-resolution target
CN112785626A (en) Twin network small target tracking method based on multi-scale feature fusion
Yu et al. Background subtraction based on GAN and domain adaptation for VHR optical remote sensing videos
CN111968155A (en) Target tracking method based on segmented target mask updating template
Wani et al. Deep learning-based video action recognition: a review
Savadi Hosseini et al. A hybrid deep learning architecture using 3d cnns and grus for human action recognition
CN115063717B (en) Video target detection and tracking method based on real scene modeling of key area
CN116734834A (en) Positioning and mapping method and device applied to dynamic scene and intelligent equipment
CN115512263A (en) Dynamic visual monitoring method and device for falling object
CN114972426A (en) Single-target tracking method based on attention and convolution
Wang et al. A spatio-temporal attention convolution block for action recognition
Ouanan et al. Pubface: Celebrity face identification based on deep learning
Sun et al. Research on robot target recognition based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination