CN117576149A - Single-target tracking method based on attention mechanism - Google Patents
Single-target tracking method based on attention mechanism Download PDFInfo
- Publication number
- CN117576149A CN117576149A CN202311360574.5A CN202311360574A CN117576149A CN 117576149 A CN117576149 A CN 117576149A CN 202311360574 A CN202311360574 A CN 202311360574A CN 117576149 A CN117576149 A CN 117576149A
- Authority
- CN
- China
- Prior art keywords
- network
- feature map
- attention
- training
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000007246 mechanism Effects 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 238000012360 testing method Methods 0.000 claims abstract description 16
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 230000000694 effects Effects 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000002360 preparation method Methods 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 24
- 238000010586 diagram Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 13
- 230000004927 fusion Effects 0.000 claims description 8
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000008901 benefit Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 2
- 238000007670 refining Methods 0.000 claims description 2
- 241001465754 Metazoa Species 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000009975 flexible effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/248—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0985—Hyperparameter optimisation; Meta-learning; Learning-to-learn
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of computer vision, in particular to a single-target tracking method based on an attention mechanism, which comprises the following steps: step 1, data preprocessing, namely providing data preparation for subsequent network model training; step 2, constructing a twin network frame based on time sequence information; step 3, adding an attention module; training the model, namely training the constructed network model, and finally obtaining the network weight of the single-target network architecture based on the attention; step 5, adding a bounding box thinning module; and 6, testing a model, namely testing the effect of tracking the target by using the network weight obtained through training in a new video sequence. According to the invention, the attention module is combined with the online extraction module, so that the problem of interference of unnecessary features on calculation is reduced, the tracking precision is further improved, and the bounding box thinning module is introduced, so that the model performance is greatly improved.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a single-target tracking method based on an attention mechanism.
Background
The single-target tracking is one of the hot spots in the field of computer vision research and is widely applied, the tracking focusing of a camera, the automatic target tracking of an unmanned aerial vehicle and the like are required to be used for single-target tracking technology, in addition, the tracking of specific objects, such as human body tracking, vehicle tracking in a traffic monitoring system, gesture tracking in a human face tracking and intelligent interaction system and the like, are required to be applied to the target tracking task, along with continuous deep research of researchers, the vision target tracking has breakthrough progress in more than ten years, so that a vision tracking algorithm is not only limited to a traditional machine learning method, but also combines the methods of artificial intelligent hot-tide-deep learning, a related filter and the like in recent years, robust and accurate results are obtained, along with great success of deep learning in the fields of voice recognition, image classification, target detection and the like, the deep learning framework is applied to target tracking task in more and more comprehensive development, and wider application of the target tracking technology is also obtained in a plurality of social fields, especially in the current social environment, and the requirements of various layers of society on a high-tech tracking mode show a continuously improved state, and the importance of the target tracking technology is further important.
In recent years, unmanned aerial vehicles are increasingly widely applied to a plurality of fields such as civil use, military use, scientific research and the like by virtue of the characteristics of small size, flexible action, large detection range, strong autonomy and the like, and the unmanned aerial vehicle is also increasingly valued for tracking ground targets. Most of the existing tracking algorithms mainly aim at target tracking in natural scenes, and a complete solution for tracking tasks in unmanned airport scenes is not available. For a single target tracking data set in a general scene, the category of the tracking target is mostly common general targets, such as pedestrians, vehicles, animals and the like. The shooting mode is generally shooting by a common camera and the like, and has the characteristics of larger target, obvious outline and clear texture. However, the corresponding difficulty is that the targets are easy to deform greatly during the movement process, such as pedestrians, and the actions of steering, lifting hands, jumping and the like can occur, which can bring great challenges to the tracking algorithm. With the shooting task requirement of special scenes, unmanned aerial vehicles are increasingly widely applied in life. Many scholars are studying how to provide unmanned aerial vehicles with rich computer vision functions such as target detection, tracking and the like. One of the benefits of using unmanned aerial vehicles for target tracking is that it can track group targets, such as aggregated people, animals, ships, etc., while tasks of target tracking using aerial devices are being applied in various industries, such as shooting of extreme movements, such as high-altitude searching and rescue, gliding, etc., monitoring of dense people, monitoring of wild animals, etc.
The main difference between single-target tracking in an unmanned aerial vehicle scene and single-target tracking in a general scene is that the unmanned aerial vehicle tracking time is generally longer in an aerial shooting environment, the tracking boundary frame is inaccurate due to higher shooting height, the computing resources are limited by the limited capability of an aerial platform, the problem of interference caused by redundant features on computation is solved, the shooting angle is easy to change, and the problems that the target scale is changed greatly due to rapid movement of a camera are more remarkable. Thus, for these problems, a single-target tracking method based on an attention mechanism is proposed.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides a single-target tracking method based on an attention mechanism, which solves the problems in the background art.
(II) technical scheme
The invention adopts the following technical scheme for realizing the purposes:
a single target tracking method based on an attention mechanism, comprising the steps of:
step 1, data preprocessing, namely providing data preparation for subsequent network model training;
step 2, constructing a twin network frame based on time sequence information;
step 3, adding an attention module;
training the model, namely training the constructed network model, and finally obtaining the network weight of the single-target network architecture based on the attention;
step 5, adding a bounding box thinning module;
and 6, testing a model, namely testing the effect of tracking the target by using the network weight obtained through training in a new video sequence.
Further, in the step 1, each video picture in each dataset is cut into a fixed size by a data preprocessing operation, and then placed in a regenerated folder, wherein the folders are all templates for training after cutting and sample pictures of a search area, the size of the template picture Z is 127×127, and the size of the search area picture X is 511×511.
Further, in the step 2, a twin network for feature extraction, a transformation-based similarity graph refinement network, and a classification regression sub-network for classification and regression of the target location are included.
Further, in the step 3, an attention module is added, the attention module includes an adaptive average pooling layer for reducing the input feature into a global average value, then, the feature after the dimension reduction is mapped into a context weight range through a series of convolution layers and activation functions, finally, the context weight is multiplied with the input feature to obtain a weighted feature representation, the attention mechanism has the advantage of effectively modeling the global context, and has a lightweight calculation function, so that the interference problem of unnecessary features on calculation is reduced, and the formula of the attention mechanism is expressed as follows:
wherein the method comprises the steps ofIs the weight of the global attention pool, δ (·) =w v2 ReLU(LN(W V1 (·)) represents a bottleneck transformation, this attention module is lightweight, and remote dependencies can be better captured without increasing computational cost.
Further, the step 4 model training includes the steps of:
s1, inputting sample pictures into a twin network for training, wherein the training process is performed offline, and 5 data sets of COCO, imageNet-VID 2015, GOT-10K, laSOT and YOUTUBEBB are adopted for training;
s2, the twin network is used for measuring the similarity of input samples: the sample comprises a target image and a search image, wherein the target image is 127x127 of images to be tracked, the search image is 511x511 of images to be executed to track the target; the twin neural network has two input branches, one branch target template Z is used as input, the other branch is used as input, the two inputs are sent into the two weight-shared neural networks, and the two neural networks map the inputs to new spaces respectively to form representations of the inputs in the new spaces;
s3, extracting a network from the feature map: the target template and the search area are placed in the same feature extraction network, namely Alexnet, three layers of convolutions in front of the Alexnet network are reserved, the fourth layer of convolutions and the fifth layer of convolutions are replaced by on-line time sequence self-adaptive convolutions, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;
s4, refining a network based on a similarity graph of a transducer: the two-branch feature map information output by the feature extraction network is simultaneously input into a transducer network, firstly, timing sequence priori knowledge with fixed size is designed, the priori knowledge is effectively encoded in a high-efficiency memory mode, then the priori knowledge is decoded for accurately adjusting a similar map, and then information filtering is carried out to obtain a feature map of a current frame, wherein the feature map is used as the most basic component of the transducer, a multi-head attention formula is shown as follows, and 6 heads are used in the text:
for the sake of clarity, we define the timing knowledge of the t-1 frame asThe current frame is F t Intermediate result->Can be expressed as:
thus the output of the information filterThe method comprises the following steps:
wherein F represents a convolution layer;
knowledge of the timing of the final current frameIs->Can be expressed as:
for t=1, an independent convolution is used for the initialization operation;
s5, classifying and returning to a network: firstly, mapping each position (i, j) in the feature map back to a search area (x, y), and obtaining a classification branch and a regression branch through convolution of a response map; classifying branches to obtain a classification characteristic diagramAnd center feature map->The classification characteristic map is a class for predicting each position, classification characteristic map +.>Each point (i, j,:) above contains a 2D vector representing the corresponding foreground and background scores, respectively; simultaneously with the classification feature map there is also a central feature map, central feature map +.>The center of each pixel point is given a score, the center is the center position with high score, the center can be used for deleting abnormal values, and the position far away from the center often generates a low-quality prediction boundary box;
regression branch output regression feature map of classified regression networkRegression profile->Each point (i, j,:) contains a 4D vector t (i, j) = (l, t, r, b) representing the distance from the corresponding location to the four sides of the bounding box in the input search area, set (x) 0 ,y 0 ) And (x) 1 ,y 1 ) Representing truth bounding boxes(x, y) represents the corresponding position of point (i, j) at a point +.>Regression objective of->The method can be calculated by the following formula:
wherein (x) 0 ,y 0 ) And (x) 1 ,y 1 ) Representing the top left and bottom right corners of the truth bounding box,representing the corresponding point on the regression feature map +.>Regression objective of->Respectively representing the distances from the points on the regression feature map to the four sides of the bounding box.
Training the whole network in an end-to-end manner; wherein the loss function value of the classification section is L cls The regression function value of the bounding box is L reg The centrality loss is L cen Weighting the two components together according to the corresponding weight values to serve as a loss function after weighting the whole system;
L=L cls +λ 1 L cen +λ 2 L reg
in the above formula, the cross entropy loss is adopted for classification, the IOU loss is used for regression, and the centrality loss is also used;
according to the calculation gradient of the loss function L, updating the parameters of the network by using an optimizer SGD to reduce the loss function of the whole network until convergence, and ending the whole training to obtain trained network weight based on single-target tracking of attention;
s6, pre-training: the network backbone is pre-trained on ImageNet, the same initialization is used for the online time sequence self-adaptive convolution layer, the similarity graph refinement network is randomly initialized, parameters of the backbone are frozen for the first 5 epochs, the learning rate on the rest epochs of the training process is reduced from 0.005 to 0.0005, SGD is adopted as an optimizer, the momentum is 0.9, the small batch size is 124, and the input sizes of the template and the search area are 127x127 and 511x511 respectively.
Further, in the step 5, a bounding box refinement network is added, the bounding box refinement module needs to input template information for initialization, a search area of the bounding box refinement module is from a local search area intercepted by a current bounding box output by the tracker, the local search area is about 2 times of a target size, a feature fusion operation which plays a key role in the bounding box refinement module is performed in the bounding box refinement module, a point-by-point convolution is adopted in the bounding box refinement module for feature fusion, and a template feature map is assumed to be expressed asThe characteristic diagram of the search area is expressed as S epsilon R C×H×W Then K is first decomposed into H 0 ×W 0 Small feature map K j ∈R C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as
The method comprises the steps of carrying out a point-by-point convolution on a template feature map and a search area feature map, wherein the common correlation operation is represented, the problem of space ambiguity caused by cross-correlation operation of the whole template feature map and the search area feature map by utilizing a sliding window can be avoided, a result predicted and output by a refined boundary frame of the module is taken as a final output result, and meanwhile, the tracker is updated by utilizing the result.
Further, the model test in the step 6 comprises the following steps:
s1, testing the tracking effect in a new video sequence which does not appear by using the trained weight parameters.
(III) beneficial effects
Compared with the prior art, the invention provides a single-target tracking method based on an attention mechanism, which has the following beneficial effects:
according to the invention, the attention template is added, so that the problem of interference of unnecessary features on calculation is reduced, and the object representation with more discrimination is generated, thereby greatly improving the model performance.
According to the invention, by introducing the bounding box refinement module, when the tracker finds the target in the local search area, the algorithm can output the bounding box of the target as a current tracking result. Because the current algorithm only can roughly estimate the bounding box of the target, the target cannot be well surrounded in many cases, so that the tracking accuracy is reduced. The algorithm performance can be improved by using the bounding box refinement module as a final output result.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a schematic diagram of the overall network framework structure of the present invention;
FIG. 3 is a schematic diagram of the attention structure of the present invention;
FIG. 4 is a schematic diagram of a training architecture of the present invention;
FIG. 5 is a schematic diagram of a bounding box refinement module of the present invention;
FIG. 6 is a graph of the present invention evaluating a UAV123 test dataset;
FIG. 7 is a schematic diagram of the structure of the result of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
As shown in fig. 1-7, a single-target tracking method based on an attention mechanism according to an embodiment of the present invention includes the following steps:
step 1, cutting each video picture in each dataset into a fixed size through data preprocessing operation, then placing the video pictures in a regenerated folder, wherein the folders are all templates for training after cutting and sample pictures of a search area, the size of a template picture Z is 127 multiplied by 127, and the size of a search area picture X is 511 multiplied by 511;
step 2, constructing a time sequence information twin network frame; the core idea of target tracking is that an object image to be tracked is firstly framed through an initial frame and used as a retrieval basis of a subsequent frame; secondly, inputting the target and the search into a twin network simultaneously, and outputting two feature graphs; then, the two-branch feature map information output by the feature extraction network is simultaneously input into the transducer network, the time domain knowledge is effectively encoded in a memory-efficient mode, and then the time domain knowledge is decoded for accurately adjusting the similarity map, so that finer feature description is obtained.
The main network for extracting the features is a CNN structure shared by two branches, a target template and a search area are placed in the same feature extraction network, namely Alexnet, three layers of convolution before the Alexnet network are reserved, a fourth layer of convolution and a fifth layer of convolution are replaced by on-line time sequence self-adaptive convolution to enhance the space features, the process calibrates the convolution weight of the next frame according to the convolution weight of the previous frame, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;
to increase the operation speed, we replace the normal convolution with the on-line time sequence adaptive convolution, first define the t frameThe image is X t Is X-after convolution calculation t The formula is as follows:
X^ t =W*X t +b
wherein, operator represents convolution operation, W t 、b t Is the time weight and offset of the convolution, in the online convolution layer, the parameter is a parameter (W t 、b t ) And convolution operator, the calibration factors being different for each frame, i.eAnd->
And simultaneously inputting the two-branch feature map information output by the feature extraction network into a transducer network, designing time sequence priori knowledge with a fixed size, continuously extracting old knowledge and adding the old knowledge into new knowledge, and filtering information to obtain a feature map of the current frame. As the most basic composition of the transducer, the multi-headed attention formula is shown below, where we use 6-headed:
for the sake of clarity, we define the timing knowledge of the t-1 frame asThe current frame is F t Intermediate result->Can be expressed as:
thus the output of the information filterThe method comprises the following steps:
wherein F represents a convolution layer.
Knowledge of the timing of the final current frameIs->Can be expressed as:
for t=1, an independent convolution is used for the initialization operation.
And 3, adding an attention module, wherein the attention module comprises an adaptive average pooling layer for reducing the dimension of the input features to a global average value, mapping the dimension-reduced features into a context weight range through a series of convolution layers and activation functions, and multiplying the context weight with the input features to obtain a weighted feature representation. This attention mechanism has both the advantage of efficient modeling of global context and the lightweight computing function. The problem of interference of unnecessary features on calculation is reduced, and the attention mechanism is expressed as follows:
wherein the method comprises the steps ofIs the weight of the global attention pool, δ (·) =w v2 ReLU(LN(W V1 (·))) represents a bottleneck transformation. This attention module is lightweight and can better capture remote dependencies without increasing computational costs.
Step 4, inputting sample pictures into a network for training, wherein the training process is performed offline; the sample pictures are 127x127, the search pictures are 511x511, and the network backbone is pre-trained on ImageNet-1 k. For the first 5 epochs, the parameters of the trunk were frozen and the learning rate on the remaining epochs of the training process was reduced from 0.005 to 0.0005 for a total training period of 10 cycles. The SGD was used as an optimizer with a momentum of 0.9, where the small lot size was 124 and the input sizes of the template and search area were 127x127 and 511x511, respectively.
Step 5, adding a boundary box thinning module, wherein the boundary box thinning module firstly needs to input template information for initial processingThe search area of the bounding box refinement module is initiated from the local search area intercepted by the current bounding box output by the tracker, which is about 2 times the target size. The feature fusion operation plays a key role in the bounding box refinement module, and feature fusion is carried out by adopting point-by-point convolution in the bounding box refinement module. Assume that the template feature map is expressed asThe characteristic diagram of the search area is expressed as S epsilon R C×H×W Then K is first decomposed into H 0 ×W 0 Small feature map K j ∈R C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as
Where x represents the normal correlation operation. The point-by-point convolution mode can avoid the problem of space blurring caused by the cross-correlation operation of the whole template feature map and the search area feature map by utilizing a sliding window. And predicting the output result of the refined boundary box through the module as a final output result.
Step 6, model testing; by adopting the data set provided by the UAV123 functional network and testing the training effect of the method according to the evaluation index of the UAV123 data set, as can be found from fig. 6, the single-target tracking algorithm provided by the invention has better performance than the original basic algorithm.
As shown in fig. 2, in some embodiments, the basic network framework is a time series information twin network framework, including a twin network for feature extraction, a Transformer-based similarity graph refinement network, and a classification regression sub-network for classification and regression of target locations.
Feature map extraction network: the method comprises the steps of extracting characteristics of a target template and a search area, putting the target template and the search area into the same characteristic extraction network, namely Alexnet, reserving three layers of convolutions before the Alexnet, replacing a fourth layer of convolutions and a fifth layer of convolutions with online time sequence self-adaptive convolutions, and finally obtaining two characteristic maps respectively, namely a target image characteristic map and a search area characteristic map through the characteristic extraction network;
network refinement based on a Transformer similarity graph: the two-branch feature map information output by the feature extraction network is simultaneously input into a transducer network, firstly, timing sequence priori knowledge with fixed size is designed, the priori knowledge is effectively encoded in a high-efficiency memory mode, then the priori knowledge is decoded for accurately adjusting a similar map, and information filtering is carried out to obtain the feature map of the current frame. As the most basic composition of the transducer, the multi-headed attention formula is shown below, where we use 6-headed:
for the sake of clarity, we define the timing knowledge of the t-1 frame asThe current frame is F t Intermediate result->Can be expressed as:
thus the output of the information filterThe method comprises the following steps:
wherein F represents a convolution layer.
Knowledge of the timing of the final current frameIs->Can be expressed as:
for t=1, an independent convolution is used for the initialization operation.
Classification regression network: first of all feature mapThe position (i, j) in the search area can be mapped back to be (x, y), and the response graph can obtain a classification branch and a regression branch through convolution; classifying branches to obtain a classification characteristic diagramAnd center feature map->The classification characteristic map is a class for predicting each position, classification characteristic map +.>Each point (i, j,:) above contains a 2D vector representing the corresponding foreground and background scores, respectively; simultaneously with the classification feature map there is also a central feature map, central feature map +.>The center of each pixel point is given a score, the center is the center position with high score, the center can be used for deleting abnormal values, and the position far away from the center often generates a low-quality prediction boundary box;
regression branch output regression feature map of classified regression networkRegression profile->Each point (i, j,:) contains a 4D vector t (i, j) = (l, t, r, b) representing the distance from the corresponding location to the four sides of the bounding box in the input search area, set (x) 0 ,y 0 ) And (x) 1 ,y 1 ) Representing the top left and bottom right corners of the truth bounding box, (x, y) represents the corresponding position of point (i, j). A certain point on the regression feature map->Regression objective of->The method can be calculated by the following formula:
wherein (x) 0 ,y 0 ) And (x) 1 ,y 1 ) Representing the top left and bottom right corners of the truth bounding box,representing the corresponding point on the regression feature map +.>Regression objective of->Respectively representing the distances from the points on the regression feature map to the four sides of the bounding box.
Training the whole network in an end-to-end manner; wherein the loss function value of the classification section is L cls The regression function value of the bounding box is L reg The centrality loss is L cen Weighting the two components together according to the corresponding weight values to serve as a loss function after weighting the whole system;
L=L cls +λ 1 L cen +λ 2 L reg
in the above formula, the cross entropy loss adopted is classified, the IOU loss is regressed, and the centrality loss is also included;
according to the calculated gradient of the loss function L, the optimizer SGD is used for updating the parameters of the network, so that the whole network loss function is reduced until convergence, and then the whole training is finished, and the trained network weight based on the single-target tracking of the attention is obtained.
As shown in fig. 3, in some embodiments, the feature extraction network introduces an attention mechanism, and the attention module includes an adaptive averaging pooling layer for reducing the input features to a global average, then mapping the reduced features into a context-representing weight range through a series of convolution layers and activation functions, and finally multiplying the context weights with the input features to obtain a weighted feature representation. This attention mechanism has both the advantage of efficient modeling of global context and the lightweight computing function. The problem of interference of unnecessary features on calculation is reduced, and the attention mechanism is expressed as follows:
wherein the method comprises the steps ofIs the weight of the global attention pool, δ (·) =w v2 ReLU(LN(W V1 (·))) represents a bottleneck transformation. This attention module is lightweight and can better capture remote dependencies without increasing computational costs.
As shown in fig. 4, in some embodiments, the model training includes the steps of:
firstly, obtaining a sample, inputting the sample into a network model for training, wherein the sample comprises a target image and a search image, the target image is 127X127 of the tracked image, the search image is 511X511 of the tracked image, the backbone network is a CNN structure shared by two branches, one branch target template Z is used as input, the other branch search area X is used as input, the target template and the search area are both put into the same feature extraction network, namely Alexnet, three-layer convolution before the Alexnet network is reserved, the fourth-layer convolution and the fifth-layer convolution are replaced by online time sequence self-adaptive convolution, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;
for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture of the target at the center of the picture through affine transformation;
similar to the generation of the target feature map, the generation of the search region feature map is performed by performing classification regression on the feature map, wherein the regression: for distinguishing whether the anchor frame is foreground or background, classification: is a deviation value of the position of the anchor frame from the actual position for correcting the position of the last frame.
The network backbone is pre-trained on ImageNet. The same initialization is used for the online timing adaptive convolutional layer, and the similarity graph refinement network is randomly initialized. For the first 5 epochs, the parameters of the trunk were frozen and the learning rate on the remaining epochs of the training process was reduced from 0.005 to 0.0005. With SGD as the optimizer, the momentum is 0.9, with a small batch size of 124 and input sizes of 127x127 and 511x511 for the template and search area, respectively.
As shown in fig. 5, in some embodiments, the network incorporates a bounding box refinement module that first needs to input template information for initialization, the search area of the bounding box refinement module being from the local search area intercepted by the current bounding box output by the tracker, which is approximately 2 times the target size. The feature fusion operation plays a key role in the bounding box refinement module, and feature fusion is carried out by adopting point-by-point convolution in the bounding box refinement module. Assume that the template feature map is expressed asThe characteristic diagram of the search area is expressed as S epsilon R C×H×W Then K is first decomposed into H 0 ×W 0 Small feature map K j ∈R C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as:
where x represents the normal correlation operation. The point-by-point convolution mode can avoid the problem of space blurring caused by the cross-correlation operation of the whole template feature map and the search area feature map by utilizing a sliding window. And predicting the output result of the refined boundary box through the module as a final output result.
As shown in fig. 6, in some embodiments, the model test includes the steps of: testing the tracking effect of the trained model in a new video sequence;
in single target tracking, as shown in fig. 7, a rectangular box is typically given in the first frame, containing the center position and size of the target to be tracked, this box is typically manually marked, then the tracking algorithm is required to follow this box in the subsequent frame, and the position offset and size change of the target is calculated by the algorithm in the subsequent frame, and the test results on the UAV123 dataset are presented on the video sequence for a direct visual sense.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A single target tracking method based on an attention mechanism is characterized in that: the method comprises the following steps:
step 1, data preprocessing, namely providing data preparation for subsequent network model training;
step 2, constructing a twin network frame based on time sequence information;
step 3, adding an attention module;
training the model, namely training the constructed network model, and finally obtaining the network weight of the single-target network architecture based on the attention;
step 5, adding a bounding box thinning module;
and 6, testing a model, namely testing the effect of tracking the target by using the network weight obtained through training in a new video sequence.
2. The attention-based single-object tracking method of claim 1, wherein: in the step 1, each video picture in each dataset is cut into a fixed size by a data preprocessing operation and then placed in a regenerated folder, wherein the folder is a template for training after cutting and a sample picture of a search area, the size of the template picture Z is 127×127, and the size of the search area picture X is 511×511.
3. The attention-based single-object tracking method of claim 1, wherein: in the step 2, a twin network for feature extraction, a similarity graph refinement network based on a transducer and a classification regression sub-network for classification and regression of target positions are included.
4. The attention-based single-object tracking method of claim 1, wherein: the attention module is added in the step 3, and comprises an adaptive average pooling layer for reducing the input feature into a global average value, then the feature after the dimension reduction is mapped into a representation context weight range through a series of convolution layers and activation functions, finally the context weight is multiplied with the input feature to obtain a weighted feature representation, and the attention mechanism has the advantage of effectively modeling the global context and has a lightweight calculation function, so that the interference problem of unnecessary features on calculation is reduced, and the formula of the attention mechanism is expressed as follows:
wherein the method comprises the steps ofIs the weight of the global attention pool, δ (·) =w v2 ReLU(LN(W V1 (·)) represents a bottleneck transformation, this attention module is lightweight, and remote dependencies can be better captured without increasing computational cost.
5. The method of claim 1, wherein the step 4 model training comprises the steps of:
s1, inputting sample pictures into a twin network for training, wherein the training process is performed offline, and 5 data sets of COCO, imageNet-VID 2015, GOT-10K, laSOT and YOUTUBEBB are adopted for training;
s2, the twin network is used for measuring the similarity of input samples: the sample comprises a target image and a search image, wherein the target image is 127x127 of images to be tracked, the search image is 511x511 of images to be executed to track the target; the twin neural network has two input branches, one branch target template Z is used as input, the other branch is used as input, the two inputs are sent into the two weight-shared neural networks, and the two neural networks map the inputs to new spaces respectively to form representations of the inputs in the new spaces;
s3, extracting a network from the feature map: the target template and the search area are placed in the same feature extraction network, namely Alexnet, three layers of convolutions in front of the Alexnet network are reserved, the fourth layer of convolutions and the fifth layer of convolutions are replaced by on-line time sequence self-adaptive convolutions, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;
s4, refining a network based on a similarity graph of a transducer: the two-branch feature map information output by the feature extraction network is simultaneously input into a transducer network, firstly, timing sequence priori knowledge with fixed size is designed, the priori knowledge is effectively encoded in a high-efficiency memory mode, then the priori knowledge is decoded for accurately adjusting a similar map, and then information filtering is carried out to obtain a feature map of a current frame, wherein the feature map is used as the most basic component of the transducer, a multi-head attention formula is shown as follows, and 6 heads are used in the text:
for the sake of clarity, we define the timing knowledge of the t-1 frame asThe current frame is F t Intermediate resultCan be expressed as:
thus the output of the information filterThe method comprises the following steps:
wherein F represents a convolution layer;
knowledge of the timing of the final current frameIs->Can be expressed as:
for t=1, an independent convolution is used for the initialization operation;
s5, classifying and returning to a network: first of all each in the feature mapThe positions (i, j) can be mapped back to the search area as (x, y), and the response graph can obtain a classification branch and a regression branch through convolution; classifying branches to obtain a classification characteristic diagramAnd center feature map->The classification characteristic map is a class for predicting each position, classification characteristic map +.>Each point (i, j,:) above contains a 2D vector representing the corresponding foreground and background scores, respectively; simultaneously with the classification feature map there is also a central feature map, central feature map +.>The center of each pixel point is given a score, the center is the center position with high score, the center can be used for deleting abnormal values, and the position far away from the center often generates a low-quality prediction boundary box;
regression branch output regression feature map of classified regression networkRegression profile->Each point (i, j,:) contains a 4D vector t (i, j) = (l, t, r, b) representing the distance from the corresponding location to the four sides of the bounding box in the input search area, set (x) 0 ,y 0 ) And (x) 1 ,y 1 ) Representing the upper left corner and the lower right corner of the truth bounding box, (x, y) represents the corresponding position of the point (i, j), a certain point +.>Regression objective of->The method can be calculated by the following formula:
wherein (x) 0 ,y 0 ) And (x) 1 ,y 1 ) Representing the top left and bottom right corners of the truth bounding box,representing the corresponding point on the regression feature map +.>Regression objective of->Respectively representing the distances from the points on the regression feature map to the four sides of the bounding box.
Training the whole network in an end-to-end manner; wherein the loss function value of the classification section is L cls The regression function value of the bounding box is L reg The centrality loss is L cen Weighting the two components together according to the corresponding weight values to serve as a loss function after weighting the whole system;
L=L cls +λ 1 L cen +λ 2 L reg
in the above formula, the cross entropy loss is adopted for classification, the IOU loss is used for regression, and the centrality loss is also used;
according to the calculation gradient of the loss function L, updating the parameters of the network by using an optimizer SGD to reduce the loss function of the whole network until convergence, and ending the whole training to obtain trained network weight based on single-target tracking of attention;
s6, pre-training: the network backbone is pre-trained on ImageNet, the same initialization is used for the online time sequence self-adaptive convolution layer, the similarity graph refinement network is randomly initialized, parameters of the backbone are frozen for the first 5 epochs, the learning rate on the rest epochs of the training process is reduced from 0.005 to 0.0005, SGD is adopted as an optimizer, the momentum is 0.9, the small batch size is 124, and the input sizes of the template and the search area are 127x127 and 511x511 respectively.
6. The attention-based single-object tracking method of claim 1, wherein: in the step 5, a bounding box refinement network is added, the bounding box refinement module needs to input template information for initialization, the search area of the bounding box refinement module is from the local search area intercepted by the current bounding box output by the tracker, about 2 times of the target size, the feature fusion operation plays a key role in the bounding box refinement module, the feature fusion is performed by adopting point-by-point convolution in the bounding box refinement module, and the template feature graph is assumed to be expressed asThe characteristic diagram of the search area is expressed as S epsilon R C×H×W Then K is first decomposed into H 0 ×W 0 Small feature map K j ∈R C×1×1 Then, performing common related operations with S to obtain a fused characteristic diagram with the size of +.>This process can be described as
The method comprises the steps of carrying out a point-by-point convolution on a template feature map and a search area feature map, wherein the common correlation operation is represented, the problem of space ambiguity caused by cross-correlation operation of the whole template feature map and the search area feature map by utilizing a sliding window can be avoided, a result predicted and output by a refined boundary frame of the module is taken as a final output result, and meanwhile, the tracker is updated by utilizing the result.
7. The method for single-object tracking based on the attention mechanism as recited in claim 1, wherein said step 6 model test comprises the steps of:
s1, testing the tracking effect in a new video sequence which does not appear by using the trained weight parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311360574.5A CN117576149A (en) | 2023-10-19 | 2023-10-19 | Single-target tracking method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311360574.5A CN117576149A (en) | 2023-10-19 | 2023-10-19 | Single-target tracking method based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117576149A true CN117576149A (en) | 2024-02-20 |
Family
ID=89885202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311360574.5A Pending CN117576149A (en) | 2023-10-19 | 2023-10-19 | Single-target tracking method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117576149A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117974722A (en) * | 2024-04-02 | 2024-05-03 | 江西师范大学 | Single-target tracking system and method based on attention mechanism and improved transducer |
-
2023
- 2023-10-19 CN CN202311360574.5A patent/CN117576149A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117974722A (en) * | 2024-04-02 | 2024-05-03 | 江西师范大学 | Single-target tracking system and method based on attention mechanism and improved transducer |
CN117974722B (en) * | 2024-04-02 | 2024-06-11 | 江西师范大学 | Single-target tracking system and method based on attention mechanism and improved transducer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
Fu et al. | Onboard real-time aerial tracking with efficient Siamese anchor proposal network | |
CN113158862B (en) | Multitasking-based lightweight real-time face detection method | |
CN111507378A (en) | Method and apparatus for training image processing model | |
CN108537824B (en) | Feature map enhanced network structure optimization method based on alternating deconvolution and convolution | |
CN106845430A (en) | Pedestrian detection and tracking based on acceleration region convolutional neural networks | |
CN112163498B (en) | Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method | |
Cepni et al. | Vehicle detection using different deep learning algorithms from image sequence | |
CN108830170B (en) | End-to-end target tracking method based on layered feature representation | |
CN107657625A (en) | Merge the unsupervised methods of video segmentation that space-time multiple features represent | |
CN110969648A (en) | 3D target tracking method and system based on point cloud sequence data | |
CN115512251A (en) | Unmanned aerial vehicle low-illumination target tracking method based on double-branch progressive feature enhancement | |
Sun et al. | IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN114463492A (en) | Adaptive channel attention three-dimensional reconstruction method based on deep learning | |
CN117576149A (en) | Single-target tracking method based on attention mechanism | |
CN113297961A (en) | Target tracking method based on boundary feature fusion twin circulation neural network | |
Su et al. | Monocular depth estimation using information exchange network | |
CN113297959A (en) | Target tracking method and system based on corner attention twin network | |
Sun et al. | Two-stage deep regression enhanced depth estimation from a single RGB image | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN114492634A (en) | Fine-grained equipment image classification and identification method and system | |
CN117115911A (en) | Hypergraph learning action recognition system based on attention mechanism | |
Yin et al. | M2F2-RCNN: Multi-functional faster RCNN based on multi-scale feature fusion for region search in remote sensing images | |
CN114743045B (en) | Small sample target detection method based on double-branch area suggestion network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |