CN116630850A - Twin target tracking method based on multi-attention task fusion and bounding box coding - Google Patents

Twin target tracking method based on multi-attention task fusion and bounding box coding Download PDF

Info

Publication number
CN116630850A
CN116630850A CN202310555213.XA CN202310555213A CN116630850A CN 116630850 A CN116630850 A CN 116630850A CN 202310555213 A CN202310555213 A CN 202310555213A CN 116630850 A CN116630850 A CN 116630850A
Authority
CN
China
Prior art keywords
attention
network
template
features
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310555213.XA
Other languages
Chinese (zh)
Inventor
胡昭华
刘浩男
林潇
王莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202310555213.XA priority Critical patent/CN116630850A/en
Publication of CN116630850A publication Critical patent/CN116630850A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences

Abstract

The invention discloses a twin target tracking method based on multi-attention task fusion and bounding box coding, which comprises the steps of firstly constructing a twin target tracking network, carrying out channel attention and space attention enhancement operation on a template feature extraction branch by a first multi-attention fusion module, carrying out channel attention and space attention enhancement operation on a search feature extraction branch by a second multi-attention fusion module, enabling output features to enter a cross-correlation matching network together to be fused with bounding box coding features, and inputting the fused features into a classification regression network to obtain a classification score graph and a regression prediction graph. According to the invention, the feature extraction network is divided into the template feature extraction branch and the search feature extraction branch, and the multi-attention fusion module is added into the feature extraction network to perform pretreatment of channel attention and space attention, so that the problems of shielding, visual field disappearance, motion blurring, background disorder, scale change and the like can be well solved, and the tracking performance is good.

Description

Twin target tracking method based on multi-attention task fusion and bounding box coding
Technical Field
The invention relates to the field of target tracking, in particular to a twin target tracking method based on multi-attention task fusion and bounding box coding.
Background
Target tracking is a fundamental and challenging task in the computer vision field, one of the most active research topics in the computer vision field for decades. The task of target tracking is defined as: a video sequence is able to keep track of the target accurately in each subsequent frame given only the initial frame position of the tracked target. Target tracking has wide application in the fields of automatic driving, video monitoring, marine exploration, medical imaging and the like, and is therefore concerned by academia and industry. The target tracking can be divided into two main branches, one is target tracking based on correlation filtering, and the other is target tracking based on deep learning.
With the rapid development of deep learning, a target tracking model based on deep learning has become an important treasures for further improving the target tracking accuracy. The current research shows that in the target tracking model based on deep learning, the target tracking model based on the twin network achieves good balance between tracking precision and reasoning speed.
As a representative of a twin target tracking algorithm, siamFC (Bertinetto L, valmadre J, henriques J F, et al Fully-convolutional siamese networks for object tracking [ C ]// European con ference on computer vision Springer, cham, 2016:850-865.) achieves a good balance of speed and accuracy of target tracking by designing a compact and reasonable tracking network structure. SiamRPN++ (Li B, wu W, wang Q, et al Siamrpn++: evolution of siamese visual tracking with very deep network s [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition.2019:4282-4291.) the deep neural network ResNet50 is introduced into the twin tracking network, greatly improving tracking performance. However, most object tracking models, including these algorithms described above, typically employ pre-trained weighting parameters in the feature extraction network portion when training, but the pre-trained deep neural network does not actually fit the task definition of object tracking. The target tracking is different from other visual tasks such as image classification, target detection, instance segmentation and the like, the target classes of the training and testing of the visual tasks such as classification, detection and the like are predefined, the target classes are more focused on class prediction, and the target tracking task needs to face any class of targets in the tracking process, and the target tracking task is more focused on the position prediction of a foreground target. In addition, the pre-trained deep neural network is more biased towards differences outside the class, and the extracted features of the deep neural network may not be sensitive to changes in the target class, so that the discrimination capability of the tracking model is limited to a certain extent by adopting the pre-trained feature extraction network model.
The mainstream twin tracking algorithm generally has two utilization modes for the given bounding box information of the template frame, one is to perform conventional center clipping on the input image, and the other is to use coordinates of the bounding box to do relevant ROI (Region Of Interest) mapping or to extract an image pixel mask. Both of these approaches extract and manipulate the image-level information by utilizing existing bounding box coordinate information, but tend to ignore the value of the unstructured data information, the bounding box coordinates themselves.
The patent with application number of 2022113443642 discloses a target tracking method and a target tracking system based on a cross-correlation matching enhanced twin network, but the method also cannot solve the technical problem that the discrimination capability of a tracking model is limited.
Disclosure of Invention
The invention aims to solve the technical problems that: aiming at the problems that a pre-trained feature extraction network in the existing tracking algorithm does not completely meet the definition of a tracking task and network priori information cannot be fully utilized, the invention provides a twin target tracking method based on multi-attention task fusion and bounding box coding, and the problems of shielding, visual field disappearance, motion blur, background disorder, scale change and the like can be well solved by the tracking model through multi-attention task fusion operation and bounding box coding, so that the discrimination capability of the tracking model is well improved, good tracking performance is achieved, and good robustness can be maintained in long-time tracking.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a twin target tracking method based on multi-attention task fusion and bounding box coding, which comprises the following steps:
s1, constructing a twin target tracking network, wherein the twin target tracking network comprises a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module and a boundary box coding module;
the feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module are identical in structure, and the expressions are as follows:
F MA =F CA ·F SA =CA(F I )·DC(EC(F CA ))
wherein F is CA ∈R (C×H×W) A feature map representing enhanced attention to the channel; f (F) SA ∈R (C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) I ∈R (C×H×W) An input feature map representing a multi-attention task fusion module;
the boundary frame coding module codes boundary frame information of the template image to obtain boundary frame coding characteristics;
the cross-correlation matching network carries out cross-correlation matching on the template features output by the first multi-attention fusion module and the search features output by the second multi-attention fusion module to obtain cross-correlation features, then fuses the cross-correlation features with the boundary frame coding features, inputs the fused features into the classification regression network, carries out convolution layer calculation, and obtains a classification score graph and a regression prediction graph; the regression prediction graph is used for predicting offset distances between the center position of the target and four edges of the regression frame; the classification score map is a prediction target foreground score, and each score corresponds to the offset distance of four sides in the regression prediction map;
s2, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network in an offline mode, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges;
s3, performing online tracking test, cutting a first frame image of a test video sequence, inputting the first frame image as a template image into a constructed and trained twin target tracking network, sending the template image into a first multi-attention fusion module through a template feature extraction branch to obtain enhanced template features, and meanwhile, performing boundary frame information coding on the template image through a boundary frame coding module to obtain boundary frame coding features; cutting each subsequent frame of the test video sequence by using an area which is four times that of the template image, taking the center of the cut area as a predicted target center point of the previous frame, and sending the cut area as a search image into a constructed and trained twin target tracking network to obtain search features; and sending the obtained search features and template features to a cross-correlation matching network to be fused with the boundary frame coding features, inputting the fused features to a classification regression network to obtain a classification score graph and a regression prediction graph, and obtaining the final position of the target on the video sequence frame according to the position with the maximum response value in the classification score graph and the offset of the regression prediction graph.
Further, the template feature extraction branch and the search feature extraction branch are both improved on the original ResNet50 deep neural network, the step length of the third layer and the fourth layer of the ResNet50 deep neural network is set to be 1, the size of expansion convolution is set to be 4, and the fifth layer of the ResNet50 convolution layer is deleted.
Further, the size of the template image is 127×127×3, and the size of the search image is 255×255×3.
Further, the channel attention enhancement operation comprises the sub-steps of:
inputting a template feature map and a search feature map, and carrying out global average pooling operation along a channel to obtain a template feature map and a search feature map with unchanged channel number and compressed size; respectively carrying out two nonlinear convolution transformations of dimension reduction and dimension increase on the obtained template feature diagram and the search feature diagram; the template feature map and the search feature map after dimension increase are subjected to activation function operation to obtain feature maps representing the weights of all channels, and the channel weight feature maps are multiplied with the input feature map to obtain feature maps subjected to channel attention enhancement, wherein the calculation formula is as follows:
F sd =GAP(F)
F ac =Conv up (Conv down (F sd ))
F CA =σ(F ac )·F
wherein GAP (-) represents a global average pooling operation; f (F) sd ∈R (C×1×1) Representing a feature map after global average pooling operation along the channel direction; conv down (. Cndot.) represents a convolution dimension reduction operation; conv up (. Cndot.) represents a convolution dimension-increasing operation; the relevant proportionality coefficient r of the dimension reduction and the dimension increase is set to be 16; sigma represents Sigmoid activation function; f (F) ac ∈R (C×1×1) Representing the feature map after dimension increase, F I ∈R (C×H×W) An input feature map representing a multi-attention task fusion module; f (F) CA Representing a feature map enhanced by channel attention.
Further, the spatial attention enhancing operation comprises the sub-steps of:
the extracted template feature map and the search feature map are used as input features to carry out spatial feature transformation through a multi-layer coding and decoding network structure to obtain a feature map representing spatial position weight, wherein the coding structure is composed of a plurality of downsampling convolution layers, the decoding structure is composed of a plurality of upsampling deconvolution layers, the features are restored to be single-channel feature maps at the last layer of the upsampling deconvolution layers, the single-channel feature maps are used as spatial weight response feature maps, and the calculation formula is as follows:
F SA =DC(EC(F CA ))
wherein EC (·) represents the encoding downsampling operation; DC (·) represents a decoding up-sampling operation; f (F) SA ∈R (1×H×W) Representing a spatial weight response characteristic diagram finally obtained by encoding and decoding operations; f (F) CA ∈R (C×H×W) Representing the feature map after attention enhancement through the channel.
Further, the process of obtaining the boundary box coding feature is as follows:
converting the boundary frame coordinates of the template frame into one-dimensional feature vectors B (x, y, w, h), wherein (x, y) is the angular point coordinates of the target boundary frame, w is the width of the target boundary frame, h is the height of the target boundary frame, and the feature vectors B pass through multiple full-connection layers to obtain an output feature map B C
B C =f C (B)
Wherein f C Representing a fully connected layer structure; b (B) C Representing the output characteristics of the characteristic vector B through the full connection layer;
feature map output by full connection layer and feature map F output by cross correlation network C ∈R (C×H×W) Performing a broadcast addition operation:
F b =F C +B C
wherein F is b ∈R (C×H×W)
The final output result of the boundary frame coding obtained by convolution coding with the dimension of 1 multiplied by C is as follows:
F BM =f g (F b )
wherein f g A convolutional encoding operation representing a dimension size of 1 x C; f (F) BM ∈R (C×H×W) Representing the final bounding box encoded output.
Further, the size of the classification score map is 25×25×1, and the size of the regression prediction map is 25×25×4. Further, the calculation formula of the joint task loss function is as follows:
L=λ 1 L cls2 L reg
wherein L is cls Is a binary cross entropy loss function, L reg Lambda is the IOU penalty function 1 And lambda (lambda) 2 Respectively is L cls And L reg Is a weight of (2).
Compared with the prior art, the invention adopts the technical proposal and has the following remarkable technical effects:
1. the twin target tracking method based on multi-attention task fusion and bounding box coding does not adopt the fifth layer characteristic of ResNet50, but only adopts the characteristics on the third and fourth layers of ResNet 50; therefore, the training and reasoning speed of the twin neural network can be greatly accelerated, and the accuracy cannot be obviously influenced.
2. According to the twin target tracking method based on multi-attention task fusion and bounding box coding, the proposed multi-attention task fusion module can enhance the target characteristic channel and adaptively adjust the importance of the related channel, so that the characteristic expression capability of the target is further improved; the spatial attention can be enhanced to the position of the target spatial region, and the integral multi-attention task fusion module can be used for enhancing the tracking characteristics more pertinently, so that the effectiveness of the tracking performance can be better improved.
3. The twin target tracking method based on multi-attention task fusion and bounding box coding provided by the invention can further help to enhance the response of a target foreground region in order to better utilize the existing priori information and code the given target boundary truth box information into a tracking model.
Drawings
FIG. 1 is a network structure diagram corresponding to a twin object tracking method based on multi-attention task fusion and bounding box coding.
FIG. 2 is a block diagram of a channel attention module according to the present invention.
Fig. 3 is a block diagram of a spatial attention module according to the present invention.
FIG. 4 is a block diagram of a multi-attention task fusion module of the present invention.
Fig. 5 is an internal structural diagram of the bounding box encoding module of the present invention.
Fig. 6 is a schematic diagram of the performance evaluation results of the present invention on OTB100 and other mainstream tracking algorithms.
Fig. 7 is a schematic diagram of the visual results of the present invention on an OTB100 video sequence.
FIG. 8 is a graph showing the performance evaluation results of the present invention on GOT10K and other mainstream tracking algorithms.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
Fig. 1 is a network structure diagram corresponding to a twin target tracking method based on multi-attention task fusion and bounding box coding, and an embodiment of the invention discloses a twin target tracking method based on multi-attention task fusion and bounding box coding, which comprises the following steps:
s1, constructing a twin target tracking network, wherein the twin target tracking network comprises a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module and a boundary box coding module. The feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module are identical in structure, and the expressions are as follows:
F MA =F CA ·F SA =CA(F I )·DC(EC(F CA ))。
wherein F is CA ∈R (C×H×W) A feature map representing enhanced attention to the channel; f (F) SA ∈R (C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) I ∈R (C×H×W) Representing the input feature map of the multi-attention task fusion module.
And the boundary frame coding module codes boundary frame information of the template image to obtain boundary frame coding characteristics.
The cross-correlation matching network carries out cross-correlation matching on the template features output by the first multi-attention fusion module and the search features output by the second multi-attention fusion module to obtain cross-correlation features, then fuses the cross-correlation features with the boundary frame coding features, inputs the fused features into the classification regression network, carries out convolution layer calculation, and obtains a classification score graph and a regression prediction graph; the regression prediction graph is used for predicting offset distances between the center position of the target and four edges of the regression frame; the classification score map is a prediction target foreground score, and each score corresponds to the offset distance of four sides in the regression prediction map.
S2, adopting an off-line mode, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges.
S3, performing online tracking test, cutting a first frame image of a test video sequence, inputting the first frame image as a template image into a constructed and trained twin target tracking network, sending the template image into a first multi-attention fusion module through a template feature extraction branch to obtain enhanced template features, and meanwhile, performing boundary frame information coding on the template image through a boundary frame coding module to obtain boundary frame coding features; cutting each subsequent frame of the test video sequence by using an area which is four times that of the template image, taking the center of the cut area as a predicted target center point of the previous frame, and sending the cut area as a search image into a constructed and trained twin target tracking network to obtain search features; and sending the obtained search features and template features to a cross-correlation matching network to be fused with the boundary frame coding features, inputting the fused features to a classification regression network to obtain a classification score graph and a regression prediction graph, and obtaining the final position of the target on the video sequence frame according to the position with the maximum response value in the classification score graph and the offset of the regression prediction graph.
1. Constructing a twin target tracking network
First, a twin target tracking network is constructed, the twin target tracking network comprising a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module, and a bounding box encoding module.
The feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module have the same structure, and the expressions are:
F MA =F CA ·F SA =CA(F I )·DC(EC(F CA ))。
wherein F is CA ∈R (C×H×W) A feature map representing enhanced attention to the channel; f (F) SA ∈R (C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) I ∈R (C×H×W) Representing the input feature map of the multi-attention task fusion module.
And the boundary frame coding module codes boundary frame information of the template image to obtain boundary frame coding characteristics.
The cross-correlation matching network carries out cross-correlation matching on the template features output by the first multi-attention fusion module and the search features output by the second multi-attention fusion module to obtain cross-correlation features, then fuses the cross-correlation features with the boundary frame coding features, inputs the fused features into the classification regression network, carries out convolution layer calculation, and obtains a classification score graph and a regression prediction graph; the regression prediction graph is used for predicting offset distances between the center position of the target and four edges of the regression frame; the classification score map is a prediction target foreground score, and each score corresponds to the offset distance of four sides in the regression prediction map.
For example, setting the step size of the last two-layer structure of the original ResNet50 to 1 and the size of the dilation convolution to 4 increases the receptive field size. Meanwhile, when a feature extraction network is designed, experimental analysis of a main network in SiamRPN++ is used for reference, and a fifth-layer convolution layer of ResNet50 is removed, so that the feature extraction network has deeper feature extraction capability compared with AlexNet and GoogleNet, resNet. Furthermore, the fifth layer feature of ResNet50 is not utilized in the present invention, but rather only the third, fourth layer feature of ResNet50 is utilized. Because the training and reasoning speed of the twin neural network can be greatly accelerated, and the accuracy is not obviously influenced.
The whole network is divided into two input branches, namely a template image branch and a search image branch. The size of the template image is set to 127×127×3, and the size of the search image is set to 255×255×3. The two branch images respectively enter a first multi-attention fusion module and a second multi-attention fusion module to extract the characteristics, and then the template branch characteristics are obtainedSearch branching feature->Representative height is H z Width W z A characteristic diagram with the number of channels being C,
representative height is H x Width W x The number of channels is C.
FIG. 2 is a diagram of a channel attention module according to the present invention, and FIG. 3 is a diagram of a spatial attention module according to the present invention, wherein the template image and the search image are subjected to feature extraction and then subjected to feature enhancement operation by the same multi-attention task fusion module, so that the feature map of the input image after feature extraction is set to be F E R (C×H×W) This feature may be channel enhanced and spatial feature enhanced by a multi-attention task fusion module. Input feature map F E R (C×H×W) And carrying out global average pooling operation along the channels to obtain a characteristic diagram with unchanged channel number and compressed size. Then, the feature map is subjected to two nonlinear convolution transformations of decreasing dimension and increasing dimension, respectively. Finally, the feature after dimension increase is subjected to an activation function operation to obtain feature graphs representing the weights of all channels, and the channel weight feature graphs are multiplied by input features to obtain a final feature graph F subjected to channel attention enhancement CA ∈R (C×H×W) . The corresponding formula of the channel attention part concrete flow is as follows:
F sd =GAP(F)
F ac =Conv up (Conv down (F sd ))
F CA =σ(F ac )·F
wherein GAP (-) represents a global average pooling operation; f (F) sd ∈R (C×1×1) Representing a feature map after global average pooling operation along the channel direction; conv down (. Cndot.) represents a convolution dimension reduction operation; conv up (. Cndot.) represents a convolution dimension-increasing operation; the relevant proportionality coefficient r of the dimension reduction and the dimension increase is set to be 16; sigma represents Sigmoid activation function; f (F) ac ∈R (C×1×1) Representing the feature map after dimension increase, F I ∈R (C×H×W) An input feature map representing a multi-attention task fusion module; f (F) CA Representing a feature map enhanced by channel attention.
And (3) inputting the extracted template feature map and the search feature map as input features, and performing spatial feature transformation through a multi-layer coding and decoding network structure to obtain a feature map representing spatial position weight. The encoding structure is formed of a plurality of downsampled convolutional layers and the decoding structure is formed of a plurality of upsampled deconvolution layers. At the last layer of up-sampling, the features are restored to a single-channel feature map for weights representing spatial locations. The corresponding formula of the finally obtained space weight response characteristic image is as follows:
F SA =DC(EC(F CA ))
wherein EC (·) represents the encoding downsampling operation; DC (·) represents a decoding up-sampling operation; f (F) SA ∈R (1×H×W) Representing a spatial weight response characteristic diagram finally obtained by encoding and decoding operations; f (F) CA ∈R (C×H×W) Representing the feature map after attention enhancement through the channel.
FIG. 4 is a block diagram of a multi-attention task fusion module of the present invention. In the embodiment of the invention, the front half part of the multi-attention task fusion module is channel attention, the rear half part is space attention, the input of the space attention module is the characteristic of the channel attention after being enhanced, and the output characteristic image after the channel attention and the space attention are fused is obtained as follows:
F M =F CA ·F SA =CA(F I )·DC(EC(F CA ))。
wherein F is CA ∈R (C×H×W) A feature map representing enhanced attention to the channel; f (F) SA ∈R (C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) I ∈R (C×H×W) Representing the input feature map of the multi-attention task fusion module.
The invention performs feature enhancement by combining a channel attention mechanism and a spatial attention mechanism, wherein the channel attention mechanism is responsible for enhancing a feature channel of a target, the spatial attention mechanism is responsible for enhancing a spatial position of the target, and in order to avoid the problem that a global pooling mode adopted by the traditional spatial attention mechanism is difficult to distinguish an interfering object, the invention performs feature space enhancement operation by designing a coding and decoding network, so that the feature enhancement mode is more in accordance with the definition of a tracking task, and the discrimination capability of a tracking model can be better improved.
Fig. 5 is an internal structural diagram of the bounding box encoding module of the present invention. The bounding box coding module firstly converts the bounding box coordinates of the template frame into one-dimensional feature vectors: b (x, y, w, h), (x, y) represents the corner coordinates of the target bounding box, w represents the width of the target bounding box, and h represents the height of the target bounding box. The feature vector B is subjected to multi-layer full-connection layers to obtain output features as follows:
B C =f C (B)。
wherein f C Representing a fully connected layer structure; b (B) C Representing the output characteristics of the characteristic vector B through the full connection layer.
Feature map output by full connection layer and feature map F output by cross correlation network C ∈R (C×H×W) Performing a broadcast addition operation:
F b =F C +B C
wherein F is b ∈R (C×H×W)
The final output result of the boundary frame coding obtained by convolution coding with the dimension of 1 multiplied by C is as follows:
F BM =f g (F b )。
wherein f g A convolutional encoding operation representing a dimension size of 1 x C; f (F) BM ∈R (C×H×W) Representing the final bounding box encoded output.
The invention can better utilize the tracking priori information by designing the boundary box coding module, thereby further improving the performance of the tracking model.
After obtaining the output characteristics of the cross-correlation matching network, the cross-correlation matching network outputs the characteristics F BM ∈R (C×H×W) Through a series of convolution layers, a classification score map of size 25×25×1 and a regression prediction map of size 25×25×4 are finally output. The classification branch is responsible for predicting the target as a foreground score. The regression branch is responsible for predicting four distances (l, t, b, r) representing the offset distances of the target center position from the four edges of the regression frame, respectively.
2. Training network
And adopting an offline mode, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges.
The whole network performs end-to-end offline training on a plurality of large-scale data sets, wherein the training data sets comprise five large-scale data sets of ImageNet VID, youTube-BoundingBoxes, GOT K, imageNet DET and COCO. The input image comprises a template image and a search image, the template image and the search image being respectively from different image frames of the same video sequence. In the training stage, the preprocessed template image and the preprocessed search image are used as inputs of a network to share the same network and network parameters, the template image and the preprocessed search image are output to a classification score graph and a regression prediction graph after passing through the whole twin target tracking network, and the whole network is subjected to end-to-end training optimization by combining a joint task loss function.
For example, the template image and the search image are randomly selected for training in the same video sequence during an offline training process. Specifically, during training, the batch size of each iteration is set to 16 on a single GPU, and gradient backhaul optimization is performed using SGD random gradient descent with momentum. The total number of the whole training is 50, when training, the first five training rounds adopt the learning rate of 0.001 gradually to 0.005 to carry out preheating training, and the subsequent forty five training rounds adopt the learning rate of 0.005 gradually to 0.00001 to carry out training. The weight decay factor and momentum parameter are set to 0.0001 and 0.9, respectively. In the training stage, the preprocessed template image and the preprocessed search image are used as inputs of a network to share the same network and network parameters, the template image and the preprocessed search image are subjected to the whole twin target tracking network to obtain a finally output classification score graph and a regression prediction graph, and then the whole network is subjected to end-to-end training optimization by combining a joint task loss function.
For a classification branch, the classification score graph ultimately predicts the probability that the target is foreground. For the regression branches, the regression prediction graph ultimately predicts four center offset distances. The joint task loss function L can thus be expressed as follows:
L=λ 1 L cls2 L reg
wherein L is cls Is a binary cross entropy loss function, L reg Is an IOU penalty function. In the embodiment of the invention, lambda is set 1 =1,λ 2 =1。
3. On-line tracking test
Cutting a first frame image of a test video sequence, inputting the first frame image as a template image into a constructed and trained twin target tracking network, sending the template image into a multi-attention fusion module through a feature extraction network to obtain fused template features, and carrying out boundary frame information coding on the template image through a boundary frame coding module template to obtain boundary frame coding features; cutting each subsequent frame of the test video sequence by using an area which is four times that of the template image, taking the center of the cut area as a predicted target center point of the previous frame, sending the cut area as a search image into a constructed and trained twin target tracking network to obtain search features, sending the obtained search features and template features to a cross-correlation matching network to be fused with boundary frame coding features, inputting the fused features into a classification regression network to obtain a classification score graph and a regression prediction graph, and combining the offset of the regression prediction graph according to the position with the maximum response value in the classification score graph to obtain the final position of the target on the video sequence frame.
For example, in the embodiment of the invention, in a test stage, a first frame image of a test video sequence is cut and processed as a template image and sent to a feature extraction network, template features obtained by pretreatment of a multi-attention module are fixed in the network, boundary frame information given by the template frames is encoded, each subsequent frame of the test video sequence is cut in an area which is four times that of the template image, the center of the cut area is a target center point predicted by the previous frame, the cut area is used as a search image and sent to a twin tracking network, feature extraction is carried out through a ResNet50 deep neural network, multi-attention enhancement is carried out, and a classification score graph and a regression prediction graph are obtained after the template features obtained before are subjected to a cross-correlation matching network and a classification regression network. And finally, carrying out four offset regression corresponding to the regression prediction graph at the position with the maximum classification score through a series of post-processing operations to obtain a final prediction boundary box of the target in the frame. The whole test process ends or starts tracking the next video sequence after the last frame of the video sequence is tracked, and the corresponding template features are updated.
The invention evaluates tracking performance by evaluating two indexes of accuracy and area under a curve of a work diagram through one pass on an OTB100 data set, and the main stream tracking algorithm for performance evaluation comparison with the invention comprises the following steps: ATOM, daSiamRPN, gradNet, siamRPN, CFNet, siamFC. As shown in fig. 6, the performance evaluation of the tracking algorithm proposed by the present invention on OTB100 is compared with that of other mainstream tracking algorithms, and compared with ATOM, daSiamRPN, gradNet, siamRPN, CFNet, siamFC, the performance evaluation of the tracking algorithm is respectively 0.6, 1.5, 3.4, 4.4, 8.6 and 8.7 percentage points higher. Compared with ATOM, daSiamRPN, gradNet, siamRPN, CFNet, siamFC, the invention is respectively higher by 0.7, 0.8, 2.6, 4.0, 10.9 and 11.5 percent on the accuracy index. In addition, the invention selects three kinds of fast motion, motion blur and deformation under the OTB100 reference, and further tests the tracking performance of the invention under the challenging scene which can greatly influence the tracking characteristics. As shown in fig. 7, in three challenging scenarios, the method ranks first in comparison with performance indexes of various tracking algorithms, obtains good performance, and proves the effectiveness of the design of multi-attention task fusion and bounding box coding in improving tracking performance.
As shown in fig. 8, the performance evaluation of the tracking algorithm proposed by the present invention on the large tracking dataset GOT10K is compared with that of various mainstream tracking algorithms, and most of the video sequences in the GOT10K dataset are in outdoor scenes, so that the tracking algorithm is more in line with actual tracking scenes, and various challenges to tracking are more. The tracking algorithm provided by the invention is arranged first on the average overlapping rate and the success rate index taking 0.75 as the overlapping threshold value, and is arranged second on the success rate index of 0.5 as the overlapping threshold value, so that the tracking algorithm still obtains competitive tracking effect compared with the tracking algorithms of various main streams.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims (8)

1. The twin target tracking method based on multi-attention task fusion and bounding box coding is characterized by comprising the following steps of:
s1, constructing a twin target tracking network, wherein the twin target tracking network comprises a feature extraction network, a cross-correlation matching network, a classification regression network, a multi-attention task fusion module and a boundary box coding module;
the feature extraction network comprises a template feature extraction branch, a search feature extraction branch, a first multi-attention fusion module and a second multi-attention fusion module; the first multi-attention fusion module performs channel attention and space attention enhancement operation on the template feature extraction branches to obtain enhanced template branch features; the second multi-attention fusion module performs channel attention and space attention enhancement operation on the search feature extraction branch to obtain enhanced search branch features; the first multi-attention fusion module and the second multi-attention fusion module are identical in structure, and the expressions are as follows:
F MA =F CA ·F SA =CA(F I )·DC(EC(F CA ))
wherein F is CA ∈R (C×H×W) A feature map representing enhanced attention to the channel; f (F) SA ∈R (C×H×W) Representing a feature map enhanced by spatial attention; CA (-) represents a channel attention enhancement operation; DC (·) represents a decoding up-sampling operation; EC (·) represents the encoding downsampling operation; f (F) I ∈R (C×H×W) An input feature map representing a multi-attention task fusion module;
the boundary frame coding module codes boundary frame information of the template image to obtain boundary frame coding characteristics;
the cross-correlation matching network carries out cross-correlation matching on the template features output by the first multi-attention fusion module and the search features output by the second multi-attention fusion module to obtain cross-correlation features, then fuses the cross-correlation features with the boundary frame coding features, inputs the fused features into the classification regression network, carries out convolution layer calculation, and obtains a classification score graph and a regression prediction graph; the regression prediction graph is used for predicting offset distances between the center position of the target and four edges of the regression frame; the classification score map is a prediction target foreground score, and each score corresponds to the offset distance of four sides in the regression prediction map;
s2, randomly extracting pairs of template images and search images in a training sample set by a feature extraction network in an offline mode, carrying out gradient feedback by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until a joint task loss function converges;
s3, performing online tracking test, cutting a first frame image of a test video sequence, inputting the first frame image as a template image into a constructed and trained twin target tracking network, sending the template image into a first multi-attention fusion module through a template feature extraction branch to obtain enhanced template features, and meanwhile, performing boundary frame information coding on the template image through a boundary frame coding module to obtain boundary frame coding features; cutting each subsequent frame of the test video sequence by using an area which is four times that of the template image, taking the center of the cut area as a predicted target center point of the previous frame, and sending the cut area as a search image into a constructed and trained twin target tracking network to obtain search features; and sending the obtained search features and template features to a cross-correlation matching network to be fused with the boundary frame coding features, inputting the fused features to a classification regression network to obtain a classification score graph and a regression prediction graph, and obtaining the final position of the target on the video sequence frame according to the position with the maximum response value in the classification score graph and the offset of the regression prediction graph.
2. The twin object tracking method based on multi-attention task fusion and bounding box coding according to claim 1, wherein the template feature extraction branch and the search feature extraction branch are both improvements to an original ResNet50 deep neural network, the step sizes of a third layer and a fourth layer structure of the ResNet50 deep neural network are set to be 1, the size of expansion convolution is set to be 4, and a fifth layer convolution layer of the ResNet50 is deleted.
3. The twin object tracking method based on multi-attention task fusion and bounding box encoding of claim 1, wherein the template image has a size of 127 x 3 and the search image has a size of 255 x 3.
4. The multi-attention task fusion and bounding box encoding based twin object tracking method of claim 1, wherein the channel attention enhancement operation comprises the sub-steps of:
inputting a template feature map and a search feature map, and carrying out global average pooling operation along a channel to obtain a template feature map and a search feature map with unchanged channel number and compressed size; respectively carrying out two nonlinear convolution transformations of dimension reduction and dimension increase on the obtained template feature diagram and the search feature diagram; the template feature map and the search feature map after dimension increase are subjected to activation function operation to obtain feature maps representing the weights of all channels, and the channel weight feature maps are multiplied with the input feature map to obtain feature maps subjected to channel attention enhancement, wherein the calculation formula is as follows:
F sd =GAP(F)
F ac =Conv up (Conv down (F sd ))
F CA =σ(F ac )·F
wherein GAP (-) represents a global average pooling operation; f (F) sd ∈R (C×1×1) Representing a feature map after global average pooling operation along the channel direction; conv down (. Cndot.) represents a convolution dimension reduction operation; conv up (. Cndot.) represents a convolution dimension-increasing operation; the relevant proportionality coefficient r of the dimension reduction and the dimension increase is set to be 16; sigma represents Sigmoid activation function; f (F) ac ∈R (C×1×1) Representing post-Uygur-like featuresSign chart, F I ∈R (C×H×W) An input feature map representing a multi-attention task fusion module; f (F) CA Representing a feature map enhanced by channel attention.
5. The multi-attention task fusion and bounding box encoding based twin object tracking method as claimed in claim 1 wherein the spatial attention enhancement operation comprises the sub-steps of:
the extracted template feature map and the search feature map are used as input features to carry out spatial feature transformation through a multi-layer coding and decoding network structure to obtain a feature map representing spatial position weight, wherein the coding structure is composed of a plurality of downsampling convolution layers, the decoding structure is composed of a plurality of upsampling deconvolution layers, the features are restored to be single-channel feature maps at the last layer of the upsampling deconvolution layers, the single-channel feature maps are used as spatial weight response feature maps, and the calculation formula is as follows:
F SA =DC(EC(F CA ))
wherein EC (·) represents the encoding downsampling operation; DC (·) represents a decoding up-sampling operation; f (F) SA ∈R (1×H×W) Representing a spatial weight response characteristic diagram finally obtained by encoding and decoding operations; f (F) CA ∈R (C×H×W) Representing the feature map after attention enhancement through the channel.
6. The twin object tracking method based on multi-attention task fusion and bounding box coding according to claim 1, wherein the process of obtaining the bounding box coding features is as follows:
converting the boundary frame coordinates of the template frame into one-dimensional feature vectors B (x, y, w, h), wherein (x, y) is the angular point coordinates of the target boundary frame, w is the width of the target boundary frame, h is the height of the target boundary frame, and the feature vectors B pass through multiple full-connection layers to obtain an output feature map B C
B C =f C (B)
Wherein f C Representing a fully connected layer structure; b (B) C Output of representative feature vector B via full connection layerFeatures;
feature map output by full connection layer and feature map F output by cross correlation network C ∈R (C×H×W) Performing a broadcast addition operation:
F b =F C +B C
wherein F is b ∈R (C×H×W)
The final output result of the boundary frame coding obtained by convolution coding with the dimension of 1 multiplied by C is as follows:
F BM =f g (F b )
wherein f g A convolutional encoding operation representing a dimension size of 1 x C; f (F) BM ∈R (C×H×W) Representing the final bounding box encoded output.
7. The twin object tracking method based on multi-attention task fusion and bounding box encoding of claim 1, wherein the classification score map is 25 x 1 in size and the regression prediction map is 25 x 4 in size.
8. The twin object tracking method based on multi-attention task fusion and bounding box coding according to claim 1, wherein the calculation formula of the joint task loss function is:
L=λ 1 L cls2 L reg
wherein L is cls Is a binary cross entropy loss function, L reg Lambda is the IOU penalty function 1 And lambda (lambda) 2 Respectively is L cls And L reg Is a weight of (2).
CN202310555213.XA 2023-05-17 2023-05-17 Twin target tracking method based on multi-attention task fusion and bounding box coding Pending CN116630850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310555213.XA CN116630850A (en) 2023-05-17 2023-05-17 Twin target tracking method based on multi-attention task fusion and bounding box coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310555213.XA CN116630850A (en) 2023-05-17 2023-05-17 Twin target tracking method based on multi-attention task fusion and bounding box coding

Publications (1)

Publication Number Publication Date
CN116630850A true CN116630850A (en) 2023-08-22

Family

ID=87641089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310555213.XA Pending CN116630850A (en) 2023-05-17 2023-05-17 Twin target tracking method based on multi-attention task fusion and bounding box coding

Country Status (1)

Country Link
CN (1) CN116630850A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333515A (en) * 2023-12-01 2024-01-02 南昌工程学院 Target tracking method and system based on regional awareness

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333515A (en) * 2023-12-01 2024-01-02 南昌工程学院 Target tracking method and system based on regional awareness
CN117333515B (en) * 2023-12-01 2024-02-09 南昌工程学院 Target tracking method and system based on regional awareness

Similar Documents

Publication Publication Date Title
Zhang et al. Ga-net: Guided aggregation net for end-to-end stereo matching
Xu et al. Line segment detection using transformers without edges
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN109919032B (en) Video abnormal behavior detection method based on motion prediction
CN113628249B (en) RGBT target tracking method based on cross-modal attention mechanism and twin structure
CN111723693B (en) Crowd counting method based on small sample learning
CN111767847B (en) Pedestrian multi-target tracking method integrating target detection and association
CN113744311A (en) Twin neural network moving target tracking method based on full-connection attention module
CN112668522B (en) Human body key point and human body mask joint detection network and method
KR20200027887A (en) Learning method, learning device for optimizing parameters of cnn by using multiple video frames and testing method, testing device using the same
Han et al. A method based on multi-convolution layers joint and generative adversarial networks for vehicle detection
CN116630850A (en) Twin target tracking method based on multi-attention task fusion and bounding box coding
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN112184731A (en) Multi-view stereo depth estimation method based on antagonism training
Wang et al. Water hazard detection using conditional generative adversarial network with mixture reflection attention units
CN116934796A (en) Visual target tracking method based on twinning residual error attention aggregation network
CN116823878A (en) Visual multi-target tracking method based on fusion paradigm
CN111104855A (en) Workflow identification method based on time sequence behavior detection
CN116188555A (en) Monocular indoor depth estimation algorithm based on depth network and motion information
CN114463187B (en) Image semantic segmentation method and system based on aggregation edge features
KR102527642B1 (en) System and method for detecting small target based deep learning
Wang et al. LRDif: Diffusion Models for Under-Display Camera Emotion Recognition
CN112652059B (en) Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method
CN115546252A (en) Target tracking method and system based on cross-correlation matching enhanced twin network
Xue et al. Multiscale feature extraction network for real-time semantic segmentation of road scenes on the autonomous robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination