CN116229112A - Twin network target tracking method based on multiple attentives - Google Patents

Twin network target tracking method based on multiple attentives Download PDF

Info

Publication number
CN116229112A
CN116229112A CN202211558887.7A CN202211558887A CN116229112A CN 116229112 A CN116229112 A CN 116229112A CN 202211558887 A CN202211558887 A CN 202211558887A CN 116229112 A CN116229112 A CN 116229112A
Authority
CN
China
Prior art keywords
attention
features
image
branch
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211558887.7A
Other languages
Chinese (zh)
Inventor
周丽芳
刘金兰
李伟生
马将凯
卢峻民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202211558887.7A priority Critical patent/CN116229112A/en
Publication of CN116229112A publication Critical patent/CN116229112A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention discloses a twin network target tracking method (SiamMAN, siamese Multi-attention Network) based on multiple attentions, and belongs to the technical field of computer vision. Mainly comprises the following steps: firstly, in order to more effectively utilize target feature information to cope with the problem of complex background interference, a multi-attention module is designed to optimize the features, wherein a channel attention branch is used for giving higher weight to a more discriminative channel, and a position attention branch fully utilizes the position information of a target; secondly, in order to better utilize shallow features, a feature fusion method is designed in the multi-attention module, the shallow features and features optimized by attention branches are fused by using a residual error learning method, and then the two attention features are fused, so that feature representation is further enhanced; finally, the Focal-EIoU penalty is used as a regression penalty function, leading the tracker to generate a more accurate tracking box.

Description

Twin network target tracking method based on multiple attentives
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a visual target tracking method.
Background
Video object tracking is an important research topic in the field of computer vision, requiring that the size and position of an object in a subsequent frame be predicted given the object size and position of the initial frame of a video sequence. In recent years, along with the continuous development of target tracking algorithms, target tracking theory is also becoming more and more perfect, and target tracking technology is widely applied to the fields of video monitoring, pedestrian tracking, intelligent traffic, modern military and the like. However, in a complex and changeable real scene, a moving target is often accompanied by various influencing factors such as shielding, scale change, background clutter and illumination change, and the like, so that it is still a very challenging task to accurately track any target, and therefore, research on a target tracking technology still has great research value and practical value.
At present, the target tracking methods can be roughly divided into two main types, namely a method based on correlation filtering and a method based on deep learning. The target tracking method based on the correlation filtering mainly carries out correlation operation by designing a filtering template and a target candidate region, and finds the maximum value in a response diagram to position the target, thereby having the advantage of high speed. However, such methods may result in too little information learned by the filtering template due to the cosine window and search area limitations, which may easily lead to tracking drift in large-scale deformation and complex background interference scenarios. And the target tracking method based on deep learning can provide features with more discriminant, so that the tracker is more robust. The target tracking algorithm based on the twin network converts the tracking problem into the similarity matching problem, and the tracking problem is simplified by using an end-to-end training mode, so that higher tracking precision is realized.
However, in the target tracking method based on the twin network, the following problems still remain: 1) In a complex and changeable real scene, the tracking field is a very troublesome problem when the complex background is interfered, while the existing tracker aiming at the complex background challenge mostly adopts a deep backbone network to extract features, and the extracted features are relatively more, but larger parameters are introduced, so that the calculation cost is increased, and the tracking is slow; 2) In addition, other trackers aiming at complex background challenges enhance the anti-interference capability of the network by adding training data, but the time cost is increased, the training time is increased, and the tracking effect is not improved greatly. In order to alleviate the defects, the invention provides a twin network target tracking method based on multiple attentiveness, which aims at the problem of complex background challenges in a real scene and is used for improving tracking performance.
CN111192292a, a target tracking method based on an attention mechanism and a twin network and related devices, input a target template and a search area into a preset target tracking model, and output target tracking information of the target template in the search area through the preset target tracking model; the target tracking model comprises a twin network, and a channel attention module and/or a space attention module are/is additionally arranged in the twin network. According to the target tracking method and the related equipment, the channel attention module and/or the spatial attention module are added into the residual error network, so that the average expected overlapping rate and robustness of a twin tracking algorithm are obviously improved, and the robustness of motion change, camera motion, shielding and size change attributes in tracking is improved, and therefore, when the target tracking method provided by the embodiment is used for target tracking prediction, a more accurate result can be obtained.
The invention with the publication number of CN111192292A takes SiamRPN++ as a basic network, adopts ResNet-50 as a feature extraction network, adds a channel attention and/or space attention module in a twin network, and obtains a regression result and a feature classification result of a frame through frame regression and classification branches. Although both are target tracking methods based on twin networks and both use an attention mechanism to improve tracking performance, the present invention and the invention of publication No. CN111192292a differ in the following points:
(1) Selection of a feature extraction network: the invention of publication number CN111192292a utilizes deep res net-50 as the feature extraction network, while the invention employs relatively shallow google net as the feature extraction network. Therefore, the parameter amount of the invention is smaller, and the tracking speed is faster.
(2) Position of the attention module: the invention with the publication number of CN111192292A designs a multiple attention module which consists of a channel attention branch and a position attention branch, wherein the characteristics extracted by a characteristic extraction network enter the two attention branches in parallel and then are fused, so that channel information and position information can be fully utilized, interference of redundant information is reduced, and tracking performance is improved.
(3) Mode of target prediction: while the invention with publication number CN111192292A adopts an Anchor-based mode to predict the position of the target, the invention adopts an Anchor-free mode to distinguish the foreground and the background by classifying branches, calculates the distance between each point and the center point of the target by a centrality branch, is used for restraining the condition that some predicted points and center points are far, and obtains a tracking frame by a regression branch.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A twin network target tracking method based on multiple attentiveness is provided. The technical scheme of the invention is as follows:
a twin network target tracking method based on multiple attentiveness, comprising the steps of:
step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images;
step 2: respectively inputting the preprocessed template image and the preprocessed search image into a template branch and a search branch of a twin network, and extracting features through a GoogLeNet feature extraction backbone network to obtain a feature map of the template image and a feature map of the search image;
step 3: the template image features and the search image features are respectively input into a multi-attention module which is formed by a plurality of parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discrimination, and the position attention branches fully utilize the position information of the targets, so that the features are further refined; propagating target information from template image features to search image features through a graph attention mechanism to obtain a feature response graph;
step 4: the characteristic response diagram is input into a classification-regression sub-network, and regression branches replace IoU losses by introducing Focal-EIoU losses, so that a tracker is guided to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result.
Further, the step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images, wherein the method specifically comprises the following steps of:
a1, preprocessing a template image: selecting a first frame image of a video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds an image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to 127 multiplied by 127 pixels;
a2, preprocessing a search image: and selecting subsequent other frame images of the video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds the image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to the size of 287 multiplied by 287 pixels.
Further, the step 2: the preprocessed template image and the preprocessed search image are respectively input into a template branch and a search branch of a twin network, feature extraction is carried out through a GoogLeNet feature extraction backbone network, and features of the template image and features of the search image are obtained, specifically comprising the following steps:
b1, obtaining a template image z with the size of 127 multiplied by 127 according to the step A1, and obtaining a search image x with the size of 287 multiplied by 287 according to the step A2;
b2, inputting the template image z into a template branch of a twin network, and extracting the template image features through a GoogLeNet (acceptance v 3) feature extraction backbone network
Figure BDA0003983754050000041
B3, inputting the search image x into a search branch of a twin network, and extracting the search image features through a GoogLeNet (acceptance v 3) feature extraction backbone network
Figure BDA0003983754050000042
Further, the step 3: the method comprises the steps that the features of a template image and the features of a search image are respectively input into a multi-attention module formed by parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discriminant, and the position attention branches fully utilize the position information of targets, so that the features are further refined; the target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, so that a characteristic response graph is obtained, and the method specifically comprises the following steps:
c1, template image characteristics obtained according to the step B2
Figure BDA0003983754050000043
Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining the thinned template image characteristics>
Figure BDA0003983754050000051
C2, searching image characteristics obtained according to the step B3
Figure BDA0003983754050000052
Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining refined search image characteristics>
Figure BDA0003983754050000053
C3, refining the template image features
Figure BDA0003983754050000054
And search for image features->
Figure BDA0003983754050000055
Target information is transmitted from template image features to search image features through a graph attention mechanism, and a feature response graph F is obtained fin
Further, the steps are as followsC3, refining the template image features
Figure BDA0003983754050000056
And search for image features->
Figure BDA0003983754050000057
The target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, and the specific steps are as follows:
d1, the template image features
Figure BDA0003983754050000058
The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained z
D2, image features are to be searched
Figure BDA0003983754050000059
The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained x
D3, constructing a full bipartite graph g= (V, E), wherein,
Figure BDA00039837540500000510
/>
Figure BDA00039837540500000511
two sub-maps of G are +.>
Figure BDA00039837540500000512
And->
Figure BDA00039837540500000513
D4, because the more similar the location in the search image is to the local location of the template image, the more likely it is to be foreground, to which more target information should be transferred. Firstly, respectively carrying out linear change on nodes, calculating the inner product of the nodes to calculate a correlation score, wherein the correlation score is the similarity degree of the two nodes, and finally, generating a response diagram, and the formula is as follows:
Figure BDA00039837540500000514
wherein e i,j Representing node i e V x Sum node j e V z Correlation score, W x And W is z In the form of a linear transformation matrix,
Figure BDA00039837540500000515
and->
Figure BDA00039837540500000516
The eigenvectors of nodes i and j, respectively.
Further, the multiple attention module comprises the following specific steps:
e1, inputting the feature F before optimization into a channel attention branch, giving higher weight to a channel with more discriminant, and performing pixel multiplication operation by using residual error learning and the feature before optimization to obtain the feature F 1 The formula is:
Figure BDA0003983754050000061
wherein F is 1 Representing features optimized by channel attention branches in the multi-attention module, F representing features before optimization,
Figure BDA0003983754050000062
representing pixel multiplication operations, σ represents Sigmoid activation function, F SENet Representing the features obtained after passing through the channel attention mechanism;
e2, inputting the feature F before optimization into a position attention branch, and performing pixel multiplication operation by utilizing residual error learning and the feature before optimization to obtain the feature F 2 The formula is:
Figure BDA0003983754050000063
wherein F is 2 Representing features optimized by the position attention branches in the multi-attention module, F representing features before optimization,
Figure BDA0003983754050000064
representing pixel multiplication operations, σ represents Sigmoid activation function, F CA Representing the features obtained after passing through the location attention mechanism;
e3, feature F to be obtained by step E1 1 And feature F obtained by step E2 2 Performing feature fusion, namely performing pixel multiplication operation on the two features, and further enhancing feature representation through two 3×3 convolution operations to obtain an optimized feature F output The formula is:
Figure BDA0003983754050000065
wherein F is output Representing the optimized features, F 3×3 A 3 x 3 convolution operation is shown,
Figure BDA0003983754050000066
representing a pixel multiplication operation.
Further, the step 4 inputs the feature response graph to a classification-regression sub-network, and the regression branch replaces IoU loss by introducing Focal-EIoU loss, so as to guide the tracker to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result, and specifically comprises the following steps:
f1, convolving the obtained characteristic response graph according to the step C3, and respectively inputting the characteristic response graph into a classification branch, a centrality branch and a regression branch;
f2, classifying branches adopt conventional cross entropy loss to carry out classification tasks to obtain classification loss L cls
F3, paralleling the centrality branch and the classification branch, and removing abnormal data to obtain centrality lossL cen
F4, carrying out regression task on the regression branch by adopting Focal-EIoU loss to obtain regression loss L reg
F5, obtaining classification loss L according to the step F2 cls Obtaining the centrality loss L according to the step F3 cen Obtaining regression loss L according to step F4 reg The formula for calculating the final total loss function is:
L=L cls1 L cen2 L reg (5)
wherein L represents the total loss function, lambda 1 Representing hyper-parameters, lambda, in the centrality loss function 2 Representing the hyper-parameters in the regression loss function.
Further, the Focal-EIoU loss and the loss function of the regression branch are:
Figure BDA0003983754050000071
wherein L is EIOU Representing EIoU loss, IOU representing the intersection ratio of two anchor frames, b representing the center point of the anchor frame, b gt Represents the center point of the truth box, w represents the width of the anchor box, and w gt Represents the width of the truth box, h represents the height of the anchor box, h gt Representing the height, w, of the truth box c Represents the width of the smallest bounding box, h c Representing the height of the smallest bounding box;
L reg =IoU γ L EIoU (7)
wherein L is reg Represents the regression loss calculated by the Focal-EIoU loss, and γ is a hyper-parameter.
The invention has the advantages and beneficial effects as follows:
1. aiming at the common complex background interference problem in the field of target tracking, the invention designs a twin network target tracking method based on multiple attentions, and improves the tracking performance from the characteristic level through designing a multiple attentions module. Compared with the current state-of-the-art trackers (SiamFC++, siamCAR, siamGAT), the invention can show superior tracking performance on common target tracking data sets;
2. the feature optimization network is a method for improving the tracking performance of the tracker, so that the invention designs a multi-attention module which comprises a channel attention branch and a position attention branch, enhances the learning capacity of the network for feature selection, and reduces the burden of redundant information on the network. The channel attention branches give higher weight to the characteristic channels with more discriminant, the position attention branches fully utilize the position information of the targets, and then a characteristic fusion module is designed to fuse the two different characteristics, so that characteristic representation is further enhanced, more robust characteristic representation is obtained, and the tracking precision of the tracker is effectively improved;
3. the target tracking task comprises classification branches and regression branches, positive and negative samples are determined through the classification branches, and a boundary box of the target is determined through the regression branches. Most current target tracking methods often use IoU loss as a regression loss function, but when two tracking frames do not intersect, the IoU loss does not reflect the distance between the two frames well. A Focal-EIoU penalty is proposed in the EIoU paper, and the penalty function calculates the aspect ratio of the tracking box. Therefore, the invention guides the tracker to generate a more accurate regression frame by introducing Focal-EIoU loss on the regression branch, thereby further improving the tracking performance of the tracker.
Drawings
FIG. 1 is a general framework diagram of a twin network target tracking method based on multiple attentiveness in accordance with a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-attention module designed in the present invention;
figure 3 is a graph of the tracking effect of the present invention on OTB100 dataset MotorRolling, board and the Soccer video sequence.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the embodiment of the invention is based on a SiamGAT target tracking framework as a basic framework, and the details are shown in documents Dongyan Guo, yanyan Shao, yes Cui, zhenhua Wang, ryan Zhang, chunhua Shan. Firstly, a tracking frame is built by using SiamGAT as a basis, then a multi-attention module is designed, characteristics extracted through a backbone network are optimized, focal-EIoU loss is introduced, and a tracker is guided to obtain a more accurate regression frame, so that tracking precision is improved.
As shown in fig. 1, a twin network target tracking method based on multiple attentiveness includes the following steps:
1. as shown in fig. 1, a first frame image of a video is selected as a template image, other subsequent frame images of the video are selected as search images, and then preprocessing operation is performed on the template image and the search images, specifically including:
1.1 template image pretreatment: selecting a first frame image of a video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds an image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to 127 multiplied by 127 pixels;
1.2 search image preprocessing: and selecting subsequent other frame images of the video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds the image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to the size of 287 multiplied by 287 pixels.
2. As shown in fig. 1, the preprocessed template image and the search image are respectively input into a template branch and a search branch of a twin network, and feature extraction is performed through a google net feature extraction backbone network to obtain features of the template image and features of the search image, which specifically comprises:
2.1 obtaining a template image z with the size of 127×127 according to the step 1.1, and obtaining a search image x with the size of 287×287 according to the step 1.2;
2.2 inputting the template image z into a template branch of a twin network, extracting the template image features through a GoogLeNet (acceptance v 3) feature extraction backbone network
Figure BDA0003983754050000091
2.3 inputting the search image x into the search branch of the twin network, extracting the feature of the search image by the GoogLeNet (acceptance v 3) feature extraction backbone network
Figure BDA0003983754050000092
3. As shown in fig. 1, the features of the template image and the features of the search image are respectively input into a multi-attention module consisting of parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to the feature channels with more discrimination, and the position attention branches fully utilize the position information of the targets, so that the features are further refined; the target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, and a characteristic response graph is obtained, which comprises the following specific steps:
3.1 template image features obtained according to step 2.2
Figure BDA0003983754050000101
Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining the thinned template image characteristics>
Figure BDA0003983754050000102
3.2 search map obtained according to step 2.3Image characteristics
Figure BDA0003983754050000103
Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining refined search image characteristics>
Figure BDA0003983754050000104
3.3 characterizing the thinned template image
Figure BDA0003983754050000105
And search for image features->
Figure BDA0003983754050000106
Target information is transmitted from template image features to search image features through a graph attention mechanism, and a feature response graph F is obtained fin
4. Features of the thinned template image
Figure BDA0003983754050000107
And search for image features->
Figure BDA0003983754050000108
The target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, and the specific steps are as follows:
4.1 characterizing template images
Figure BDA0003983754050000109
The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained z
4.2 image features will be searched
Figure BDA00039837540500001010
The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained x
4.3 constructing a full bipartite graph g= (V, E), wherein,
Figure BDA00039837540500001011
Figure BDA00039837540500001012
two sub-maps of G are +.>
Figure BDA00039837540500001013
And->
Figure BDA00039837540500001014
4.4 because the more likely it is that the location in the search image is similar to the local location of the template image, more target information should be delivered thereto. Firstly, respectively carrying out linear change on nodes, calculating the inner product of the nodes to calculate a correlation score, wherein the correlation score is the similarity degree of the two nodes, and finally, generating a response diagram, and the formula is as follows:
Figure BDA00039837540500001015
wherein e i,j Representing node i e V x Sum node j e V z Correlation score, W x And W is z In the form of a linear transformation matrix,
Figure BDA00039837540500001016
and->
Figure BDA00039837540500001017
The eigenvectors of nodes i and j, respectively.
5. As shown in fig. 2, the multiple attention module includes a channel attention branch and a position attention branch, and specifically includes:
5.1 inputting the feature F before optimization into a channel attention branch, giving higher weight to a channel with more discriminant, and then carrying out pixel multiplication operation by utilizing residual error learning and the feature before optimization to avoid redundant information and inhibit background noise so as to obtain the feature F 1 The formula is:
Figure BDA0003983754050000111
wherein F is 1 Representing features optimized by channel attention branches in the multi-attention module, F representing features before optimization,
Figure BDA0003983754050000112
representing pixel multiplication operations, σ represents Sigmoid activation function, F SENet Representing the features obtained after passing through the channel attention mechanism.
5.2 inputting the feature F before optimization into a position attention branch, fully utilizing the position information of the target, and then utilizing residual error learning and the feature before optimization to perform pixel multiplication operation, avoiding redundant information and inhibiting background noise so as to obtain the feature F 2 The formula is:
Figure BDA0003983754050000113
wherein F is 2 Representing features optimized by the position attention branches in the multi-attention module, F representing features before optimization,
Figure BDA0003983754050000114
representing pixel multiplication operations, σ represents Sigmoid activation function, F CA Representing the features obtained after passing through the position attention mechanism.
5.3 feature F obtained in step 5.1 1 And feature F obtained by step 5.2 2 Feature fusion is carried out, firstly, pixel multiplication operation is carried out on two features, and then, the two 3 multiplied by 3 convolution operations are used for further enhancementFeature representation, obtaining optimized feature F output The formula is:
Figure BDA0003983754050000115
wherein F is output Representing the optimized features, F 3×3 A 3 x 3 convolution operation is shown,
Figure BDA0003983754050000116
representing a pixel multiplication operation.
6. As shown in fig. 1, the characteristic response diagram is input into a classification-regression sub-network, and regression branches are used for introducing Focal-EIoU loss to replace IoU loss, so that a tracker is guided to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result, and the method specifically comprises the following steps:
6.1, according to the step 3.3, convolving the obtained characteristic response graph, and respectively inputting the characteristic response graph into a classification branch, a centrality branch and a regression branch;
6.2 classifying branches adopt conventional cross entropy loss to carry out classifying tasks to obtain classifying loss L cls
6.3 the centrality branches are parallel to the classification branches for removing abnormal data to obtain centrality loss L cen
6.4 regression branches regression tasks with Focal-EIoU loss to obtain regression loss L reg The calculation formula of the regression loss is:
Figure BDA0003983754050000121
L reg =IoU γ L EIoU (6)
wherein L is EIOU Representing EIoU loss, IOU representing the intersection ratio of two anchor frames, b representing the center point of the anchor frame, b gt Represents the center point of the truth box, w represents the width of the anchor box, and w gt Represents the width of the truth box, h represents the height of the anchor box, h gt Representing the height, w, of the truth box c Represents the width of the smallest bounding box, h c Representing the height of the smallest bounding box, L reg Represents the regression loss calculated by the Focal-EIoU loss, and γ is a hyper-parameter.
6.5 obtaining the Classification loss L according to step 6.2 cls Obtaining the centrality loss L according to the step 6.3 cen Obtaining regression loss L according to step 6.4 reg The formula for calculating the final total loss function is:
L=L cls1 L cen2 L reg (7)
wherein L represents the total loss function, lambda 1 Representing hyper-parameters, lambda, in the centrality loss function 2 Representing the hyper-parameters in the regression loss function.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims (8)

1. The twin network target tracking method based on multiple attentiveness is characterized by comprising the following steps of:
step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images;
step 2: respectively inputting the preprocessed template image and the preprocessed search image into a template branch and a search branch of a twin network, and extracting features through a GoogLeNet feature extraction backbone network to obtain a feature map of the template image and a feature map of the search image;
step 3: the template image features and the search image features are respectively input into a multi-attention module consisting of parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discrimination, and the position attention branches fully utilize the position information of targets, so that the features are further refined; propagating target information from template image features to search image features through a graph attention mechanism to obtain a feature response graph;
step 4: the characteristic response diagram is input into a classification-regression sub-network, and regression branches replace IoU losses by introducing Focal-EIoU losses, so that a tracker is guided to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result.
2. The method for twin network target tracking based on multiple attentiveness as defined in claim 1, wherein step 1: selecting a first frame image of a video as a template image, selecting other subsequent frame images of the video as search images, and respectively preprocessing the template image and the search images, wherein the method specifically comprises the following steps of:
a1, preprocessing a template image: selecting a first frame image of a video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds an image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to 127 multiplied by 127 pixels;
a2, preprocessing a search image: and selecting subsequent other frame images of the video, calibrating a target area by using a rectangular frame, wherein the center point of the rectangular frame represents the position of the target center point, respectively expanding p pixels on four sides of the target rectangular frame, if the rectangular frame exceeds the image boundary, filling the exceeding part by using an image pixel mean value, and finally scaling the size of the cut target image to the size of 287 multiplied by 287 pixels.
3. The method for twin network target tracking based on multiple attentiveness according to claim 2, wherein the step 2: the preprocessed template image and the preprocessed search image are respectively input into a template branch and a search branch of a twin network, feature extraction is carried out through a GoogLeNet feature extraction backbone network, and features of the template image and features of the search image are obtained, specifically comprising the following steps:
b1, obtaining a template image z with the size of 127 multiplied by 127 according to the step A1, and obtaining a search image x with the size of 287 multiplied by 287 according to the step A2;
b2, inputting the template image z into a template branch of a twin network, and extracting the template image features through a GoogLeNet (acceptance v 3) feature extraction backbone network
Figure FDA0003983754040000021
B3, inputting the search image x into a search branch of a twin network, and extracting the search image features through a GoogLeNet (acceptance v 3) feature extraction backbone network
Figure FDA0003983754040000022
4. A twin network target tracking method based on multiple attentiveness as defined in claim 3 in which step 3: the method comprises the steps that the features of a template image and the features of a search image are respectively input into a multi-attention module formed by parallel channel attention branches and position attention branches, wherein the channel attention branches give higher weight to feature channels with more discriminant, and the position attention branches fully utilize the position information of targets, so that the features are further refined; the target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, so that a characteristic response graph is obtained, and the method specifically comprises the following steps:
c1, template image characteristics obtained according to the step B2
Figure FDA0003983754040000023
Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining the thinned template image characteristics>
Figure FDA0003983754040000024
C2, searching image characteristics obtained according to the step B3
Figure FDA0003983754040000025
Inputting the two attention characteristics into a multi-attention module, mainly comprising a channel attention branch and a position attention branch, obtaining two different attention characteristics, fusing the two attention characteristics through pixel-by-pixel multiplication operation, then further enhancing characteristic representation through two 3X 3 convolution operations, and finally obtaining refined search image characteristics>
Figure FDA0003983754040000026
C3, refining the template image features
Figure FDA0003983754040000027
And search for image features->
Figure FDA0003983754040000028
Target information is transmitted from template image features to search image features through a graph attention mechanism, and a feature response graph F is obtained fin
5. The multi-attention-based twin network object tracking method according to claim 4, wherein the step C3 is to refine the template image features
Figure FDA0003983754040000031
And search for image features->
Figure FDA0003983754040000032
The target information is transmitted from the template image characteristics to the search image characteristics through a graph attention mechanism, and the specific steps are as follows:
d1, the template image features
Figure FDA0003983754040000033
The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained z
D2, image features are to be searched
Figure FDA0003983754040000034
The 1 x C grid is regarded as a node, wherein C is the number of characteristic channels, and a node set V containing all the nodes is obtained x
D3, constructing a full bipartite graph g= (V, E), wherein v=v z ∪V x
Figure FDA0003983754040000035
Figure FDA0003983754040000036
Two sub-maps of G are +.>
Figure FDA0003983754040000037
And->
Figure FDA0003983754040000038
D4, because the more similar the location in the search image is to the local location of the template image, the more likely it is to be foreground, to which more target information should be transferred. Firstly, respectively carrying out linear change on nodes, calculating the inner product of the nodes to calculate a correlation score, wherein the correlation score is the similarity degree of the two nodes, and finally, generating a response diagram, and the formula is as follows:
Figure FDA0003983754040000039
wherein e i,j Representing node i e V x Sum node j e V z Correlation score, W x And W is z In the form of a linear transformation matrix,
Figure FDA00039837540400000310
and->
Figure FDA00039837540400000311
The eigenvectors of nodes i and j, respectively.
6. The twin network target tracking method based on multiple attentions according to claim 4, wherein the multiple attentions module comprises the following specific steps:
e1, inputting the feature F before optimization into a channel attention branch, giving higher weight to a channel with more discriminant, and performing pixel multiplication operation by using residual error learning and the feature before optimization to obtain the feature F 1 The formula is:
Figure FDA00039837540400000312
wherein F is 1 Representing features optimized by channel attention branches in the multi-attention module, F representing features before optimization,
Figure FDA00039837540400000313
representing pixel multiplication operations, σ represents Sigmoid activation function, F SENet Representing the features obtained after passing through the channel attention mechanism; />
E2, inputting the feature F before optimization into a position attention branch, and performing pixel multiplication operation by utilizing residual error learning and the feature before optimization to obtain the feature F 2 The formula is:
Figure FDA0003983754040000041
wherein F is 2 Representing features optimized by the position attention branches in the multi-attention module, F representing features before optimization,
Figure FDA0003983754040000042
representing pixel multiplication operations, σ represents Sigmoid activation function, F CA Representing the features obtained after passing through the location attention mechanism;
e3, feature F to be obtained by step E1 1 And feature F obtained by step E2 2 Performing feature fusion, namely performing pixel multiplication operation on the two features, and further enhancing feature representation through two 3×3 convolution operations to obtain an optimized feature F output The formula is:
Figure FDA0003983754040000043
wherein F is output Representing the optimized features, F 3×3 A 3 x 3 convolution operation is shown,
Figure FDA0003983754040000044
representing a pixel multiplication operation.
7. The multiple-attention-based twin network target tracking method according to claim 6, wherein the step 4 inputs a characteristic response graph into a classification-regression sub-network, and the regression branch introduces Focal-EIoU loss to replace IoU loss, so as to guide the tracker to generate a more accurate tracking frame, and the obtained tracking frame is the final tracking result, and specifically comprises the following steps:
f1, convolving the obtained characteristic response graph according to the step C3, and respectively inputting the characteristic response graph into a classification branch, a centrality branch and a regression branch;
f2, classifying branches adopt conventional cross entropy loss to carry out classification tasks to obtain classification loss L cls
F3, paralleling the centrality branch and the classification branch, and removing abnormal data to obtain centrality loss L cen
F4, carrying out regression task on the regression branch by adopting Focal-EIoU loss to obtain regression loss L reg
F5, obtaining classification loss L according to the step F2 cls Obtaining the centrality loss L according to the step F3 cen Obtaining regression loss L according to step F4 reg The formula for calculating the final total loss function is:
L=L cls1 L cen2 L reg (5)
wherein L represents the total loss function, lambda 1 Representing hyper-parameters, lambda, in the centrality loss function 2 Representing the hyper-parameters in the regression loss function.
8. The multiple-attention-based twin network target tracking method of claim 7 wherein the Focal-EIoU loss and regression branch loss functions are:
Figure FDA0003983754040000051
wherein L is EIOU Representing EIoU loss, IOU representing the intersection ratio of two anchor frames, b representing the center point of the anchor frame, b gt Represents the center point of the truth box, w represents the width of the anchor box, and w gt Represents the width of the truth box, h represents the height of the anchor box, h gt Representing the height, w, of the truth box c Represents the width of the smallest bounding box, h c Representing the height of the smallest bounding box;
L reg =IoU γ L EIoU (7)
wherein L is reg Represents the regression loss calculated by the Focal-EIoU loss, and γ is a hyper-parameter.
CN202211558887.7A 2022-12-06 2022-12-06 Twin network target tracking method based on multiple attentives Pending CN116229112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211558887.7A CN116229112A (en) 2022-12-06 2022-12-06 Twin network target tracking method based on multiple attentives

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211558887.7A CN116229112A (en) 2022-12-06 2022-12-06 Twin network target tracking method based on multiple attentives

Publications (1)

Publication Number Publication Date
CN116229112A true CN116229112A (en) 2023-06-06

Family

ID=86579290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211558887.7A Pending CN116229112A (en) 2022-12-06 2022-12-06 Twin network target tracking method based on multiple attentives

Country Status (1)

Country Link
CN (1) CN116229112A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934796A (en) * 2023-07-20 2023-10-24 河南大学 Visual target tracking method based on twinning residual error attention aggregation network
CN117670938A (en) * 2024-01-30 2024-03-08 江西方兴科技股份有限公司 Multi-target space-time tracking method based on super-treatment robot

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934796A (en) * 2023-07-20 2023-10-24 河南大学 Visual target tracking method based on twinning residual error attention aggregation network
CN117670938A (en) * 2024-01-30 2024-03-08 江西方兴科技股份有限公司 Multi-target space-time tracking method based on super-treatment robot

Similar Documents

Publication Publication Date Title
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN112001385B (en) Target cross-domain detection and understanding method, system, equipment and storage medium
CN111460968B (en) Unmanned aerial vehicle identification and tracking method and device based on video
CN116229112A (en) Twin network target tracking method based on multiple attentives
CN111160407B (en) Deep learning target detection method and system
CN112560656A (en) Pedestrian multi-target tracking method combining attention machine system and end-to-end training
CN110942471B (en) Long-term target tracking method based on space-time constraint
CN110929848B (en) Training and tracking method based on multi-challenge perception learning model
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN111767847A (en) Pedestrian multi-target tracking method integrating target detection and association
CN110781785A (en) Traffic scene pedestrian detection method improved based on fast RCNN algorithm
CN113129335B (en) Visual tracking algorithm and multi-template updating strategy based on twin network
Liang et al. Comparison detector for cervical cell/clumps detection in the limited data scenario
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
CN111898432A (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111931953A (en) Multi-scale characteristic depth forest identification method for waste mobile phones
CN113052184A (en) Target detection method based on two-stage local feature alignment
CN112884147A (en) Neural network training method, image processing method, device and electronic equipment
CN110349176B (en) Target tracking method and system based on triple convolutional network and perceptual interference learning
Dong et al. Learning regional purity for instance segmentation on 3d point clouds
CN112287906B (en) Template matching tracking method and system based on depth feature fusion
CN113888586A (en) Target tracking method and device based on correlation filtering
CN116664867A (en) Feature extraction method and device for selecting training samples based on multi-evidence fusion
CN111291785A (en) Target detection method, device, equipment and storage medium
Lu et al. Siamese Graph Attention Networks for robust visual object tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination