LU102992B1 - Siamese network target tracking method based on channel and spatial attention mechanisms - Google Patents

Siamese network target tracking method based on channel and spatial attention mechanisms Download PDF

Info

Publication number
LU102992B1
LU102992B1 LU102992A LU102992A LU102992B1 LU 102992 B1 LU102992 B1 LU 102992B1 LU 102992 A LU102992 A LU 102992A LU 102992 A LU102992 A LU 102992A LU 102992 B1 LU102992 B1 LU 102992B1
Authority
LU
Luxembourg
Prior art keywords
target
channel
network model
spatial attention
target images
Prior art date
Application number
LU102992A
Other languages
French (fr)
Inventor
Jun Wang
Yuanyun Wang
Original Assignee
Nanchang Inst Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Inst Tech filed Critical Nanchang Inst Tech
Priority to LU102992A priority Critical patent/LU102992B1/en
Application granted granted Critical
Publication of LU102992B1 publication Critical patent/LU102992B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a siamese network target tracking method based on channel and spatial attention mechanisms, including the following steps: processing a video or image data set to obtain a plurality of target images having a uniform image size; constructing and obtaining a new backbone network model on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism; extracting training samples from the plurality of target images to train the new backbone network model; extracting deep features of target image samples from the plurality of target images by the well-trained new backbone network model, and performing similarity matching on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks; and utilizing a target candidate block with a maximum similarity score acquired, thereby performing target tracking.

Description

SIAMESE NETWORK TARGET TRACKING METHOD BASED ON CHANNEL LU102992
AND SPATIALATTENTION MECHANISMS
TECHNICAL FIELD
[01] The invention relates to the technical field of computer vision, and in particular, to a siamese network target tracking method based on channel and spatial attention mechanisms.
BACKGROUND ART
[02] Target tracking as an important project in the computer vision has practical applications in the fields of automatic drive, video monitoring, video analysis, medical treatment, military science, and the like, but the target tracking with a complex background usually suffers from deformation and has such challenge problems as motion blurring and sheltering.
[03] Generally, target tracking algorithms include a discriminative algorithm and a generative algorithm. In recent years, deep learning and a tracking algorithm based on a siamese network have been widely concerned as well. A siamese network architecture is used for template matching of target candidate samples detected thereby, and the location of a target image is obtained by calculating a maximum similarity between a target region and a candidate region.
[04] However, a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism are not combined simultaneously in the prior art, and accuracy and robustness are dissatisfactory.
SUMMARY
[05] The invention aims to solve the problems that vision target tracking does not simultaneously combine a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism, and that target tracking accuracy and robustness are dissatisfactory.
[06] The invention provides a siamese network target tracking method based on channel and spatial attention mechanisms, comprising the following steps:
[07] step I processing a video or image data set to obtain a plurality of target images having a uniform image size;
[08] step II: constructing and obtaining a new backbone network model on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism;
[09] step III: extracting training samples from the plurality of target images to train the new backbone network model;
[10] step IV: extracting deep features of target image samples from the plurality of target images by the well-trained new backbone network mode, and performing similarity matching on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks, with each target candidate block corresponding to a similarity score;
[11] step V: utilizing a target candidate block with a maximum similarity score acquired, 1 thereby performing target tracking.
LU102992
[12] In the invention, GOT-10k is used as a training set to adjust model parameters for off- line training, which can more accurately represent a target in a video; and then, the lightweight convolutional neural network model is used for feature extraction. An appearance model of a tracking algorithm designed in the invention has better robustness and accuracy.
[13] Provided is the siamese network target tracking method based on channel and spatial attention mechanisms, wherein in the step of constructing and obtaining the new backbone network model on the basis of the convolutional neural network model, the channel attention mechanism and the spatial attention mechanism,
[14] the plurality of target images are taken as a training data set for training, wherein the training data set includes 560 motion objects and 87 motion pattern categories;
[15] a stochastic gradient descent method is used for training and construction, wherein the momentum is set as 0.9.
[16] Provided is the siamese network target tracking method based on channel and spatial attention mechanisms, wherein the sizes of target image features respectively extracted by a template branch and a search branch in the siamese network framework are “6x6x128” and “22x22x128”.
BRIEF DESCRIPTION OF THE DRAWINGS
[17] FIG. 1is a flow diagram of a siamese network target tracking method based on channel and spatial attention mechanisms according to the invention;
[18] FIG. 2 is a principle diagram of a siamese network target tracking method based on channel and spatial attention mechanisms according to the invention.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[19] The invention discloses a siamese network target tracking method based on channel and spatial attention mechanisms, as shown in FIG. 1 and FIG. 2, and the method includes the following steps:
[20] S101, a video or image data set is processed to obtain a plurality of target images having a uniform image size.
[21] Images in a video or image data set are processed to have a uniform size, which is not only convenient for subsequent input, but also convenient for extracting deep features of images having uniform sizes in a tracking stage.
[22] S102, a new backbone network model is constructed and obtained on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism.
[23] The new backbone network model is a siamese network framework which includes a template branch and a search branch. As shown in FIG. 2, Z corresponds to the template branch, and X corresponds to the search branch.
[24] In the dotted-line box in the middle of FIG. 2, the convolutional neural network model, 2 a channel attention module and a spatial attention module are combined to construct the new backbone network model. Therein, the convolutional neural network model includes a LU102992 convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4 and a convolutional layer 5. Therein, the channel attention module and the spatial attention module are located between the convolutional layer 1 and the convolutional layer 2. This is used in a subsequent step for processing deep features of target image samples extracted from target images.
[25] S103, training samples are extracted from the plurality of target images to train the new backbone network model.
[26] During training, it is required to determine the picture size according to the complexity and the display memory size of the model. In the invention, the sizes of the sample images input into the template branch are 127x127x3, and the sizes of the sample images input into the search branch are 255x255x3.
[27] It shall be additionally described that:
[28] when a child window for searching the target images is expanded beyond the scope of the target images, an image missing part is filled by an RGB mean value. In a subsequent testing stage (including step S104 and step S105), target images of two channels will be respectively introduced into the template branch and the search branch of the siamese network framework, so as to acquire the deep features of the target image samples.
[29] Inthe step of taking the plurality of target images as a training data set for training, the training data set includes 560 motion objects and 87 motion pattern categories. Moreover, the training data set furthermore provides video clips of over 10,000 real-world motion objects and over 1,500,000 manually-made marker frames. The new backbone network model designed above can implement end-to-end training for a large-scale data set GOT-10k.
[30] In addition, a stochastic gradient descent (SGD) method is used for training and construction, wherein the momentum is set as 0.9. The learning rate for each iteration is reduced from an initial learning rate to a final learning rate, respectively set as 0.01 and 0.00001. The new backbone network model disclosed in the invention is trained for 50 epochs in total, the weight attenuation is set as 0.0005, and the batch size is 16.
[31] S104, the deep features of the target image samples are extracted from the plurality of target images by the well-trained new backbone network model, and similarity matching is performed on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks, with each target candidate block corresponding to a similarity score.
[32] Specifically, for the above-mentioned new backbone network model, the convolutional neural network model (CNN model) includes 5 convolutional layers, without any fully- connected layer. The channel attention mechanism and the spatial attention mechanism consist of the channel attention module and the spatial attention module. According to the successive placement positions of the channel attention module and the spatial attention module, the channel attention module and the spatial attention module are constructed behind a first convolutional layer and a pooling layer. A “7x7” convolution kernel is used for a receptive field of the spatial attention module.
[33] In the siamese network framework, there is a formula as follows: 3 h(L,,x) = L,h(x) 0 LU102992
[34] In the formula, h represents a mapping function of input/output signals, k represents a
L stride, © is a conversion value of an effective region in the input/output signals, kz and
L | x 7 both represent translation operators, and represents an input target image.
[35] Moreover, a convolutional inline function pl) is usually used to associate two input target images X and Z with each other to generate an output response diagram used for representing a similarity score between the deep features of the target image samples after the two input target images pass through the siamese network framework.
[36] Therein, the formula expression of the similarity score is as follows: fz) =p) *p@ybll © f(Z,x) Le
[37] In the formula, represents the similarity score between the two input target images, bIl represents a deviation of a value bel , ] represents a set of real numbers, plz) and Px) represent output characteristics of the two input target images after passing through the siamese network framework, Z, X represents the two input target images, and pl) represents the convolutional inline function.
[38] Specifically, in the step of extracting the deep features of the target image samples in the plurality of target images by the well-trained new backbone network model , the channel attention mechanism includes the following steps:
[39] Al. Features of target images of two channels are obtained by max-pooling and global average-pooling.
[40] In the invention, the size of the input target images Z is “HxWxC”, and the features of the target images of the two channels are obtained by the max-pooling and the global average- pooling, and the size of the features of the target images of the two channels is “1x1xC”.
[41] BI. The features of the target images of the two channels obtained after the max-pooling and the global average-pooling are input into a multi-layer perceptron network to obtain a feature vector by neuron number summation.
[42] Then, the features of the target images of the two channels obtained after the max- pooling and the global average-pooling are input into a multi-layer perceptron network (namely
MLP). Wherein, a first-layer neuron number is C/r, an activation function is ReLU, and a second-layer neuron number is C. Wherein, neural network parameters of two layers are shared.
The feature vector is output after neuron number summation. . . WR, (F_(Z2)+WR,(F,. (Z
[43] Therein, the feature vector is |! o(Ene(2))) Ro (Fax (2))) . 4
[44] Cl. The feature vector is processed by a Sigmoid activation function to obtain a first . . . as . . . LU102992 weight coefficient, and the first weight coefficient is multiplied with the input target image Z to obtain a first weighted new feature.
[45] Inthe step, the first weight coefficient is represented as follows:
Orca (Z) = o(WR, ((Ene (Z))) + WR, ((Fnax (Z)))) (3)
O,(Z) . . . . .
[46] In the formula, 7 is the first weight coefficient, represents the Sigmoid activation function, Wi represents a weight of a shared multi-layer perceptron network, R,
FC). Fl) : represents a ReLU function, is a global average-pooling function, and isa max-pooling function;
[47] the first weighted new feature is represented as follows:
E,=0(2)QZ (4)
F . ®
[48] Inthe formula, ~ represents the first weighted new feature, represents neuron number level multiplication, and Z represents the input target images.
[49] Following the channel attention module, the spatial attention module is introduced to concern which features are significant in the input target images. Specifically, in the step of extracting the deep features of the target image samples from the plurality of target images by the well-trained new backbone network model to obtain, the spatial attention mechanism includes the following steps:
[50] A2. The features of the target images of the two channels are obtained by the max- pooling and the global average-pooling, and the features of the target images of the two channels are spliced by a first convolutional layer.
[51] Similar with the channel attention module, the size of the input target images Z is “HxWxC”. The size of the features of the target images of the two channels obtained by the max-pooling and the global average-pooling of one channel dimensionality is “HxWx1”, and the features are spliced together according to a standard convolutional layer (the first convolutional layer).
[52] B2. The spliced features of the target images of the two channels are calculated by a second convolutional layer and the Sigmoid activation function to obtain a second weight coefficient. 0.2).
[53] Then, a weight coefficient ” is obtained by the 7x7 convolutional layer and
CL . . . 0 (2) . 4 Le the Sigmoid activation function. Finally, the weight coefficient” (2) is multiplied with
F the input target image Z’ to obtain a second weighted new feature °P.
[54] Therein, the second weight coefficient is represented as follows:
7x7 "\. 0p (Z)= (fT (Foy (2): Fa CM 5 LU102992
O ( 7" ; ; ; f
[55] Inthe formula, is the second weight coefficient, represents that the receptive field of the convolution kernel is 7x7, Z represents the first weighted new feature as well, and © represents the Sigmoid activation function.
[56] C2. The second weight coefficient is multiplied with the first weighted new feature to obtain a second weighted new feature.
[57] The second weighted new feature is represented as follows: — ! !
F,=0,(2)®Z ©
FE, .
[58] P is the second weighted new feature.
[59] Further, the similarity matching is performed in the target image candidate region. In other words, the similarities of all conversion child windows is calculated in a dense grid, and the specific operation is as shown in Formula (2). Namely, mutual association is performed by the convolutional inline function pl) to generate the output response diagram used for representing the similarity score between the deep features of the target image samples after the two input target images pass through the siamese network framework.
[60] The target candidate blocks described here are all obtained by the search branch, and the corresponding size is “22x22x128”. The above-mentioned similarity score is obtained by similarity comparison between the target candidate blocks (which are essentially target image features) in the search branch and the sample image features in the template branch.
[61] S105, a target candidate block with a maximum similarity score acquired thereby is used for target tracking.
[62] This step specifically includes: the similarity between the deep features of the target image samples (in the template branch) and the deep features of the candidate target image samples (in the search branch) is calculated and compared, and the target images of the region with a maximum similarity score, found in subsequent frames, are determined as expected results, thereby achieving target tracking. 6

Claims (3)

WHAT IS CLAIMED IS: LU102992
1. A siamese network target tracking method based on channel and spatial attention mechanisms, comprising the following steps: step I: processing a video or image data set to obtain a plurality of target images having a uniform image size; step II: constructing and obtaining a new backbone network model on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism; step IIT: extracting training samples from the plurality of target images to train the new backbone network model; step IV: extracting deep features of target image samples from the plurality of target images by the well-trained new backbone network mode, and performing similarity matching on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks, with each target candidate block corresponding to a similarity score; step V: utilizing a target candidate block with a maximum similarity score acquired, thereby performing target tracking.
2. The siamese network target tracking method based on the channel and spatial attention mechanisms according to claim 1, wherein the new backbone network model is a siamese network framework which comprises a template branch and a search branch; and the step of extracting the training samples from the plurality of target images comprises: when expanding a child window for searching the target images beyond the scope of the target images, filling an image missing part by an RGB mean value.
3. The siamese network target tracking method based on the channel and spatial attention mechanisms according to claim 2, wherein the sizes of target image features respectively extracted by the template branch and the search branch in the siamese network framework are “6x6x128” and “22x22x128”. 7
LU102992A 2022-08-02 2022-08-02 Siamese network target tracking method based on channel and spatial attention mechanisms LU102992B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
LU102992A LU102992B1 (en) 2022-08-02 2022-08-02 Siamese network target tracking method based on channel and spatial attention mechanisms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
LU102992A LU102992B1 (en) 2022-08-02 2022-08-02 Siamese network target tracking method based on channel and spatial attention mechanisms

Publications (1)

Publication Number Publication Date
LU102992B1 true LU102992B1 (en) 2024-02-02

Family

ID=89720120

Family Applications (1)

Application Number Title Priority Date Filing Date
LU102992A LU102992B1 (en) 2022-08-02 2022-08-02 Siamese network target tracking method based on channel and spatial attention mechanisms

Country Status (1)

Country Link
LU (1) LU102992B1 (en)

Similar Documents

Publication Publication Date Title
AU2020103905A4 (en) Unsupervised cross-domain self-adaptive medical image segmentation method based on deep adversarial learning
CN112560631B (en) Knowledge distillation-based pedestrian re-identification method
CN113705769B (en) Neural network training method and device
CN111291809B (en) Processing device, method and storage medium
CN111325111A (en) Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision
CN112733659B (en) Hyperspectral image classification method based on self-learning double-flow multi-scale dense connection network
Xu et al. Underwater image classification using deep convolutional neural networks and data augmentation
CN111797881B (en) Image classification method and device
CN111797882B (en) Image classification method and device
CN111274869A (en) Method for classifying hyperspectral images based on parallel attention mechanism residual error network
CN111695673B (en) Method for training neural network predictor, image processing method and device
CN112288011A (en) Image matching method based on self-attention deep neural network
CN113283407A (en) Twin network target tracking method based on channel and space attention mechanism
CN113191489B (en) Training method of binary neural network model, image processing method and device
CN112818764A (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN113610046B (en) Behavior recognition method based on depth video linkage characteristics
CN114821640A (en) Skeleton action identification method based on multi-stream multi-scale expansion space-time diagram convolution network
CN114399533B (en) Single-target tracking method based on multi-level attention mechanism
CN115048870A (en) Target track identification method based on residual error network and attention mechanism
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN114821286A (en) Lightweight underwater target detection method and system based on image enhancement
CN112446253B (en) Skeleton behavior recognition method and device
CN116188509A (en) High-efficiency three-dimensional image segmentation method
Liu et al. Audio-visual speech recognition using a two-step feature fusion strategy
LU102992B1 (en) Siamese network target tracking method based on channel and spatial attention mechanisms

Legal Events

Date Code Title Description
FG Patent granted

Effective date: 20240202