CN113283407A - Twin network target tracking method based on channel and space attention mechanism - Google Patents

Twin network target tracking method based on channel and space attention mechanism Download PDF

Info

Publication number
CN113283407A
CN113283407A CN202110828947.1A CN202110828947A CN113283407A CN 113283407 A CN113283407 A CN 113283407A CN 202110828947 A CN202110828947 A CN 202110828947A CN 113283407 A CN113283407 A CN 113283407A
Authority
CN
China
Prior art keywords
target
attention mechanism
channel
target image
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110828947.1A
Other languages
Chinese (zh)
Inventor
王军
孟晨晨
邓承志
王员云
张珮芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202110828947.1A priority Critical patent/CN113283407A/en
Publication of CN113283407A publication Critical patent/CN113283407A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention provides a twin network target tracking method based on a channel and space attention mechanism, which comprises the following steps: processing the video or image data set to obtain a plurality of target images of uniform image size; constructing and obtaining a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism; extracting training samples from a plurality of target images to train the novel backbone network model; extracting deep features of a target image sample from a plurality of target images by using a trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain a plurality of target candidate blocks, wherein each target candidate block corresponds to a similarity score; and tracking the target by using the acquired target candidate block with the maximum similarity score. The apparent model of the tracking algorithm designed by the invention has better robustness and accuracy.

Description

Twin network target tracking method based on channel and space attention mechanism
Technical Field
The invention relates to the technical field of computer vision, in particular to a twin network target tracking method based on a channel and space attention mechanism.
Background
Target tracking is a very important topic in computer vision, and has practical applications in the fields of automatic driving, video monitoring, video analysis, medical treatment, military and the like. In practical application, due to wide and complex application scenes of target tracking, deformation often occurs on target tracking with a complex background, and challenging problems of motion blur, occlusion and the like exist. In addition, due to the requirements of application scenarios in the aspects of commerce, industry, military affairs, medicine and the like, the research on the target tracking technology is extremely valuable.
Generally, the target tracking algorithm includes two types, namely discriminant algorithm and generator algorithm. Wherein discriminant model-based algorithms can effectively distinguish the tracked object from the surrounding background. The model-based algorithm compares in a given search region using a learned similarity function between the target image sample and the candidate region target image sample. In recent years, with the advent of large-scale public labeled image data sets and the rapid development of computer hardware performance and software technology, deep learning has been highly successful in various fields of image processing. Among them, the discriminant correlation filter based on deep learning has been successfully applied to target tracking because of its fast operation speed. In addition, twin network-based tracking algorithms have also gained wide attention in target tracking tasks. And performing template matching on the detected target candidate sample by utilizing a twin network architecture, and calculating the highest similarity between the target region and the candidate region to obtain the position of the target image.
However, in the prior art, when performing visual target tracking, the convolutional neural network model, the channel attention mechanism and the spatial attention mechanism are not combined at the same time, and the accuracy and robustness of performing target tracking are not ideal.
Disclosure of Invention
In view of the above situation, there is a need to solve the problem in the prior art that, when performing visual target tracking, the accuracy and robustness of target tracking are not ideal without simultaneously combining the convolutional neural network model, the channel and the spatial attention mechanism.
The embodiment of the invention provides a twin network target tracking method based on a channel and space attention mechanism, wherein the method comprises the following steps:
the method comprises the following steps: processing the video or image data set to obtain a plurality of target images of uniform image size;
step two: constructing and obtaining a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism;
step three: extracting training samples from the plurality of target images to train the novel backbone network model;
step four: extracting deep features of a target image sample from the multiple target images by using the trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain multiple target candidate blocks, wherein each target candidate block corresponds to a similarity score;
step five: and tracking the target by using the acquired target candidate block with the maximum similarity score.
The invention provides a twin network target tracking method based on a channel and space attention mechanism, which comprises the steps of firstly processing a video or image data set to obtain a target image with a uniform image size, then jointly constructing a novel backbone network model based on a convolutional neural network model, the channel attention mechanism and the space attention mechanism, then extracting a training sample from the target image, training the novel backbone network model, extracting deep features of the target image sample from the target image by using the trained novel backbone network model, further performing similarity matching in a target image candidate area to obtain a plurality of target candidate blocks, and finally performing target tracking by using the obtained target candidate blocks with the maximum similarity score.
According to the method, the GOT-10k is used as a training set to adjust model parameters of off-line training, so that targets in the video can be more accurately represented; feature extraction is then performed by using a lightweight convolutional neural network model. The apparent model of the tracking algorithm designed by the invention has better robustness and accuracy.
The twin network target tracking method based on the channel and space attention mechanism is characterized in that the novel backbone network model is a twin network framework, and the twin network framework comprises a template branch and a search branch;
wherein the step of extracting training samples from the plurality of target images comprises:
when the sub-window searching the target image extends beyond the range of the target image, the missing image portion is filled with the average RGB values.
The twin network target tracking method based on the channel and space attention mechanism comprises the following steps of:
inputting a target image through the template branch and the search branch respectively, and acquiring deep features of a target image sample according to the template branch and the search branch;
in the twin network framework, the following formula exists:
Figure 470849DEST_PATH_IMAGE001
where h denotes a mapping function of the input-output signal,kthe length of the stride is represented by,
Figure 525392DEST_PATH_IMAGE002
is the transition value of the active area in the input-output signal,
Figure 719351DEST_PATH_IMAGE003
and
Figure 492135DEST_PATH_IMAGE004
both represent the translation operator and are,
Figure 563996DEST_PATH_IMAGE005
representing the input target image.
The twin network target tracking method based on the channel and space attention mechanism is characterized in that in the fourth step, the similarity score is expressed by the following formula:
Figure 676309DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 808213DEST_PATH_IMAGE007
representing a similarity score between two input target images,
Figure 802714DEST_PATH_IMAGE008
representative value
Figure 678266DEST_PATH_IMAGE009
The deviation of (a) is determined,
Figure 707402DEST_PATH_IMAGE010
a set of real numbers is represented as,
Figure 10207DEST_PATH_IMAGE011
and
Figure 259048DEST_PATH_IMAGE012
representing the output characteristics of two input target images after passing through the twin network framework,
Figure 672712DEST_PATH_IMAGE013
representing two input target images of the object,
Figure 556354DEST_PATH_IMAGE014
representing a convolution embedding function.
The twin network target tracking method based on the channel and space attention mechanism comprises the following steps of, in the step of extracting deep features of target image samples from the plurality of target images by using the trained novel backbone network model, executing the following steps by the channel attention mechanism:
obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling;
inputting the features of the target images of the two channels obtained after the maximum pooling and the global average pooling into a multilayer perceptron network, and obtaining feature vectors after element summation;
passing the feature vector through a Sigmoid activation function to obtain a first weight coefficient, and combining the first weight coefficient with an input target image
Figure 764482DEST_PATH_IMAGE015
The multiplication is performed to obtain a first weighted new feature.
The twin network target tracking method based on the channel and space attention mechanism is characterized in that the first weight coefficient is expressed as:
Figure 936837DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 154192DEST_PATH_IMAGE017
is a first weight coefficient of the first weight coefficient,
Figure 892341DEST_PATH_IMAGE018
indicating that the Sigmoid-activated function,
Figure 802528DEST_PATH_IMAGE019
representing the weight of the shared multi-tier perceptron network,
Figure 258917DEST_PATH_IMAGE020
the representation of the function of the ReLU,
Figure 512918DEST_PATH_IMAGE021
is a function of the global average pooling,
Figure 371153DEST_PATH_IMAGE022
is a function of the maximum pooling,
Figure 186662DEST_PATH_IMAGE015
representing an input target image;
the first weighted new feature is represented as:
Figure 68031DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 627188DEST_PATH_IMAGE024
a first weighted new characteristic is represented that is,
Figure 339929DEST_PATH_IMAGE025
representing multiplication at the element level.
The twin network target tracking method based on the channel and spatial attention mechanism comprises the following steps of, in the step of extracting deep features of target image samples from the plurality of target images by using the trained novel backbone network model, executing the following steps:
obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling, and splicing the characteristics of the target images of the two channels through the first convolution layer;
calculating the characteristics of the target images of the two spliced channels through a second convolution layer and a Sigmoid activation function to obtain a second weight coefficient;
and multiplying the second weighting coefficient and the first weighting new characteristic to obtain a second weighting new characteristic.
The twin network target tracking method based on the channel and space attention mechanism is characterized in that the second weight coefficient is expressed as:
Figure 591919DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 757321DEST_PATH_IMAGE027
for the purpose of the second weight coefficient,
Figure 120169DEST_PATH_IMAGE028
the perceptual domain representing the convolution kernel is 7 x 7,
Figure 188882DEST_PATH_IMAGE029
also a first weighted new feature is represented,
Figure 346193DEST_PATH_IMAGE018
representing a Sigmoid activation function;
the second weighted new feature is represented as:
Figure 202154DEST_PATH_IMAGE030
Figure 103114DEST_PATH_IMAGE031
weighting the new feature for the second.
The twin network target tracking method based on the channel and space attention mechanism comprises the steps of constructing a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism,
training with the plurality of target images as a training data set, wherein the training data set contains 560 moving objects and 87 motion pattern classes;
a random gradient descent method was used in the training construction, where the momentum was set to 0.9.
The twin network target tracking method based on the channel and space attention mechanism is characterized in that target image feature sizes respectively extracted by a template branch and a search branch in the twin network frame are 6 x 128 and 22 x 128.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a twin network target tracking method based on a channel and space attention mechanism according to the present invention;
FIG. 2 is a schematic diagram of a twin network target tracking method based on a channel and space attention mechanism according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
In the prior art, when visual target tracking is carried out, a convolutional neural network model, a channel attention mechanism and a space attention mechanism are not combined at the same time, and the accuracy and robustness of target tracking are not ideal.
In order to solve the technical problem, the present invention provides a twin network target tracking method based on channel and spatial attention mechanism, please refer to fig. 1 and fig. 2, wherein the method includes the following steps:
s101, processing the video or image data set to obtain a plurality of target images with uniform image sizes.
In this step, the images in the video or image data set need to be processed to a uniform size. It should be noted that the target images with uniform size are processed, which is convenient for subsequent input and extraction of deep features of images with uniform size in the tracking stage.
And S102, constructing and obtaining a novel backbone network model based on the convolutional neural network model, the channel attention mechanism and the space attention mechanism.
In this embodiment, the novel backbone network model is a twin network framework, which includes a template branch and a search branch. As shown in figure 2 of the drawings, in which,
Figure 790447DEST_PATH_IMAGE032
the corresponding is the branch of the template,
Figure 118660DEST_PATH_IMAGE033
the corresponding is a search branch.
Additionally, in fig. 2, a convolutional neural network model, a channel attention module, and a spatial attention module are integrated to construct a novel backbone network model within the middle dashed box. The convolutional neural network model comprises a convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4 and a convolutional layer 5. Wherein the channel attention module and the space attention module are located between the convolutional layer 1 and the convolutional layer 2. The method is used for processing the deep features of the target image sample extracted from the target image in the subsequent step.
S103, extracting training samples from the target images to train the novel backbone network model.
During training, the size of the picture needs to be determined according to the complexity of the model and the size of the video memory. In the present invention, the sample image size input to the template branch is 127 × 127 × 3, and the target image size input to the search branch is 255 × 255 × 3.
In the step of extracting the training samples from the plurality of target images, it is additionally noted that:
when the sub-window searching for the target image extends beyond the range of the target image, the missing image portion is filled with the average RGB values. In the subsequent testing stage (including step S104 and step S105), the target images of the two channels are respectively introduced into the template branch and the search branch of the twin network framework to obtain the deep features of the target image sample.
It should be noted that, here, a plurality of target images are used as the training data set for training. Wherein the training data set comprises 560 moving objects and 87 motion pattern classes. In addition, the training data set also provides video clips of over 10000 real-world moving objects and over 150 million hand-made labeled borders. In the invention, the novel backbone network model designed above can realize end-to-end training of a large-scale data set GOT-10 k.
In addition, a random gradient descent method (SGD) was used in the training construction, in which the momentum was set to 0.9. Further, the learning rate per iteration decreases from the initial learning rate to the final learning rate, being set to 0.01 and 0.00001, respectively. The new backbone network model proposed in the present invention was trained for a total of 50 epochs, with a weight decay set to 0.0005 and a batch size of 16.
And S104, extracting deep features of the target image sample from the plurality of target images by using the trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain a plurality of target candidate blocks, wherein each target candidate block corresponds to a similarity score.
Specifically, for the above-mentioned novel backbone network model, the convolutional neural network model (CNN model) includes 5 convolutional layers, but does not include a full link layer. The channel attention mechanism and the space attention mechanism are composed of a channel attention module and a space attention module. Is constructed after the first layer of convolutional and pooling layers, in terms of the sequential discharge positions of the channel attention module-spatial attention module. Additionally, the receptive field of the spatial attention module employs a "7 × 7" convolution kernel.
In the twin network framework, the following formula exists:
Figure 524234DEST_PATH_IMAGE001
(1)
where h denotes a mapping function of the input-output signal,kthe length of the stride is represented by,
Figure 963305DEST_PATH_IMAGE002
is the transition value of the active area in the input-output signal,
Figure 505145DEST_PATH_IMAGE003
and
Figure 207522DEST_PATH_IMAGE004
both represent the translation operator and are,
Figure 333348DEST_PATH_IMAGE005
representing the input target image.
Furthermore, a convolution embedding function is typically used
Figure 576110DEST_PATH_IMAGE014
So that two input target images
Figure 238036DEST_PATH_IMAGE005
And
Figure 173631DEST_PATH_IMAGE034
and performing correlation to generate an output response graph for representing the similarity score between the deep features of the target image sample after the two input target images pass through the twin network framework.
Wherein, the formula of the similarity score is expressed as:
Figure 22638DEST_PATH_IMAGE006
(2)
wherein,
Figure 69091DEST_PATH_IMAGE007
Representing a similarity score between two input target images,
Figure 585523DEST_PATH_IMAGE008
representative value
Figure 426440DEST_PATH_IMAGE009
The deviation of (a) is determined,
Figure 966006DEST_PATH_IMAGE010
a set of real numbers is represented as,
Figure 52036DEST_PATH_IMAGE011
and
Figure 422975DEST_PATH_IMAGE012
representing the output characteristics of two input target images after passing through the twin network framework,
Figure 434793DEST_PATH_IMAGE013
representing two input target images of the object,
Figure 789551DEST_PATH_IMAGE014
representing a convolution embedding function.
For the channel attention module described above, each channel of the feature map represents a particular detector when extracting relevant features of the input target image. Therefore, measures need to be taken to focus the channel attention module on certain specific features to be useful for the input target image.
Specifically, in the step of extracting deep features of the target image sample from the plurality of target images by using the trained novel backbone network model, the channel attention mechanism performs the following steps:
and A1, obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling.
In the present invention, the size of the input target image Z is "H × W × C", and Max-Pooling (maximum Pooling) and Global Average-Pooling (Global Average Pooling) are used to obtain the features of the target images of two channels, and the size of the features of the target images of two channels is "1 × 1 × C".
And B1, inputting the features of the target images of the two channels obtained after the maximum pooling and the global average pooling into the multi-layer perceptron network, and summing the element numbers to obtain a feature vector.
The features of the target images of the two channels obtained after the maximum pooling and the global average pooling are then input into a multi-layered perceptron network (i.e., MLP). Wherein, the number of the first layer of neurons is C/r, the activation function is ReLU, and the number of the second layer of neurons is C. Wherein the neural network parameters of the two layers are shared. And summing the elements to output a feature vector.
Wherein the feature vector is
Figure 912228DEST_PATH_IMAGE035
C1, passing the feature vector through a Sigmoid activation function to obtain a first weight coefficient, and combining the first weight coefficient with the input target image
Figure 403252DEST_PATH_IMAGE015
The multiplication is performed to obtain a first weighted new feature.
In this step, the first weight coefficient is expressed as:
Figure 585971DEST_PATH_IMAGE016
(3)
wherein the content of the first and second substances,
Figure 100129DEST_PATH_IMAGE017
is a first weight coefficient of the first weight coefficient,
Figure 26497DEST_PATH_IMAGE018
indicating that the Sigmoid-activated function,
Figure 870563DEST_PATH_IMAGE019
representing the weight of the shared multi-tier perceptron network,
Figure 224184DEST_PATH_IMAGE020
the representation of the function of the ReLU,
Figure 22376DEST_PATH_IMAGE021
is a function of the global average pooling,
Figure 18013DEST_PATH_IMAGE022
is a maximum pooling function;
the first weighted new feature is represented as:
Figure 218051DEST_PATH_IMAGE023
(4)
wherein the content of the first and second substances,
Figure 742573DEST_PATH_IMAGE024
a first weighted new characteristic is represented that is,
Figure 28061DEST_PATH_IMAGE025
which represents a multiplication at the level of the element,
Figure 765073DEST_PATH_IMAGE015
representing the input target image.
Further, after the channel attention module, a spatial attention module is introduced to focus on which features in the input target image are meaningful. Specifically, in the step of extracting deep features of the target image sample from the plurality of target images by using the trained novel backbone network model, the spatial attention mechanism performs the following steps:
and A2, obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling, and splicing the characteristics of the target images of the two channels through the first convolution layer.
Similar to the channel attention Module, the purpose of the inputTarget image
Figure 819616DEST_PATH_IMAGE015
The size of (1) is "H × W × C". Max-Pooling (maximum Pooling) and Global Average-Pooling (Global Average Pooling) of one channel dimension are utilized to obtain the characteristics of the target images of the two channels, the size is H multiplied by W multiplied by 1, and the target images are spliced together according to a standard convolutional layer (first convolutional layer).
And B2, calculating the characteristics of the target images of the two spliced channels through a second convolution layer and a Sigmoid activation function to obtain a second weight coefficient.
Then, the weight coefficient is obtained by 7 × 7 convolution layer and Sigmoid activation function
Figure 16505DEST_PATH_IMAGE027
. Finally, the weight coefficient is calculated
Figure 54868DEST_PATH_IMAGE027
Multiplying with the input target image Z' to obtain a second weighted new feature
Figure 126729DEST_PATH_IMAGE031
Wherein the second weight coefficient is expressed as:
Figure 35779DEST_PATH_IMAGE026
(5)
wherein the content of the first and second substances,
Figure 167683DEST_PATH_IMAGE027
for the purpose of the second weight coefficient,
Figure 162184DEST_PATH_IMAGE028
the perceptual domain representing the convolution kernel is 7 x 7,
Figure 240999DEST_PATH_IMAGE029
also a first weighted new feature is represented,
Figure 4555DEST_PATH_IMAGE018
representing a Sigmoid activation function.
And C2, multiplying the second weighting coefficient and the first weighting new characteristic to obtain a second weighting new characteristic.
The second weighted new feature is represented as:
Figure 307361DEST_PATH_IMAGE030
(6)
Figure 311130DEST_PATH_IMAGE031
weighting the new feature for the second.
For the above step S104, in summary, in the test tracking phase, the convolution feature between the target images of the two branches in the original twin network structure does not contain background context information. Therefore, the tracker has difficulty in distinguishing the target from the complex background information, and is prone to tracking drift and failure. Based on the method, the deep features of the target image sample are extracted by using a trained novel backbone network model, and the deep features of the target image sample are distinguished from background information so as to focus on important features and inhibit useless information.
Then we give more weight to the channel attention module and the spatial attention module of the sequence. The channel attention module and the spatial attention module play an important role in improving the discrimination capability of the tracker. Finally, the sizes of the target image features extracted by the template branch and the search branch in the twin network framework are "6 × 6 × 128" and "22 × 22 × 128", respectively.
Further, similarity matching is performed in the target image candidate region. I.e. calculating the similarity of all the transformation sub-windows on a dense grid, as shown in the above formula (2), i.e. using convolution embedding function
Figure 724794DEST_PATH_IMAGE014
Are correlated to generate an outputAnd a response graph is used for representing the similarity score between deep features of the target image sample after the two input target images pass through the twin network framework.
Here, the target candidate blocks described here are all obtained by search branching, and the corresponding size is "22 × 22 × 128". The similarity score is obtained by comparing the similarity between the target candidate block (which is also the target image feature in nature) in the search branch and the sample image feature in the template branch.
And S105, performing target tracking by using the acquired target candidate block with the maximum similarity score.
In this step, the target candidate block with the obtained maximum similarity score is used for target tracking. The method specifically comprises the following steps: and calculating and comparing the similarity between the deep features of the target image sample (in the template branch) and the deep features of the candidate target image sample (in the search branch), and finding the target image with the region with the highest similarity score in the subsequent frame as a predicted result, thereby realizing target tracking.
The invention provides a twin network target tracking method based on a channel and space attention mechanism, which comprises the steps of firstly processing a video or image data set to obtain a target image with a uniform image size, then jointly constructing a novel backbone network model based on a convolutional neural network model, the channel attention mechanism and the space attention mechanism, then extracting a training sample from the target image, training the novel backbone network model, extracting deep features of the target image sample from the target image by using the trained novel backbone network model, further performing similarity matching in a target image candidate area to obtain a corresponding similarity score, and finally performing target tracking according to an obtained target candidate block with the maximum similarity score.
According to the method, the GOT-10k is used as a training set to adjust model parameters of off-line training, so that targets in the video can be more accurately represented; feature extraction is then performed by using a lightweight convolutional neural network model. The apparent model of the tracking algorithm designed by the invention has better robustness and accuracy.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the appended claims.

Claims (10)

1. A twin network target tracking method based on a channel and space attention mechanism is characterized by comprising the following steps:
the method comprises the following steps: processing the video or image data set to obtain a plurality of target images of uniform image size;
step two: constructing and obtaining a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism;
step three: extracting training samples from the plurality of target images to train the novel backbone network model;
step four: extracting deep features of a target image sample from the multiple target images by using the trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain multiple target candidate blocks, wherein each target candidate block corresponds to a similarity score;
step five: and tracking the target by using the acquired target candidate block with the maximum similarity score.
2. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 1, wherein the novel backbone network model is a twin network framework, the twin network framework comprises a template branch and a search branch;
wherein the step of extracting training samples from the plurality of target images comprises:
when the sub-window searching the target image extends beyond the range of the target image, the missing image portion is filled with the average RGB values.
3. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 2, wherein in the twin network framework, the method comprises:
inputting a target image through the template branch and the search branch respectively, and acquiring deep features of a target image sample according to the template branch and the search branch;
in the twin network framework, the following formula exists:
Figure 165686DEST_PATH_IMAGE001
where h denotes a mapping function of the input-output signal,kthe length of the stride is represented by,
Figure 649757DEST_PATH_IMAGE002
is the transition value of the active area in the input-output signal,
Figure 181233DEST_PATH_IMAGE003
and
Figure 321227DEST_PATH_IMAGE004
both represent the translation operator and are,
Figure 527343DEST_PATH_IMAGE005
representing the input target image.
4. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 3, wherein in the fourth step, the formula of the similarity score is expressed as:
Figure 69183DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 833876DEST_PATH_IMAGE007
representing a similarity score between two input target images,
Figure 195587DEST_PATH_IMAGE008
representative value
Figure 438350DEST_PATH_IMAGE009
The deviation of (a) is determined,
Figure 100275DEST_PATH_IMAGE010
a set of real numbers is represented as,
Figure 770291DEST_PATH_IMAGE011
and
Figure 88140DEST_PATH_IMAGE012
representing the output characteristics of two input target images after passing through the twin network framework,
Figure 869014DEST_PATH_IMAGE013
representing two input target images of the object,
Figure 907419DEST_PATH_IMAGE014
representing a convolution embedding function.
5. The method of claim 4, wherein in the step of extracting deep features of the target image samples from the plurality of target images by using the trained novel backbone network model, the channel attention mechanism performs the following steps:
obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling;
inputting the features of the target images of the two channels obtained after the maximum pooling and the global average pooling into a multilayer perceptron network, and obtaining feature vectors after element summation;
and obtaining a first weighting coefficient by passing the feature vector through a Sigmoid activation function, and multiplying the first weighting coefficient and an input target image to obtain a first weighted new feature.
6. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 5, wherein the first weight coefficient is expressed as:
Figure 748336DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 350219DEST_PATH_IMAGE016
is a first weight coefficient of the first weight coefficient,
Figure 200363DEST_PATH_IMAGE017
indicating that the Sigmoid-activated function,
Figure 571301DEST_PATH_IMAGE018
representing the weight of the shared multi-tier perceptron network,
Figure 583120DEST_PATH_IMAGE019
the representation of the function of the ReLU,
Figure 406719DEST_PATH_IMAGE020
is a function of the global average pooling,
Figure 794975DEST_PATH_IMAGE021
is a function of the maximum pooling,
Figure 223683DEST_PATH_IMAGE022
representing an input target image;
the first weighted new feature is represented as:
Figure 907867DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 484342DEST_PATH_IMAGE024
a first weighted new characteristic is represented that is,
Figure 410710DEST_PATH_IMAGE025
representing multiplication at the element level.
7. The twin network target tracking method based on channel and spatial attention mechanism of claim 6, wherein in the step of extracting deep features of target image samples from the plurality of target images by using the trained novel backbone network model, the spatial attention mechanism performs the following steps:
obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling, and splicing the characteristics of the target images of the two channels through the first convolution layer;
calculating the characteristics of the target images of the two spliced channels through a second convolution layer and a Sigmoid activation function to obtain a second weight coefficient;
and multiplying the second weighting coefficient and the first weighting new characteristic to obtain a second weighting new characteristic.
8. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 7,
the second weight coefficient is expressed as:
Figure 490661DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 109861DEST_PATH_IMAGE027
for the purpose of the second weight coefficient,
Figure 173632DEST_PATH_IMAGE028
the perceptual domain representing the convolution kernel is 7 x 7,
Figure 638112DEST_PATH_IMAGE029
also a first weighted new feature is represented,
Figure 838149DEST_PATH_IMAGE017
representing a Sigmoid activation function;
the second weighted new feature is represented as:
Figure 126785DEST_PATH_IMAGE030
Figure 146694DEST_PATH_IMAGE031
weighting the new feature for the second.
9. The twin network target tracking method based on channel and space attention mechanism as claimed in claim 1, wherein in the step of constructing a new backbone network model based on the convolutional neural network model, the channel attention mechanism and the space attention mechanism,
training with the plurality of target images as a training data set, wherein the training data set contains 560 moving objects and 87 motion pattern classes;
a random gradient descent method was used in the training construction, where the momentum was set to 0.9.
10. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 2, wherein the sizes of the target image features extracted by the template branch and the search branch in the twin network frame are "6 x 128" and "22 x 128", respectively.
CN202110828947.1A 2021-07-22 2021-07-22 Twin network target tracking method based on channel and space attention mechanism Pending CN113283407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110828947.1A CN113283407A (en) 2021-07-22 2021-07-22 Twin network target tracking method based on channel and space attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110828947.1A CN113283407A (en) 2021-07-22 2021-07-22 Twin network target tracking method based on channel and space attention mechanism

Publications (1)

Publication Number Publication Date
CN113283407A true CN113283407A (en) 2021-08-20

Family

ID=77287159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110828947.1A Pending CN113283407A (en) 2021-07-22 2021-07-22 Twin network target tracking method based on channel and space attention mechanism

Country Status (1)

Country Link
CN (1) CN113283407A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705588A (en) * 2021-10-28 2021-11-26 南昌工程学院 Twin network target tracking method and system based on convolution self-attention module
CN114519847A (en) * 2022-01-13 2022-05-20 东南大学 Target consistency judging method suitable for vehicle-road cooperative sensing system
CN114782488A (en) * 2022-04-01 2022-07-22 燕山大学 Underwater target tracking method based on channel perception
CN115018878A (en) * 2022-04-21 2022-09-06 哈尔滨工业大学 Attention mechanism-based target tracking method in complex scene, storage medium and equipment
CN115063445A (en) * 2022-08-18 2022-09-16 南昌工程学院 Target tracking method and system based on multi-scale hierarchical feature representation
CN115147456A (en) * 2022-06-29 2022-10-04 华东师范大学 Target tracking method based on time sequence adaptive convolution and attention mechanism
CN115223193A (en) * 2022-06-19 2022-10-21 浙江爱达科技有限公司 Capsule endoscope image focus identification method based on focus feature importance
CN115661207A (en) * 2022-11-14 2023-01-31 南昌工程学院 Target tracking method and system based on space consistency matching and weight learning
CN117437459A (en) * 2023-10-08 2024-01-23 昆山市第一人民医院 Method for realizing user knee joint patella softening state analysis based on decision network

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140313752A1 (en) * 2009-11-21 2014-10-23 Volkswagen Ag Method for Controlling a Headlamp System for a Vehicle, and Headlamp System
CN109035297A (en) * 2018-07-19 2018-12-18 深圳市唯特视科技有限公司 A kind of real-time tracing method based on dual Siam's network
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN109993774A (en) * 2019-03-29 2019-07-09 大连理工大学 Online Video method for tracking target based on depth intersection Similarity matching
CN110120064A (en) * 2019-05-13 2019-08-13 南京信息工程大学 A kind of depth related objective track algorithm based on mutual reinforcing with the study of more attention mechanisms
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110335290A (en) * 2019-06-04 2019-10-15 大连理工大学 Twin candidate region based on attention mechanism generates network target tracking method
CN110675423A (en) * 2019-08-29 2020-01-10 电子科技大学 Unmanned aerial vehicle tracking method based on twin neural network and attention model
CN111192292A (en) * 2019-12-27 2020-05-22 深圳大学 Target tracking method based on attention mechanism and twin network and related equipment
CN111354017A (en) * 2020-03-04 2020-06-30 江南大学 Target tracking method based on twin neural network and parallel attention module
CN111462175A (en) * 2020-03-11 2020-07-28 华南理工大学 Space-time convolution twin matching network target tracking method, device, medium and equipment
CN112348849A (en) * 2020-10-27 2021-02-09 南京邮电大学 Twin network video target tracking method and device
CN112785624A (en) * 2021-01-18 2021-05-11 苏州科技大学 RGB-D characteristic target tracking method based on twin network
CN112837344A (en) * 2019-12-18 2021-05-25 沈阳理工大学 Target tracking method for generating twin network based on conditional confrontation

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140313752A1 (en) * 2009-11-21 2014-10-23 Volkswagen Ag Method for Controlling a Headlamp System for a Vehicle, and Headlamp System
CN109035297A (en) * 2018-07-19 2018-12-18 深圳市唯特视科技有限公司 A kind of real-time tracing method based on dual Siam's network
CN109993774A (en) * 2019-03-29 2019-07-09 大连理工大学 Online Video method for tracking target based on depth intersection Similarity matching
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110120064A (en) * 2019-05-13 2019-08-13 南京信息工程大学 A kind of depth related objective track algorithm based on mutual reinforcing with the study of more attention mechanisms
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110335290A (en) * 2019-06-04 2019-10-15 大连理工大学 Twin candidate region based on attention mechanism generates network target tracking method
CN110675423A (en) * 2019-08-29 2020-01-10 电子科技大学 Unmanned aerial vehicle tracking method based on twin neural network and attention model
CN112837344A (en) * 2019-12-18 2021-05-25 沈阳理工大学 Target tracking method for generating twin network based on conditional confrontation
CN111192292A (en) * 2019-12-27 2020-05-22 深圳大学 Target tracking method based on attention mechanism and twin network and related equipment
CN111354017A (en) * 2020-03-04 2020-06-30 江南大学 Target tracking method based on twin neural network and parallel attention module
CN111462175A (en) * 2020-03-11 2020-07-28 华南理工大学 Space-time convolution twin matching network target tracking method, device, medium and equipment
CN112348849A (en) * 2020-10-27 2021-02-09 南京邮电大学 Twin network video target tracking method and device
CN112785624A (en) * 2021-01-18 2021-05-11 苏州科技大学 RGB-D characteristic target tracking method based on twin network

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
MD. MAKLACHUR RAHMAN ET AL: "Efficient Visual Tracking With Stacked Channel-Spatial Attention Learning", 《 IEEE ACCESS》 *
MD. MAKLACHUR RAHMAN ET AL: "Efficient Visual Tracking With Stacked Channel-Spatial Attention Learning", 《IEEE ACCESS》 *
YINGSEN ZENG ET AL: "Learning Spatial-Channel Attention for Visual Tracking", 《2019 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC)》 *
周迪雅等: "基于孪生网络与注意力机制的目标跟踪方法", 《信息通信》 *
杨康等: "基于双重注意力孪生网络的实时视觉跟踪", 《计算机应用》 *
董航: "基于深度学习的目标检测与跟踪", 《中国优秀博硕士学位论文全文数据库(硕士)》 *
钟莎等: "基于孪生区域候选网络的无人机指定目标跟踪", 《计算机应用》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705588A (en) * 2021-10-28 2021-11-26 南昌工程学院 Twin network target tracking method and system based on convolution self-attention module
CN114519847A (en) * 2022-01-13 2022-05-20 东南大学 Target consistency judging method suitable for vehicle-road cooperative sensing system
CN114782488A (en) * 2022-04-01 2022-07-22 燕山大学 Underwater target tracking method based on channel perception
CN115018878A (en) * 2022-04-21 2022-09-06 哈尔滨工业大学 Attention mechanism-based target tracking method in complex scene, storage medium and equipment
CN115223193A (en) * 2022-06-19 2022-10-21 浙江爱达科技有限公司 Capsule endoscope image focus identification method based on focus feature importance
CN115223193B (en) * 2022-06-19 2023-07-04 浙江爱达科技有限公司 Capsule endoscope image focus identification method based on focus feature importance
CN115147456A (en) * 2022-06-29 2022-10-04 华东师范大学 Target tracking method based on time sequence adaptive convolution and attention mechanism
CN115063445A (en) * 2022-08-18 2022-09-16 南昌工程学院 Target tracking method and system based on multi-scale hierarchical feature representation
CN115063445B (en) * 2022-08-18 2022-11-08 南昌工程学院 Target tracking method and system based on multi-scale hierarchical feature representation
CN115661207A (en) * 2022-11-14 2023-01-31 南昌工程学院 Target tracking method and system based on space consistency matching and weight learning
CN117437459A (en) * 2023-10-08 2024-01-23 昆山市第一人民医院 Method for realizing user knee joint patella softening state analysis based on decision network
CN117437459B (en) * 2023-10-08 2024-03-22 昆山市第一人民医院 Method for realizing user knee joint patella softening state analysis based on decision network

Similar Documents

Publication Publication Date Title
CN113283407A (en) Twin network target tracking method based on channel and space attention mechanism
Wu et al. Radio Galaxy Zoo: CLARAN–a deep learning classifier for radio morphologies
Zhu et al. I can find you! boundary-guided separated attention network for camouflaged object detection
Hu et al. SAC-Net: Spatial attenuation context for salient object detection
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
CN111626176B (en) Remote sensing target rapid detection method and system based on dynamic attention mechanism
CN111507370A (en) Method and device for obtaining sample image of inspection label in automatic labeling image
CN108520203B (en) Multi-target feature extraction method based on fusion of self-adaptive multi-peripheral frame and cross pooling feature
CN111611851B (en) Model generation method, iris detection method and device
Wimmer et al. Convolutional neural network architectures for the automated diagnosis of celiac disease
US11468296B2 (en) Relative position encoding based networks for action recognition
CN115359074B (en) Image segmentation and training method and device based on hyper-voxel clustering and prototype optimization
CN111915644A (en) Real-time target tracking method of twin guiding anchor frame RPN network
CN111428664A (en) Real-time multi-person posture estimation method based on artificial intelligence deep learning technology for computer vision
CN117037215B (en) Human body posture estimation model training method, estimation device and electronic equipment
Cheng et al. A vision-based robot grasping system
CN111582091A (en) Pedestrian identification method based on multi-branch convolutional neural network
CN112308825A (en) SqueezeNet-based crop leaf disease identification method
CN116597224A (en) Potato defect detection method based on improved YOLO V8 network model
CN115004316A (en) Multi-functional computer-assisted gastroscopy system and method employing optimized integrated AI solutions
Liu et al. Semi-supervised keypoint detector and descriptor for retinal image matching
CN114925320A (en) Data processing method and related device
CN112036250B (en) Pedestrian re-identification method, system, medium and terminal based on neighborhood cooperative attention
CN113033371A (en) CSP model-based multi-level feature fusion pedestrian detection method
Liu et al. MSSTResNet-TLD: A robust tracking method based on tracking-learning-detection framework by using multi-scale spatio-temporal residual network feature model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210820

RJ01 Rejection of invention patent application after publication