CN115423847B

CN115423847B - Twin multi-modal target tracking method based on Transformer

Info

Publication number: CN115423847B
Application number: CN202211376018.2A
Authority: CN
Inventors: 王辉; 韩星宇; 范自柱; 杨辉
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-02-07
Anticipated expiration: 2042-11-04
Also published as: CN115423847A

Abstract

The invention provides a twin multi-modal target tracking method based on a Transformer, which is used for acquiring RGB image information and thermal image information in a scene; extracting high-level features of different modes through a pre-trained ResNet network, and simultaneously fusing a network based on cross-mode features of a twin network to obtain common features of different modes; and then inputting the high-level features of the corresponding modes into a Transformer module designed for multi-mode to perform cross-mode information fusion, inputting the high-level features into a regression network based on a fully-connected convolutional neural network to perform regression of a final detection frame, reversely transmitting errors generated in the process into each preorder network, and constructing a target tracking network according to a final weight network so as to track a target under the multi-mode condition. The method can accurately predict the position information of the object in each mode, improves the target tracking and positioning accuracy, and can be widely applied to various scenes.

Description

Twin multi-modal target tracking method based on Transformer

Technical Field

The invention relates to the technical field of computer target tracking, in particular to a twin multi-modal target tracking method based on a Transformer.

Background

Visual target tracking is carried out by utilizing RGB and Thermal Infrared (TIR) spectrums, RGBT tracking is short, and the defects that targets are easy to lose and poor in performance under extreme illumination conditions in the traditional tracking task can be effectively overcome. At present, common multi-modal target tracking methods include two major categories, namely a mathematical tracking method based on traditional graphics and a feature matching method based on a twin network.

A mathematical tracking method based on traditional graphics is generally to construct a kernel function in a target detection areafAnd a filtering templatehAnd performing convolution operation, and then performing optimization through a corresponding algorithm to obtain a global optimal regression frame. However, in such methods, such as a target tracking method based on a correlation filtering algorithm, a linear regression filtering algorithm, and a multi-feature algorithm, it is difficult to perform a target tracking method with complex functionsTracking objects with miscellaneous foregrounds leads to the problem that the objects are easy to lose or cannot accurately return to a target frame.

Disclosure of Invention

Therefore, the embodiment of the invention provides a twin multi-modal target tracking method based on a Transformer to solve the technical problem.

The invention provides a twin multi-modal target tracking method based on a Transformer, which comprises the following steps:

acquiring RGB image information and thermal image information in a current scene through a camera and a thermal imaging device;

secondly, respectively performing feature extraction on the RGB image information and the thermal image information by utilizing a pre-trained ResNet feature extraction network to correspondingly obtain RGB image features and thermal image features; aligning the RGB image information and the thermal image information by a method based on linear hypothesis, and performing feature extraction on the RGB image information and the thermal image information together by using a twin network based on ResNet to obtain RGB-thermal image features;

thirdly, matching the RGB image characteristics, the thermal image characteristics and the RGB-thermal image characteristics in pairs by using a characteristic fusion network based on a Transformer encoder to perform composite encoding so as to obtain an encoded characteristic diagram;

inputting the coded feature map into a feature matching network based on a transform to perform expansion and matching so as to obtain a matching result of the template feature map and the background feature map, and performing expansion and re-matching on the matching result of the template feature map and the background feature map by using a matching mechanism based on attention of a circulating window so as to obtain a first feature map;

inputting the first characteristic diagram into a regressor based on a multilayer perceptron model to perform regression of a regression frame, returning an error calculation value based on a designed loss function and performing back propagation;

step six, confirming the loss of the current regression frame through a rapid gradient descent method, finishing training and outputting each network weight file when the loss of the regression frame is minimum;

and step seven, constructing a multi-modal target tracker according to the finally obtained network weight files and determining the position of the tracked target in the image in real time.

The invention provides a twin multi-modal target tracking method based on a Transformer, which is used for acquiring RGB image information and thermal image information in a scene; extracting high-level features of different modes through a pre-trained ResNet network, and simultaneously fusing a network based on cross-mode features of a twin network to obtain common features of different modes; and then inputting the high-level characteristics of the corresponding modes into a Transformer module designed for multiple modes to perform cross-mode information fusion, inputting the high-level characteristics into a regression network based on a fully-connected convolutional neural network to perform regression of a final detection frame, reversely transmitting errors generated in the process into each network of a preamble, and constructing a target tracking network according to a final weight network so as to track the target under the multiple modes. The method can accurately predict the position information of the object in each mode, improves the target tracking and positioning accuracy, and can be widely applied to various scenes.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of embodiments of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a Transformer-based twin multi-modal target tracking method according to the present invention;

FIG. 2 is a schematic block diagram of a twin multi-modal target tracking method based on Transformer according to the present invention;

fig. 3 is a schematic execution diagram of the transform-based twin multi-modal target tracking method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to fig. 3, the present invention provides a method for twin multi-modal target tracking based on Transformer, wherein the method includes the following steps:

s101, collecting RGB image information and thermal image information under a current scene through a camera and a thermal imaging device.

S102, respectively performing feature extraction on RGB image information and thermal image information by using a pre-trained ResNet feature extraction network to correspondingly obtain RGB image features and thermal image features; the method based on the linear hypothesis aligns the RGB image information with the thermal image information, and uses a ResNet-based twin network to perform feature extraction on the RGB image information and the thermal image information together to obtain RGB-thermal image features.

In the present invention, the above-mentioned ResNet feature extraction network is a ResNet50 feature extraction network, and specifically, in step S102, the method further includes:

s1021, pre-training data of the network on the ImageNet10k data set is extracted by using the ResNet50 features, and feature extraction is respectively carried out on the RGB image information and the thermal image information.

S1022, the RGB image in the RGB image information is adjusted according to the set image size and the given first frame data.

Specifically, in the step of adjusting the RGB image in the RGB image information, the corresponding expression is:

wherein the content of the first and second substances,

represents the output of the processed RGB image,

representing the input of the current RGB image,

indicating the size of the current thermal image,

which represents the size of the current RGB image,

indicating the amount of shift of the center point of the image.

And S1023, carrying out constraint calculation on the ResNet50 feature extraction network by utilizing KL divergence to obtain a loss value of current output.

Specifically, in the step of performing constraint calculation on the ResNet50 feature extraction network by using the KL divergence to obtain the currently output loss value, the corresponding expression is as follows:

wherein, the first and the second end of the pipe are connected with each other,

a loss value representing the current output is indicated,

the dimensions of the output feature vector are represented,

representing the second of the feature vectors output by the RGB image through the ResNet50 feature extraction network

The columns of the image data are,

representing the second of the feature vectors output by the thermal image through the ResNet50 feature extraction network

The columns of the image data are,

representing the number of columns in the output feature vector.

And S1024, calculating to obtain a final network loss value corresponding to the whole network according to the currently output loss value.

Wherein the whole network is composed of a ResNet feature extraction network (equivalent to the RGB feature extraction network and the thermal feature extraction network in FIG. 2), a twin network based on ResNet (equivalent to the thermal-RGB fusion feature extraction network in FIG. 2), a feature fusion network based on a transform encoder (equivalent to the feature fusion module in FIG. 2), and a transform-based feature matching network (equivalent to the transform-based feature matching-expansion network in FIG. 2). It should be noted that, in fig. 2, L represents the number of current features, r represents the size of the template, and d represents the dimension of the current features. In fig. 2, Q denotes an operation through a Query vector generation network, K denotes an operation through a Key vector generation network, and V denotes an operation through a Value vector generation network.

In this step, the final network loss value corresponding to the whole network is expressed as:

wherein the content of the first and second substances,

representing the final network loss value corresponding to the overall network,

representing the loss value propagated back to the subsequent network,

representing a hyper-parameter. In the present embodiment, the hyper-parameter

The value of (A) is 0.97.

S103, matching the RGB image characteristics, the thermal image characteristics and the RGB-thermal image characteristics in pairs by using a feature fusion network based on a Transformer encoder to perform composite encoding so as to obtain an encoded feature map.

In this step, in the step of performing composite encoding on the RGB image features, the thermal image features, and the RGB-thermal image features in pairs to obtain an encoded feature map, a formula corresponding to the encoding operation is represented as:

wherein the content of the first and second substances,

which represents the output of the encoder, and,

the Softmax function is expressed in terms of,

a feature vector representing the RGB image through the ResNet50 feature extraction network,

which represents an RGB image, is provided,

indicating heat powerThe image is a picture of a person to be imaged,

a feature vector representing the thermal image passing through the ResNet50 feature extraction network,

the dimension of the overall feature vector is represented,

the natural constant is represented by a natural constant,

which represents a convolution operation, the operation of the convolution,

representing the input of the current layer.

Further, a formula corresponding to the feature matching network based on the Transformer is represented as:

wherein the content of the first and second substances,

the output of the Transformer network is represented,

represents the output of the Query vector generation network,

represents the output of the Key vector generation network,

represents the output of the Value vector generation network,

the dimensions of the current layer are represented,

each represents a first type of learnable parameter,

each represents a second type of learnable parameter,

representing a matrix transposition.

S104, inputting the coded feature map into a feature matching network based on a transform to perform expansion and matching so as to obtain a matching result of the template feature map and the background feature map, and performing expansion and re-matching on the matching result of the template feature map and the background feature map by using a matching mechanism based on attention of a circulation window so as to obtain a first feature map.

Specifically, step S104 specifically includes:

s1041, translating the input template feature diagram up, down, left and right on the background feature diagram, and generating a matching thermodynamic diagram larger than the size of the original background feature diagram.

S1042, using the size of

The template group is used for matching the template frame and the background frame of the expanded coded feature image to obtain a matching result of the template feature image and the background feature image; wherein matching is performedThe step size of the time is

。

Wherein the template set

Is sized as

The size of the first feature map is

，

Is shown asiThe side length of each template is as long as,

is shown asiThe number of dimensions of the feature vector.

And S105, inputting the first feature map into a regressor based on a multilayer perceptron model to perform regression of a regression frame, and returning an error calculation value based on a designed loss function and performing back propagation.

In step S105, the first feature map is input to the MLP-based regressor to perform regression of the regression frame, and the corresponding formula is expressed as:

wherein the content of the first and second substances,

representing the output of the multi-layered perceptron model network,

the result of the final regression is shown.

Further, for the loss function of the output, there is the following formula:

wherein the content of the first and second substances,

indicating the loss of the current box to the real box,

indicating the degree of coincidence of the current frame with the real frame,

a value representing the difference in the current frame and real frame coordinate positions,

a value representing the difference in the current frame coordinate size from the real frame coordinate size,

a value representing the mean square error of the abscissa of the current frame and the real frame,

a value representing the mean square error of the ordinate of the current frame and the real frame,

a value representing the mean square error of the current frame with the abscissa or ordinate of the real frame,

indicating a high difference between the current frame and the real frame,

representing the wide difference between the current frame and the real frame,

the abscissa representing the real frame of the target,

the ordinate of the real frame of the target is represented,

representing the abscissa of the target predicted by the tracker,

representing the ordinate of the target predicted by the tracker,

represents the scaling system of the current frame and the real frame,

representing the ratio of the size of the current frame to the real frame,

a value representing the mean square error of the current frame and the real frame coordinate size,

a scaling factor representing the width between the current frame and the real frame,

indicating the current box and trueThe scaling factor of the height between the solid frames,

represents a value calculated from a scaling coefficient of the width or height between the current frame and the real frame,

indicating the width of the prediction by the tracker,

indicating that the high predicted by the tracker is high,

the width of the real frame of the target is represented,

indicating the height of the real box of the target,

representing a given hyper-parameter. In the present embodiment, the hyper-parameter

Has a value of 4.

Further, after each round of iterative computation of back propagation is completed, the learning rate is updated by using a preset learning rate formula, and the corresponding learning rate updating formula is expressed as:

indicates the updated learning rate of the current frame,

it indicates the minimum learning rate of the learning,

which indicates the maximum learning rate of the image data,

indicates the index of the current epoch,

indicating the index of the maximum epoch.

In this embodiment, the preferred total number of iterative training times is set to 500, and the initial learning rate is set to 0.003.

And S106, confirming the loss of the current regression frame through a rapid gradient descent method, finishing training and outputting each network weight file when the loss of the regression frame is minimum.

And S107, constructing a multi-modal target tracker according to each finally obtained network weight file, and determining the position of the tracked target in the image in real time.

In this embodiment, the target tracking frame is a frame surrounded by coordinates of diagonal vertices output by the algorithm, and the multi-modal feature fusion algorithm is proposed based on a transform hybrid architecture and applied to multi-modal target tracking, so that the accuracy and robustness of a target tracking task can be greatly improved. In addition, a common feature extraction network based on a twin network is utilized, and a KL divergence mathematical index is matched with a backward propagation loss value to serve as an extraction algorithm of a common feature between two modes; by changing the step length parameter, the global search can be ensured, and the convergence speed of the algorithm can be ensured; the application of the Transformer hybrid architecture to multi-modal target tracking has the advantages of a target tracking method generally based on deep learning and a traditional graphics algorithm, and has the characteristics of high tracking precision and target loss avoidance.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A twin multi-modal target tracking method based on transformers is characterized by comprising the following steps:

acquiring RGB image information and thermal image information under a current scene through a camera and a thermal imaging device;

step seven, according to each finally obtained network weight file, a multi-modal target tracker is constructed, and the position of the tracked target in the image is determined in real time;

the ResNet feature extraction network is a ResNet50 feature extraction network, and in the second step, the method further includes:

pre-training data of the network on an ImageNet10k data set is extracted by using ResNet50 characteristics, and RGB image information and thermal image information are respectively subjected to characteristic extraction;

adjusting the RGB image in the RGB image information according to the set image size and the given first frame diagram data;

performing constraint calculation on the ResNet50 feature extraction network by using KL divergence to obtain a loss value of current output;

calculating to obtain a final network loss value corresponding to the whole network according to the currently output loss value, wherein the whole network consists of a ResNet feature extraction network, a twin network based on ResNet, a feature fusion network based on a Transformer encoder and a feature matching network based on a Transformer;

in the step of adjusting the RGB image in the RGB image information, the corresponding expression is:

wherein the content of the first and second substances,

represents the output of the processed RGB image,

representing the input of the current RGB image,

indicating the size of the current thermal image,

which represents the size of the current RGB image,

representing the offset of the center point of the image;

in the step of performing constraint calculation on the ResNet50 feature extraction network by using KL divergence to obtain a loss value of current output, a corresponding expression is as follows:

wherein the content of the first and second substances,

a loss value representing the current output is indicated,

the dimensions of the output feature vector are represented,

The columns of the image data are,

The columns of the image data are,

representing the number of columns in the output feature vector;

in the step of calculating a final network loss value corresponding to the entire network according to the currently output loss value, the final network loss value corresponding to the entire network is expressed as:

wherein the content of the first and second substances,

representing the final network loss value corresponding to the overall network,

representing the loss value propagated back to the subsequent network,

indicating a hyper-parameter.

2. The Transformer-based twin multi-modal target tracking method as claimed in claim 1, wherein in the step three, in the step of performing the composite encoding on the RGB image features, the thermal image features and the RGB-thermal image features in pairwise combination to obtain the encoded feature map, a formula corresponding to the encoding operation is represented as follows:

wherein the content of the first and second substances,

which represents the output of the encoder, and,

the Softmax function is expressed in terms of,

which represents an RGB image, is provided,

a thermal image is represented by a thermal image,

the dimension of the overall feature vector is represented,

the natural constant is represented by a natural constant,

which represents a convolution operation, is a function of,

representing the input of the current layer.

3. The Transformer-based twin multi-modal target tracking method as recited in claim 2, wherein the Transformer-based feature matching network corresponds to a formula represented as:

wherein the content of the first and second substances,

the output of the transform network is shown,

represents the output of the Query vector generation network,

represents the output of the Key vector generation network,

represents the output of the Value vector generation network,

the dimensions of the current layer are represented,

each represents a first type of learnable parameter,

each represents a second type of learnable parameter,

representing a matrix transposition.

4. The Transformer-based twin multimodal target tracking method according to claim 3, wherein the fourth step specifically comprises:

translating the input template characteristic diagram up, down, left and right on the background characteristic diagram, and generating a matching thermodynamic diagram with the size larger than that of the original background characteristic diagram;

is used in the size of

The template group is used for matching the template frame and the background frame of the expanded coded feature image to obtain a matching result of the template feature image and the background feature image; wherein the step size when matching is performed is

；

Wherein the template group

Is sized as

The size of the first feature map is

，

Is shown asiThe side length of each template is as long as,

is shown asiThe number of dimensions of the individual feature vectors.

5. The method for twin multi-modal target tracking based on Transformer as claimed in claim 4, wherein in the step five, the first feature map is input into a regressor based on a multi-layer perceptron model to perform regression of a regression box, and a corresponding formula is expressed as:

wherein the content of the first and second substances,

representing the output of the multi-layered perceptron model network,

the result of the final regression is shown.

6. The method for twin multi-modal target tracking based on Transformer as claimed in claim 5, wherein in the step five, in the step of returning error calculation value based on designed loss function and performing back propagation, the following formula exists for the output loss function:

wherein the content of the first and second substances,

indicating the loss of the current box to the real box,

indicating the degree of coincidence of the current frame with the real frame,

a value representing the difference in the current frame and the real frame coordinate size,

indicating a high difference between the current frame and the real frame,

representing the wide difference between the current frame and the real frame,

the abscissa representing the real frame of the target,

the ordinate of the real frame of the target is represented,

representing the target abscissa predicted by the tracker,

representing the ordinate of the target predicted by the tracker,

represents the scaling system of the current frame and the real frame,

representing the ratio of the size of the current frame to the real frame,

a scaling factor representing the height between the current frame and the real frame,

indicating the width of the prediction by the tracker,

indicating that the tracker is predicting a high,

the width of the real frame of the target is represented,

indicating the height of the real box of the target,

representing a given hyper-parameter.

7. The Transformer-based twin multimodal target tracking method according to claim 6, wherein in the step five, the method further comprises:

after each round of iterative computation of back propagation is completed, updating the learning rate by using a preset learning rate formula, wherein the corresponding learning rate updating formula is represented as:

wherein the content of the first and second substances,

indicates the updated learning rate of the current frame,

it indicates the minimum learning rate of the learning,

which indicates the maximum learning rate of the image data,

indicates the index of the current epoch,

indicating the index of the maximum epoch.