CN115423847B - Twin multi-modal target tracking method based on Transformer - Google Patents

Twin multi-modal target tracking method based on Transformer Download PDF

Info

Publication number
CN115423847B
CN115423847B CN202211376018.2A CN202211376018A CN115423847B CN 115423847 B CN115423847 B CN 115423847B CN 202211376018 A CN202211376018 A CN 202211376018A CN 115423847 B CN115423847 B CN 115423847B
Authority
CN
China
Prior art keywords
network
representing
frame
feature
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211376018.2A
Other languages
Chinese (zh)
Other versions
CN115423847A (en
Inventor
王辉
韩星宇
范自柱
杨辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202211376018.2A priority Critical patent/CN115423847B/en
Publication of CN115423847A publication Critical patent/CN115423847A/en
Application granted granted Critical
Publication of CN115423847B publication Critical patent/CN115423847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a twin multi-modal target tracking method based on a Transformer, which is used for acquiring RGB image information and thermal image information in a scene; extracting high-level features of different modes through a pre-trained ResNet network, and simultaneously fusing a network based on cross-mode features of a twin network to obtain common features of different modes; and then inputting the high-level features of the corresponding modes into a Transformer module designed for multi-mode to perform cross-mode information fusion, inputting the high-level features into a regression network based on a fully-connected convolutional neural network to perform regression of a final detection frame, reversely transmitting errors generated in the process into each preorder network, and constructing a target tracking network according to a final weight network so as to track a target under the multi-mode condition. The method can accurately predict the position information of the object in each mode, improves the target tracking and positioning accuracy, and can be widely applied to various scenes.

Description

Twin multi-modal target tracking method based on Transformer
Technical Field
The invention relates to the technical field of computer target tracking, in particular to a twin multi-modal target tracking method based on a Transformer.
Background
Visual target tracking is carried out by utilizing RGB and Thermal Infrared (TIR) spectrums, RGBT tracking is short, and the defects that targets are easy to lose and poor in performance under extreme illumination conditions in the traditional tracking task can be effectively overcome. At present, common multi-modal target tracking methods include two major categories, namely a mathematical tracking method based on traditional graphics and a feature matching method based on a twin network.
A mathematical tracking method based on traditional graphics is generally to construct a kernel function in a target detection areafAnd a filtering templatehAnd performing convolution operation, and then performing optimization through a corresponding algorithm to obtain a global optimal regression frame. However, in such methods, such as a target tracking method based on a correlation filtering algorithm, a linear regression filtering algorithm, and a multi-feature algorithm, it is difficult to perform a target tracking method with complex functionsTracking objects with miscellaneous foregrounds leads to the problem that the objects are easy to lose or cannot accurately return to a target frame.
Disclosure of Invention
Therefore, the embodiment of the invention provides a twin multi-modal target tracking method based on a Transformer to solve the technical problem.
The invention provides a twin multi-modal target tracking method based on a Transformer, which comprises the following steps:
acquiring RGB image information and thermal image information in a current scene through a camera and a thermal imaging device;
secondly, respectively performing feature extraction on the RGB image information and the thermal image information by utilizing a pre-trained ResNet feature extraction network to correspondingly obtain RGB image features and thermal image features; aligning the RGB image information and the thermal image information by a method based on linear hypothesis, and performing feature extraction on the RGB image information and the thermal image information together by using a twin network based on ResNet to obtain RGB-thermal image features;
thirdly, matching the RGB image characteristics, the thermal image characteristics and the RGB-thermal image characteristics in pairs by using a characteristic fusion network based on a Transformer encoder to perform composite encoding so as to obtain an encoded characteristic diagram;
inputting the coded feature map into a feature matching network based on a transform to perform expansion and matching so as to obtain a matching result of the template feature map and the background feature map, and performing expansion and re-matching on the matching result of the template feature map and the background feature map by using a matching mechanism based on attention of a circulating window so as to obtain a first feature map;
inputting the first characteristic diagram into a regressor based on a multilayer perceptron model to perform regression of a regression frame, returning an error calculation value based on a designed loss function and performing back propagation;
step six, confirming the loss of the current regression frame through a rapid gradient descent method, finishing training and outputting each network weight file when the loss of the regression frame is minimum;
and step seven, constructing a multi-modal target tracker according to the finally obtained network weight files and determining the position of the tracked target in the image in real time.
The invention provides a twin multi-modal target tracking method based on a Transformer, which is used for acquiring RGB image information and thermal image information in a scene; extracting high-level features of different modes through a pre-trained ResNet network, and simultaneously fusing a network based on cross-mode features of a twin network to obtain common features of different modes; and then inputting the high-level characteristics of the corresponding modes into a Transformer module designed for multiple modes to perform cross-mode information fusion, inputting the high-level characteristics into a regression network based on a fully-connected convolutional neural network to perform regression of a final detection frame, reversely transmitting errors generated in the process into each network of a preamble, and constructing a target tracking network according to a final weight network so as to track the target under the multiple modes. The method can accurately predict the position information of the object in each mode, improves the target tracking and positioning accuracy, and can be widely applied to various scenes.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of embodiments of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a Transformer-based twin multi-modal target tracking method according to the present invention;
FIG. 2 is a schematic block diagram of a twin multi-modal target tracking method based on Transformer according to the present invention;
fig. 3 is a schematic execution diagram of the transform-based twin multi-modal target tracking method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 to fig. 3, the present invention provides a method for twin multi-modal target tracking based on Transformer, wherein the method includes the following steps:
s101, collecting RGB image information and thermal image information under a current scene through a camera and a thermal imaging device.
S102, respectively performing feature extraction on RGB image information and thermal image information by using a pre-trained ResNet feature extraction network to correspondingly obtain RGB image features and thermal image features; the method based on the linear hypothesis aligns the RGB image information with the thermal image information, and uses a ResNet-based twin network to perform feature extraction on the RGB image information and the thermal image information together to obtain RGB-thermal image features.
In the present invention, the above-mentioned ResNet feature extraction network is a ResNet50 feature extraction network, and specifically, in step S102, the method further includes:
s1021, pre-training data of the network on the ImageNet10k data set is extracted by using the ResNet50 features, and feature extraction is respectively carried out on the RGB image information and the thermal image information.
S1022, the RGB image in the RGB image information is adjusted according to the set image size and the given first frame data.
Specifically, in the step of adjusting the RGB image in the RGB image information, the corresponding expression is:
Figure 860974DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 246956DEST_PATH_IMAGE002
represents the output of the processed RGB image,
Figure 761113DEST_PATH_IMAGE003
representing the input of the current RGB image,
Figure 625164DEST_PATH_IMAGE004
indicating the size of the current thermal image,
Figure 173957DEST_PATH_IMAGE005
which represents the size of the current RGB image,
Figure 730841DEST_PATH_IMAGE006
indicating the amount of shift of the center point of the image.
And S1023, carrying out constraint calculation on the ResNet50 feature extraction network by utilizing KL divergence to obtain a loss value of current output.
Specifically, in the step of performing constraint calculation on the ResNet50 feature extraction network by using the KL divergence to obtain the currently output loss value, the corresponding expression is as follows:
Figure 466715DEST_PATH_IMAGE007
wherein, the first and the second end of the pipe are connected with each other,
Figure 400036DEST_PATH_IMAGE008
a loss value representing the current output is indicated,
Figure 537757DEST_PATH_IMAGE009
the dimensions of the output feature vector are represented,
Figure 265541DEST_PATH_IMAGE010
representing the second of the feature vectors output by the RGB image through the ResNet50 feature extraction network
Figure 488712DEST_PATH_IMAGE011
The columns of the image data are,
Figure 960145DEST_PATH_IMAGE012
representing the second of the feature vectors output by the thermal image through the ResNet50 feature extraction network
Figure 217951DEST_PATH_IMAGE011
The columns of the image data are,
Figure 382216DEST_PATH_IMAGE011
representing the number of columns in the output feature vector.
And S1024, calculating to obtain a final network loss value corresponding to the whole network according to the currently output loss value.
Wherein the whole network is composed of a ResNet feature extraction network (equivalent to the RGB feature extraction network and the thermal feature extraction network in FIG. 2), a twin network based on ResNet (equivalent to the thermal-RGB fusion feature extraction network in FIG. 2), a feature fusion network based on a transform encoder (equivalent to the feature fusion module in FIG. 2), and a transform-based feature matching network (equivalent to the transform-based feature matching-expansion network in FIG. 2). It should be noted that, in fig. 2, L represents the number of current features, r represents the size of the template, and d represents the dimension of the current features. In fig. 2, Q denotes an operation through a Query vector generation network, K denotes an operation through a Key vector generation network, and V denotes an operation through a Value vector generation network.
In this step, the final network loss value corresponding to the whole network is expressed as:
Figure 89753DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 99298DEST_PATH_IMAGE014
representing the final network loss value corresponding to the overall network,
Figure 211610DEST_PATH_IMAGE015
representing the loss value propagated back to the subsequent network,
Figure 281197DEST_PATH_IMAGE016
representing a hyper-parameter. In the present embodiment, the hyper-parameter
Figure 744540DEST_PATH_IMAGE016
The value of (A) is 0.97.
S103, matching the RGB image characteristics, the thermal image characteristics and the RGB-thermal image characteristics in pairs by using a feature fusion network based on a Transformer encoder to perform composite encoding so as to obtain an encoded feature map.
In this step, in the step of performing composite encoding on the RGB image features, the thermal image features, and the RGB-thermal image features in pairs to obtain an encoded feature map, a formula corresponding to the encoding operation is represented as:
Figure 292196DEST_PATH_IMAGE017
Figure 524594DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 765082DEST_PATH_IMAGE019
which represents the output of the encoder, and,
Figure 450141DEST_PATH_IMAGE020
the Softmax function is expressed in terms of,
Figure 67068DEST_PATH_IMAGE021
a feature vector representing the RGB image through the ResNet50 feature extraction network,
Figure 888393DEST_PATH_IMAGE022
which represents an RGB image, is provided,
Figure 565362DEST_PATH_IMAGE023
indicating heat powerThe image is a picture of a person to be imaged,
Figure 737717DEST_PATH_IMAGE024
a feature vector representing the thermal image passing through the ResNet50 feature extraction network,
Figure 627176DEST_PATH_IMAGE025
the dimension of the overall feature vector is represented,
Figure 303008DEST_PATH_IMAGE026
the natural constant is represented by a natural constant,
Figure 502609DEST_PATH_IMAGE027
which represents a convolution operation, the operation of the convolution,
Figure 162260DEST_PATH_IMAGE028
representing the input of the current layer.
Further, a formula corresponding to the feature matching network based on the Transformer is represented as:
Figure 855410DEST_PATH_IMAGE029
Figure 651327DEST_PATH_IMAGE030
Figure 404520DEST_PATH_IMAGE031
Figure 551467DEST_PATH_IMAGE032
wherein the content of the first and second substances,
Figure 48308DEST_PATH_IMAGE033
the output of the Transformer network is represented,
Figure 964311DEST_PATH_IMAGE034
represents the output of the Query vector generation network,
Figure 622826DEST_PATH_IMAGE035
represents the output of the Key vector generation network,
Figure 257069DEST_PATH_IMAGE036
represents the output of the Value vector generation network,
Figure 557600DEST_PATH_IMAGE037
the dimensions of the current layer are represented,
Figure 325181DEST_PATH_IMAGE038
each represents a first type of learnable parameter,
Figure 420176DEST_PATH_IMAGE039
each represents a second type of learnable parameter,
Figure 276136DEST_PATH_IMAGE040
representing a matrix transposition.
S104, inputting the coded feature map into a feature matching network based on a transform to perform expansion and matching so as to obtain a matching result of the template feature map and the background feature map, and performing expansion and re-matching on the matching result of the template feature map and the background feature map by using a matching mechanism based on attention of a circulation window so as to obtain a first feature map.
Specifically, step S104 specifically includes:
s1041, translating the input template feature diagram up, down, left and right on the background feature diagram, and generating a matching thermodynamic diagram larger than the size of the original background feature diagram.
S1042, using the size of
Figure 380358DEST_PATH_IMAGE041
The template group is used for matching the template frame and the background frame of the expanded coded feature image to obtain a matching result of the template feature image and the background feature image; wherein matching is performedThe step size of the time is
Figure 5375DEST_PATH_IMAGE042
Wherein the template set
Figure 271271DEST_PATH_IMAGE041
Is sized as
Figure 614528DEST_PATH_IMAGE043
The size of the first feature map is
Figure 256862DEST_PATH_IMAGE044
Figure 1964DEST_PATH_IMAGE045
Is shown asiThe side length of each template is as long as,
Figure 438761DEST_PATH_IMAGE046
is shown asiThe number of dimensions of the feature vector.
And S105, inputting the first feature map into a regressor based on a multilayer perceptron model to perform regression of a regression frame, and returning an error calculation value based on a designed loss function and performing back propagation.
In step S105, the first feature map is input to the MLP-based regressor to perform regression of the regression frame, and the corresponding formula is expressed as:
Figure 269314DEST_PATH_IMAGE047
Figure 715339DEST_PATH_IMAGE048
wherein the content of the first and second substances,
Figure 49368DEST_PATH_IMAGE049
representing the output of the multi-layered perceptron model network,
Figure 922646DEST_PATH_IMAGE050
the result of the final regression is shown.
Further, for the loss function of the output, there is the following formula:
Figure 240495DEST_PATH_IMAGE051
Figure 959053DEST_PATH_IMAGE052
Figure 678747DEST_PATH_IMAGE053
wherein the content of the first and second substances,
Figure 722926DEST_PATH_IMAGE054
indicating the loss of the current box to the real box,
Figure 262492DEST_PATH_IMAGE055
indicating the degree of coincidence of the current frame with the real frame,
Figure 50319DEST_PATH_IMAGE056
a value representing the difference in the current frame and real frame coordinate positions,
Figure 624520DEST_PATH_IMAGE057
a value representing the difference in the current frame coordinate size from the real frame coordinate size,
Figure 574022DEST_PATH_IMAGE058
a value representing the mean square error of the abscissa of the current frame and the real frame,
Figure 600884DEST_PATH_IMAGE059
a value representing the mean square error of the ordinate of the current frame and the real frame,
Figure 929752DEST_PATH_IMAGE060
a value representing the mean square error of the current frame with the abscissa or ordinate of the real frame,
Figure 358460DEST_PATH_IMAGE061
indicating a high difference between the current frame and the real frame,
Figure 744442DEST_PATH_IMAGE062
representing the wide difference between the current frame and the real frame,
Figure 258600DEST_PATH_IMAGE063
the abscissa representing the real frame of the target,
Figure 388230DEST_PATH_IMAGE064
the ordinate of the real frame of the target is represented,
Figure 671443DEST_PATH_IMAGE065
representing the abscissa of the target predicted by the tracker,
Figure 228327DEST_PATH_IMAGE066
representing the ordinate of the target predicted by the tracker,
Figure 229781DEST_PATH_IMAGE067
represents the scaling system of the current frame and the real frame,
Figure 163102DEST_PATH_IMAGE068
representing the ratio of the size of the current frame to the real frame,
Figure 35243DEST_PATH_IMAGE069
a value representing the mean square error of the current frame and the real frame coordinate size,
Figure 28607DEST_PATH_IMAGE070
a scaling factor representing the width between the current frame and the real frame,
Figure 517357DEST_PATH_IMAGE071
indicating the current box and trueThe scaling factor of the height between the solid frames,
Figure 723210DEST_PATH_IMAGE072
represents a value calculated from a scaling coefficient of the width or height between the current frame and the real frame,
Figure 981016DEST_PATH_IMAGE073
indicating the width of the prediction by the tracker,
Figure 145281DEST_PATH_IMAGE074
indicating that the high predicted by the tracker is high,
Figure 590169DEST_PATH_IMAGE075
the width of the real frame of the target is represented,
Figure 865293DEST_PATH_IMAGE076
indicating the height of the real box of the target,
Figure 243184DEST_PATH_IMAGE077
representing a given hyper-parameter. In the present embodiment, the hyper-parameter
Figure 47192DEST_PATH_IMAGE077
Has a value of 4.
Further, after each round of iterative computation of back propagation is completed, the learning rate is updated by using a preset learning rate formula, and the corresponding learning rate updating formula is expressed as:
Figure 510535DEST_PATH_IMAGE078
wherein, the first and the second end of the pipe are connected with each other,
Figure 323770DEST_PATH_IMAGE079
indicates the updated learning rate of the current frame,
Figure 556168DEST_PATH_IMAGE080
it indicates the minimum learning rate of the learning,
Figure 247523DEST_PATH_IMAGE081
which indicates the maximum learning rate of the image data,
Figure 198162DEST_PATH_IMAGE082
indicates the index of the current epoch,
Figure 549509DEST_PATH_IMAGE083
indicating the index of the maximum epoch.
In this embodiment, the preferred total number of iterative training times is set to 500, and the initial learning rate is set to 0.003.
And S106, confirming the loss of the current regression frame through a rapid gradient descent method, finishing training and outputting each network weight file when the loss of the regression frame is minimum.
And S107, constructing a multi-modal target tracker according to each finally obtained network weight file, and determining the position of the tracked target in the image in real time.
In this embodiment, the target tracking frame is a frame surrounded by coordinates of diagonal vertices output by the algorithm, and the multi-modal feature fusion algorithm is proposed based on a transform hybrid architecture and applied to multi-modal target tracking, so that the accuracy and robustness of a target tracking task can be greatly improved. In addition, a common feature extraction network based on a twin network is utilized, and a KL divergence mathematical index is matched with a backward propagation loss value to serve as an extraction algorithm of a common feature between two modes; by changing the step length parameter, the global search can be ensured, and the convergence speed of the algorithm can be ensured; the application of the Transformer hybrid architecture to multi-modal target tracking has the advantages of a target tracking method generally based on deep learning and a traditional graphics algorithm, and has the characteristics of high tracking precision and target loss avoidance.
The invention provides a twin multi-modal target tracking method based on a Transformer, which is used for acquiring RGB image information and thermal image information in a scene; extracting high-level features of different modes through a pre-trained ResNet network, and simultaneously fusing a network based on cross-mode features of a twin network to obtain common features of different modes; and then inputting the high-level characteristics of the corresponding modes into a Transformer module designed for multiple modes to perform cross-mode information fusion, inputting the high-level characteristics into a regression network based on a fully-connected convolutional neural network to perform regression of a final detection frame, reversely transmitting errors generated in the process into each network of a preamble, and constructing a target tracking network according to a final weight network so as to track the target under the multiple modes. The method can accurately predict the position information of the object in each mode, improves the target tracking and positioning accuracy, and can be widely applied to various scenes.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (7)

1. A twin multi-modal target tracking method based on transformers is characterized by comprising the following steps:
acquiring RGB image information and thermal image information under a current scene through a camera and a thermal imaging device;
secondly, respectively performing feature extraction on the RGB image information and the thermal image information by utilizing a pre-trained ResNet feature extraction network to correspondingly obtain RGB image features and thermal image features; aligning the RGB image information and the thermal image information by a method based on linear hypothesis, and performing feature extraction on the RGB image information and the thermal image information together by using a twin network based on ResNet to obtain RGB-thermal image features;
thirdly, matching the RGB image characteristics, the thermal image characteristics and the RGB-thermal image characteristics in pairs by using a characteristic fusion network based on a Transformer encoder to perform composite encoding so as to obtain an encoded characteristic diagram;
inputting the coded feature map into a feature matching network based on a transform to perform expansion and matching so as to obtain a matching result of the template feature map and the background feature map, and performing expansion and re-matching on the matching result of the template feature map and the background feature map by using a matching mechanism based on attention of a circulating window so as to obtain a first feature map;
inputting the first characteristic diagram into a regressor based on a multilayer perceptron model to perform regression of a regression frame, returning an error calculation value based on a designed loss function and performing back propagation;
step six, confirming the loss of the current regression frame through a rapid gradient descent method, finishing training and outputting each network weight file when the loss of the regression frame is minimum;
step seven, according to each finally obtained network weight file, a multi-modal target tracker is constructed, and the position of the tracked target in the image is determined in real time;
the ResNet feature extraction network is a ResNet50 feature extraction network, and in the second step, the method further includes:
pre-training data of the network on an ImageNet10k data set is extracted by using ResNet50 characteristics, and RGB image information and thermal image information are respectively subjected to characteristic extraction;
adjusting the RGB image in the RGB image information according to the set image size and the given first frame diagram data;
performing constraint calculation on the ResNet50 feature extraction network by using KL divergence to obtain a loss value of current output;
calculating to obtain a final network loss value corresponding to the whole network according to the currently output loss value, wherein the whole network consists of a ResNet feature extraction network, a twin network based on ResNet, a feature fusion network based on a Transformer encoder and a feature matching network based on a Transformer;
in the step of adjusting the RGB image in the RGB image information, the corresponding expression is:
Figure 700085DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 983299DEST_PATH_IMAGE002
represents the output of the processed RGB image,
Figure 602499DEST_PATH_IMAGE003
representing the input of the current RGB image,
Figure 135111DEST_PATH_IMAGE004
indicating the size of the current thermal image,
Figure 678219DEST_PATH_IMAGE005
which represents the size of the current RGB image,
Figure 815940DEST_PATH_IMAGE006
representing the offset of the center point of the image;
in the step of performing constraint calculation on the ResNet50 feature extraction network by using KL divergence to obtain a loss value of current output, a corresponding expression is as follows:
Figure 606041DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 625950DEST_PATH_IMAGE008
a loss value representing the current output is indicated,
Figure 707169DEST_PATH_IMAGE009
the dimensions of the output feature vector are represented,
Figure 699396DEST_PATH_IMAGE010
representing the second of the feature vectors output by the RGB image through the ResNet50 feature extraction network
Figure 722716DEST_PATH_IMAGE011
The columns of the image data are,
Figure 433183DEST_PATH_IMAGE012
representing the second of the feature vectors output by the thermal image through the ResNet50 feature extraction network
Figure 49584DEST_PATH_IMAGE011
The columns of the image data are,
Figure 161897DEST_PATH_IMAGE011
representing the number of columns in the output feature vector;
in the step of calculating a final network loss value corresponding to the entire network according to the currently output loss value, the final network loss value corresponding to the entire network is expressed as:
Figure 90539DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 288302DEST_PATH_IMAGE014
representing the final network loss value corresponding to the overall network,
Figure 711324DEST_PATH_IMAGE015
representing the loss value propagated back to the subsequent network,
Figure 678143DEST_PATH_IMAGE016
indicating a hyper-parameter.
2. The Transformer-based twin multi-modal target tracking method as claimed in claim 1, wherein in the step three, in the step of performing the composite encoding on the RGB image features, the thermal image features and the RGB-thermal image features in pairwise combination to obtain the encoded feature map, a formula corresponding to the encoding operation is represented as follows:
Figure DEST_PATH_IMAGE017
Figure 512107DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 806953DEST_PATH_IMAGE019
which represents the output of the encoder, and,
Figure 158300DEST_PATH_IMAGE020
the Softmax function is expressed in terms of,
Figure 307522DEST_PATH_IMAGE021
a feature vector representing the RGB image through the ResNet50 feature extraction network,
Figure 515649DEST_PATH_IMAGE022
which represents an RGB image, is provided,
Figure 501054DEST_PATH_IMAGE023
a thermal image is represented by a thermal image,
Figure 515146DEST_PATH_IMAGE024
a feature vector representing the thermal image passing through the ResNet50 feature extraction network,
Figure 456557DEST_PATH_IMAGE025
the dimension of the overall feature vector is represented,
Figure 835586DEST_PATH_IMAGE026
the natural constant is represented by a natural constant,
Figure 107954DEST_PATH_IMAGE027
which represents a convolution operation, is a function of,
Figure 801104DEST_PATH_IMAGE028
representing the input of the current layer.
3. The Transformer-based twin multi-modal target tracking method as recited in claim 2, wherein the Transformer-based feature matching network corresponds to a formula represented as:
Figure 456076DEST_PATH_IMAGE029
Figure 209268DEST_PATH_IMAGE030
Figure 966003DEST_PATH_IMAGE031
Figure 462843DEST_PATH_IMAGE032
wherein the content of the first and second substances,
Figure 237901DEST_PATH_IMAGE033
the output of the transform network is shown,
Figure 896416DEST_PATH_IMAGE034
represents the output of the Query vector generation network,
Figure 140446DEST_PATH_IMAGE035
represents the output of the Key vector generation network,
Figure 175398DEST_PATH_IMAGE036
represents the output of the Value vector generation network,
Figure 70542DEST_PATH_IMAGE037
the dimensions of the current layer are represented,
Figure 899958DEST_PATH_IMAGE038
each represents a first type of learnable parameter,
Figure 631285DEST_PATH_IMAGE039
each represents a second type of learnable parameter,
Figure 469928DEST_PATH_IMAGE040
representing a matrix transposition.
4. The Transformer-based twin multimodal target tracking method according to claim 3, wherein the fourth step specifically comprises:
translating the input template characteristic diagram up, down, left and right on the background characteristic diagram, and generating a matching thermodynamic diagram with the size larger than that of the original background characteristic diagram;
is used in the size of
Figure 953999DEST_PATH_IMAGE041
The template group is used for matching the template frame and the background frame of the expanded coded feature image to obtain a matching result of the template feature image and the background feature image; wherein the step size when matching is performed is
Figure 219895DEST_PATH_IMAGE042
Wherein the template group
Figure 170009DEST_PATH_IMAGE041
Is sized as
Figure 812343DEST_PATH_IMAGE043
The size of the first feature map is
Figure 416499DEST_PATH_IMAGE044
Figure 853297DEST_PATH_IMAGE045
Is shown asiThe side length of each template is as long as,
Figure 293637DEST_PATH_IMAGE046
is shown asiThe number of dimensions of the individual feature vectors.
5. The method for twin multi-modal target tracking based on Transformer as claimed in claim 4, wherein in the step five, the first feature map is input into a regressor based on a multi-layer perceptron model to perform regression of a regression box, and a corresponding formula is expressed as:
Figure 208503DEST_PATH_IMAGE047
Figure 932745DEST_PATH_IMAGE048
wherein the content of the first and second substances,
Figure 540444DEST_PATH_IMAGE049
representing the output of the multi-layered perceptron model network,
Figure 468080DEST_PATH_IMAGE050
the result of the final regression is shown.
6. The method for twin multi-modal target tracking based on Transformer as claimed in claim 5, wherein in the step five, in the step of returning error calculation value based on designed loss function and performing back propagation, the following formula exists for the output loss function:
Figure 186637DEST_PATH_IMAGE051
wherein the content of the first and second substances,
Figure 765386DEST_PATH_IMAGE052
indicating the loss of the current box to the real box,
Figure 543986DEST_PATH_IMAGE053
indicating the degree of coincidence of the current frame with the real frame,
Figure 958918DEST_PATH_IMAGE054
a value representing the difference in the current frame and real frame coordinate positions,
Figure 481167DEST_PATH_IMAGE055
a value representing the difference in the current frame and the real frame coordinate size,
Figure 914422DEST_PATH_IMAGE056
a value representing the mean square error of the abscissa of the current frame and the real frame,
Figure 598344DEST_PATH_IMAGE057
a value representing the mean square error of the ordinate of the current frame and the real frame,
Figure 491783DEST_PATH_IMAGE058
a value representing the mean square error of the current frame with the abscissa or ordinate of the real frame,
Figure DEST_PATH_IMAGE059
indicating a high difference between the current frame and the real frame,
Figure 145618DEST_PATH_IMAGE060
representing the wide difference between the current frame and the real frame,
Figure 308747DEST_PATH_IMAGE061
the abscissa representing the real frame of the target,
Figure 304515DEST_PATH_IMAGE062
the ordinate of the real frame of the target is represented,
Figure 818673DEST_PATH_IMAGE063
representing the target abscissa predicted by the tracker,
Figure 807358DEST_PATH_IMAGE064
representing the ordinate of the target predicted by the tracker,
Figure 824993DEST_PATH_IMAGE065
represents the scaling system of the current frame and the real frame,
Figure 257242DEST_PATH_IMAGE066
representing the ratio of the size of the current frame to the real frame,
Figure 993117DEST_PATH_IMAGE067
a value representing the mean square error of the current frame and the real frame coordinate size,
Figure 785492DEST_PATH_IMAGE068
a scaling factor representing the width between the current frame and the real frame,
Figure 657633DEST_PATH_IMAGE069
a scaling factor representing the height between the current frame and the real frame,
Figure 965511DEST_PATH_IMAGE070
represents a value calculated from a scaling coefficient of the width or height between the current frame and the real frame,
Figure 188682DEST_PATH_IMAGE071
indicating the width of the prediction by the tracker,
Figure DEST_PATH_IMAGE072
indicating that the tracker is predicting a high,
Figure 66640DEST_PATH_IMAGE073
the width of the real frame of the target is represented,
Figure 58866DEST_PATH_IMAGE074
indicating the height of the real box of the target,
Figure 159151DEST_PATH_IMAGE075
representing a given hyper-parameter.
7. The Transformer-based twin multimodal target tracking method according to claim 6, wherein in the step five, the method further comprises:
after each round of iterative computation of back propagation is completed, updating the learning rate by using a preset learning rate formula, wherein the corresponding learning rate updating formula is represented as:
Figure 869618DEST_PATH_IMAGE076
wherein the content of the first and second substances,
Figure 754529DEST_PATH_IMAGE077
indicates the updated learning rate of the current frame,
Figure 866841DEST_PATH_IMAGE078
it indicates the minimum learning rate of the learning,
Figure 795483DEST_PATH_IMAGE079
which indicates the maximum learning rate of the image data,
Figure 993246DEST_PATH_IMAGE080
indicates the index of the current epoch,
Figure 416269DEST_PATH_IMAGE081
indicating the index of the maximum epoch.
CN202211376018.2A 2022-11-04 2022-11-04 Twin multi-modal target tracking method based on Transformer Active CN115423847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211376018.2A CN115423847B (en) 2022-11-04 2022-11-04 Twin multi-modal target tracking method based on Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211376018.2A CN115423847B (en) 2022-11-04 2022-11-04 Twin multi-modal target tracking method based on Transformer

Publications (2)

Publication Number Publication Date
CN115423847A CN115423847A (en) 2022-12-02
CN115423847B true CN115423847B (en) 2023-02-07

Family

ID=84207365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211376018.2A Active CN115423847B (en) 2022-11-04 2022-11-04 Twin multi-modal target tracking method based on Transformer

Country Status (1)

Country Link
CN (1) CN115423847B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116563569B (en) * 2023-04-17 2023-11-17 昆明理工大学 Hybrid twin network-based heterogeneous image key point detection method and system
CN117876824B (en) * 2024-03-11 2024-05-10 华东交通大学 Multi-modal crowd counting model training method, system, storage medium and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021033A (en) * 2019-02-22 2019-07-16 广西师范大学 A kind of method for tracking target based on the twin network of pyramid
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11599730B2 (en) * 2019-12-09 2023-03-07 Salesforce.Com, Inc. Learning dialogue state tracking with limited labeled data
US11604719B2 (en) * 2021-02-01 2023-03-14 Microsoft Technology Licensing, Llc. Automated program repair using stack traces and back translations
CN114372173A (en) * 2022-01-11 2022-04-19 中国人民公安大学 Natural language target tracking method based on Transformer architecture
CN115187799A (en) * 2022-07-04 2022-10-14 河南工业大学 Single-target long-time tracking method
CN115205590A (en) * 2022-07-11 2022-10-18 齐齐哈尔大学 Hyperspectral image classification method based on complementary integration Transformer network
CN115100235B (en) * 2022-08-18 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Target tracking method, system and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021033A (en) * 2019-02-22 2019-07-16 广西师范大学 A kind of method for tracking target based on the twin network of pyramid
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features

Also Published As

Publication number Publication date
CN115423847A (en) 2022-12-02

Similar Documents

Publication Publication Date Title
CN115423847B (en) Twin multi-modal target tracking method based on Transformer
CN113902926B (en) General image target detection method and device based on self-attention mechanism
US11763433B2 (en) Depth image generation method and device
CN110781838A (en) Multi-modal trajectory prediction method for pedestrian in complex scene
CN113205466A (en) Incomplete point cloud completion method based on hidden space topological structure constraint
CN113297972B (en) Transformer substation equipment defect intelligent analysis method based on data fusion deep learning
CN111832484A (en) Loop detection method based on convolution perception hash algorithm
CN116049459B (en) Cross-modal mutual retrieval method, device, server and storage medium
CN111460894A (en) Intelligent car logo detection method based on convolutional neural network
CN116385761A (en) 3D target detection method integrating RGB and infrared information
CN115439694A (en) High-precision point cloud completion method and device based on deep learning
CN116188825A (en) Efficient feature matching method based on parallel attention mechanism
Lin et al. DA-Net: density-adaptive downsampling network for point cloud classification via end-to-end learning
Kim et al. Self-supervised keypoint detection based on multi-layer random forest regressor
CN117213470A (en) Multi-machine fragment map aggregation updating method and system
CN116228825B (en) Point cloud registration method based on significant anchor point geometric embedding
CN111578956A (en) Visual SLAM positioning method based on deep learning
CN115578574A (en) Three-dimensional point cloud completion method based on deep learning and topology perception
CN116645514A (en) Improved U 2 Ceramic tile surface defect segmentation method of Net
CN116363552A (en) Real-time target detection method applied to edge equipment
CN114399628A (en) Insulator high-efficiency detection system under complex space environment
CN114155406A (en) Pose estimation method based on region-level feature fusion
Zhu et al. Recurrent multi-view collaborative registration network for 3D reconstruction and optical measurement of blade profiles
Kaviani et al. Semi-Supervised 3D hand shape and pose estimation with label propagation
Xiong et al. SPEAL: Skeletal Prior Embedded Attention Learning for Cross-Source Point Cloud Registration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant