CN117522923A

CN117522923A - Target tracking system and method integrating multi-mode characteristics

Info

Publication number: CN117522923A
Application number: CN202311554028.5A
Authority: CN
Inventors: 潘祥生
Original assignee: Kunming Mingaozhi Data Technology Co ltd
Current assignee: Kunming Mingaozhi Data Technology Co ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-02-06

Abstract

The application relates to the technical field of target tracking, and particularly discloses a target tracking system and method integrating multi-mode features, which are characterized in that the RGB features and the thermal infrared features are respectively extracted from RGB images and thermal infrared images of target tracking by utilizing a deep learning technology, the RGB features and the thermal infrared features are subjected to feature integration in a linear combination mode, then a residual attention mechanism is used for adaptively adjusting feature weights on space dimensions and channel dimensions, and finally whether a target object is separated from a tracking range is judged by a classifier. Therefore, the accuracy of target tracking is improved by effectively fusing the multi-mode characteristics in target tracking, so that a better target tracking effect is achieved.

Description

Target tracking system and method integrating multi-mode characteristics

Technical Field

The present disclosure relates to the field of target tracking technologies, and in particular, to a target tracking system and method that incorporates multi-modal features.

Background

Target tracking is an important problem in the field of computer vision, and is widely applied to the fields of sports event rebroadcasting, security monitoring, unmanned aerial vehicles, unmanned vehicles, robots and the like at present. In the process of target tracking, the target object needs to be identified and tracked, and the position, the gesture and other information of the target object are monitored in real time. Conventional target tracking methods typically utilize features of only one modality for target tracking, such as visible light images. However, in a practical application scenario, it is often difficult to obtain a good tracking effect by using only the features of one mode due to the complexity of the environmental factors and the features of the target itself.

In order to improve the accuracy and the robustness of target tracking, a target tracking method with multi-mode feature fusion is provided. The method effectively fuses the characteristics of different modes, so that richer target information is obtained, and the accuracy and the robustness of target tracking are improved. At present, a common multi-modal feature fusion method is based on weighted fusion. However, the weighted fusion method is highly subjective, and it is difficult to adaptively determine the weight of each modal feature.

Therefore, a target tracking method and method that incorporates multi-modal features is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a target tracking system and method integrating multi-mode features, which are characterized in that a deep learning technology is utilized to extract RGB features and thermal infrared features from RGB images and thermal infrared images of target tracking respectively, the RGB features and the thermal infrared features are integrated in a linear combination mode, then a residual attention mechanism is used for adaptively adjusting feature weights in space dimension and channel dimension, and finally whether a target object is separated from a tracking range is judged through a classifier. Therefore, the accuracy of target tracking is improved by effectively fusing the multi-mode characteristics in target tracking, so that a better target tracking effect is achieved.

Accordingly, according to one aspect of the present application, there is provided a target tracking system incorporating multi-modal features, comprising:

the data acquisition module is used for acquiring an RGB image and a thermal infrared image of target tracking in real time;

the RGB image feature extraction module is used for enabling the RGB image to pass through a first convolution neural network model serving as a filter to obtain an RGB feature map;

the thermal infrared image feature extraction module is used for enabling the thermal infrared image to pass through a second convolution neural network model serving as a filter so as to obtain a thermal infrared feature map;

the linear combination module is used for carrying out feature fusion on the RGB feature map and the thermal infrared feature map in a linear combination mode so as to obtain a multi-mode fusion feature map;

the residual double-attention module is used for enabling the multi-mode fusion feature map to pass through a residual double-attention mechanism model to obtain a classification feature map;

the characteristic strengthening module is used for carrying out autocorrelation strengthening of characteristic value quantization distribution characteristics on the classification characteristic diagram so as to obtain a strengthened classification characteristic diagram;

and the tracking result generation module is used for enabling the enhanced classification characteristic diagram to pass through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the target object is out of a tracking range.

In the above target tracking system with multi-modal feature fusion, the RGB image feature extraction module is configured to: each layer of the first convolutional neural network model used as the filter performs the following steps on input data in forward transfer of the layer: performing convolution processing on the input data based on a two-dimensional convolution check to generate a convolution feature map; carrying out mean pooling processing based on a local feature matrix on the convolution feature map to generate a pooled feature map; non-linear activation is carried out on the characteristic values of all the positions in the pooled characteristic map so as to generate an activated characteristic map; the output of the last layer of the first convolutional neural network model is the RGB feature map, the input from the second layer to the last layer of the first convolutional neural network model is the output of the last layer, and the input of the first layer of the first convolutional neural network model is the RGB image.

In the target tracking system with the multi-mode feature fusion, the thermal infrared image feature extraction module is configured to: and respectively carrying out two-dimensional convolution processing, local feature matrix-based mean pooling processing and nonlinear activation processing on input data in forward transfer of layers by using each layer of the second convolution neural network model serving as a filter so as to output the thermal infrared feature map by the last layer of the second convolution neural network model.

In the above target tracking system fusing multi-modal features, the residual dual-attention module includes: the spatial attention unit is used for inputting the multi-mode fusion feature map into a spatial attention module of the residual double-attention mechanism model to obtain a spatial attention map; the channel attention unit is used for inputting the multi-mode fusion feature map into a channel attention module of the residual double-attention mechanism model to obtain a channel attention map; an attention fusion unit for fusing the spatial attention map and the channel attention map to obtain a fused attention map; the activating unit is used for inputting the fusion attention map into a Sigmoid activating function to activate so as to obtain a fusion attention feature map; the attention applying unit is used for calculating the weighted feature map obtained by multiplying the position-by-position points of the fusion attention feature map and the multi-mode fusion feature map; and the residual fusion unit is used for fusing the weighted feature map and the multi-mode fusion feature map to obtain the classification feature map.

In the above target tracking system that merges multi-modal features, the spatial attention unit includes: the space perception subunit is used for carrying out convolution encoding on the multi-mode fusion feature map by using a convolution layer of a space attention module of the residual double-attention mechanism model so as to obtain an initial convolution feature map; a probability subunit, configured to pass the initial convolution feature map through a Softmax function to obtain a spatial attention score map; and the spatial attention applying subunit is used for multiplying the spatial attention score graph and the multi-mode fusion feature graph by position points to obtain the spatial attention graph.

In the above target tracking system that merges multi-modal features, the channel attention unit includes: the channel dimension pooling subunit is used for carrying out global average pooling along the channel dimension on the multi-mode fusion feature map so as to obtain a channel feature vector; a nonlinear activation subunit, configured to pass the channel feature vector through a Softmax activation function to obtain a channel weight feature vector; and the channel attention applying subunit is used for weighting each feature matrix of the multi-mode fusion feature map along the channel dimension by taking the feature value of each position in the channel weight feature vector as a weight so as to obtain the channel attention map.

In the above target tracking system fusing multi-modal features, the feature enhancement module includes: the characteristic squeezing unit is used for passing the classification characteristic map through a characteristic squeezing module based on a convolution layer to obtain a squeezing classification characteristic map; the characteristic excitation unit is used for enabling the squeezing classification characteristic diagram to pass through a characteristic excitation module based on a deconvolution layer to obtain an excitation classification characteristic diagram; the cosine similarity calculation unit is used for calculating cosine similarity between channel feature vectors of every two pixel positions of the excitation classification feature map to obtain a classification feature autocorrelation matrix; the normalization unit is used for normalizing the classification characteristic autocorrelation matrix through a Softmax function to obtain an autocorrelation class attention matrix; and the strengthening unit is used for modeling the relation between any two pixel points in the excitation classification characteristic diagram by utilizing the autocorrelation class attention matrix so as to obtain the strengthening classification characteristic diagram.

In the target tracking system integrating the multi-mode features, the reinforcement unit is configured to: modeling the relation between any two pixel points in the excitation classification feature map by using the autocorrelation class attention matrix according to the following reinforcement formula to obtain the reinforcement classification feature map after the association feature mapping; wherein, the strengthening formula is:

wherein S represents the autocorrelation class attention focusing matrix, F ₁ Representing the excitation classification feature map, W representing a matrix of learnable parameters,representing matrix multiplication, F ₂ And representing the enhanced classification characteristic diagram after the association characteristic mapping.

In the target tracking system integrating the multi-mode features, the tracking result generating module is configured to: processing the enhanced classification feature map with the classifier in the following classification formula to obtain the classification result; wherein, the classification formula is: o=softmax { (M) _c ，B _c )|Projeot(F ₂ ) }, where Projeot (F) ₂ ) Representing the projection of the enhanced classification feature map as a vector, M _c Weight matrix of full connection layer, B _c Representing the bias matrix of the fully connected layer, softmax representing the normalized exponential function, and O representing the classification result.

According to another aspect of the present application, there is provided a target tracking method fusing multi-modal characteristics, including:

acquiring an RGB image and a thermal infrared image of target tracking in real time;

passing the RGB image through a first convolutional neural network model serving as a filter to obtain an RGB feature map;

the thermal infrared image is passed through a second convolution neural network model serving as a filter to obtain a thermal infrared characteristic diagram;

performing feature fusion on the RGB feature map and the thermal infrared feature map in a linear combination mode to obtain a multi-mode fusion feature map;

the multi-mode fusion feature map is subjected to a residual double-attention mechanism model to obtain a classification feature map;

performing autocorrelation strengthening of characteristic value quantization distribution characteristics on the classification characteristic map to obtain a strengthened classification characteristic map;

and the enhanced classification characteristic diagram is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the target object is out of the tracking range.

Compared with the prior art, the target tracking system and method integrating the multi-mode features, provided by the application, utilize a deep learning technology to extract RGB features and thermal infrared features from an RGB image and a thermal infrared image of target tracking respectively, perform feature integration on the RGB features and the thermal infrared features in a linear combination mode, then adaptively adjust feature weights in space dimension and channel dimension by using a residual attention mechanism, and finally judge whether a target object is separated from a tracking range by using a classifier. Therefore, the accuracy of target tracking is improved by effectively fusing the multi-mode characteristics in target tracking, so that a better target tracking effect is achieved.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a block diagram of a target tracking system incorporating multi-modal features according to an embodiment of the present application.

Fig. 2 is a schematic architecture diagram of a target tracking system incorporating multi-modal features according to an embodiment of the present application.

Fig. 3 is a block diagram of a residual dual attention module in a target tracking system incorporating multi-modal features in accordance with an embodiment of the present application.

Fig. 4 is a block diagram of a spatial attention unit in a target tracking system incorporating multi-modal features in accordance with an embodiment of the present application.

Fig. 5 is a block diagram of a channel attention unit in a target tracking system incorporating multi-modal features according to an embodiment of the present application.

Fig. 6 is a flowchart of a target tracking method incorporating multi-modal features according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

FIG. 1 is a block diagram of a target tracking system incorporating multi-modal features according to an embodiment of the present application. Fig. 2 is a schematic architecture diagram of a target tracking system incorporating multi-modal features according to an embodiment of the present application. As shown in fig. 1 and 2, a target tracking system 100 that incorporates multi-modal features according to an embodiment of the present application includes: the data acquisition module 110 is used for acquiring an RGB image and a thermal infrared image of target tracking in real time; an RGB image feature extraction module 120, configured to pass the RGB image through a first convolutional neural network model serving as a filter to obtain an RGB feature map; a thermal infrared image feature extraction module 130, configured to pass the thermal infrared image through a second convolutional neural network model serving as a filter to obtain a thermal infrared feature map; the linear combination module 140 is configured to perform feature fusion on the RGB feature map and the thermal infrared feature map in a linear combination manner to obtain a multi-mode fusion feature map; the residual dual-attention module 150 is configured to pass the multi-mode fusion feature map through a residual dual-attention mechanism model to obtain a classification feature map; the feature enhancement module 160 is configured to perform auto-correlation enhancement of feature value quantization distribution features on the classification feature map to obtain an enhanced classification feature map; the tracking result generating module 170 is configured to pass the enhanced classification feature map through a classifier to obtain a classification result, where the classification result is used to indicate whether the target object deviates from the tracking range.

In the above-mentioned target tracking system 100 that fuses multi-modal features, the data acquisition module 110 is configured to acquire an RGB image and a thermal infrared image of target tracking in real time. As described above in the background art, in order to improve the accuracy and robustness of target tracking, a target tracking method with multi-modal feature fusion is proposed. Accordingly, considering that the RGB image can provide clear color and texture information under good illumination conditions, the method can be used for extracting the appearance characteristics of the target tracking object. The thermal infrared image can provide thermal energy information of the target tracking object, and has advantages for target tracking in low light or complex background. Therefore, the RGB image and the thermal infrared image can complement each other, and the target tracking accuracy is improved. Based on this, in the technical scheme of the application, firstly, an RGB image of target tracking is acquired through a camera, and a thermal infrared image of target tracking is acquired through a thermal infrared instrument.

In the above-mentioned target tracking system 100 with multi-modal feature fusion, the RGB image feature extraction module 120 is configured to pass the RGB image through a first convolutional neural network model as a filter to obtain an RGB feature map. In target tracking, the RGB image contains rich color and texture information, and the appearance features have important significance for target tracking identification. In order to extract the appearance feature information of the RGB image, the RGB image is further feature-mined using a convolutional neural network having good performance in image feature extraction. It should be appreciated that convolutional neural networks (Convolutional Neural Network, CNN) are a deep learning model that is adept at processing image data. By using the CNN model, useful features can be effectively extracted from RGB images and applied to target tracking tasks. Specifically, by inputting the RGB image into a convolutional neural network model, the network extracts local implicit features in the image through a series of convolution, pooling and activation operations to capture information such as edges, textures, shapes and the like of the target, further distinguish tracking targets from backgrounds and provide valuable feature representations.

Accordingly, in one specific example, the RGB image feature extraction module 120 is configured to: each layer of the first convolutional neural network model used as the filter performs the following steps on input data in forward transfer of the layer: performing convolution processing on the input data based on a two-dimensional convolution check to generate a convolution feature map; carrying out mean pooling processing based on a local feature matrix on the convolution feature map to generate a pooled feature map; non-linear activation is carried out on the characteristic values of all the positions in the pooled characteristic map so as to generate an activated characteristic map; the output of the last layer of the first convolutional neural network model is the RGB feature map, the input from the second layer to the last layer of the first convolutional neural network model is the output of the last layer, and the input of the first layer of the first convolutional neural network model is the RGB image.

In the target tracking system 100 with multi-modal feature fusion, the thermal infrared image feature extraction module 130 is configured to pass the thermal infrared image through a second convolutional neural network model serving as a filter to obtain a thermal infrared feature map. It should be appreciated that thermal infrared images are obtained by measuring the thermal energy emitted by the tracked object, which can provide thermal energy distribution information of the object. In the object tracking task, thermal infrared images have advantages for object tracking in low light or complex backgrounds, because it is not affected by lighting conditions and can highlight the thermal energy characteristics of the object. Similarly, the thermal infrared image is input into a second convolutional neural network model, which captures local implicit features in the thermal infrared image through convolution, pooling and activation operations to reflect information such as thermal energy distribution, shape and texture of the target, and for subsequent feature fusion and target tracking tasks, similar to the processing of the RGB image.

Accordingly, in one specific example, the thermal infrared image feature extraction module 130 is configured to: and respectively carrying out two-dimensional convolution processing, local feature matrix-based mean pooling processing and nonlinear activation processing on input data in forward transfer of layers by using each layer of the second convolution neural network model serving as a filter so as to output the thermal infrared feature map by the last layer of the second convolution neural network model.

In the target tracking system 100 with multi-modal feature fusion, the linear combination module 140 is configured to perform feature fusion on the RGB feature map and the thermal infrared feature map in a linear combination manner to obtain a multi-modal fusion feature map. Considering that the RGB image and the thermal infrared image represent different sources of information, they provide complementary features in the object tracking task. The advantages of the two modes are comprehensively utilized by fusing the information of the two modes, so that the accuracy of target tracking is improved. Here, it is described. And carrying out feature fusion on the RGB feature map and the thermal infrared feature map in a linear combination mode, and adjusting according to the characteristics of specific tasks and data sets so as to adapt to different tracking scenes. Meanwhile, through reasonable weight distribution, the fused feature images can better capture key features of the target and provide more accurate target characterization, so that the follow-up target tracking performance is improved.

In the target tracking system 100 that merges the multi-modal features, the residual dual-attention module 150 is configured to pass the multi-modal merged feature map through a residual dual-attention mechanism model to obtain a classification feature map. It should be appreciated that the residual dual-attention mechanism model is a variation of the attention mechanism that is capable of adaptively learning important information in the feature map and enhancing meaningful feature representations. Specifically, the residual dual-attention mechanism model adaptively selects and weights different position features in the feature map through an attention mechanism, and can adjust feature weights of the feature map from two directions of a channel dimension and a space dimension at the same time, so that the model focuses on a target area and important features more, and background interference and irrelevant information are restrained. And by introducing residual connection, the model can effectively transfer and integrate characteristic representations of different levels, reduce the problems of information loss and gradient disappearance, and improve the expression capability and discriminant of the characteristics.

Fig. 3 is a block diagram of a residual dual attention module in a target tracking system incorporating multi-modal features in accordance with an embodiment of the present application. As shown in fig. 3, the residual dual-attention module 150 includes: a spatial attention unit 151, configured to input the multi-modal fusion feature map into a spatial attention module of the residual dual-attention mechanism model to obtain a spatial attention map; a channel attention unit 152, configured to input the multimodal fusion feature map into a channel attention module of the residual dual-attention mechanism model to obtain a channel attention map; an attention fusion unit 153 for fusing the spatial attention profile and the channel attention profile to obtain a fused attention profile; an activating unit 154, configured to activate the fused attention map by inputting a Sigmoid activating function to obtain a fused attention profile; an attention applying unit 155 for calculating a weighted feature map obtained by multiplying the fused attention feature map and the multi-modal fused feature map by the position points; and a residual fusion unit 156, configured to fuse the weighted feature map and the multi-mode fusion feature map to obtain the classification feature map.

Fig. 4 is a block diagram of a spatial attention unit in a target tracking system incorporating multi-modal features in accordance with an embodiment of the present application. As shown in fig. 4, the spatial attention unit 151 includes: a spatial perception subunit 1511, configured to convolutionally encode the multi-modal fusion feature map using a convolution layer of a spatial attention module of the residual dual-attention mechanism model to obtain an initial convolution feature map; a probabilizing subunit 1512, configured to pass the initial convolution feature map through a Softmax function to obtain a spatial attention score map; a spatial attention applying subunit 1513 is configured to multiply the spatial attention score map and the multi-modal fusion feature map by location points to obtain the spatial attention map.

Fig. 5 is a block diagram of a channel attention unit in a target tracking system incorporating multi-modal features according to an embodiment of the present application. As shown in fig. 5, the channel attention unit 152 includes: a channel dimension pooling subunit 1521, configured to pool the multi-modal fusion feature map along a global average of channel dimensions to obtain a channel feature vector; a nonlinear activation subunit 1522, configured to obtain a channel weight feature vector by using a Softmax activation function on the channel feature vector; a channel attention applying subunit 1523, configured to weight each feature matrix of the multi-modal fusion feature map along a channel dimension with a feature value of each position in the channel weight feature vector as a weight to obtain the channel attention map.

In the target tracking system 100 with multi-modal feature fusion, the feature enhancement module 160 is configured to perform auto-correlation enhancement of feature value quantization distribution features on the classification feature map to obtain an enhanced classification feature map. In particular, in the technical scheme of the application, the RGB feature map and the thermal infrared feature map are considered to be fused in a linear combination mode in the feature fusion process. If the distributions of the two feature maps do not match or there is a large difference, outliers and outliers may occur in the fused feature maps. The residual dual-attention mechanism model may contain a large number of parameters and complex computational processes, especially when processing the multi-modal fusion feature map. Such complexity may result in the model being overly sensitive to outliers and outliers, thereby enlarging or introducing it into the classification map. In order to reduce abnormal values and outliers existing in the classification feature map, correlation among feature values of all positions in the classification feature map is considered, namely, the overall quantized distribution of all positions of the classification feature map contains important mode features, so if the feature distribution feature can be quantized by utilizing the feature values of the classification feature map, the feature distribution autocorrelation enhancement can be carried out on the feature distribution feature, the information distillation degree and the certainty of the classification feature map can be improved, and the accuracy of classification results obtained by the classifier of the classification feature map is improved.

Specifically, the feature enhancement module 160 includes: the characteristic squeezing unit is used for passing the classification characteristic map through a characteristic squeezing module based on a convolution layer to obtain a squeezing classification characteristic map; the characteristic excitation unit is used for enabling the squeezing classification characteristic diagram to pass through a characteristic excitation module based on a deconvolution layer to obtain an excitation classification characteristic diagram; the cosine similarity calculation unit is used for calculating cosine similarity between channel feature vectors of every two pixel positions of the excitation classification feature map to obtain a classification feature autocorrelation matrix; the normalization unit is used for normalizing the classification characteristic autocorrelation matrix through a Softmax function to obtain an autocorrelation class attention matrix; and the strengthening unit is used for modeling the relation between any two pixel points in the excitation classification characteristic diagram by utilizing the autocorrelation class attention matrix so as to obtain the strengthening classification characteristic diagram.

In one embodiment of the present application, the reinforcement unit is configured to: and modeling the relation between any two pixel points in the excitation classification characteristic map by utilizing the autocorrelation class attention matrix by utilizing element-by-element multiplication operation so as to obtain the enhanced classification characteristic map after the association characteristic mapping. The correlation feature effectively aggregates the complete information of each target according to the similarity of pixels, and the calculation formula is as follows:

That is, in the technical solution of the present application, the classification characteristic map is first information-purified by using a press-excitation mechanism. And in the step of obtaining the excitation classification characteristic diagram, a cosine similarity operation is utilized to obtain a relation matrix among pixels in the excitation classification characteristic diagram, namely, the cosine similarity between channel characteristic vectors of every two pixel positions of the excitation classification characteristic diagram is calculated to obtain a classification characteristic autocorrelation matrix. After dynamically learning the pixel relationship, the classification feature autocorrelation matrix is subjected to a Softmax function to normalize a similarity matrix by using the Softmax so as to obtain an autocorrelation type attention focusing matrix. And modeling the relation between any two pixel points in the excitation classification characteristic map by using the autocorrelation class attention matrix to obtain the enhanced classification characteristic map.

In the target tracking system 100 that merges the multi-modal features, the tracking result generating module 170 is configured to pass the enhanced classification feature map through a classifier to obtain a classification result, where the classification result is used to indicate whether the target object deviates from the tracking range. It should be appreciated that a classifier is a machine learning model that is capable of learning input feature representations and class discriminant rules and outputting corresponding class labels or probability distributions based on the input features. Here, the enhanced classification feature map is processed by a classifier to obtain a classification result, for example, the classification result is a classification label such as "target object is out of tracking range" or "target object is in tracking range". In this way, the system can be assisted in determining if the target deviates from the original tracking area based on the classification results and taking corresponding actions, such as repositioning the target or triggering an alarm, etc.

Accordingly, in one specific example, the tracking result generating module 170 is configured to: processing the enhanced classification feature map with the classifier in the following classification formula to obtain the classification result; wherein, the classification formula is: o=softmax { (M) _c ，B _c )|Project(F ₂ ) Of which Project (F) ₂ ) Representing the projection of the enhanced classification feature map as a vector, M _c Weight matrix of full connection layer, B _c Representing the bias matrix of the fully connected layer, softmax representing the normalized exponential function, and O representing the classification result.

In summary, a target tracking system with multi-mode feature fusion according to an embodiment of the present application is illustrated, which extracts RGB features and thermal infrared features from an RGB image and a thermal infrared image of target tracking respectively by using a deep learning technology, performs feature fusion on the RGB features and the thermal infrared features by using a linear combination manner, then adaptively adjusts feature weights in a space dimension and a channel dimension by using a residual attention mechanism, and finally determines whether a target object is separated from a tracking range by using a classifier. Therefore, the accuracy of target tracking is improved by effectively fusing the multi-mode characteristics in target tracking, so that a better target tracking effect is achieved.

Fig. 6 is a flowchart of a target tracking method incorporating multi-modal features according to an embodiment of the present application. As shown in fig. 6, a target tracking method for fusing multi-modal features according to an embodiment of the present application includes the steps of: s110, acquiring an RGB image and a thermal infrared image of target tracking in real time; s120, passing the RGB image through a first convolution neural network model serving as a filter to obtain an RGB feature map; s130, passing the thermal infrared image through a second convolution neural network model serving as a filter to obtain a thermal infrared characteristic diagram; s140, carrying out feature fusion on the RGB feature map and the thermal infrared feature map in a linear combination mode to obtain a multi-mode fusion feature map; s150, the multi-mode fusion feature map is subjected to a residual double-attention mechanism model to obtain a classification feature map; s160, carrying out autocorrelation strengthening of characteristic value quantization distribution characteristics on the classification characteristic map to obtain a strengthened classification characteristic map; and S170, the enhanced classification characteristic diagram is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the target object is out of the tracking range.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described multi-modal feature-fused object tracking method have been described in detail in the above description of the multi-modal feature-fused object tracking system with reference to fig. 1 to 5, and thus, repetitive descriptions thereof will be omitted.

Claims

1. A target tracking system incorporating multi-modal features, comprising:

2. The multi-modal feature fused target tracking system of claim 1, wherein the RGB image feature extraction module is configured to: each layer of the first convolutional neural network model used as the filter performs the following steps on input data in forward transfer of the layer:

performing convolution processing on the input data based on a two-dimensional convolution check to generate a convolution feature map;

carrying out mean pooling processing based on a local feature matrix on the convolution feature map to generate a pooled feature map;

non-linear activation is carried out on the characteristic values of all the positions in the pooled characteristic map so as to generate an activated characteristic map;

the output of the last layer of the first convolutional neural network model is the RGB feature map, the input from the second layer to the last layer of the first convolutional neural network model is the output of the last layer, and the input of the first layer of the first convolutional neural network model is the RGB image.

3. The multi-modal feature fused target tracking system of claim 2, wherein the thermal infrared image feature extraction module is configured to: and respectively carrying out two-dimensional convolution processing, local feature matrix-based mean pooling processing and nonlinear activation processing on input data in forward transfer of layers by using each layer of the second convolution neural network model serving as a filter so as to output the thermal infrared feature map by the last layer of the second convolution neural network model.

4. A multi-modal feature fused target tracking system as claimed in claim 3, wherein the residual dual-attention module comprises:

the spatial attention unit is used for inputting the multi-mode fusion feature map into a spatial attention module of the residual double-attention mechanism model to obtain a spatial attention map;

the channel attention unit is used for inputting the multi-mode fusion feature map into a channel attention module of the residual double-attention mechanism model to obtain a channel attention map;

an attention fusion unit for fusing the spatial attention map and the channel attention map to obtain a fused attention map;

the activating unit is used for inputting the fusion attention map into a Sigmoid activating function to activate so as to obtain a fusion attention feature map;

the attention applying unit is used for calculating the weighted feature map obtained by multiplying the position-by-position points of the fusion attention feature map and the multi-mode fusion feature map;

and the residual fusion unit is used for fusing the weighted feature map and the multi-mode fusion feature map to obtain the classification feature map.

5. The multi-modal feature fused target tracking system of claim 4, wherein the spatial attention unit comprises:

the space perception subunit is used for carrying out convolution encoding on the multi-mode fusion feature map by using a convolution layer of a space attention module of the residual double-attention mechanism model so as to obtain an initial convolution feature map;

a probability subunit, configured to pass the initial convolution feature map through a Softmax function to obtain a spatial attention score map;

and the spatial attention applying subunit is used for multiplying the spatial attention score graph and the multi-mode fusion feature graph by position points to obtain the spatial attention graph.

6. The multi-modal feature fused target tracking system of claim 5, wherein the channel attention unit comprises:

the channel dimension pooling subunit is used for carrying out global average pooling along the channel dimension on the multi-mode fusion feature map so as to obtain a channel feature vector;

a nonlinear activation subunit, configured to pass the channel feature vector through a Softmax activation function to obtain a channel weight feature vector;

and the channel attention applying subunit is used for weighting each feature matrix of the multi-mode fusion feature map along the channel dimension by taking the feature value of each position in the channel weight feature vector as a weight so as to obtain the channel attention map.

7. The multi-modal feature fused target tracking system of claim 6, wherein the feature augmentation module comprises:

the characteristic squeezing unit is used for passing the classification characteristic map through a characteristic squeezing module based on a convolution layer to obtain a squeezing classification characteristic map;

the characteristic excitation unit is used for enabling the squeezing classification characteristic diagram to pass through a characteristic excitation module based on a deconvolution layer to obtain an excitation classification characteristic diagram;

the cosine similarity calculation unit is used for calculating cosine similarity between channel feature vectors of every two pixel positions of the excitation classification feature map to obtain a classification feature autocorrelation matrix;

the normalization unit is used for normalizing the classification characteristic autocorrelation matrix through a Softmax function to obtain an autocorrelation class attention matrix;

and the strengthening unit is used for modeling the relation between any two pixel points in the excitation classification characteristic diagram by utilizing the autocorrelation class attention matrix so as to obtain the strengthening classification characteristic diagram.

8. The multi-modal feature fused target tracking system of claim 7, wherein the reinforcement unit is configured to: modeling the relation between any two pixel points in the excitation classification feature map by using the autocorrelation class attention matrix according to the following reinforcement formula to obtain the reinforcement classification feature map after the association feature mapping; wherein, the strengthening formula is:

9. The target tracking system for merging multi-modal features according to claim 8, wherein the tracking result generating module is configured to: processing the enhanced classification feature map with the classifier in the following classification formula to obtain the classification result;

wherein, the classification formula is: o=softmax { (M) _c ，B _c )|Project(F ₂ ) Of which Project (F) ₂ ) Representing the projection of the enhanced classification feature map as a vector, M _c Weight matrix of full connection layer, B _c Representing the bias matrix of the fully connected layer, softmax representing the normalized exponential function, and O representing the classification result.

10. The target tracking method integrating the multi-mode features is characterized by comprising the following steps of: