CN112949451B

CN112949451B - Cross-modal target tracking method and system through modal perception feature learning

Info

Publication number: CN112949451B
Application number: CN202110214908.2A
Authority: CN
Inventors: 李成龙; 白曼; 朱天昊; 汤进
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2022-09-09
Anticipated expiration: 2041-02-24
Also published as: CN112949451A

Abstract

The invention provides a cross-modal target tracking method and system through modal perception feature learning; a modal perception module is added on the basis of the original VGG network so as to adaptively combine effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification penalty of the training sample for an arbitrary modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used for effectively training the proposed network, the final full-connection layer is retrained, the mode perception fusion module adaptively combines effective characteristics from two branches, and any mode is used as input to solve the appearance difference between the RGB mode and the near-infrared mode.

Description

Cross-modal target tracking method and system through modal perception feature learning

Technical Field

The invention relates to the field of computer vision, in particular to a cross-modal target tracking method and a cross-modal target tracking system through modal perception feature learning.

Background

Visual tracking is an important problem in the field of computer vision, and plays an important role in visual systems such as visual monitoring, intelligent transportation, robots and the like. Existing tracking methods are often based on RGB image sequences that are sensitive to illumination variations, resulting in some targets being ineffective in low light conditions, in which case the tracking performance of existing methods may be significantly reduced. The introduction of other modality data is an effective method to solve the imaging limitations of single sources, such as depth and thermal infrared data, to overcome the imaging limitations of RGB sources; depth sensors can provide valuable additional depth information, improving tracking results through robust occlusion and model drift processing, but are limited in range (up to 4-5 meters) and indoor environments. Thermal sensors are usually independent of RGB sensors, their visual characteristics being very different. Therefore, a lot of work is required in the design of the platform and the alignment of the frame, and it is not currently applied to many practical applications.

Near infrared imaging is an essential part of many surveillance cameras, and its imaging can be switched between RGB and near infrared according to light intensity, as shown in fig. 3. The imaging system well solves the imaging limitation of the RGB source under the condition of weak light, and simultaneously avoids the imaging and platform problems brought by the existing multi-mode vision system, and FIG. 4 is a typical example of a near-infrared camera in imaging. However, the two modalities, near infrared and RGB, are heterogeneous, as shown on the left side of fig. 3, with very different visual properties, so that the appearance of the target object is completely different in the different modalities. Such appearance gaps present significant challenges for visual tracking.

For example, the application number CN201911146615.4 discloses a human body detection and tracking method based on multi-modal information perception, color camera and depth camera calibration and data filtering processing; detecting the body and the head of a person in the color image and the depth image respectively based on multi-mode information perception human body detection, and fusing two detection results according to the spatial proportion information of the head and the body; based on multi-mode information perception, tracking a body and a head in a color image and a depth image respectively by using a coring related filtering tracking algorithm, and establishing a model of a tracked object; and the space constraint of the tracking object model and the head-to-body ratio is utilized to perfect a tracking mechanism. The method disclosed by the invention is based on multi-mode information perception, overcomes the defects of a target detection and tracking method only based on vision, has wide application in the field of indoor service robots, and is beneficial to functions of human-computer interaction operation, user following and the like. However, the method is directed to depth and visible light, and two cameras need to be aligned in calibration, which is not applicable to a single camera with multiple modes.

Existing tracking efforts have not investigated this challenging problem. Therefore, we have studied a new topic of cross-modal targeting, aiming to solve the following 2 problems:

1) in many vision systems, visual tracking is often based on RGB image sequences, where some targets are ineffective in low light conditions, thereby severely impacting tracking performance;

2) the introduction of other modalities such as depth and infrared data can solve the problem of single source imaging limitations, but multi-modality imaging platforms typically require careful design and are not currently applicable in many practical applications.

Based on the above analysis, how to create a video reference dataset to facilitate research and development of cross-modal object tracking? How to design a proper baseline algorithm to reduce the appearance difference between the RGB mode and the near-infrared mode, realize robust cross-mode target tracking and reduce the influence of switching time on tracking performance is a key research method of the application.

Disclosure of Invention

The invention aims to provide a cross-modal target tracking method.

The invention solves the technical problems through the following technical means:

the cross-modal target tracking method through modal perception feature learning comprises the following steps:

s01, acquiring training sample data; the training sample data comprises at least a first sample set and a second sample set; the first set of samples includes at least two subsets of known modality samples; the second sample set comprises full frame data comprising all the modalities of the first sample set, each frame data modality being unknown;

s02, network training, wherein the network training is divided into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the branch networks corresponding to the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;

s03, target tracking, wherein the target network carries out target tracking on the multi-mode video data and outputs a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using the target network.

The present invention introduces a new task, called cross-modal object tracking; a modal perception module is added on the basis of the original VGG network so as to adaptively combine effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification penalty of the training sample for an arbitrary modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used to efficiently train the proposed network, retraining the final fully-connected layer, specifically, first initializing the parameters of the base network in VGG-M using a pre-trained model. In a first training phase, the modality-aware fusion module is removed and a single modality is used to train the corresponding branch network. In the second training phase, images with two modes are used as input, and since it is not known which mode the current frame belongs to, all frames are used to learn the mode perception fusion module and the classifier, and parameters of the basic network are fine-tuned. The invention designs a modal perception fusion module to adaptively combine effective characteristics from two branches, and takes any mode as input to solve the appearance difference between RGB and near infrared modes.

Further, the modalities in the step S01 at least include near infrared and visible light modalities.

Further, in the first stage learning in step S02, each branch adopts an initiation network.

Further, the specific method of target tracking in step S03 is as follows:

the method comprises the steps of firstly inputting a first frame marked with a target to be tracked to a target network, starting from a second frame, sampling 256 candidate regions around Gaussian distribution of a latest frame tracking result in a sample pool when a t frame is tracked, carrying out binarization classification by utilizing cross entropy loss and example embedding loss of softmax to distinguish a foreground from a background, then calculating scores of the candidate regions by utilizing the target network, and selecting an average bounding box of 5 candidate samples with the highest score as a tracking result of the t frame.

The invention also provides a cross-modal target tracking system through modal perception feature learning, which comprises:

a training sample data acquisition module, wherein the training sample data at least comprises a first sample set and a second sample set; the first set of samples includes at least two subsets of known modality samples; the second sample set comprises full frame data comprising all modalities of the first sample set, each frame data modality being unknown;

the network training module is used for dividing the network training into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;

the target tracking module is used for tracking the target of the multi-mode video data by the target network and outputting a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using the target network.

Furthermore, the modalities of the training sample data acquisition module at least include two modalities, namely near infrared modality and visible light modality.

Furthermore, in the first stage learning in the network training module, each branch adopts an initiation network.

Further, a specific method for tracking the target in the target tracking module is as follows:

The present invention also provides a processing device comprising at least one processor, and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the methods described above.

The invention also provides a computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.

The invention has the advantages that:

the invention adds a modal perception module on the original basis of the VGG network to adaptively combine the effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification loss of training samples of any modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used to efficiently train the proposed network, retraining the final fully-connected layer, specifically, first initializing the parameters of the base network in VGG-M using a pre-trained model. In a first training phase, the modality-aware fusion module is removed and a single modality is used to train the corresponding branch network. In the second training phase, images with two modes are used as input, and since it is not known which mode the current frame belongs to, all frames are used to learn the mode perception fusion module and the classifier, and parameters of the basic network are fine-tuned. The invention designs a modal perception fusion module to adaptively combine effective characteristics from two branches, and takes any mode as input to solve the appearance difference between RGB and near infrared modes.

Drawings

FIG. 1 is a schematic flow chart of the overall process of the method in the embodiment of the invention.

FIG. 2 is a feature mapping under a modality conversion challenge in an embodiment of the present invention: inputting two adjacent frames, wherein the first column is the position where the mode switching occurs; the second column is visualization of a characteristic diagram of the baseline tracker RT-MDNet; the third column visualizes the feature map of the method of the invention.

FIG. 3 is an illustration of the heterogeneous nature between RGB and near-infrared modalities in an embodiment of the present invention: when the light intensity is lowered from normal, the vision camera changes the RGB imaging to near-infrared imaging and vice versa.

FIG. 4 is a typical example of the method in the embodiment of the present invention: in this sequence a number of key frames are provided, the bounding box representing the true value, and it can be seen that as the intensity changes, its imaging switches between RGB and near infrared depending on the intensity.

Fig. 5 is a visualization diagram of a target feature sequence in an embodiment of the present invention: the left graph is the expected feature distribution of the baseline tracker and the right graph is the feature distribution of the method of the invention, where circles and triangles represent the target features of the RGB and near-infrared modalities, respectively.

FIG. 6 is a two-stage learning algorithm in an embodiment of the present invention: in the first stage, removing a modal perception fusion module, and training a corresponding branch network by using a modal sample set; in the second stage, a modal perception fusion module is trained by utilizing a sample set of two modalities.

Fig. 7 and 8 are test comparison tables of the method of the present invention and other trackers in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a cross-modal target tracking method through modal perception feature learning, which comprises the following steps:

step 01, acquiring training sample data; the training sample data comprises at least a first sample set and a second sample set; the first sample set comprises at least two subsets of known modality samples; the second sample set comprises full frame data comprising all the modalities of the first sample set, each frame data modality being unknown; the first sample set and the second sample set are both video data that have been labeled with objects, and may be 500 positive samples and 5000 negative samples taken in the first frame with the initial bounding box, which are greater than 0.7 and less than 0.3, respectively, from IoU of the initial bounding box. The modes in this embodiment include at least two modes, near infrared and visible light.

Step 02, network training, wherein the network training is divided into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; and the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module.

The present embodiment uses a VGG network as a base network to extract features of two modes, and the kernel sizes of the three convolutional layers are 7 × 7, 5 × 5, and 3 × 3, respectively.

The embodiment provides a two-stage learning algorithm to effectively train the proposed network, the first stage learning is to remove a modal perception fusion module, train the corresponding branch network by using a modal sample set, design a dual-current CNN, respectively extract RGB and NIR expressions in parallel, and train the corresponding branch network. On the structure of each branch, an initiation network is adopted, so that efficient calculation is realized.

In the second training stage, because the mode of the current frame is unknown, a mode fusion module is added after the two branches of the VGG network after the learning of the first stage, all frames are used for learning the mode perception fusion module and the classifier, and parameters of the base network are fine-tuned. The end result is shown in fig. 5, with the predicted characteristics of the baseline tracker on the left. On the right is the predictive feature of the invention. Here, circles and triangles represent the target for analysis of RGB and NIR modalities, respectively. From the results, it can be seen that the heterogeneous gap between different modes of the target object is alleviated to some extent.

Step 03, target tracking, namely performing target tracking on the multi-modal video data by using the target network and outputting a tracking result; due to the particularity of cross-modal target tracking, two branches are designed to capture the representation of a specific mode, a mode perception fusion module is designed, and the first 1 x 1 convolutional layer is used for capturing the general representation under the attribute. It is then split into two streams using two more 1 x 1 half-channel convolutional layers to reduce the dimension of the filter space. A 3 x 3 filter is implemented in the stream to perceive differently sized regional fields in two ways, one using a conventional 3 x 3 filter and the other using a cascade of 1 x 3 and 3 x 1 filters. Their outputs are connected together as attribute-specific representations. The effectiveness of the modality-aware fusion module is demonstrated by adaptively integrating the characteristics of the two branch outputs given a modality input, as shown in fig. 6.

In the target tracking process, the first frame is the marked target, and tracking is started from the second frame. In the tracking process, 50 positive samples with IoU values larger than 0.7 and 200 negative samples with IoU values smaller than 0.3 of the current frame tracking image are selected to adapt to changes in the target tracking process. Two branches are used to capture a representation of a particular modality, and a modality-aware fusion module of 1 x 1 convolutional layers adaptively integrates the characteristics of the two branch outputs given a modality input.

In order to learn a representation that distinguishes between objects and background, the penalty used by the method of the present invention is a classification penalty. Input image x given d-field ^d And bounding box R, output score f of the network ^d The activation configuration to connect the last fully connected layer is:

wherein is phi ^d (ii) the two-dimensional binary classification score of the last fully connected layer fc6 in domain d, d being the number of domains in the training dataset. And assigning the output characteristics to a softmax function for binary classification, wherein the function judges whether a bounding box R is a target or a background in a d domain. The output features are subjected to multi-domain instance discrimination through another softmax operator. The function of softmax is:

for the classification penalty, there is a binary classification penalty for the domain d (k) (k mod d) in the kth iteration:

before the output of the tracking target, a meta-update controller is used for judging whether the tracking result is successfully tracked or not so as to adapt to the change in the target tracking process. And if the tracking is successful, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by the target network by using the current latest tracking result of the sample pool. In the embodiment, long-term updating is preset to be performed every 10 frames, firstly, a RoI alignment pooling layer is introduced, and the features of the candidate region are directly extracted from the feature map, so that the feature extraction in the tracking process is greatly accelerated. The last fully connected layer (fc6) learning rate is set to 0.003, the other layer (fc4-5) learning rate is set to 0.0015, and the epoch number is set to 15. Three fully-connected layers (fc4-6) are then used to accommodate the appearance changes of the instances in different videos and frames. And (4) tracking failure in the current frame, training the hyper-parameters and long-term updating, and performing short-term updating when the hyper-parameters and the long-term updating are the same. In tracking the t-th frame, 256 candidate regions are sampled with a gaussian distribution around the latest frame tracking result in the sample pool. Finally, the cross entropy loss and the instance embedding loss of softmax are utilized to carry out binary classification so as to distinguish the foreground from the background. The scores for these candidate regions are then calculated using the trained network. And selecting the average bounding box of the 5 candidate samples with the highest score as the tracking result of the t-th frame.

The present invention introduces a new task, called cross-modal object tracking; a modal perception module is added on the basis of the original VGG network so as to adaptively combine effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification loss of training samples of any modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used to efficiently train the proposed network, retraining the final fully-connected layer, specifically, first initializing the parameters of the base network in VGG-M using a pre-trained model. In a first training phase, the modality-aware fusion module is removed and a single modality is used to train the corresponding branch network. In the second training phase, images with two modes are used as input, and since it is not known which mode the current frame belongs to, all frames are used to learn the mode perception fusion module and the classifier, and parameters of the basic network are fine-tuned. The invention designs a modal perception fusion module to adaptively combine effective characteristics from two branches, and takes any mode as input to solve the appearance difference between RGB and near infrared modes.

Fig. 7 and 8, below, and fig. 7 and 8 are graphs of experimental results of the present invention, tested on the disclosed data set CMOTB and evaluated with other trackers on SR (success rate), PR (accuracy) and NPR (normalized precision). The MArMOT represents the tracking result precision of the method, and compared with other methods, the tracking performance of the method is improved uniformly and greatly. In fig. 7 and 8, Trackers is the name of the tracker;

the MArMOT is Cross-Module Object Tracking view modification-Aware Feature Learning (the method of the invention);

LTMU is High-Performance Long-Term Tracking with Meta-Updater;

the ATOM is the ATOM tracking by overlay mapping;

SiamRPN + + is SiamRPN + +, Evolution of Siamese Visual Tracking with Very Deep Networks;

SiambAN is Siamese Box Adaptive Network for Visual Tracking;

VITAL is VITAL Tracking via Adversal Learning;

SiamMask is Fast on Object Tracking and Segmentation A unity applying;

MDNet is leaving multi-domain con volumetric neural networks for visual tracking;

DaSiameseRPN is separator-aware Siamese Networks for Visual Object Tracking;

TACT is Visual Tracking by TridenTalign and Context Embedding;

SiamFC is full-conditional name Networks for Object Tracking;

GradNet for Visual Object Tracking;

SiamDW is separator and wire Siamese Networks for Real-Time Visual Tracking;

SPLT is 'Skimming-Perusal' Tracking, A frame for Real-Time and Robust Long-term Tracking;

the RT-MDNet is Real-time MDNet;

ocean is Object-aware Anchor-free Tracking;

globaltrack is Globaltrack A Simple and Strong Baseline for Long-term Tracking.

The present embodiment further provides a cross-modal target tracking system through modal aware feature learning, including:

a training sample data acquisition module; the training sample data comprises at least a first sample set and a second sample set; the first sample set comprises at least two subsets of known modality samples; the second sample set comprises full frame data comprising all modalities of the first sample set, each frame data modality being unknown; the first sample set and the second sample set are both video data that have been marked with an object, and may be that 500 positive samples and 5000 negative samples, which are greater than 0.7 and less than 0.3, respectively, from IoU of the initial bounding box, are taken in the first frame with the initial bounding box. The modes in this embodiment include at least near infrared and visible light modes.

The network training module is used for dividing the network training into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; and the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module.

This embodiment uses the VGG network as a base network to extract features of two modes, the kernel sizes of the three convolutional layers being 7 × 7, 5 × 5, and 3 × 3, respectively.

In this embodiment, a two-stage learning algorithm is proposed to effectively train the proposed network, the first stage learning is to remove a modal perception fusion module, train the corresponding branch network by using a modal sample set, design a dual-flow CNN, extract RGB and NIR representations in parallel, respectively, and train the corresponding branch network. On the structure of each branch, an initiation network is adopted, so that efficient calculation is realized.

In the second training stage, all frames are used to learn the modal perception fusion module and the classifier, and parameters of the base network are finely adjusted at the same time, because the current frame does not know which mode belongs to. The end result is shown in fig. 5, with the predicted characteristics of the baseline tracker on the left. On the right is the predictive feature of the invention. Here, circles and triangles represent the target for analysis of RGB and NIR modalities, respectively. From the results, it can be seen that the heterogeneous gap between different modes of the target object is alleviated to some extent.

The target tracking module is used for tracking the target of the multi-mode video data by utilizing the target network and outputting a tracking result; due to the particularity of cross-modal target tracking, two branches are designed to capture the representation of a specific modality, a modality-aware fusion module is designed, and the first 1 × 1 convolutional layer is used for capturing the general representation under the attribute. It is then split into two streams using two more 1 x 1 half-channel convolutional layers to reduce the dimension of the filter space. A 3 x 3 filter is implemented in the stream to perceive differently sized regional fields in two ways, one using a conventional 3 x 3 filter and the other using a cascade of 1 x 3 and 3 x 1 filters. Their outputs are connected together as attribute-specific representations. The effectiveness of the modality-aware fusion module is demonstrated by adaptively integrating the features of two branch outputs given a modality input, as shown in fig. 6.

In order to learn the representation that distinguishes between the target and the background, the penalty used by the method of the present invention is a classification penalty. Input image x for a given d-field ^d And bounding box R, output score f of the network ^d And the activation structure connecting the last full connection layer is as follows:

wherein is phi ^d (ii) the two-dimensional binary classification score of the last fully connected layer fc6 in domain d, d being the number of domains in the training dataset. And assigning the output characteristics to a softmax function for binary classification, wherein the function judges whether a bounding box R is a target or a background in a d domain. And the output features are subjected to multi-domain instance discrimination through another softmax operator. The function of softmax is:

and the online updating module judges whether the tracking result is successfully tracked or not by using the meta-update updating controller before the tracking result is output so as to adapt to the change in the target tracking process. And if the tracking is successful, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by the target network by using the current latest tracking result of the sample pool. In the embodiment, long-term updating is preset to be performed every 10 frames, firstly, a RoI alignment pooling layer is introduced, and the features of the candidate region are directly extracted from the feature map, so that the feature extraction in the tracking process is greatly accelerated. The last fully connected layer (fc6) learning rate is set to 0.003, the other layer (fc4-5) learning rate is set to 0.0015, and the epoch number is set to 15. Three fully-connected layers (fc4-6) are then used to accommodate the appearance changes of the instances in different videos and frames. And (4) tracking failure in the current frame, training the hyper-parameters and long-term updating, and performing short-term updating when the hyper-parameters and the long-term updating are the same. In tracking the t-th frame, 256 candidate regions are sampled with a gaussian distribution around the latest frame tracking result in the sample pool. Finally, the cross entropy loss and the instance embedding loss of softmax are utilized to carry out binary classification so as to distinguish the foreground from the background. The scores for these candidate regions are then calculated using the trained network. And selecting the average bounding box of the 5 candidate samples with the highest scores as the tracking result of the t-th frame.

Fig. 7 and 8 below, and fig. 7 and 8 are graphs of experimental results of the present invention, tested on the published data set CMOTB, and the test results were evaluated with other trackers on SR (success rate), PR (accuracy), and NPR (normalized precision). The MArMOT represents the tracking result precision of the method, and compared with other methods, the tracking performance of the method is improved uniformly and greatly. In fig. 7 and 8, Trackers is the name of the tracker;

MArMOT is Cross-Modal Object Tracking view modification-Aware Feature Learning (method of the present invention);

LTMU is High-Performance Long-Term Tracking with Meta-Updater;

ATOM is ATOM, Accurate tracking by overlay maximum attenuation;

SiambAN is Siamese Box Adaptive Network for Visual Tracking;

VITAL is VITAL Tracking via Adversal Learning;

SimMask is Fast on Object Tracking and Segmentation, A Unifying Approach;

DaSiameseRPN is separator-aware Siamese Networks for Visual Object Tracking;

TACT is Visual Tracking by TridentAlign and Context Embedding;

SiamFC is full-conditional name Networks for Object Tracking;

GradNet for Visual Object Tracking;

SiamDW is stripper and wire diameter Networks for read-Time Visual Tracking;

the RT-MDNet is Real-time MDNet;

ocean is Object-aware Anchor-free Tracking;

globaltrack is Globaltrack A Simple and Strong Baseline for Long-term Tracking.

The present embodiments also provide a processing device, comprising at least one processor, and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which when called upon are capable of performing the methods described above.

The present embodiments also provide a computer-readable storage medium storing computer instructions that cause the computer to perform the method as described above.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The cross-modal target tracking method through modal perception feature learning is characterized by comprising the following steps of:

s02, network training, wherein the network training is divided into two stages of learning, and the first stage of learning utilizes the two types of modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;

s03, target tracking, wherein the target network carries out target tracking on the multi-mode video data and outputs a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using a target network;

the specific method for tracking the target in step S03 is as follows:

2. The method for cross-modal target tracking through modal-aware feature learning according to claim 1, wherein the modalities of step S01 include at least two modalities of near infrared and visible light.

3. The method for cross-modal object tracking through modal-aware feature learning according to claim 1, wherein in the first stage learning in step S02, an initiation network is adopted for each branch.

4. A cross-modal target tracking system through modal-aware feature learning, comprising the steps of:

the target tracking module is used for tracking the target of the multi-mode video data by the target network and outputting a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using a target network;

the specific method for tracking the target in the target tracking module comprises the following steps:

5. The system according to claim 4, wherein the modalities of the training sample data acquisition module include at least two modalities of near infrared and visible light.

6. The system according to claim 4, wherein in the first learning stage in the network training module, an initiation network is used for each branch.

7. A processing device comprising at least one processor and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor being capable of executing the method of any one of claims 1 to 3 when invoked by the processor.

8. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 3.