CN112949451B - Cross-modal target tracking method and system through modal perception feature learning - Google Patents

Cross-modal target tracking method and system through modal perception feature learning Download PDF

Info

Publication number
CN112949451B
CN112949451B CN202110214908.2A CN202110214908A CN112949451B CN 112949451 B CN112949451 B CN 112949451B CN 202110214908 A CN202110214908 A CN 202110214908A CN 112949451 B CN112949451 B CN 112949451B
Authority
CN
China
Prior art keywords
tracking
network
modal
target
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110214908.2A
Other languages
Chinese (zh)
Other versions
CN112949451A (en
Inventor
李成龙
白曼
朱天昊
汤进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202110214908.2A priority Critical patent/CN112949451B/en
Publication of CN112949451A publication Critical patent/CN112949451A/en
Application granted granted Critical
Publication of CN112949451B publication Critical patent/CN112949451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences

Abstract

The invention provides a cross-modal target tracking method and system through modal perception feature learning; a modal perception module is added on the basis of the original VGG network so as to adaptively combine effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification penalty of the training sample for an arbitrary modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used for effectively training the proposed network, the final full-connection layer is retrained, the mode perception fusion module adaptively combines effective characteristics from two branches, and any mode is used as input to solve the appearance difference between the RGB mode and the near-infrared mode.

Description

Cross-modal target tracking method and system through modal perception feature learning
Technical Field
The invention relates to the field of computer vision, in particular to a cross-modal target tracking method and a cross-modal target tracking system through modal perception feature learning.
Background
Visual tracking is an important problem in the field of computer vision, and plays an important role in visual systems such as visual monitoring, intelligent transportation, robots and the like. Existing tracking methods are often based on RGB image sequences that are sensitive to illumination variations, resulting in some targets being ineffective in low light conditions, in which case the tracking performance of existing methods may be significantly reduced. The introduction of other modality data is an effective method to solve the imaging limitations of single sources, such as depth and thermal infrared data, to overcome the imaging limitations of RGB sources; depth sensors can provide valuable additional depth information, improving tracking results through robust occlusion and model drift processing, but are limited in range (up to 4-5 meters) and indoor environments. Thermal sensors are usually independent of RGB sensors, their visual characteristics being very different. Therefore, a lot of work is required in the design of the platform and the alignment of the frame, and it is not currently applied to many practical applications.
Near infrared imaging is an essential part of many surveillance cameras, and its imaging can be switched between RGB and near infrared according to light intensity, as shown in fig. 3. The imaging system well solves the imaging limitation of the RGB source under the condition of weak light, and simultaneously avoids the imaging and platform problems brought by the existing multi-mode vision system, and FIG. 4 is a typical example of a near-infrared camera in imaging. However, the two modalities, near infrared and RGB, are heterogeneous, as shown on the left side of fig. 3, with very different visual properties, so that the appearance of the target object is completely different in the different modalities. Such appearance gaps present significant challenges for visual tracking.
For example, the application number CN201911146615.4 discloses a human body detection and tracking method based on multi-modal information perception, color camera and depth camera calibration and data filtering processing; detecting the body and the head of a person in the color image and the depth image respectively based on multi-mode information perception human body detection, and fusing two detection results according to the spatial proportion information of the head and the body; based on multi-mode information perception, tracking a body and a head in a color image and a depth image respectively by using a coring related filtering tracking algorithm, and establishing a model of a tracked object; and the space constraint of the tracking object model and the head-to-body ratio is utilized to perfect a tracking mechanism. The method disclosed by the invention is based on multi-mode information perception, overcomes the defects of a target detection and tracking method only based on vision, has wide application in the field of indoor service robots, and is beneficial to functions of human-computer interaction operation, user following and the like. However, the method is directed to depth and visible light, and two cameras need to be aligned in calibration, which is not applicable to a single camera with multiple modes.
Existing tracking efforts have not investigated this challenging problem. Therefore, we have studied a new topic of cross-modal targeting, aiming to solve the following 2 problems:
1) in many vision systems, visual tracking is often based on RGB image sequences, where some targets are ineffective in low light conditions, thereby severely impacting tracking performance;
2) the introduction of other modalities such as depth and infrared data can solve the problem of single source imaging limitations, but multi-modality imaging platforms typically require careful design and are not currently applicable in many practical applications.
Based on the above analysis, how to create a video reference dataset to facilitate research and development of cross-modal object tracking? How to design a proper baseline algorithm to reduce the appearance difference between the RGB mode and the near-infrared mode, realize robust cross-mode target tracking and reduce the influence of switching time on tracking performance is a key research method of the application.
Disclosure of Invention
The invention aims to provide a cross-modal target tracking method.
The invention solves the technical problems through the following technical means:
the cross-modal target tracking method through modal perception feature learning comprises the following steps:
s01, acquiring training sample data; the training sample data comprises at least a first sample set and a second sample set; the first set of samples includes at least two subsets of known modality samples; the second sample set comprises full frame data comprising all the modalities of the first sample set, each frame data modality being unknown;
s02, network training, wherein the network training is divided into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the branch networks corresponding to the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;
s03, target tracking, wherein the target network carries out target tracking on the multi-mode video data and outputs a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using the target network.
The present invention introduces a new task, called cross-modal object tracking; a modal perception module is added on the basis of the original VGG network so as to adaptively combine effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification penalty of the training sample for an arbitrary modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used to efficiently train the proposed network, retraining the final fully-connected layer, specifically, first initializing the parameters of the base network in VGG-M using a pre-trained model. In a first training phase, the modality-aware fusion module is removed and a single modality is used to train the corresponding branch network. In the second training phase, images with two modes are used as input, and since it is not known which mode the current frame belongs to, all frames are used to learn the mode perception fusion module and the classifier, and parameters of the basic network are fine-tuned. The invention designs a modal perception fusion module to adaptively combine effective characteristics from two branches, and takes any mode as input to solve the appearance difference between RGB and near infrared modes.
Further, the modalities in the step S01 at least include near infrared and visible light modalities.
Further, in the first stage learning in step S02, each branch adopts an initiation network.
Further, the specific method of target tracking in step S03 is as follows:
the method comprises the steps of firstly inputting a first frame marked with a target to be tracked to a target network, starting from a second frame, sampling 256 candidate regions around Gaussian distribution of a latest frame tracking result in a sample pool when a t frame is tracked, carrying out binarization classification by utilizing cross entropy loss and example embedding loss of softmax to distinguish a foreground from a background, then calculating scores of the candidate regions by utilizing the target network, and selecting an average bounding box of 5 candidate samples with the highest score as a tracking result of the t frame.
The invention also provides a cross-modal target tracking system through modal perception feature learning, which comprises:
a training sample data acquisition module, wherein the training sample data at least comprises a first sample set and a second sample set; the first set of samples includes at least two subsets of known modality samples; the second sample set comprises full frame data comprising all modalities of the first sample set, each frame data modality being unknown;
the network training module is used for dividing the network training into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;
the target tracking module is used for tracking the target of the multi-mode video data by the target network and outputting a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using the target network.
Furthermore, the modalities of the training sample data acquisition module at least include two modalities, namely near infrared modality and visible light modality.
Furthermore, in the first stage learning in the network training module, each branch adopts an initiation network.
Further, a specific method for tracking the target in the target tracking module is as follows:
the method comprises the steps of firstly inputting a first frame marked with a target to be tracked to a target network, starting from a second frame, sampling 256 candidate regions around Gaussian distribution of a latest frame tracking result in a sample pool when a t frame is tracked, carrying out binarization classification by utilizing cross entropy loss and example embedding loss of softmax to distinguish a foreground from a background, then calculating scores of the candidate regions by utilizing the target network, and selecting an average bounding box of 5 candidate samples with the highest score as a tracking result of the t frame.
The present invention also provides a processing device comprising at least one processor, and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the methods described above.
The invention also provides a computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.
The invention has the advantages that:
the invention adds a modal perception module on the original basis of the VGG network to adaptively combine the effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification loss of training samples of any modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used to efficiently train the proposed network, retraining the final fully-connected layer, specifically, first initializing the parameters of the base network in VGG-M using a pre-trained model. In a first training phase, the modality-aware fusion module is removed and a single modality is used to train the corresponding branch network. In the second training phase, images with two modes are used as input, and since it is not known which mode the current frame belongs to, all frames are used to learn the mode perception fusion module and the classifier, and parameters of the basic network are fine-tuned. The invention designs a modal perception fusion module to adaptively combine effective characteristics from two branches, and takes any mode as input to solve the appearance difference between RGB and near infrared modes.
Drawings
FIG. 1 is a schematic flow chart of the overall process of the method in the embodiment of the invention.
FIG. 2 is a feature mapping under a modality conversion challenge in an embodiment of the present invention: inputting two adjacent frames, wherein the first column is the position where the mode switching occurs; the second column is visualization of a characteristic diagram of the baseline tracker RT-MDNet; the third column visualizes the feature map of the method of the invention.
FIG. 3 is an illustration of the heterogeneous nature between RGB and near-infrared modalities in an embodiment of the present invention: when the light intensity is lowered from normal, the vision camera changes the RGB imaging to near-infrared imaging and vice versa.
FIG. 4 is a typical example of the method in the embodiment of the present invention: in this sequence a number of key frames are provided, the bounding box representing the true value, and it can be seen that as the intensity changes, its imaging switches between RGB and near infrared depending on the intensity.
Fig. 5 is a visualization diagram of a target feature sequence in an embodiment of the present invention: the left graph is the expected feature distribution of the baseline tracker and the right graph is the feature distribution of the method of the invention, where circles and triangles represent the target features of the RGB and near-infrared modalities, respectively.
FIG. 6 is a two-stage learning algorithm in an embodiment of the present invention: in the first stage, removing a modal perception fusion module, and training a corresponding branch network by using a modal sample set; in the second stage, a modal perception fusion module is trained by utilizing a sample set of two modalities.
Fig. 7 and 8 are test comparison tables of the method of the present invention and other trackers in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a cross-modal target tracking method through modal perception feature learning, which comprises the following steps:
step 01, acquiring training sample data; the training sample data comprises at least a first sample set and a second sample set; the first sample set comprises at least two subsets of known modality samples; the second sample set comprises full frame data comprising all the modalities of the first sample set, each frame data modality being unknown; the first sample set and the second sample set are both video data that have been labeled with objects, and may be 500 positive samples and 5000 negative samples taken in the first frame with the initial bounding box, which are greater than 0.7 and less than 0.3, respectively, from IoU of the initial bounding box. The modes in this embodiment include at least two modes, near infrared and visible light.
Step 02, network training, wherein the network training is divided into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; and the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module.
The present embodiment uses a VGG network as a base network to extract features of two modes, and the kernel sizes of the three convolutional layers are 7 × 7, 5 × 5, and 3 × 3, respectively.
The embodiment provides a two-stage learning algorithm to effectively train the proposed network, the first stage learning is to remove a modal perception fusion module, train the corresponding branch network by using a modal sample set, design a dual-current CNN, respectively extract RGB and NIR expressions in parallel, and train the corresponding branch network. On the structure of each branch, an initiation network is adopted, so that efficient calculation is realized.
In the second training stage, because the mode of the current frame is unknown, a mode fusion module is added after the two branches of the VGG network after the learning of the first stage, all frames are used for learning the mode perception fusion module and the classifier, and parameters of the base network are fine-tuned. The end result is shown in fig. 5, with the predicted characteristics of the baseline tracker on the left. On the right is the predictive feature of the invention. Here, circles and triangles represent the target for analysis of RGB and NIR modalities, respectively. From the results, it can be seen that the heterogeneous gap between different modes of the target object is alleviated to some extent.
Step 03, target tracking, namely performing target tracking on the multi-modal video data by using the target network and outputting a tracking result; due to the particularity of cross-modal target tracking, two branches are designed to capture the representation of a specific mode, a mode perception fusion module is designed, and the first 1 x 1 convolutional layer is used for capturing the general representation under the attribute. It is then split into two streams using two more 1 x 1 half-channel convolutional layers to reduce the dimension of the filter space. A 3 x 3 filter is implemented in the stream to perceive differently sized regional fields in two ways, one using a conventional 3 x 3 filter and the other using a cascade of 1 x 3 and 3 x 1 filters. Their outputs are connected together as attribute-specific representations. The effectiveness of the modality-aware fusion module is demonstrated by adaptively integrating the characteristics of the two branch outputs given a modality input, as shown in fig. 6.
In the target tracking process, the first frame is the marked target, and tracking is started from the second frame. In the tracking process, 50 positive samples with IoU values larger than 0.7 and 200 negative samples with IoU values smaller than 0.3 of the current frame tracking image are selected to adapt to changes in the target tracking process. Two branches are used to capture a representation of a particular modality, and a modality-aware fusion module of 1 x 1 convolutional layers adaptively integrates the characteristics of the two branch outputs given a modality input.
In order to learn a representation that distinguishes between objects and background, the penalty used by the method of the present invention is a classification penalty. Input image x given d-field d And bounding box R, output score f of the network d The activation configuration to connect the last fully connected layer is:
Figure GDA0003763445850000071
wherein is phi d (ii) the two-dimensional binary classification score of the last fully connected layer fc6 in domain d, d being the number of domains in the training dataset. And assigning the output characteristics to a softmax function for binary classification, wherein the function judges whether a bounding box R is a target or a background in a d domain. The output features are subjected to multi-domain instance discrimination through another softmax operator. The function of softmax is:
Figure GDA0003763445850000072
for the classification penalty, there is a binary classification penalty for the domain d (k) (k mod d) in the kth iteration:
Figure GDA0003763445850000073
before the output of the tracking target, a meta-update controller is used for judging whether the tracking result is successfully tracked or not so as to adapt to the change in the target tracking process. And if the tracking is successful, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by the target network by using the current latest tracking result of the sample pool. In the embodiment, long-term updating is preset to be performed every 10 frames, firstly, a RoI alignment pooling layer is introduced, and the features of the candidate region are directly extracted from the feature map, so that the feature extraction in the tracking process is greatly accelerated. The last fully connected layer (fc6) learning rate is set to 0.003, the other layer (fc4-5) learning rate is set to 0.0015, and the epoch number is set to 15. Three fully-connected layers (fc4-6) are then used to accommodate the appearance changes of the instances in different videos and frames. And (4) tracking failure in the current frame, training the hyper-parameters and long-term updating, and performing short-term updating when the hyper-parameters and the long-term updating are the same. In tracking the t-th frame, 256 candidate regions are sampled with a gaussian distribution around the latest frame tracking result in the sample pool. Finally, the cross entropy loss and the instance embedding loss of softmax are utilized to carry out binary classification so as to distinguish the foreground from the background. The scores for these candidate regions are then calculated using the trained network. And selecting the average bounding box of the 5 candidate samples with the highest score as the tracking result of the t-th frame.
The present invention introduces a new task, called cross-modal object tracking; a modal perception module is added on the basis of the original VGG network so as to adaptively combine effective characteristics from two branches; a two-stage learning algorithm is developed to train the efficiency and effectiveness of the network. In the training phase, the classification loss of training samples of any modality will be propagated back to the two modality branches. Modal information is available during the training phase and not available during the testing phase; a two-stage learning algorithm is used to efficiently train the proposed network, retraining the final fully-connected layer, specifically, first initializing the parameters of the base network in VGG-M using a pre-trained model. In a first training phase, the modality-aware fusion module is removed and a single modality is used to train the corresponding branch network. In the second training phase, images with two modes are used as input, and since it is not known which mode the current frame belongs to, all frames are used to learn the mode perception fusion module and the classifier, and parameters of the basic network are fine-tuned. The invention designs a modal perception fusion module to adaptively combine effective characteristics from two branches, and takes any mode as input to solve the appearance difference between RGB and near infrared modes.
Fig. 7 and 8, below, and fig. 7 and 8 are graphs of experimental results of the present invention, tested on the disclosed data set CMOTB and evaluated with other trackers on SR (success rate), PR (accuracy) and NPR (normalized precision). The MArMOT represents the tracking result precision of the method, and compared with other methods, the tracking performance of the method is improved uniformly and greatly. In fig. 7 and 8, Trackers is the name of the tracker;
the MArMOT is Cross-Module Object Tracking view modification-Aware Feature Learning (the method of the invention);
LTMU is High-Performance Long-Term Tracking with Meta-Updater;
the ATOM is the ATOM tracking by overlay mapping;
SiamRPN + + is SiamRPN + +, Evolution of Siamese Visual Tracking with Very Deep Networks;
SiambAN is Siamese Box Adaptive Network for Visual Tracking;
VITAL is VITAL Tracking via Adversal Learning;
SiamMask is Fast on Object Tracking and Segmentation A unity applying;
MDNet is leaving multi-domain con volumetric neural networks for visual tracking;
DaSiameseRPN is separator-aware Siamese Networks for Visual Object Tracking;
TACT is Visual Tracking by TridenTalign and Context Embedding;
SiamFC is full-conditional name Networks for Object Tracking;
GradNet for Visual Object Tracking;
SiamDW is separator and wire Siamese Networks for Real-Time Visual Tracking;
SPLT is 'Skimming-Perusal' Tracking, A frame for Real-Time and Robust Long-term Tracking;
the RT-MDNet is Real-time MDNet;
ocean is Object-aware Anchor-free Tracking;
globaltrack is Globaltrack A Simple and Strong Baseline for Long-term Tracking.
The present embodiment further provides a cross-modal target tracking system through modal aware feature learning, including:
a training sample data acquisition module; the training sample data comprises at least a first sample set and a second sample set; the first sample set comprises at least two subsets of known modality samples; the second sample set comprises full frame data comprising all modalities of the first sample set, each frame data modality being unknown; the first sample set and the second sample set are both video data that have been marked with an object, and may be that 500 positive samples and 5000 negative samples, which are greater than 0.7 and less than 0.3, respectively, from IoU of the initial bounding box, are taken in the first frame with the initial bounding box. The modes in this embodiment include at least near infrared and visible light modes.
The network training module is used for dividing the network training into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; and the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module.
This embodiment uses the VGG network as a base network to extract features of two modes, the kernel sizes of the three convolutional layers being 7 × 7, 5 × 5, and 3 × 3, respectively.
In this embodiment, a two-stage learning algorithm is proposed to effectively train the proposed network, the first stage learning is to remove a modal perception fusion module, train the corresponding branch network by using a modal sample set, design a dual-flow CNN, extract RGB and NIR representations in parallel, respectively, and train the corresponding branch network. On the structure of each branch, an initiation network is adopted, so that efficient calculation is realized.
In the second training stage, all frames are used to learn the modal perception fusion module and the classifier, and parameters of the base network are finely adjusted at the same time, because the current frame does not know which mode belongs to. The end result is shown in fig. 5, with the predicted characteristics of the baseline tracker on the left. On the right is the predictive feature of the invention. Here, circles and triangles represent the target for analysis of RGB and NIR modalities, respectively. From the results, it can be seen that the heterogeneous gap between different modes of the target object is alleviated to some extent.
The target tracking module is used for tracking the target of the multi-mode video data by utilizing the target network and outputting a tracking result; due to the particularity of cross-modal target tracking, two branches are designed to capture the representation of a specific modality, a modality-aware fusion module is designed, and the first 1 × 1 convolutional layer is used for capturing the general representation under the attribute. It is then split into two streams using two more 1 x 1 half-channel convolutional layers to reduce the dimension of the filter space. A 3 x 3 filter is implemented in the stream to perceive differently sized regional fields in two ways, one using a conventional 3 x 3 filter and the other using a cascade of 1 x 3 and 3 x 1 filters. Their outputs are connected together as attribute-specific representations. The effectiveness of the modality-aware fusion module is demonstrated by adaptively integrating the features of two branch outputs given a modality input, as shown in fig. 6.
In the target tracking process, the first frame is the marked target, and tracking is started from the second frame. In the tracking process, 50 positive samples with IoU values larger than 0.7 and 200 negative samples with IoU values smaller than 0.3 of the current frame tracking image are selected to adapt to changes in the target tracking process. Two branches are used to capture a representation of a particular modality, and a modality-aware fusion module of 1 x 1 convolutional layers adaptively integrates the characteristics of the two branch outputs given a modality input.
In order to learn the representation that distinguishes between the target and the background, the penalty used by the method of the present invention is a classification penalty. Input image x for a given d-field d And bounding box R, output score f of the network d And the activation structure connecting the last full connection layer is as follows:
Figure GDA0003763445850000101
wherein is phi d (ii) the two-dimensional binary classification score of the last fully connected layer fc6 in domain d, d being the number of domains in the training dataset. And assigning the output characteristics to a softmax function for binary classification, wherein the function judges whether a bounding box R is a target or a background in a d domain. And the output features are subjected to multi-domain instance discrimination through another softmax operator. The function of softmax is:
Figure GDA0003763445850000102
for the classification penalty, there is a binary classification penalty for the domain d (k) (k mod d) in the kth iteration:
Figure GDA0003763445850000111
and the online updating module judges whether the tracking result is successfully tracked or not by using the meta-update updating controller before the tracking result is output so as to adapt to the change in the target tracking process. And if the tracking is successful, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by the target network by using the current latest tracking result of the sample pool. In the embodiment, long-term updating is preset to be performed every 10 frames, firstly, a RoI alignment pooling layer is introduced, and the features of the candidate region are directly extracted from the feature map, so that the feature extraction in the tracking process is greatly accelerated. The last fully connected layer (fc6) learning rate is set to 0.003, the other layer (fc4-5) learning rate is set to 0.0015, and the epoch number is set to 15. Three fully-connected layers (fc4-6) are then used to accommodate the appearance changes of the instances in different videos and frames. And (4) tracking failure in the current frame, training the hyper-parameters and long-term updating, and performing short-term updating when the hyper-parameters and the long-term updating are the same. In tracking the t-th frame, 256 candidate regions are sampled with a gaussian distribution around the latest frame tracking result in the sample pool. Finally, the cross entropy loss and the instance embedding loss of softmax are utilized to carry out binary classification so as to distinguish the foreground from the background. The scores for these candidate regions are then calculated using the trained network. And selecting the average bounding box of the 5 candidate samples with the highest scores as the tracking result of the t-th frame.
Fig. 7 and 8 below, and fig. 7 and 8 are graphs of experimental results of the present invention, tested on the published data set CMOTB, and the test results were evaluated with other trackers on SR (success rate), PR (accuracy), and NPR (normalized precision). The MArMOT represents the tracking result precision of the method, and compared with other methods, the tracking performance of the method is improved uniformly and greatly. In fig. 7 and 8, Trackers is the name of the tracker;
MArMOT is Cross-Modal Object Tracking view modification-Aware Feature Learning (method of the present invention);
LTMU is High-Performance Long-Term Tracking with Meta-Updater;
ATOM is ATOM, Accurate tracking by overlay maximum attenuation;
SiamRPN + + is SiamRPN + +, Evolution of Siamese Visual Tracking with Very Deep Networks;
SiambAN is Siamese Box Adaptive Network for Visual Tracking;
VITAL is VITAL Tracking via Adversal Learning;
SimMask is Fast on Object Tracking and Segmentation, A Unifying Approach;
MDNet is leaving multi-domain con volumetric neural networks for visual tracking;
DaSiameseRPN is separator-aware Siamese Networks for Visual Object Tracking;
TACT is Visual Tracking by TridentAlign and Context Embedding;
SiamFC is full-conditional name Networks for Object Tracking;
GradNet for Visual Object Tracking;
SiamDW is stripper and wire diameter Networks for read-Time Visual Tracking;
SPLT is 'Skimming-Perusal' Tracking, A frame for Real-Time and Robust Long-term Tracking;
the RT-MDNet is Real-time MDNet;
ocean is Object-aware Anchor-free Tracking;
globaltrack is Globaltrack A Simple and Strong Baseline for Long-term Tracking.
The present embodiments also provide a processing device, comprising at least one processor, and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which when called upon are capable of performing the methods described above.
The present embodiments also provide a computer-readable storage medium storing computer instructions that cause the computer to perform the method as described above.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. The cross-modal target tracking method through modal perception feature learning is characterized by comprising the following steps of:
s01, acquiring training sample data; the training sample data comprises at least a first sample set and a second sample set; the first set of samples includes at least two subsets of known modality samples; the second sample set comprises full frame data comprising all the modalities of the first sample set, each frame data modality being unknown;
s02, network training, wherein the network training is divided into two stages of learning, and the first stage of learning utilizes the two types of modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;
s03, target tracking, wherein the target network carries out target tracking on the multi-mode video data and outputs a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using a target network;
the specific method for tracking the target in step S03 is as follows:
the method comprises the steps of firstly inputting a first frame marked with a target to be tracked to a target network, starting from a second frame, sampling 256 candidate regions around Gaussian distribution of a latest frame tracking result in a sample pool when a t frame is tracked, carrying out binarization classification by utilizing cross entropy loss and example embedding loss of softmax to distinguish a foreground from a background, then calculating scores of the candidate regions by utilizing the target network, and selecting an average bounding box of 5 candidate samples with the highest score as a tracking result of the t frame.
2. The method for cross-modal target tracking through modal-aware feature learning according to claim 1, wherein the modalities of step S01 include at least two modalities of near infrared and visible light.
3. The method for cross-modal object tracking through modal-aware feature learning according to claim 1, wherein in the first stage learning in step S02, an initiation network is adopted for each branch.
4. A cross-modal target tracking system through modal-aware feature learning, comprising the steps of:
a training sample data acquisition module, wherein the training sample data at least comprises a first sample set and a second sample set; the first set of samples includes at least two subsets of known modality samples; the second sample set comprises full frame data comprising all modalities of the first sample set, each frame data modality being unknown;
the network training module is used for dividing the network training into two-stage learning, and the first-stage learning utilizes the two modal data in the first sample set to respectively train the corresponding branch networks of the VGG network without the modal perception fusion module; the second stage learning is that the VGG network after the first stage learning is added with a modal perception fusion module, and then the VGG network is trained by utilizing a second sample set to obtain a target network with the modal perception fusion module;
the target tracking module is used for tracking the target of the multi-mode video data by the target network and outputting a tracking result; before outputting the result, judging whether the tracking result is successfully tracked by using a meta-update controller, if so, collecting a tracking sample to perform online updating on the sample pool, and judging the target tracking of the next frame by using the current latest frame tracking result of the sample pool by using a target network;
the specific method for tracking the target in the target tracking module comprises the following steps:
the method comprises the steps of firstly inputting a first frame marked with a target to be tracked to a target network, starting from a second frame, sampling 256 candidate regions around Gaussian distribution of a latest frame tracking result in a sample pool when a t frame is tracked, carrying out binarization classification by utilizing cross entropy loss and example embedding loss of softmax to distinguish a foreground from a background, then calculating scores of the candidate regions by utilizing the target network, and selecting an average bounding box of 5 candidate samples with the highest score as a tracking result of the t frame.
5. The system according to claim 4, wherein the modalities of the training sample data acquisition module include at least two modalities of near infrared and visible light.
6. The system according to claim 4, wherein in the first learning stage in the network training module, an initiation network is used for each branch.
7. A processing device comprising at least one processor and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor being capable of executing the method of any one of claims 1 to 3 when invoked by the processor.
8. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 3.
CN202110214908.2A 2021-02-24 2021-02-24 Cross-modal target tracking method and system through modal perception feature learning Active CN112949451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110214908.2A CN112949451B (en) 2021-02-24 2021-02-24 Cross-modal target tracking method and system through modal perception feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110214908.2A CN112949451B (en) 2021-02-24 2021-02-24 Cross-modal target tracking method and system through modal perception feature learning

Publications (2)

Publication Number Publication Date
CN112949451A CN112949451A (en) 2021-06-11
CN112949451B true CN112949451B (en) 2022-09-09

Family

ID=76246387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110214908.2A Active CN112949451B (en) 2021-02-24 2021-02-24 Cross-modal target tracking method and system through modal perception feature learning

Country Status (1)

Country Link
CN (1) CN112949451B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920171B (en) * 2021-12-09 2022-10-25 南京理工大学 Bimodal target tracking method based on feature level and decision level fusion
CN116188528B (en) * 2023-01-10 2024-03-15 中国人民解放军军事科学院国防科技创新研究院 RGBT unmanned aerial vehicle target tracking method and system based on multi-stage attention mechanism

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598654A (en) * 2019-09-18 2019-12-20 合肥工业大学 Multi-granularity cross modal feature fusion pedestrian re-identification method and re-identification system
CN110874590A (en) * 2019-11-18 2020-03-10 安徽大学 Training and visible light infrared visual tracking method based on adapter mutual learning model
CN110929848A (en) * 2019-11-18 2020-03-27 安徽大学 Training and tracking method based on multi-challenge perception learning model
CN110942060A (en) * 2019-10-22 2020-03-31 清华大学 Material identification method and device based on laser speckle and modal fusion
CN111209810A (en) * 2018-12-26 2020-05-29 浙江大学 Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images
CN111401107A (en) * 2019-01-02 2020-07-10 上海大学 Multi-mode face recognition method based on feature fusion neural network
CN111539246A (en) * 2020-03-10 2020-08-14 西安电子科技大学 Cross-spectrum face recognition method and device, electronic equipment and storage medium thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019071370A1 (en) * 2017-10-09 2019-04-18 Intel Corporation Feature fusion for multi-modal machine learning analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209810A (en) * 2018-12-26 2020-05-29 浙江大学 Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images
CN111401107A (en) * 2019-01-02 2020-07-10 上海大学 Multi-mode face recognition method based on feature fusion neural network
CN110598654A (en) * 2019-09-18 2019-12-20 合肥工业大学 Multi-granularity cross modal feature fusion pedestrian re-identification method and re-identification system
CN110942060A (en) * 2019-10-22 2020-03-31 清华大学 Material identification method and device based on laser speckle and modal fusion
CN110874590A (en) * 2019-11-18 2020-03-10 安徽大学 Training and visible light infrared visual tracking method based on adapter mutual learning model
CN110929848A (en) * 2019-11-18 2020-03-27 安徽大学 Training and tracking method based on multi-challenge perception learning model
CN111539246A (en) * 2020-03-10 2020-08-14 西安电子科技大学 Cross-spectrum face recognition method and device, electronic equipment and storage medium thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Effective Fusion of Multi-Modal Data with Group Convolutions for Semantic Segmentation of Aerial Imagery;K. Chen 等;《2019 IEEE International Geoscience and Remote Sensing Symposium》;20191231;第3911-3914页 *
基于可靠相关度的实时多模态目标跟踪方法;鲁玉龙 等;《安徽大学学报(自然科学版)》;20190531;第43卷(第3期);第33-38页 *
面向深度学习的多模态融合技术研究综述;何俊 等;《计算机工程》;20200531;第46卷(第5期);第1-11页 *

Also Published As

Publication number Publication date
CN112949451A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN109344701B (en) Kinect-based dynamic gesture recognition method
US20230045519A1 (en) Target Detection Method and Apparatus
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
CN108062525B (en) Deep learning hand detection method based on hand region prediction
CN106886216B (en) Robot automatic tracking method and system based on RGBD face detection
CN112949451B (en) Cross-modal target tracking method and system through modal perception feature learning
CN112287875B (en) Abnormal license plate recognition method, device, equipment and readable storage medium
KR100572768B1 (en) Automatic detection method of human facial objects for the digital video surveillance
Choi et al. Attention-based multimodal image feature fusion module for transmission line detection
CN110020658B (en) Salient object detection method based on multitask deep learning
US11804026B2 (en) Device and a method for processing data sequences using a convolutional neural network
CN111428664A (en) Real-time multi-person posture estimation method based on artificial intelligence deep learning technology for computer vision
CN114639042A (en) Video target detection algorithm based on improved CenterNet backbone network
CN111199238A (en) Behavior identification method and equipment based on double-current convolutional neural network
Gal Automatic obstacle detection for USV’s navigation using vision sensors
Botterill et al. Finding a vine's structure by bottom-up parsing of cane edges
CN108563997B (en) Method and device for establishing face detection model and face recognition
CN112686122A (en) Human body and shadow detection method, device, electronic device and storage medium
Mehrübeoglu et al. Real-time iris tracking with a smart camera
CN108288041B (en) Preprocessing method for removing false detection of pedestrian target
CN110555406A (en) Video moving target identification method based on Haar-like characteristics and CNN matching
CN115861981A (en) Driver fatigue behavior detection method and system based on video attitude invariance
CN112200840B (en) Moving object detection system in visible light and infrared image combination
CN112446292B (en) 2D image salient object detection method and system
CN113361475A (en) Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant