CN114445461A - Visible light infrared target tracking training method and device based on non-paired data - Google Patents

Visible light infrared target tracking training method and device based on non-paired data Download PDF

Info

Publication number
CN114445461A
CN114445461A CN202210095429.8A CN202210095429A CN114445461A CN 114445461 A CN114445461 A CN 114445461A CN 202210095429 A CN202210095429 A CN 202210095429A CN 114445461 A CN114445461 A CN 114445461A
Authority
CN
China
Prior art keywords
modality
visible light
module
modal
specific network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210095429.8A
Other languages
Chinese (zh)
Inventor
李成龙
何小倩
沈庆
汤进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210095429.8A priority Critical patent/CN114445461A/en
Publication of CN114445461A publication Critical patent/CN114445461A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a visible light infrared target tracking training method and device based on unpaired data, wherein the method comprises the steps of obtaining unpaired visible light images and thermal infrared images and generating candidate samples; the method comprises the steps that a candidate sample is utilized to train a visible light infrared tracker, the visible light infrared tracker comprises a mode specific module, a mode sharing module, a mode self-adaptive attention module and a mode adaptation module which are sequentially connected, the mode specific module comprises a first mode specific network and a second mode specific network, a visible light image serves as the input of the first mode specific network and the mode sharing module, a thermal infrared image serves as the input of the second mode specific network and the mode sharing module, and the output of the first mode specific network and the output of the second mode specific network are fused with the output of the mode sharing module respectively and then serve as the input of the mode self-adaptive attention module. The method gets rid of dependence on large-scale registration data and improves the target tracking performance.

Description

Visible light infrared target tracking training method and device based on non-paired data
Technical Field
The invention relates to the technical field of computer vision, in particular to a visible light infrared target tracking training method and device based on unpaired data.
Background
Target tracking has been an important issue in the field of computer vision, and in recent years, a target tracking technology has made a great breakthrough and is widely applied to the fields of intelligent transportation, unmanned driving, robots and the like. The task of target tracking is to predict the size and position of a target in an initial frame of a video sequence, given the size and position of the target in the initial frame.
The target tracking algorithm is mostly based on a single visible light mode condition, and has excellent tracking performance under the mode condition, but the tracking robustness is still to be improved under a complex environment or an extreme condition, such as haze weather and low light. In recent years, more and more sensor technologies are applied to the field of target tracking, such as thermal infrared sensors and depth sensors. The thermal infrared sensor is used for imaging by capturing temperature information of a target, and has low sensitivity to illumination conditions. Meanwhile, the visible light data can compensate the defects of fuzzy edge and less detail information of the thermal infrared image, and the complementarity of the visible light data and the thermal infrared data can help the algorithm to realize stable tracking.
At present, most of the research of visible light infrared tracking algorithms focuses on multi-modal advantage complementation, and information of each mode is fused to achieve a more robust tracking result. However, the visible light infrared tracking algorithm is usually based on the matched visible light data and thermal infrared data, for example, the all-weather target real-time tracking method based on the visible light and infrared image disclosed in the invention patent application with the application number of 201510521038.8 needs to register the visible light image and the infrared image and then perform target tracking detection. The thermal infrared data can be acquired through the thermal infrared sensor, but the visible light thermal infrared data matched with the thermal infrared data needs a large amount of manual selection and manual labeling, certain challenges are brought to the manufacturing of data sets, and the number of the disclosed standard visible light thermal infrared data sets is small, so that the advantages of thermal infrared modes cannot be fully played.
Disclosure of Invention
The invention aims to solve the technical problem of how to solve the problem of dependence of visible light infrared target tracking performance training on large-scale registration data.
The invention solves the technical problems through the following technical means:
on one hand, the embodiment of the invention provides a visible light infrared target tracking training method based on non-paired data, which comprises the following steps:
acquiring an unpaired visible light image and thermal infrared image, and generating a candidate sample based on the visible light image and the thermal infrared image, wherein the candidate sample comprises a positive sample and a negative sample;
training a visible light infrared tracker by using the candidate sample;
the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module.
The invention sets a first mode specific network and a second mode specific network to respectively extract the characteristics of the visible light image and the thermal infrared image, sets a mode sharing module to extract the similar characteristics of the visible light data and the thermal infrared data, further strengthens the relation between the modes, sets a mode self-adaptive attention module to realize the learning and strengthening between unpaired visible light data and thermal infrared data modes, liberates the strength of unpaired visible light infrared data, effectively avoids the problem of insufficient data required by training, fully excavates and utilizes bimodal information on a limited data set to realize the mutual strengthening between unpaired multimodal data, and trains a robust visible light infrared tracker.
Further, the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;
and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.
Further, the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;
the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;
processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;
forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;
and multiplying the modal shared query set by the attention moment arrays corresponding to the two modalities respectively to obtain the enhanced feature maps corresponding to the two modalities.
Further, the mode adapting module comprises two full-connection layers and a mode connection layer which are sequentially connected, wherein the mode connection layer comprises a visible light mode full-connection layer and a thermal infrared mode full-connection layer which correspond to the two modes;
adding a neuron random activation function after the first two full-connection layers which are connected in sequence;
the visible light modal full-connection layer and the thermal infrared modal full-connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.
Further, the method further comprises:
training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and a score positive value y ^ wherein the cross entropy loss function is as follows:
Loss=-(y*log(y^)+(1-y)*log(1-y^));
optimizing the whole network of the visible light infrared tracker by a random gradient descent method.
In another aspect, the present invention provides a training apparatus for tracking visible light and infrared targets based on unpaired data, including:
the device comprises an acquisition module, a comparison module and a processing module, wherein the acquisition module is used for acquiring unpaired visible light images and thermal infrared images and generating candidate samples based on the visible light images and the thermal infrared images, and the candidate samples comprise positive samples and negative samples;
the training module is used for training the visible light infrared tracker by using the candidate sample;
the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module.
Further, the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;
and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.
Further, the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;
the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;
processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;
forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;
and multiplying the modal shared query set by the attention moment arrays corresponding to the two modalities respectively to obtain the enhanced feature maps corresponding to the two modalities.
Further, the mode adapting module comprises two full-connection layers and a mode connection layer which are sequentially connected, wherein the mode connection layer comprises a visible light mode full-connection layer and a thermal infrared mode full-connection layer which correspond to the two modes;
adding a neuron random activation function after the first two full-connection layers which are connected in sequence;
the visible light modal full-connection layer and the thermal infrared modal full-connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.
Further, the training module comprises:
the training unit is used for training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and the score positive value y ^ wherein the cross entropy loss function is as follows:
Loss=-(y*log(y^)+(1-y)*log(1-y^));
and the optimization unit is used for optimizing the whole network of the visible light infrared tracker by a random gradient descent method.
The invention has the advantages that:
(1) the invention sets a first mode specific network and a second mode specific network to respectively extract the characteristics of the visible light image and the thermal infrared image, sets a mode sharing module to extract the similar characteristics of the visible light data and the thermal infrared data, further strengthens the relation between the modes, sets a mode self-adaptive attention module to realize the learning and strengthening between unpaired visible light data and thermal infrared data modes, liberates the strength of unpaired visible light infrared data, effectively avoids the problem of insufficient data required by training, fully excavates and utilizes bimodal information on a limited data set to realize the mutual strengthening between unpaired multimodal data, and trains a robust visible light infrared tracker.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a visible light infrared target tracking training method based on unpaired data according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating a visible light infrared target tracking training method based on unpaired data according to a first embodiment of the present invention;
FIG. 3 is a block diagram of a visible infrared tracker in accordance with the present invention;
FIG. 4 is a block diagram of a non-paired data based visible light infrared target tracking training device according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 to fig. 3, an embodiment of the present invention provides a visible light infrared target tracking training method based on unpaired data, including the following steps:
s10, acquiring an unpaired visible light image and thermal infrared image, and generating a candidate sample based on the visible light image and the thermal infrared image, wherein the candidate sample comprises a positive sample and a negative sample;
it should be noted that, by sampling 8 consecutive frames in each unpaired multimodal video, an artificially labeled detection box in the image is used to represent the current tracked target, and 256 positive samples and 768 negative samples are generated for the sampled data of the two modalities in a gaussian distribution random manner.
S20, training the visible light infrared tracker by using the candidate sample;
the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module
The system comprises a first modality specific network, a second modality specific network, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module, wherein the first modality specific network and the second modality specific network are respectively used for extracting feature maps of a visible light image and a thermal infrared image, the modality sharing module is used for extracting similar features of the visible light feature map and the thermal infrared feature map, the modality self-adaptive attention module is used for learning and enhancing between modalities of unpaired data, and the modality adaptation module is used for partitioning the modalities and achieving target tracking.
In the embodiment, the similar characteristics of visible light data and thermal infrared data are extracted by setting the mode sharing module, the connection between the modes is further strengthened, the mode self-adaptive attention module is arranged, the learning and the enhancement between the unpaired visible light data and thermal infrared data modes are realized, the power of unpaired visible light infrared data is liberated, the problem of insufficient data quantity required by training is effectively avoided, bimodal information is fully mined and utilized on a limited data set, and the mutual enhancement between unpaired multimodal data is realized.
In an embodiment, the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;
and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.
It should be noted that, the samples of each mode are respectively input into its own mode specific network for parallel transmission, and simultaneously the samples of two modes are input into the mode sharing module together, the first mode specific network, the second mode specific network and the mode sharing module all use the first three convolution layers of the VGG network for reference, and the sizes of the convolution kernels are 7 × 7, 5 × 5 and 3 × 3 respectively. The results of the first two convolution layers of the first mode specific network and the second mode specific network are used as the input of the next convolution layer in the mode sharing module, and the results of the mode sharing module and the first mode specific network, the results of the mode sharing module and the result of the mode specific network are fused respectively to obtain the final mode characteristics.
It should be noted that the modality sharing module is configured to extract similar features of the visible light data and the thermal infrared data, further strengthen the relationship between modalities, help to fuse data of the two modalities, and further implement feature complementation between modalities. Although unpaired visible light data and thermal infrared data cannot be fused simply by using a common fusion network, the visible light data and the thermal infrared data have great similarity and share distribution characteristics, and the embodiment extracts the sharing characteristics of the two modes through the mode sharing module to balance the fusion ratio of the two modes.
In an embodiment, the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;
the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;
processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;
forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;
and multiplying the modal shared query set by the attention moment arrays corresponding to the two modalities respectively to obtain the enhanced feature maps corresponding to the two modalities.
In particular, the modality-adaptive attention module is composed of two 512-sized first and second fully-connected layers sharing weights, two 64-sized third and fourth fully-connected layers being modality-specific, and one 64-sized fifth fully-connected layer being modality-shared, and each modality experiences the fifth fully-connected layer to get the characteristics of the modality because the first fully-connected layer is modality-based sharing. Firstly, flattening the feature diagram output by the mode sharing module to 512, reducing dimension to 64 through the first full-connection layer and the second full-connection layer, forming a visible light and thermal infrared specific key K and a value V through the mode specific third full-connection layer and the mode specific fourth full-connection layer, and forming two mode specific sub-attention matrixes V through vector points. And the fifth full connection mode is used for forming a mode sharing query set Q, and the generated mode specific attention moment arrays are multiplied to obtain the characteristic that the two modes are finally mutually enhanced.
The set modal adaptive attention module is an attention module capable of realizing information interaction, has a strong attention mechanism, automatically learns specific gradient information preferences of two modalities, then cooperatively optimizes the whole network in the overall network optimization process, and fully utilizes the advantages of each modality to realize stable visible light infrared target tracking. The force of unpaired visible light infrared data is liberated, and the potential of single-mode data is fully mined.
In one embodiment, the modality adaptation module comprises two full-connection layers and a modality connection layer which are sequentially connected, and the modality connection layer comprises a visible light modality full-connection layer and a thermal infrared modality full-connection layer which correspond to the two modalities;
adding a neuron random activation function after the first two full-connection layers which are connected in sequence;
the visible light modal full-connection layer and the thermal infrared modal full-connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.
Specifically, the modal adaptation module comprises four full-connection layers with sizes of 1024,512,2 and 2 respectively, two full-connection layers with sizes of 2 and 2 are combined in parallel to form the modal connection layer, and a dropout (neuron random activation) normalization method is added behind the two full-connection layers with sizes of 1024,512 to reduce the risk of overfitting. Finally, two full-connection layers with the size of 2 and divided according to the mode comprise softmax layers which are respectively used for calculating the positive score f and the negative score f of each candidate sample characteristic in parallel+(xi) And f-(xi) Calculating the target probability of the candidate sample, wherein the detection box with the highest target probability is the predicted target tracking result, and finally predicting the target position according to the following formula:
Figure BDA0003490654600000121
wherein x isiDenotes the ith sample of the sample, f+(xi) Representing the probability of the target from which the sample was taken, f-(xi) Representing the background probability, x, of obtaining a sample*Is the predicted target location.
In some embodiments, the method further comprises:
training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and a score positive value y ^ wherein the cross entropy loss function is as follows:
Loss=-(y*log(y^)+(1-y)*log(1-y^));
optimizing the whole network of the visible light infrared tracker by a random gradient descent method.
Specifically, the training process of the visible-light infrared tracker is as follows:
(1) the pre-training model of the VGG is used to initialize parameters for the modality-specific module and the modality-sharing module, and the modality-adaptive attention module and the modality adaptation module are randomly initialized. The mode specific module and the mode sharing module are composed of three convolution layers and a ReLU (nonlinear layer), the first two convolution layers are also added with an LRN (local correspondence function) and a MaxPool (maximum value pooling function), and the convolution kernel sizes are 7 × 96, 5 × 256 and 3 × 512.
(2) The whole network is trained by using artificially labeled visible light infrared data which do not need to be paired, and 256 candidate samples are randomly selected near a truth box by using Gaussian distribution.
(3) In the first stage, a thermal infrared data or visible light data training mode sharing module is randomly input, network parameters are updated by using a random gradient algorithm SGD, and the parameters of the mode sharing module are updated by each training data. In the second stage, thermal infrared data and visible light data are simultaneously input to train other modules of each mode, a random gradient algorithm SGD is used for updating network parameters, and branches of each corresponding mode are iteratively updated by using a corresponding video sequence. The final model is saved for the online tracking phase.
In the model training process, the mode sharing module and the mode specific module are trained separately, and unpaired multi-mode data is used for training. The method has the advantages that the unpaired multi-mode data is used for training, the problem of dependence on large-scale alignment training data in RGBT tracking is solved, the existing thermal infrared data set and visible light data set can be fully utilized, and a large amount of labor and time cost are saved. The trained tracker reveals the strength of unpaired RGBT data, and effectively exerts the advantages of each modality.
It should be noted that the existing visible light and thermal infrared data sets are difficult to be perfectly paired, and the existing visible light infrared target tracking algorithm is designed according to the characteristics of data in two paired modes, which may cause the module design to fail to fully exert its performance.
In an embodiment, the process of tracking the target by using the trained visible light infrared tracker includes:
(1) and according to the visible light, heat and infrared pairing video sequence, extracting a first frame true value frame of the video sequence, and initializing a network model by using a pre-training parameter to obtain a new layer. At this time, the learning rates of the first two full connection layers of the modal adaptation module are set to be 0.001, and the learning rate of the last full connection layer is set to be 0.0005. After the initialization is completed, 256 candidate samples are generated using gaussian distributed sampling. .
(2) And respectively sending the candidate samples into corresponding modal specific modules, sending the candidate samples into modal sharing modules in parallel, sending the result of each layer of modal specific module into the next layer of modal sharing module, simultaneously sending the candidate samples into the modal sharing modules together, and fusing the results of the modal sharing modules and the results of the modal specific modules according to the modalities to obtain modal characteristics. The modal characteristics are sent into a modal adaptive attention module, modal sharing full connection is formed into a modal sharing query set Q, and the modal specific attention moment arrays generated are multiplied to obtain final mutually enhanced modal characteristics. And splicing the enhanced feature maps of different modes in the last convolutional layer by channel dimensions to obtain an integral feature map, sending the integral feature map into a last mode adaptation module, sending a softmax function into the last convolutional layer to obtain a classification score, and predicting the target position.
(3) And when the target probability of the prediction sample is more than 0.5, judging that the tracking is successful. When the target probability of the prediction sample is less than 0.5, tracking fails and short-term updating is carried out, if the number of frames in the positive and negative sample data sets exceeds 20, the negative sample area of the earliest frame is abandoned, 32 positive samples and 96 negative samples are extracted from the positive and negative sample sets to fine-tune parameters of the full connection layer, iteration is carried out for 10 times, and the learning rate is set to be 0.00003.
(4) In the online tracking process, long-term updating is performed every 8 frames, if the number of frames in the positive and negative sample data sets exceeds 100, the positive sample area of the earliest frame is discarded, 32 positive samples and 96 negative samples are extracted from the positive and negative sample sets to fine-tune parameters of the full connection layer, iteration is performed for 10 times, and the learning rate is set to be 0.00003. If the conditions of the short-term update and the long-term update are not met, the next frame is directly tracked, and the model is not updated.
Both the long-term update and the short-term update are to adapt to the change in the appearance of the tracking target, and the model parameters are updated using sample data. The short-term update is to update immediately (adjust immediately) when the tracking fails, and the long-term update is to adapt to a change in the target during tracking (adjust at a predetermined number of frames).
In this embodiment, the method of the present invention and some existing methods are respectively tested on the public data sets GTOT, RGBT234 and laseer, and the test results and other trackers are evaluated on SR (success rate) and PR (accuracy rate), and the results are shown in table 1:
TABLE 1
Figure BDA0003490654600000151
Where _indicatesthat no experiments were done on this data set, UMT indicates the target tracking method used in the present invention (the other method is the comparative method). As can be observed from the data in Table 1, the success rate and the accuracy of the method of the invention are higher on the existing data set, and the tracking performance is uniformly improved to a certain extent.
In addition, as shown in fig. 4, an embodiment of the present invention further provides a training apparatus for tracking a visible light infrared target based on unpaired data, where the training apparatus includes:
an obtaining module 10, configured to obtain an unpaired visible light image and thermal infrared image, and generate a candidate sample based on the visible light image and the thermal infrared image, where the candidate sample includes a positive sample and a negative sample;
a training module 20, configured to train the visible light infrared tracker by using the candidate sample;
the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module.
In an embodiment, the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;
and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.
In an embodiment, the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;
the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;
processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;
forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;
and multiplying the modal shared query set by the attention moment arrays corresponding to the two modalities respectively to obtain the enhanced feature maps corresponding to the two modalities.
In one embodiment, the modality adaptation module comprises two full-connection layers and a modality connection layer which are sequentially connected, and the modality connection layer comprises a visible light modality full-connection layer and a thermal infrared modality full-connection layer which correspond to the two modalities;
adding a neuron random activation function after the first two full-connection layers which are connected in sequence;
the visible light modal full-connection layer and the thermal infrared modal full-connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.
In one embodiment, the training module 20 includes:
the training unit is used for training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and the score positive value y ^ wherein the cross entropy loss function is as follows:
Loss=-(y*log(y^)+(1-y)*log(1-y^));
and the optimization unit is used for optimizing the whole network of the visible light infrared tracker by a random gradient descent method.
It should be noted that other embodiments or methods of implementing the non-paired data based training apparatus for tracking visible light and infrared targets of the present invention can refer to the above embodiments, and are not redundant here.
It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A visible light infrared target tracking training method based on unpaired data is characterized by comprising the following steps:
acquiring an unpaired visible light image and thermal infrared image, and generating a candidate sample based on the visible light image and the thermal infrared image, wherein the candidate sample comprises a positive sample and a negative sample;
training a visible light infrared tracker by using the candidate sample;
the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module.
2. The visible light infrared target tracking training method based on unpaired data according to claim 1, wherein the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;
and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.
3. The visible light infrared target tracking training method based on unpaired data according to claim 1, wherein the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;
the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;
processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;
forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;
and multiplying the modal shared query set by the attention moment arrays corresponding to the two modalities respectively to obtain the enhanced feature maps corresponding to the two modalities.
4. The visible light infrared target tracking training method based on unpaired data according to claim 1, wherein the mode adapting module comprises two full connection layers and a mode connection layer which are connected in sequence, and the mode connection layer comprises a visible light mode full connection layer and a thermal infrared mode full connection layer which correspond to two modes;
adding a neuron random activation function after the first two full-connection layers which are connected in sequence;
the visible light modal full connection layer and the thermal infrared modal full connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.
5. The visible light infrared target tracking training method based on unpaired data according to claim 4, further comprising:
training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and a score positive value y ^ wherein the cross entropy loss function is as follows:
Loss=-(y*log(y^)+(1-y)*log(1-y^));
optimizing the whole network of the visible light infrared tracker by a random gradient descent method.
6. A non-paired data based visible light infrared target tracking training device, the device comprising:
the device comprises an acquisition module, a comparison module and a processing module, wherein the acquisition module is used for acquiring unpaired visible light images and thermal infrared images and generating candidate samples based on the visible light images and the thermal infrared images, and the candidate samples comprise positive samples and negative samples;
the training module is used for training the visible light infrared tracker by using the candidate sample;
the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module.
7. The visible light infrared target tracking training device based on unpaired data according to claim 6, wherein the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;
and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.
8. The visible light infrared target tracking training device based on unpaired data according to claim 7, wherein the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;
the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;
processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;
forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;
and multiplying the modal sharing query set with the attention moment arrays corresponding to the two modals respectively to obtain the enhanced feature maps corresponding to the two modals.
9. The visible light infrared target tracking training device based on unpaired data according to claim 7, wherein the mode adapting module comprises two full connection layers and a mode connection layer which are connected in sequence, and the mode connection layer comprises a visible light mode full connection layer and a thermal infrared mode full connection layer which correspond to two modes;
adding a neuron random activation function after the first two full-connection layers which are connected in sequence;
the visible light modal full-connection layer and the thermal infrared modal full-connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.
10. The visible-light infrared target tracking training device based on unpaired data of claim 9, wherein the training module comprises:
the training unit is used for training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and the score positive value y ^ wherein the cross entropy loss function is as follows:
Loss=-(y*log(y^)+(1-y)*log(1-y^));
and the optimization unit is used for optimizing the whole network of the visible light infrared tracker by a random gradient descent method.
CN202210095429.8A 2022-01-26 2022-01-26 Visible light infrared target tracking training method and device based on non-paired data Pending CN114445461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210095429.8A CN114445461A (en) 2022-01-26 2022-01-26 Visible light infrared target tracking training method and device based on non-paired data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210095429.8A CN114445461A (en) 2022-01-26 2022-01-26 Visible light infrared target tracking training method and device based on non-paired data

Publications (1)

Publication Number Publication Date
CN114445461A true CN114445461A (en) 2022-05-06

Family

ID=81370637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210095429.8A Pending CN114445461A (en) 2022-01-26 2022-01-26 Visible light infrared target tracking training method and device based on non-paired data

Country Status (1)

Country Link
CN (1) CN114445461A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018884A (en) * 2022-07-19 2022-09-06 安徽大学 Visible light infrared visual tracking method based on multi-strategy fusion tree
CN115115919A (en) * 2022-06-24 2022-09-27 国网智能电网研究院有限公司 Power grid equipment thermal defect identification method and device
CN116563584A (en) * 2023-07-10 2023-08-08 安徽启新明智科技有限公司 Image matching method, device and equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115115919A (en) * 2022-06-24 2022-09-27 国网智能电网研究院有限公司 Power grid equipment thermal defect identification method and device
CN115115919B (en) * 2022-06-24 2023-05-05 国网智能电网研究院有限公司 Power grid equipment thermal defect identification method and device
CN115018884A (en) * 2022-07-19 2022-09-06 安徽大学 Visible light infrared visual tracking method based on multi-strategy fusion tree
CN115018884B (en) * 2022-07-19 2024-03-15 安徽大学 Visible light infrared visual tracking method based on multi-strategy fusion tree
CN116563584A (en) * 2023-07-10 2023-08-08 安徽启新明智科技有限公司 Image matching method, device and equipment
CN116563584B (en) * 2023-07-10 2023-11-14 安徽启新明智科技有限公司 Image matching method, device and equipment

Similar Documents

Publication Publication Date Title
Lei et al. Intelligent fault detection of high voltage line based on the Faster R-CNN
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN108780508B (en) System and method for normalizing images
CN114445461A (en) Visible light infrared target tracking training method and device based on non-paired data
CN111507378A (en) Method and apparatus for training image processing model
CN109977757B (en) Multi-modal head posture estimation method based on mixed depth regression network
US10846593B2 (en) System and method for siamese instance search tracker with a recurrent neural network
CN111161315B (en) Multi-target tracking method and system based on graph neural network
CN111507993A (en) Image segmentation method and device based on generation countermeasure network and storage medium
EP3690741A2 (en) Method for automatically evaluating labeling reliability of training images for use in deep learning network to analyze images, and reliability-evaluating device using the same
CN114492574A (en) Pseudo label loss unsupervised countermeasure domain adaptive picture classification method based on Gaussian uniform mixing model
Jiang et al. Multiobject tracking in videos based on lstm and deep reinforcement learning
CN110781262A (en) Semantic map construction method based on visual SLAM
CN110349179B (en) Visible light infrared vision tracking method and device based on multiple adapters
CN112668648A (en) Infrared and visible light fusion identification method based on symmetric fusion network
CN111898685A (en) Target detection method based on long-tail distribution data set
CN113326735A (en) Multi-mode small target detection method based on YOLOv5
CN114998220A (en) Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment
Hu et al. Aerial infrared target tracking in complex background based on combined tracking and detecting
CN115631397A (en) Target detection method and device based on bimodal image
CN114332166A (en) Visible light infrared target tracking method and device based on modal competition cooperative network
CN116861262B (en) Perception model training method and device, electronic equipment and storage medium
CN114511733A (en) Fine-grained image identification method and device based on weak supervised learning and readable medium
CN115994933A (en) Partial point cloud registration method based on consistency learning
CN114022516A (en) Bimodal visual tracking method based on high rank characteristics and position attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination