CN114445461A

CN114445461A - Visible light infrared target tracking training method and device based on non-paired data

Info

Publication number: CN114445461A
Application number: CN202210095429.8A
Authority: CN
Inventors: 李成龙; 何小倩; 沈庆; 汤进
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-06

Abstract

The invention discloses a visible light infrared target tracking training method and device based on unpaired data, wherein the method comprises the steps of obtaining unpaired visible light images and thermal infrared images and generating candidate samples; the method comprises the steps that a candidate sample is utilized to train a visible light infrared tracker, the visible light infrared tracker comprises a mode specific module, a mode sharing module, a mode self-adaptive attention module and a mode adaptation module which are sequentially connected, the mode specific module comprises a first mode specific network and a second mode specific network, a visible light image serves as the input of the first mode specific network and the mode sharing module, a thermal infrared image serves as the input of the second mode specific network and the mode sharing module, and the output of the first mode specific network and the output of the second mode specific network are fused with the output of the mode sharing module respectively and then serve as the input of the mode self-adaptive attention module. The method gets rid of dependence on large-scale registration data and improves the target tracking performance.

Description

Visible light infrared target tracking training method and device based on non-paired data

Technical Field

The invention relates to the technical field of computer vision, in particular to a visible light infrared target tracking training method and device based on unpaired data.

Background

Target tracking has been an important issue in the field of computer vision, and in recent years, a target tracking technology has made a great breakthrough and is widely applied to the fields of intelligent transportation, unmanned driving, robots and the like. The task of target tracking is to predict the size and position of a target in an initial frame of a video sequence, given the size and position of the target in the initial frame.

The target tracking algorithm is mostly based on a single visible light mode condition, and has excellent tracking performance under the mode condition, but the tracking robustness is still to be improved under a complex environment or an extreme condition, such as haze weather and low light. In recent years, more and more sensor technologies are applied to the field of target tracking, such as thermal infrared sensors and depth sensors. The thermal infrared sensor is used for imaging by capturing temperature information of a target, and has low sensitivity to illumination conditions. Meanwhile, the visible light data can compensate the defects of fuzzy edge and less detail information of the thermal infrared image, and the complementarity of the visible light data and the thermal infrared data can help the algorithm to realize stable tracking.

At present, most of the research of visible light infrared tracking algorithms focuses on multi-modal advantage complementation, and information of each mode is fused to achieve a more robust tracking result. However, the visible light infrared tracking algorithm is usually based on the matched visible light data and thermal infrared data, for example, the all-weather target real-time tracking method based on the visible light and infrared image disclosed in the invention patent application with the application number of 201510521038.8 needs to register the visible light image and the infrared image and then perform target tracking detection. The thermal infrared data can be acquired through the thermal infrared sensor, but the visible light thermal infrared data matched with the thermal infrared data needs a large amount of manual selection and manual labeling, certain challenges are brought to the manufacturing of data sets, and the number of the disclosed standard visible light thermal infrared data sets is small, so that the advantages of thermal infrared modes cannot be fully played.

Disclosure of Invention

The invention aims to solve the technical problem of how to solve the problem of dependence of visible light infrared target tracking performance training on large-scale registration data.

The invention solves the technical problems through the following technical means:

on one hand, the embodiment of the invention provides a visible light infrared target tracking training method based on non-paired data, which comprises the following steps:

acquiring an unpaired visible light image and thermal infrared image, and generating a candidate sample based on the visible light image and the thermal infrared image, wherein the candidate sample comprises a positive sample and a negative sample;

training a visible light infrared tracker by using the candidate sample;

the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module.

The invention sets a first mode specific network and a second mode specific network to respectively extract the characteristics of the visible light image and the thermal infrared image, sets a mode sharing module to extract the similar characteristics of the visible light data and the thermal infrared data, further strengthens the relation between the modes, sets a mode self-adaptive attention module to realize the learning and strengthening between unpaired visible light data and thermal infrared data modes, liberates the strength of unpaired visible light infrared data, effectively avoids the problem of insufficient data required by training, fully excavates and utilizes bimodal information on a limited data set to realize the mutual strengthening between unpaired multimodal data, and trains a robust visible light infrared tracker.

Further, the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;

and adding the output of the last convolutional layer of the first modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module, and adding the output of the last convolutional layer of the modal specific network and the output of the last convolutional layer of the modal sharing module to be used as the input of the modal adaptive attention module.

Further, the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;

the visible light modal characteristics and the thermal infrared modal characteristics are respectively used as the input of the first full connection layer and the second full connection layer, and dimension reduction processing is carried out to obtain the visible light modal characteristics after dimension reduction and the thermal infrared modal characteristics after dimension reduction;

processing the visible mode characteristics after dimensionality reduction and the thermal infrared mode characteristics after dimensionality reduction respectively by adopting an QKV mechanism through the third full connection layer and the fourth full connection layer to respectively form attention matrixes corresponding to two modes;

forming a modal shared query set by the attention matrixes corresponding to the two modalities through the fifth fully-connected layer;

and multiplying the modal shared query set by the attention moment arrays corresponding to the two modalities respectively to obtain the enhanced feature maps corresponding to the two modalities.

Further, the mode adapting module comprises two full-connection layers and a mode connection layer which are sequentially connected, wherein the mode connection layer comprises a visible light mode full-connection layer and a thermal infrared mode full-connection layer which correspond to the two modes;

adding a neuron random activation function after the first two full-connection layers which are connected in sequence;

the visible light modal full-connection layer and the thermal infrared modal full-connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.

Further, the method further comprises:

training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and a score positive value y ^ wherein the cross entropy loss function is as follows:

Loss＝-(y*log(y^)+(1-y)*log(1-y^))；

optimizing the whole network of the visible light infrared tracker by a random gradient descent method.

In another aspect, the present invention provides a training apparatus for tracking visible light and infrared targets based on unpaired data, including:

the device comprises an acquisition module, a comparison module and a processing module, wherein the acquisition module is used for acquiring unpaired visible light images and thermal infrared images and generating candidate samples based on the visible light images and the thermal infrared images, and the candidate samples comprise positive samples and negative samples;

the training module is used for training the visible light infrared tracker by using the candidate sample;

Further, the training module comprises:

the training unit is used for training the visible light infrared tracker by using a cross entropy loss function generated by the negative sample score value y and the score positive value y ^ wherein the cross entropy loss function is as follows:

Loss＝-(y*log(y^)+(1-y)*log(1-y^))；

and the optimization unit is used for optimizing the whole network of the visible light infrared tracker by a random gradient descent method.

The invention has the advantages that:

(1) the invention sets a first mode specific network and a second mode specific network to respectively extract the characteristics of the visible light image and the thermal infrared image, sets a mode sharing module to extract the similar characteristics of the visible light data and the thermal infrared data, further strengthens the relation between the modes, sets a mode self-adaptive attention module to realize the learning and strengthening between unpaired visible light data and thermal infrared data modes, liberates the strength of unpaired visible light infrared data, effectively avoids the problem of insufficient data required by training, fully excavates and utilizes bimodal information on a limited data set to realize the mutual strengthening between unpaired multimodal data, and trains a robust visible light infrared tracker.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a visible light infrared target tracking training method based on unpaired data according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a visible light infrared target tracking training method based on unpaired data according to a first embodiment of the present invention;

FIG. 3 is a block diagram of a visible infrared tracker in accordance with the present invention;

FIG. 4 is a block diagram of a non-paired data based visible light infrared target tracking training device according to a second embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 to fig. 3, an embodiment of the present invention provides a visible light infrared target tracking training method based on unpaired data, including the following steps:

s10, acquiring an unpaired visible light image and thermal infrared image, and generating a candidate sample based on the visible light image and the thermal infrared image, wherein the candidate sample comprises a positive sample and a negative sample;

it should be noted that, by sampling 8 consecutive frames in each unpaired multimodal video, an artificially labeled detection box in the image is used to represent the current tracked target, and 256 positive samples and 768 negative samples are generated for the sampled data of the two modalities in a gaussian distribution random manner.

S20, training the visible light infrared tracker by using the candidate sample;

the visible light infrared tracker comprises a modality specific module, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module which are sequentially connected, wherein the modality specific module comprises a first modality specific network and a second modality specific network; the visible light image is used as the input of the first modality specific network and the input of the modality sharing module, the thermal infrared image is used as the input of the second modality specific network and the input of the modality sharing module, the visible light modality feature obtained by adding the output of the first modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module, and the thermal infrared modality feature obtained by adding the output of the second modality specific network and the output of the modality sharing module is used as the input of the modality self-adaptive attention module

The system comprises a first modality specific network, a second modality specific network, a modality sharing module, a modality self-adaptive attention module and a modality adaptation module, wherein the first modality specific network and the second modality specific network are respectively used for extracting feature maps of a visible light image and a thermal infrared image, the modality sharing module is used for extracting similar features of the visible light feature map and the thermal infrared feature map, the modality self-adaptive attention module is used for learning and enhancing between modalities of unpaired data, and the modality adaptation module is used for partitioning the modalities and achieving target tracking.

In the embodiment, the similar characteristics of visible light data and thermal infrared data are extracted by setting the mode sharing module, the connection between the modes is further strengthened, the mode self-adaptive attention module is arranged, the learning and the enhancement between the unpaired visible light data and thermal infrared data modes are realized, the power of unpaired visible light infrared data is liberated, the problem of insufficient data quantity required by training is effectively avoided, bimodal information is fully mined and utilized on a limited data set, and the mutual enhancement between unpaired multimodal data is realized.

In an embodiment, the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;

It should be noted that, the samples of each mode are respectively input into its own mode specific network for parallel transmission, and simultaneously the samples of two modes are input into the mode sharing module together, the first mode specific network, the second mode specific network and the mode sharing module all use the first three convolution layers of the VGG network for reference, and the sizes of the convolution kernels are 7 × 7, 5 × 5 and 3 × 3 respectively. The results of the first two convolution layers of the first mode specific network and the second mode specific network are used as the input of the next convolution layer in the mode sharing module, and the results of the mode sharing module and the first mode specific network, the results of the mode sharing module and the result of the mode specific network are fused respectively to obtain the final mode characteristics.

It should be noted that the modality sharing module is configured to extract similar features of the visible light data and the thermal infrared data, further strengthen the relationship between modalities, help to fuse data of the two modalities, and further implement feature complementation between modalities. Although unpaired visible light data and thermal infrared data cannot be fused simply by using a common fusion network, the visible light data and the thermal infrared data have great similarity and share distribution characteristics, and the embodiment extracts the sharing characteristics of the two modes through the mode sharing module to balance the fusion ratio of the two modes.

In an embodiment, the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;

In particular, the modality-adaptive attention module is composed of two 512-sized first and second fully-connected layers sharing weights, two 64-sized third and fourth fully-connected layers being modality-specific, and one 64-sized fifth fully-connected layer being modality-shared, and each modality experiences the fifth fully-connected layer to get the characteristics of the modality because the first fully-connected layer is modality-based sharing. Firstly, flattening the feature diagram output by the mode sharing module to 512, reducing dimension to 64 through the first full-connection layer and the second full-connection layer, forming a visible light and thermal infrared specific key K and a value V through the mode specific third full-connection layer and the mode specific fourth full-connection layer, and forming two mode specific sub-attention matrixes V through vector points. And the fifth full connection mode is used for forming a mode sharing query set Q, and the generated mode specific attention moment arrays are multiplied to obtain the characteristic that the two modes are finally mutually enhanced.

The set modal adaptive attention module is an attention module capable of realizing information interaction, has a strong attention mechanism, automatically learns specific gradient information preferences of two modalities, then cooperatively optimizes the whole network in the overall network optimization process, and fully utilizes the advantages of each modality to realize stable visible light infrared target tracking. The force of unpaired visible light infrared data is liberated, and the potential of single-mode data is fully mined.

In one embodiment, the modality adaptation module comprises two full-connection layers and a modality connection layer which are sequentially connected, and the modality connection layer comprises a visible light modality full-connection layer and a thermal infrared modality full-connection layer which correspond to the two modalities;

Specifically, the modal adaptation module comprises four full-connection layers with sizes of 1024,512,2 and 2 respectively, two full-connection layers with sizes of 2 and 2 are combined in parallel to form the modal connection layer, and a dropout (neuron random activation) normalization method is added behind the two full-connection layers with sizes of 1024,512 to reduce the risk of overfitting. Finally, two full-connection layers with the size of 2 and divided according to the mode comprise softmax layers which are respectively used for calculating the positive score f and the negative score f of each candidate sample characteristic in parallel⁺(xⁱ) And f^-(xⁱ) Calculating the target probability of the candidate sample, wherein the detection box with the highest target probability is the predicted target tracking result, and finally predicting the target position according to the following formula:

wherein x isⁱDenotes the ith sample of the sample, f⁺(xⁱ) Representing the probability of the target from which the sample was taken, f^-(xⁱ) Representing the background probability, x, of obtaining a sample^*Is the predicted target location.

In some embodiments, the method further comprises:

Loss＝-(y*log(y^)+(1-y)*log(1-y^))；

Specifically, the training process of the visible-light infrared tracker is as follows:

(1) the pre-training model of the VGG is used to initialize parameters for the modality-specific module and the modality-sharing module, and the modality-adaptive attention module and the modality adaptation module are randomly initialized. The mode specific module and the mode sharing module are composed of three convolution layers and a ReLU (nonlinear layer), the first two convolution layers are also added with an LRN (local correspondence function) and a MaxPool (maximum value pooling function), and the convolution kernel sizes are 7 × 96, 5 × 256 and 3 × 512.

(2) The whole network is trained by using artificially labeled visible light infrared data which do not need to be paired, and 256 candidate samples are randomly selected near a truth box by using Gaussian distribution.

(3) In the first stage, a thermal infrared data or visible light data training mode sharing module is randomly input, network parameters are updated by using a random gradient algorithm SGD, and the parameters of the mode sharing module are updated by each training data. In the second stage, thermal infrared data and visible light data are simultaneously input to train other modules of each mode, a random gradient algorithm SGD is used for updating network parameters, and branches of each corresponding mode are iteratively updated by using a corresponding video sequence. The final model is saved for the online tracking phase.

In the model training process, the mode sharing module and the mode specific module are trained separately, and unpaired multi-mode data is used for training. The method has the advantages that the unpaired multi-mode data is used for training, the problem of dependence on large-scale alignment training data in RGBT tracking is solved, the existing thermal infrared data set and visible light data set can be fully utilized, and a large amount of labor and time cost are saved. The trained tracker reveals the strength of unpaired RGBT data, and effectively exerts the advantages of each modality.

It should be noted that the existing visible light and thermal infrared data sets are difficult to be perfectly paired, and the existing visible light infrared target tracking algorithm is designed according to the characteristics of data in two paired modes, which may cause the module design to fail to fully exert its performance.

In an embodiment, the process of tracking the target by using the trained visible light infrared tracker includes:

(1) and according to the visible light, heat and infrared pairing video sequence, extracting a first frame true value frame of the video sequence, and initializing a network model by using a pre-training parameter to obtain a new layer. At this time, the learning rates of the first two full connection layers of the modal adaptation module are set to be 0.001, and the learning rate of the last full connection layer is set to be 0.0005. After the initialization is completed, 256 candidate samples are generated using gaussian distributed sampling. .

(2) And respectively sending the candidate samples into corresponding modal specific modules, sending the candidate samples into modal sharing modules in parallel, sending the result of each layer of modal specific module into the next layer of modal sharing module, simultaneously sending the candidate samples into the modal sharing modules together, and fusing the results of the modal sharing modules and the results of the modal specific modules according to the modalities to obtain modal characteristics. The modal characteristics are sent into a modal adaptive attention module, modal sharing full connection is formed into a modal sharing query set Q, and the modal specific attention moment arrays generated are multiplied to obtain final mutually enhanced modal characteristics. And splicing the enhanced feature maps of different modes in the last convolutional layer by channel dimensions to obtain an integral feature map, sending the integral feature map into a last mode adaptation module, sending a softmax function into the last convolutional layer to obtain a classification score, and predicting the target position.

(3) And when the target probability of the prediction sample is more than 0.5, judging that the tracking is successful. When the target probability of the prediction sample is less than 0.5, tracking fails and short-term updating is carried out, if the number of frames in the positive and negative sample data sets exceeds 20, the negative sample area of the earliest frame is abandoned, 32 positive samples and 96 negative samples are extracted from the positive and negative sample sets to fine-tune parameters of the full connection layer, iteration is carried out for 10 times, and the learning rate is set to be 0.00003.

(4) In the online tracking process, long-term updating is performed every 8 frames, if the number of frames in the positive and negative sample data sets exceeds 100, the positive sample area of the earliest frame is discarded, 32 positive samples and 96 negative samples are extracted from the positive and negative sample sets to fine-tune parameters of the full connection layer, iteration is performed for 10 times, and the learning rate is set to be 0.00003. If the conditions of the short-term update and the long-term update are not met, the next frame is directly tracked, and the model is not updated.

Both the long-term update and the short-term update are to adapt to the change in the appearance of the tracking target, and the model parameters are updated using sample data. The short-term update is to update immediately (adjust immediately) when the tracking fails, and the long-term update is to adapt to a change in the target during tracking (adjust at a predetermined number of frames).

In this embodiment, the method of the present invention and some existing methods are respectively tested on the public data sets GTOT, RGBT234 and laseer, and the test results and other trackers are evaluated on SR (success rate) and PR (accuracy rate), and the results are shown in table 1:

TABLE 1

Where _indicatesthat no experiments were done on this data set, UMT indicates the target tracking method used in the present invention (the other method is the comparative method). As can be observed from the data in Table 1, the success rate and the accuracy of the method of the invention are higher on the existing data set, and the tracking performance is uniformly improved to a certain extent.

In addition, as shown in fig. 4, an embodiment of the present invention further provides a training apparatus for tracking a visible light infrared target based on unpaired data, where the training apparatus includes:

an obtaining module 10, configured to obtain an unpaired visible light image and thermal infrared image, and generate a candidate sample based on the visible light image and the thermal infrared image, where the candidate sample includes a positive sample and a negative sample;

a training module 20, configured to train the visible light infrared tracker by using the candidate sample;

In one embodiment, the training module 20 includes:

Loss＝-(y*log(y^)+(1-y)*log(1-y^))；

It should be noted that other embodiments or methods of implementing the non-paired data based training apparatus for tracking visible light and infrared targets of the present invention can refer to the above embodiments, and are not redundant here.

It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A visible light infrared target tracking training method based on unpaired data is characterized by comprising the following steps:

training a visible light infrared tracker by using the candidate sample;

2. The visible light infrared target tracking training method based on unpaired data according to claim 1, wherein the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;

3. The visible light infrared target tracking training method based on unpaired data according to claim 1, wherein the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;

4. The visible light infrared target tracking training method based on unpaired data according to claim 1, wherein the mode adapting module comprises two full connection layers and a mode connection layer which are connected in sequence, and the mode connection layer comprises a visible light mode full connection layer and a thermal infrared mode full connection layer which correspond to two modes;

the visible light modal full connection layer and the thermal infrared modal full connection layer comprise softmax layers, and the softmax layers are used for calculating positive and negative sample score values in the candidate samples and predicting target positions.

5. The visible light infrared target tracking training method based on unpaired data according to claim 4, further comprising:

Loss＝-(y*log(y^)+(1-y)*log(1-y^))；

6. A non-paired data based visible light infrared target tracking training device, the device comprising:

7. The visible light infrared target tracking training device based on unpaired data according to claim 6, wherein the first modality specific network, the second modality specific network and the modality sharing module each include three convolutional layers connected in sequence, and outputs of the first two convolutional layers of the first modality specific network and the second modality specific network are used as inputs of the last two convolutional layers of the modality sharing network;

8. The visible light infrared target tracking training device based on unpaired data according to claim 7, wherein the modality-adaptive attention module includes first and second fully-connected layers sharing weights, third and fourth fully-connected layers being modality-specific, and a fifth fully-connected layer being modality-shared;

and multiplying the modal sharing query set with the attention moment arrays corresponding to the two modals respectively to obtain the enhanced feature maps corresponding to the two modals.

9. The visible light infrared target tracking training device based on unpaired data according to claim 7, wherein the mode adapting module comprises two full connection layers and a mode connection layer which are connected in sequence, and the mode connection layer comprises a visible light mode full connection layer and a thermal infrared mode full connection layer which correspond to two modes;

10. The visible-light infrared target tracking training device based on unpaired data of claim 9, wherein the training module comprises:

Loss＝-(y*log(y^)+(1-y)*log(1-y^))；