CN115294176B

CN115294176B - Double-light multi-model long-time target tracking method and system and storage medium

Info

Publication number: CN115294176B
Application number: CN202211177765.3A
Authority: CN
Inventors: 何震宇; 毛凯歌; 田超; 杨超
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-04-07
Anticipated expiration: 2042-09-27
Also published as: CN115294176A

Abstract

The invention provides a double-light multi-model long-time target tracking method, a system and a storage medium. The beneficial effects of the invention are: the visible light-thermal infrared double-light target tracker has better robustness and generalization capability, and can accurately and quickly realize long-time tracking of the target.

Description

Double-light multi-model long-time target tracking method and system and storage medium

Technical Field

The invention relates to the technical field of target tracking, in particular to a double-light multi-model long-time target tracking method, a double-light multi-model long-time target tracking system and a storage medium.

Background

Target tracking is an important research direction in computer vision, and is widely applied to the fields of unmanned driving, video monitoring, intelligent robots, man-machine interaction and the like. The task of target tracking is to model the target and predict the position information of the target in the subsequent frame by combining the video context information after the position information of the target in the first frame is given. Through years of development, especially with the application of deep learning technology in recent years, the performance of the target tracking algorithm is continuously improved. However, under severe environments (such as extreme lighting environments, shelters, similar interferences, and the like), the performance of the target tracking algorithm still has a great improvement space, and how to improve the performance of the algorithm under these scenes is still a problem that needs to be studied.

The basic framework of the classic long-time target tracking algorithm is shown in fig. 1 and mainly comprises a tracking module, a detection module, a learning module and an integrator. In order to improve the tracking effect in a long-time tracking scene, the traditional target tracking algorithm and the target detection algorithm are respectively used for tracking in the algorithm, the comprehensive module is used for combining the results of the traditional target tracking algorithm and the target detection algorithm to obtain a final tracking result, and the learning module is provided for continuously updating the tracking module and the detection module on line so as to improve the adaptability of the model to the challenges of target deformation, scale change, shielding and the like, thereby enhancing the robustness of the algorithm.

In terms of data use, the current target tracking method generally only adopts visible light (or thermal infrared) images for training, and after the training is completed, the test (application) is only carried out on the visible light (or thermal infrared) data. In addition, a visible light-thermal infrared double light (RGB-T) tracking algorithm is also provided, and the algorithm uses paired view-angle aligned bimodal data in model training and testing (practical application); as shown in fig. 2, more than two feature extractors are usually used in parallel to extract features of each mode. The method has the advantages that complementary information provided by the dual modes can be utilized, and the tracking effect is better in a complex scene.

The defects of the background art are as follows:

the existing visible light-thermal infrared fusion module is divided into image-level fusion (using the same network parameters to extract the features of the dual-light images at the same time) and feature-level fusion (using different network parameters to extract the features of the dual-light images respectively and then fuse the features together). In general, there may be both features shared by a large number of modalities and features specific to a portion of the modalities in a photo-thermal infrared image pair. Therefore, the existence of modal proprietary features is ignored in image-level fusion, and the modal common features are weakened by the feature extraction process and the feature fusion process which are independent in feature-level fusion.

In the process of processing each frame by the classic long-time target tracking algorithm, the detection module is started to carry out global search on the target no matter whether the output result of the tracking module is reliable or not. However, the detection module contains a large number of computations (e.g., a detector consisting of three cascaded classifiers), and enabling a global search every frame results in a slower algorithm running speed. In addition, some existing methods switch between multiple different target tracking methods depending on the target state.

In the classic long-time target tracking algorithm, after a target is tracked successfully, a learning module updates a target model by taking a tracking result as a positive sample so as to improve the adaptability of the algorithm to changes of target shape, scale and the like. However, when the target is occluded and the tracking is still successful, the occluded target is also learned as a positive sample, and at this time, the occluded target contains many background features, which are erroneously learned and added to the sample library of the model, and the performance of the algorithm is affected in the subsequent tracking process, so that the tracking result drifts and even the tracking fails.

In the aspect of training, the existing method generally adopts parameters obtained by pre-training on a large-scale visible light data set to initialize a model, however, the network pre-trained only by using visible light images does not have the capability of extracting and fusing visible light-thermal infrared double-light image features, and therefore, the existing method cannot be well applied to a visible light-thermal infrared double-light tracking scene.

In the aspect of loss functions, the existing method directly predicts the coordinates of a target bounding box and trains regression branches together with real labeling calculation loss, and negative effects caused by weak misalignment problems in training images and actual test scenes are ignored; the existing method can not directly guide the network to select the best result from a plurality of candidate prediction frames as a final prediction result by training the classification branch through the two-classification cross entropy loss.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a double-light multi-model long-time target tracking method, a system and a storage medium, which improve the performance of a long-time tracking algorithm in the environments of extreme illumination, severe weather and the like.

The invention provides a double-light multi-model long-time target tracking method, which comprises the following steps:

pre-training: mutual reconstruction pre-training is carried out on the dual-light fusion module by utilizing an unmarked visible light-thermal infrared image to obtain an initialization weight parameter;

training: carrying out weight initialization on the dual-light fusion module by using the initialization weight parameters obtained in the pre-training step, and carrying out tracking training on a visible light-thermal infrared tracking data set by using a regression loss function based on boundary distribution prediction and a classification loss function based on cross-over ratio perception;

a re-parameterization step: the pre-training step and the training step both use a double-light fusion module with a residual error structure, and the double-light fusion module with the residual error structure is converted into a straight-tube type double-light fusion module through parameterization;

the inference step comprises: the following steps are performed for each frame of the input visible-thermal infrared image pair:

step a: extracting features of an input image frame by using a dual-light fusion module;

step b: the current algorithm running state comprises local tracking or global detection, and a local tracking module or a global detection module is run based on the current algorithm running state;

b, based on the result obtained in the step b, the running state switching module evaluates whether the current frame is successfully tracked or not and determines whether the running state is switched or not;

and b, based on the result obtained in the step b, and combined with the historical target information, evaluating whether the current frame needs to update the template in the local tracking module and the classifier in the state switching module through an updating control module.

As a further improvement of the present invention, the dual-light fusion module is composed of a dual-current convolution network, the dual-current convolution network has different coupling rates at different convolution layers, and the coupling rate gradually increases with the increase of the depth of the dual-current convolution network, through the coupled convolution kernel, the dual-light fusion module can extract the common features of the visible light and the thermal infrared modalities, and through the uncoupled convolution kernel, the private features of the visible light/thermal infrared image can be respectively extracted, and the features extracted by the dual-current convolution network are input into a channel attention module for fusion.

As a further improvement of the present invention, the state switching module includes executing the following steps:

step 1: the prediction result of the local tracking module or the global detection module is input into a classifier of the state switching module, and the classifier evaluates the prediction result to obtain a scores _s ；

And 2, step: judging the state of the current algorithm, and executing a first branch step if the current algorithm is in a local tracking state; if the current algorithm is in the global detection state, executing a second branch step;

a first branching step: judging the scores _s Whether or not less than a threshold valueγ _s If so, the local tracking module is considered to be unsuccessful in tracking the target, and the algorithm is switched to a global detection state, otherwise, the local tracking module is considered to be successful in tracking the target, and the local tracking state is continuously maintained;

a second branching step: judging the scores _s Whether or not it is greater than a threshold valueγ _s If the target is not detected, the global detection module is continuously kept in the global detection state.

As a further improvement of the present invention, the state switching module includes a classifier, and the classifier is updated according to the prediction result in the tracking process, and the updating process is as follows: firstly, randomly sampling around a prediction result, dividing the prediction result into positive and negative samples according to the positive and negative samples and the IoU of the prediction result, and then training a classifier by using the obtained positive and negative samples to update parameters of the classifier, wherein the IoU is the overlapping rate between a frame obtained by sampling and a frame of the prediction result.

As a further improvement of the present invention, the refresh control module is formed by stacking LSTM and full connection layer, as shown in formula 1, LSTM represents a long-short term memory network with multi-layer time step gradually shortened,lrepresenting the number of layers of the LSTM byX _t The context information of the target on the time sequence is aggregated by inputting the context information into the LSTM to obtain the characteristic representing the recent comprehensive state of the target, and the score is obtained by the full connection layer FCs _u Only when scorings _u Greater than the update thresholdγ _u Updating the local tracking module and the switching control module;

s _u =FC(LSTM ^l (LSTM ^l-1 (…LSTM ¹ (X _t ) 8230;)) formula 1

Wherein, the first and the second end of the pipe are connected with each other,X _t indicates the most recentt _s The state information of the frame constitutes history information of the target.

As a further development of the invention, in the pre-training step, the visible light image and the thermal infrared image are first divided uniformly into a plurality of images

Squares of size, then randomly selected from each picture

Obtaining the result after each square grid and using the color blocks to shield the image content in the selected square grid

，

Respectively representing a visible light image and a thermal infrared image after being shielded, wherein R represents a real space, and H and W respectively represent the length and the width of the image; the randomly shielded image is used as the input of the dual-light fusion module, and the visible light reconstruction module and the thermal infrared reconstruction module respectively recover the visible light image and the infrared image by using the characteristics extracted by the dual-light fusion module to obtain

，

Respectively representing the restored visible light image and the thermal infrared image; and finally, taking the original image as a true value, calculating the mean square error loss shown in formula 2, and training the modelTraining until the model converges; during tracking training, the model initializes the parameters of the dual-light fusion module obtained by loading pre-training, and the visible light reconstruction module and the thermal infrared reconstruction module are discarded;

formula 2;

where L represents the loss on the image pair,

and

respectively representing a visible light artwork and a thermal infrared artwork,

representing the restored visible light image,

representing the recovered thermal infrared image.

As a further improvement of the present invention, in the training step, the regression loss function predicted based on the boundary distribution is as shown in equation 5,

equation 4

In equation 4, E represents a certain boundary surrounding a box, E represents a set of boundaries surrounding a box,

the presence of a real label is indicated,

represents an integer part of the real tag and,

representing the boundary of an object

In that

Within a range of

The probability of (a) being in (b),

the boundary e representing the object is [0, L]Probability of falling at l +1 within the interval;

equation 5

In the formula 5, the first and second groups,

represents the loss of the conventional cross-over ratio,

respectively representing a predicted target bounding box and a true target bounding box.

As a further improvement of the present invention, in the training step, for a positive sample, the label corresponding to the positive sample is adjusted from 1 to the intersection ratio between the corresponding candidate box and the true value box; for negative examples, keep their corresponding label at 0; the final obtained cross-over ratio perception classification loss function is shown as formula 6, the classification branch of the tracker is trained through the cross-over ratio perception classification loss function, wherein Pos represents a positive sample set,

is shown in which

Confidence prediction results of the samples;

equation 6;

representing the rate of overlap between the predicted target bounding box and the true target bounding box.

As a further improvement of the invention, in the step of reparameterizing, the double light fusion module is composed of

Convolutional layer and side

A convolution branch and an identity mapping branch are formed, will

The convolution and identity mapping are both regarded as that the values of other positions except the central position of the convolution kernel parameter are 0

Convolving, and then dividing the three branches according to the additivity of the convolution

The convolution parameters are added to obtain the product which is identical to the original model output and only contains one convolution parameter

The one-way model of convolution is shown in equation 7, and Parameters represents the corresponding equation

A parameter space of convolution kernels;

equation 7;

、

、

respectively representing the parameters of the 3 x 3 convolution obtained in the training, the parameters of the 1 x 1 convolution obtained in the training and the parameters of the identity map obtained in the training,

to represent

、

、

Any one of them.

The invention also provides a double-light multi-model long-time target tracking system, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program configured to implement the steps of a dual-photon multi-model long-time target tracking method when invoked by the processor.

The present invention also provides a computer-readable storage medium characterized in that: the computer readable storage medium stores a computer program configured to implement the steps of the dual-optical multi-model long-time target tracking method when invoked by a processor.

The invention has the beneficial effects that: the visible light-thermal infrared double-light target tracker has better robustness and generalization capability, and can accurately and quickly realize long-time tracking of the target.

Drawings

FIG. 1 is a classic long-time target tracking algorithm framework diagram;

FIG. 2 is a generalized flow diagram of a feature and fused RGBT tracking;

FIG. 3 is an overall frame diagram of the present invention;

fig. 4 is a schematic diagram of a dual light fusion module;

FIG. 5 is a flow chart of the operation of the state switching module;

FIG. 6 is a pre-training diagram of visible light-thermal infrared image mutual reconstruction;

FIG. 7 is a diagram of reparameterization inference acceleration.

Detailed Description

The invention discloses a double-light multi-model long-time target tracking method, which comprises the following steps:

pre-training: mutual reconstruction pre-training is carried out on the dual-light fusion module by utilizing a large number of unmarked visible light-thermal infrared images to obtain better initialization weight parameters; pre-training a large number of unmarked visible light-thermal infrared image pairs by using mutual reconstruction as an agent task before formal training;

training: carrying out weight initialization on the dual-light fusion module by using the initialization weight parameters obtained in the pre-training step, and carrying out tracking training on a visible light-thermal infrared tracking data set by using a regression loss function based on boundary distribution prediction and a classification loss function based on cross-over ratio perception; in formal training, training regression branches by using a regression loss function based on edge distribution prediction, and improving the prediction accuracy of the algorithm in a weak misalignment scene; in formal training, training the classification branches by using a union ratio perception classification loss function, and encouraging an algorithm to select more accurate candidate boxes as final results;

a re-parameterization step: the pre-training step and the training step both use a double-light fusion module with a residual error structure, and the double-light fusion module with the residual error structure is converted into a straight-tube-type double-light fusion module through re-parameterization, so that the actual reasoning speed of the model is improved;

the inference step comprises:

the overall framework of the algorithm proposed by the present invention is shown in fig. 3. After the model initialization is completed, the following steps are carried out on the visible light-thermal infrared pair input in each frame:

b, based on the result obtained in the step b, operating a state switching module, evaluating whether the current frame is successfully tracked or not, and determining whether to switch the operating state or not; the algorithm state is switched by adopting a state switching module, so that the condition that each frame consumes resources to perform global search is avoided;

and b, based on the result obtained in the step b, and combined with the historical target information, evaluating whether the current frame needs to update the template in the local tracking module and the classifier in the state switching module through an updating control module. And an updating control module is adopted to evaluate whether the current frame is suitable for updating the target state, so that adverse effects of updating on subsequent tracking under the conditions that the target is shielded and the like are avoided.

The invention uses the double-light fusion modules with different structures and the same output in the training step and the reasoning step, and utilizes the re-parameterization technology to achieve better balance between speed and precision.

The invention is explained in detail as follows:

one, dual light fusion module:

in order to further improve the long-term target tracking effect in extreme illumination and severe weather scenes such as rain, snow and the like, and in consideration of the physical complementary characteristics of visible light and thermal infrared, the invention adopts a visible light-thermal infrared image pair as input and introduces a dual-light fusion module to fuse the bimodal characteristics of visible light and thermal infrared for other modules in the algorithm.

The dual optical fusion module is shown in fig. 4, and is composed of a dual-stream convolution network, wherein the dual-stream convolution network has different coupling rates (indicated by numbers of the shared part of each feature layer in fig. 4) in different convolution layers, and the coupling rate is gradually increased along with the increase of the network depth. Through coupled convolution kernels, the model can extract common features of the two modes of visible light and thermal infrared, and through uncoupled convolution kernels, private features of the visible light/thermal infrared images can be extracted respectively. The features extracted by the double-flow convolution network are input into a channel attention module for fusion.

The state switching module:

in order to avoid using the global detection module to search globally when the local tracking module successfully tracks the target, so as to reduce the calculated amount of the long-term tracking method and improve the running speed of the long-term tracking method, the invention introduces the state switching module.

The state switching module comprises a classifier and a preset threshold valueγ _s 。

As shown in fig. 5, the state switching module includes the following steps:

Step 2: judging the state of the current algorithm, and executing a first branch step if the current algorithm is in a local tracking state; if the current algorithm is in the global detection state, executing a second branch step;

In long-term target tracking, the characteristics of the target, such as appearance, shape and the like, may change greatly, so the classifier is continuously updated according to the prediction result in the tracking process. The update process is as follows: firstly, randomly sampling around a prediction result, dividing the prediction result into positive and negative samples according to the positive and negative samples and the IoU of the prediction result, and then training a classifier by using the samples to update parameters of the classifier.

The update control module:

in order to avoid that the model updates the target model when the model is not suitable (such as the target is partially shielded), thereby causing tracking drift or even failure, the invention introduces an update control module based on historical information, and determines whether to update the local tracking module and the state switching module in the current frame based on the historical state of the target.

After the tracking result of the t-th frame is obtained, the algorithm stores the following information of the target: 1. a target minimum bounding box predicted by the local tracking module; 2. a response graph of the target position predicted by the local tracking module; 3. extracting the characteristics of the target in the current frame according to the bounding box in the step 1; 4. and extracting the characteristics of the target in the initial frame according to the initial information of the local tracking module. Wherein 1 includes the latest motion, scale change and other information of the target, 2 includes the reliability of the local tracking module, and 3 and 4 include the appearance change of the target. Mapping the information into vectors and splicing the vectors to form the state information of the target in the current framex _t . More recently, the development of new and more recently developed devicest _s The state information of the frame constitutes the history information of the targetX _t 。

The refresh control module is formed by stacking a plurality of layers of Long Short Term Memory (LSTM) networks with gradually shortened time step and fully connected layers as shown in formula 1lRepresenting the number of layers of the LSTM. By mixingX _t The context information of the target on the time sequence is aggregated when the context information is input into an LSTM network to obtain the characteristic representing the recent comprehensive state of the target, and the score is obtained after the context information passes through a full connection layer FCs _u Only if it is greater than the update thresholdγ _u The local tracking module and the switching control module are updated in time.

s _u =FC(LSTM ^l (LSTM ^l-1 (…LSTM ¹ (X _t ) 8230;)) formula 1

Performing visible light-thermal infrared image mutual reconstruction pre-training:

parameter initialization during deep learning model training has a great influence on the performance of a training result. In order to improve the generalization capability and robustness of the model, the invention introduces visible light-thermal infrared image mutual reconstruction pre-training, utilizes a large amount of unmarked visible light-thermal infrared images to train the feature extraction capability of the double-light fusion module, and then uses the feature extraction capability as an initial parameter to carry out target tracking training.

The process of visible light-thermal infrared image mutual reconstruction pre-training is shown in fig. 6, wherein each reconstruction module is formed by stacking a plurality of layers formed by convolution operation and up-sampling operation. Each training sample in the pre-training is composed of a pair of visible light-thermal infrared images (

) And (4) forming. Firstly, uniformly dividing two images of visible light and thermal infrared into a plurality of images

Size squares, then randomly selected from each picture

. The randomly shielded image is used as the input of the dual-light fusion module, and the reconstruction module respectively recovers the visible light image and the infrared image by using the characteristics extracted by the dual-light fusion module to obtain

. Finally, the original image is taken as trueValue, calculate the mean square error loss as shown in equation 2 and train the model until the model converges. During tracking training, the model initializes the parameters of the dual-light fusion module obtained by loading pre-training, and the reconstruction module is discarded.

Equation 2

Regression loss function based on boundary distribution prediction

In order to solve the problem that the boundary of the target enclosure frame has uncertainty due to the weak misalignment of the visible light-thermal infrared image pair, the method is different from other methods for directly predicting the distance from each boundary of the target enclosure to the regression center.

Specifically, the boundary is specified to have a distribution interval of

The regression branch prediction result of the tracker is

Wherein

Showing the boundaries of the object

Within the above-mentioned interval and fall within

The probability of (c). As shown in equation 3, for the boundary of the target

Calculating the expected value of the corresponding distribution to obtain the position of the boundary predicted by the model

。

Equation 3

Typically, the actual location of an object should be in the vicinity of its real tag, even if there is uncertainty in the tag. Therefore, when the boundary is

True tag of

Located in a section

Predicted bounding box distribution probability when in range

And

it should also be larger, in order to encourage the model to predict larger probability values at locations near these truth values, a loss function is introduced as shown in equation 4:

equation 4

The regression branch bulk loss function of the tracker is shown in equation 5, where

Represents the loss of the conventional cross-over ratio,

a predicted target bounding box and a true target bounding box are represented, respectively.

Equation 5

Intersection ratio perception classification loss function:

the tracker generally needs to select the final tracking result from the candidate box according to the confidence score of the classification branch prediction. In order to promote the classification branches to select more accurate target bounding boxes, the invention introduces a cross-over ratio perception classification loss function to train the classification branches of the tracker.

Since the intersection ratio between the prediction bounding box and the truth bounding box directly reflects the accuracy of the prediction result, letting the classification branch learn to predict the intersection ratio between each candidate box and the truth box as its confidence score helps to pick the most accurate prediction result. Based on the above assumptions, the present invention improves the conventional cross entropy loss function: for the positive sample, adjusting the label corresponding to the positive sample from 1 to the intersection ratio between the corresponding candidate box and the truth box; for negative examples, its corresponding label is kept at 0. The resulting loss function is shown in equation 6, where Pos represents the set of positive samples,

is shown in which

Confidence of each sample predicts the result.

Equation 6

Seventh, speed up of heavily parameterized reasoning:

in order to simultaneously utilize the advantage of high performance during multi-branch model training and the advantage of high reasoning speed of a single-path model, the method carries out re-parameterization on the feature extraction network so as to enable the model to obtain better balance on the reasoning speed and the tracking performance.

Of modules with dual light fusion during pre-training and training phasesThe foundation constituting unit is shown in FIG. 7 and is composed of

Convolutional layer and side

A convolution branch and an identity mapping branch. Such a structure is comparable to a single one

Convolution can generate implicit integration of a large number of sub models by constructing residual connection, and therefore performance of the models is improved. After the training is completed, as shown in fig. 7, we can put the training on

The convolution and identity mapping are both regarded as that values of other positions except the central position of the convolution kernel parameter are 0

And (4) convolution. Then, based on the additivity of convolution, the three branches are combined

The convolution parameters are added to obtain the product which is identical to the original model output and only contains one

The convolution one-way model achieves the effect of improving the model reasoning speed. The procedure is shown in equation 7, where Parameters denote correspondences

The parameter space of the convolution kernel.

Equation 7

The potential application scenes of the invention comprise the fields of unmanned driving, auxiliary driving, intelligent security, military and the like. The application mode is that the algorithm and the model are deployed to the computing equipment and the designated target in the input infrared + visible light double-path video stream is tracked.

The invention has the beneficial effects that: through the scheme, the visible light-thermal infrared double-light target tracker has better robustness and generalization capability, and can accurately and quickly realize long-time tracking of the target. The concrete expression is as follows:

1. through the double-light fusion module with the partial coupling convolution layer, the characteristics of the visible light-thermal infrared image pair are better extracted and fused, and the robustness of the long-time tracking algorithm under the challenges of extreme illumination, severe weather and the like is improved.

2. By introducing the state switching module to dynamically switch between local tracking and global detection, the extra calculation amount caused by the fact that each frame needs global detection is avoided, and the operation speed of the algorithm is improved.

3. By introducing the update control module to maintain a more accurate target model, the tracking drift and even the tracking failure caused by unreliable online update are relieved.

4. Before formal training, large-scale pre-training is carried out on the visible light-thermal infrared image pair which is not marked by a mutual reconstruction agent task, and the performance and the generalization capability of the model are improved.

5. During training, a regression loss function based on boundary distribution prediction is used, and the tracking accuracy in a weak misalignment scene is improved.

6. And selecting a better candidate box as a final prediction result by using a cross-over ratio perception classification loss function in training.

7. After the training is finished, the multi-branch large model used in the training is converted into an equivalent straight-barrel small model through re-parameterization, and the reasoning speed of the model is improved.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A double-light multi-model long-time target tracking method is characterized by comprising the following steps:

b, based on the result obtained in the step b, operating a state switching module, evaluating whether the current frame is successfully tracked or not, and determining whether to switch the operating state or not;

b, based on the result obtained in the step b, in combination with historical target information, evaluating whether the current frame should update the template in the local tracking module and the classifier in the state switching module through an updating control module;

in the training step, the regression loss function predicted based on the boundary distribution is as shown in equation 5,

equation 4

the presence of a real label is indicated,

represents an integer part of the real tag and,

representing the boundary of an object

In that

Within a range of

The probability of (a) being in (b),

equation 5

In the formula 5, the first and second groups,

represents the loss of the conventional cross-over ratio,

2. A dual-light multi-model long-time target tracking method as claimed in claim 1, wherein the dual-light fusion module is composed of a dual-current convolution network, the dual-current convolution network has different coupling rates at different convolution layers, the coupling rates become larger with the increase of the depth of the dual-current convolution network, the dual-light fusion module can extract common features of two modes of visible light and thermal infrared through coupled convolution kernels, and private features of visible light/thermal infrared images can be extracted respectively through uncoupled convolution kernels, and the features extracted by the dual-current convolution network are input into a channel attention module for fusion.

3. The dual-light multi-model long-time target tracking method according to claim 1, wherein the state switching module comprises the following steps:

4. The dual-optical multi-model long-time target tracking method according to claim 3, wherein the state switching module comprises a classifier, and the classifier is updated according to the prediction result in the tracking process, and the updating process is as follows: firstly, randomly sampling around a prediction result, dividing the prediction result into positive and negative samples according to the positive and negative samples and the IoU of the prediction result, and then training a classifier by using the obtained positive and negative samples to update parameters of the classifier, wherein the IoU is the overlapping rate between a frame obtained by sampling and a frame of the prediction result.

5. The dual-optical multi-model long-time target tracking method according to claim 1, wherein the update control module is formed by stacking LSTM and full connection layer, as shown in formula 1, the LSTM represents a multi-layer long-short term memory network with gradually shortened time step,nrepresenting the number of layers of the LSTM byX _t The context information of the target on the time sequence is aggregated by inputting the context information into the LSTM to obtain the characteristic representing the recent comprehensive state of the target, and the score is obtained by the full connection layer FCs _u Only when scorings _u Greater than the update thresholdγ _u Updating the local tracking module and the switching control module;

s _u =FC(LSTM ⁿ (LSTM ^n-1 (…LSTM ¹ (X _t ) 8230;)) formula 1

Wherein the content of the first and second substances,X _t indicates the most recentt _s The state information of the frame constitutes history information of the target.

6. The dual-light multi-model long-time target tracking method according to claim 1, wherein in the pre-training step, the visible light image and the thermal infrared image are first uniformly divided into a plurality of images

Size squares, then randomized from each pictureSelecting

，

Respectively representing a visible light image and a thermal infrared image after being shielded, wherein R represents a real number space, and H and W respectively represent the length and the width of the image; the randomly shielded image is used as the input of the dual-light fusion module, the visible light reconstruction module and the thermal infrared reconstruction module respectively recover the visible light image and the infrared image by using the characteristics extracted by the dual-light fusion module to obtain

，

Respectively representing the restored visible light image and the thermal infrared image; finally, taking the original image as a true value, calculating the mean square error loss shown in a formula 2, and training the model until the model converges; during tracking training, the model initializes the parameters of the dual-light fusion module obtained by loading pre-training, and the visible light reconstruction module and the thermal infrared reconstruction module are discarded;

formula 2;

wherein L is _MSE Indicating a loss in the image pair or images,

and

respectively showing a visible light original image and a thermal infrared original image,

representing the restored visible light image,

representing the recovered thermal infrared image.

7. The dual-optical multi-model long-time target tracking method according to claim 1, wherein in the training step, for a positive sample, the corresponding label is adjusted from 1 to the intersection ratio between the corresponding candidate box and the true box; for negative examples, keeping their corresponding labels to 0; the final obtained cross-over ratio perception classification loss function is shown as formula 6, the classification branch of the tracker is trained through the cross-over ratio perception classification loss function, wherein Pos represents a positive sample set,

is shown in which

The confidence degree prediction results of the samples;

equation 6;

8. The dual-light multi-model long-time target tracking method according to claim 1, wherein in the re-parameterization step, the dual-light fusion module is composed of

Convolutional layer and side

A convolution branch and an identity mapping branch are formed, will

The one-way model of convolution is shown in equation 7, and Parameters represents the corresponding

A parameter space of convolution kernels;

equation 7;

、

、

to represent

、

、

Any one of them.

9. A dual-light multi-model long-time target tracking system is characterized by comprising: a memory, a processor, and a computer program stored on the memory, the computer program configured to, when invoked by the processor, implement the steps of the dual-photon multi-model long-time object tracking method of any one of claims 1-8.

10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the dual-photon multi-model long-time object tracking method of any one of claims 1-8.