CN115294176B - Double-light multi-model long-time target tracking method and system and storage medium - Google Patents

Double-light multi-model long-time target tracking method and system and storage medium Download PDF

Info

Publication number
CN115294176B
CN115294176B CN202211177765.3A CN202211177765A CN115294176B CN 115294176 B CN115294176 B CN 115294176B CN 202211177765 A CN202211177765 A CN 202211177765A CN 115294176 B CN115294176 B CN 115294176B
Authority
CN
China
Prior art keywords
module
dual
training
tracking
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211177765.3A
Other languages
Chinese (zh)
Other versions
CN115294176A (en
Inventor
何震宇
毛凯歌
田超
杨超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202211177765.3A priority Critical patent/CN115294176B/en
Publication of CN115294176A publication Critical patent/CN115294176A/en
Application granted granted Critical
Publication of CN115294176B publication Critical patent/CN115294176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a double-light multi-model long-time target tracking method, a system and a storage medium. The beneficial effects of the invention are: the visible light-thermal infrared double-light target tracker has better robustness and generalization capability, and can accurately and quickly realize long-time tracking of the target.

Description

Double-light multi-model long-time target tracking method and system and storage medium
Technical Field
The invention relates to the technical field of target tracking, in particular to a double-light multi-model long-time target tracking method, a double-light multi-model long-time target tracking system and a storage medium.
Background
Target tracking is an important research direction in computer vision, and is widely applied to the fields of unmanned driving, video monitoring, intelligent robots, man-machine interaction and the like. The task of target tracking is to model the target and predict the position information of the target in the subsequent frame by combining the video context information after the position information of the target in the first frame is given. Through years of development, especially with the application of deep learning technology in recent years, the performance of the target tracking algorithm is continuously improved. However, under severe environments (such as extreme lighting environments, shelters, similar interferences, and the like), the performance of the target tracking algorithm still has a great improvement space, and how to improve the performance of the algorithm under these scenes is still a problem that needs to be studied.
The basic framework of the classic long-time target tracking algorithm is shown in fig. 1 and mainly comprises a tracking module, a detection module, a learning module and an integrator. In order to improve the tracking effect in a long-time tracking scene, the traditional target tracking algorithm and the target detection algorithm are respectively used for tracking in the algorithm, the comprehensive module is used for combining the results of the traditional target tracking algorithm and the target detection algorithm to obtain a final tracking result, and the learning module is provided for continuously updating the tracking module and the detection module on line so as to improve the adaptability of the model to the challenges of target deformation, scale change, shielding and the like, thereby enhancing the robustness of the algorithm.
In terms of data use, the current target tracking method generally only adopts visible light (or thermal infrared) images for training, and after the training is completed, the test (application) is only carried out on the visible light (or thermal infrared) data. In addition, a visible light-thermal infrared double light (RGB-T) tracking algorithm is also provided, and the algorithm uses paired view-angle aligned bimodal data in model training and testing (practical application); as shown in fig. 2, more than two feature extractors are usually used in parallel to extract features of each mode. The method has the advantages that complementary information provided by the dual modes can be utilized, and the tracking effect is better in a complex scene.
The defects of the background art are as follows:
the existing visible light-thermal infrared fusion module is divided into image-level fusion (using the same network parameters to extract the features of the dual-light images at the same time) and feature-level fusion (using different network parameters to extract the features of the dual-light images respectively and then fuse the features together). In general, there may be both features shared by a large number of modalities and features specific to a portion of the modalities in a photo-thermal infrared image pair. Therefore, the existence of modal proprietary features is ignored in image-level fusion, and the modal common features are weakened by the feature extraction process and the feature fusion process which are independent in feature-level fusion.
In the process of processing each frame by the classic long-time target tracking algorithm, the detection module is started to carry out global search on the target no matter whether the output result of the tracking module is reliable or not. However, the detection module contains a large number of computations (e.g., a detector consisting of three cascaded classifiers), and enabling a global search every frame results in a slower algorithm running speed. In addition, some existing methods switch between multiple different target tracking methods depending on the target state.
In the classic long-time target tracking algorithm, after a target is tracked successfully, a learning module updates a target model by taking a tracking result as a positive sample so as to improve the adaptability of the algorithm to changes of target shape, scale and the like. However, when the target is occluded and the tracking is still successful, the occluded target is also learned as a positive sample, and at this time, the occluded target contains many background features, which are erroneously learned and added to the sample library of the model, and the performance of the algorithm is affected in the subsequent tracking process, so that the tracking result drifts and even the tracking fails.
In the aspect of training, the existing method generally adopts parameters obtained by pre-training on a large-scale visible light data set to initialize a model, however, the network pre-trained only by using visible light images does not have the capability of extracting and fusing visible light-thermal infrared double-light image features, and therefore, the existing method cannot be well applied to a visible light-thermal infrared double-light tracking scene.
In the aspect of loss functions, the existing method directly predicts the coordinates of a target bounding box and trains regression branches together with real labeling calculation loss, and negative effects caused by weak misalignment problems in training images and actual test scenes are ignored; the existing method can not directly guide the network to select the best result from a plurality of candidate prediction frames as a final prediction result by training the classification branch through the two-classification cross entropy loss.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a double-light multi-model long-time target tracking method, a system and a storage medium, which improve the performance of a long-time tracking algorithm in the environments of extreme illumination, severe weather and the like.
The invention provides a double-light multi-model long-time target tracking method, which comprises the following steps:
pre-training: mutual reconstruction pre-training is carried out on the dual-light fusion module by utilizing an unmarked visible light-thermal infrared image to obtain an initialization weight parameter;
training: carrying out weight initialization on the dual-light fusion module by using the initialization weight parameters obtained in the pre-training step, and carrying out tracking training on a visible light-thermal infrared tracking data set by using a regression loss function based on boundary distribution prediction and a classification loss function based on cross-over ratio perception;
a re-parameterization step: the pre-training step and the training step both use a double-light fusion module with a residual error structure, and the double-light fusion module with the residual error structure is converted into a straight-tube type double-light fusion module through parameterization;
the inference step comprises: the following steps are performed for each frame of the input visible-thermal infrared image pair:
step a: extracting features of an input image frame by using a dual-light fusion module;
step b: the current algorithm running state comprises local tracking or global detection, and a local tracking module or a global detection module is run based on the current algorithm running state;
b, based on the result obtained in the step b, the running state switching module evaluates whether the current frame is successfully tracked or not and determines whether the running state is switched or not;
and b, based on the result obtained in the step b, and combined with the historical target information, evaluating whether the current frame needs to update the template in the local tracking module and the classifier in the state switching module through an updating control module.
As a further improvement of the present invention, the dual-light fusion module is composed of a dual-current convolution network, the dual-current convolution network has different coupling rates at different convolution layers, and the coupling rate gradually increases with the increase of the depth of the dual-current convolution network, through the coupled convolution kernel, the dual-light fusion module can extract the common features of the visible light and the thermal infrared modalities, and through the uncoupled convolution kernel, the private features of the visible light/thermal infrared image can be respectively extracted, and the features extracted by the dual-current convolution network are input into a channel attention module for fusion.
As a further improvement of the present invention, the state switching module includes executing the following steps:
step 1: the prediction result of the local tracking module or the global detection module is input into a classifier of the state switching module, and the classifier evaluates the prediction result to obtain a scores s
And 2, step: judging the state of the current algorithm, and executing a first branch step if the current algorithm is in a local tracking state; if the current algorithm is in the global detection state, executing a second branch step;
a first branching step: judging the scores s Whether or not less than a threshold valueγ s If so, the local tracking module is considered to be unsuccessful in tracking the target, and the algorithm is switched to a global detection state, otherwise, the local tracking module is considered to be successful in tracking the target, and the local tracking state is continuously maintained;
a second branching step: judging the scores s Whether or not it is greater than a threshold valueγ s If the target is not detected, the global detection module is continuously kept in the global detection state.
As a further improvement of the present invention, the state switching module includes a classifier, and the classifier is updated according to the prediction result in the tracking process, and the updating process is as follows: firstly, randomly sampling around a prediction result, dividing the prediction result into positive and negative samples according to the positive and negative samples and the IoU of the prediction result, and then training a classifier by using the obtained positive and negative samples to update parameters of the classifier, wherein the IoU is the overlapping rate between a frame obtained by sampling and a frame of the prediction result.
As a further improvement of the present invention, the refresh control module is formed by stacking LSTM and full connection layer, as shown in formula 1, LSTM represents a long-short term memory network with multi-layer time step gradually shortened,lrepresenting the number of layers of the LSTM byX t The context information of the target on the time sequence is aggregated by inputting the context information into the LSTM to obtain the characteristic representing the recent comprehensive state of the target, and the score is obtained by the full connection layer FCs u Only when scorings u Greater than the update thresholdγ u Updating the local tracking module and the switching control module;
s u =FC(LSTM l (LSTM l-1 (…LSTM 1 (X t ) 8230;)) formula 1
Wherein, the first and the second end of the pipe are connected with each other,X t indicates the most recentt s The state information of the frame constitutes history information of the target.
As a further development of the invention, in the pre-training step, the visible light image and the thermal infrared image are first divided uniformly into a plurality of images
Figure 623425DEST_PATH_IMAGE001
Squares of size, then randomly selected from each picture
Figure 407186DEST_PATH_IMAGE003
Obtaining the result after each square grid and using the color blocks to shield the image content in the selected square grid
Figure 125743DEST_PATH_IMAGE004
Figure 314279DEST_PATH_IMAGE004
Respectively representing a visible light image and a thermal infrared image after being shielded, wherein R represents a real space, and H and W respectively represent the length and the width of the image; the randomly shielded image is used as the input of the dual-light fusion module, and the visible light reconstruction module and the thermal infrared reconstruction module respectively recover the visible light image and the infrared image by using the characteristics extracted by the dual-light fusion module to obtain
Figure 92879DEST_PATH_IMAGE005
Figure 366866DEST_PATH_IMAGE005
Respectively representing the restored visible light image and the thermal infrared image; and finally, taking the original image as a true value, calculating the mean square error loss shown in formula 2, and training the modelTraining until the model converges; during tracking training, the model initializes the parameters of the dual-light fusion module obtained by loading pre-training, and the visible light reconstruction module and the thermal infrared reconstruction module are discarded;
Figure 623535DEST_PATH_IMAGE006
formula 2;
where L represents the loss on the image pair,
Figure 666577DEST_PATH_IMAGE007
and
Figure 616079DEST_PATH_IMAGE008
respectively representing a visible light artwork and a thermal infrared artwork,
Figure 111782DEST_PATH_IMAGE009
representing the restored visible light image,
Figure 172142DEST_PATH_IMAGE011
representing the recovered thermal infrared image.
As a further improvement of the present invention, in the training step, the regression loss function predicted based on the boundary distribution is as shown in equation 5,
Figure 335270DEST_PATH_IMAGE012
equation 4
In equation 4, E represents a certain boundary surrounding a box, E represents a set of boundaries surrounding a box,
Figure 190094DEST_PATH_IMAGE013
the presence of a real label is indicated,
Figure 435743DEST_PATH_IMAGE014
represents an integer part of the real tag and,
Figure 34214DEST_PATH_IMAGE015
representing the boundary of an object
Figure 51849DEST_PATH_IMAGE016
In that
Figure 343153DEST_PATH_IMAGE017
Within a range of
Figure 813448DEST_PATH_IMAGE019
The probability of (a) being in (b),
Figure 215611DEST_PATH_IMAGE020
the boundary e representing the object is [0, L]Probability of falling at l +1 within the interval;
Figure 87752DEST_PATH_IMAGE021
equation 5
In the formula 5, the first and second groups,
Figure 549957DEST_PATH_IMAGE022
represents the loss of the conventional cross-over ratio,
Figure 241970DEST_PATH_IMAGE023
respectively representing a predicted target bounding box and a true target bounding box.
As a further improvement of the present invention, in the training step, for a positive sample, the label corresponding to the positive sample is adjusted from 1 to the intersection ratio between the corresponding candidate box and the true value box; for negative examples, keep their corresponding label at 0; the final obtained cross-over ratio perception classification loss function is shown as formula 6, the classification branch of the tracker is trained through the cross-over ratio perception classification loss function, wherein Pos represents a positive sample set,
Figure 713403DEST_PATH_IMAGE025
is shown in which
Figure 440050DEST_PATH_IMAGE027
Confidence prediction results of the samples;
Figure 273489DEST_PATH_IMAGE028
equation 6;
Figure 452798DEST_PATH_IMAGE029
representing the rate of overlap between the predicted target bounding box and the true target bounding box.
As a further improvement of the invention, in the step of reparameterizing, the double light fusion module is composed of
Figure 196763DEST_PATH_IMAGE030
Convolutional layer and side
Figure 777917DEST_PATH_IMAGE031
A convolution branch and an identity mapping branch are formed, will
Figure 581925DEST_PATH_IMAGE031
The convolution and identity mapping are both regarded as that the values of other positions except the central position of the convolution kernel parameter are 0
Figure 248530DEST_PATH_IMAGE030
Convolving, and then dividing the three branches according to the additivity of the convolution
Figure 796186DEST_PATH_IMAGE030
The convolution parameters are added to obtain the product which is identical to the original model output and only contains one convolution parameter
Figure 497425DEST_PATH_IMAGE030
The one-way model of convolution is shown in equation 7, and Parameters represents the corresponding equation
Figure 206756DEST_PATH_IMAGE030
A parameter space of convolution kernels;
Figure 626236DEST_PATH_IMAGE032
equation 7;
Figure 443494DEST_PATH_IMAGE033
Figure 733661DEST_PATH_IMAGE034
Figure 613893DEST_PATH_IMAGE035
respectively representing the parameters of the 3 x 3 convolution obtained in the training, the parameters of the 1 x 1 convolution obtained in the training and the parameters of the identity map obtained in the training,
Figure 255090DEST_PATH_IMAGE037
to represent
Figure 878969DEST_PATH_IMAGE033
Figure 554801DEST_PATH_IMAGE034
Figure 605934DEST_PATH_IMAGE035
Any one of them.
The invention also provides a double-light multi-model long-time target tracking system, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program configured to implement the steps of a dual-photon multi-model long-time target tracking method when invoked by the processor.
The present invention also provides a computer-readable storage medium characterized in that: the computer readable storage medium stores a computer program configured to implement the steps of the dual-optical multi-model long-time target tracking method when invoked by a processor.
The invention has the beneficial effects that: the visible light-thermal infrared double-light target tracker has better robustness and generalization capability, and can accurately and quickly realize long-time tracking of the target.
Drawings
FIG. 1 is a classic long-time target tracking algorithm framework diagram;
FIG. 2 is a generalized flow diagram of a feature and fused RGBT tracking;
FIG. 3 is an overall frame diagram of the present invention;
fig. 4 is a schematic diagram of a dual light fusion module;
FIG. 5 is a flow chart of the operation of the state switching module;
FIG. 6 is a pre-training diagram of visible light-thermal infrared image mutual reconstruction;
FIG. 7 is a diagram of reparameterization inference acceleration.
Detailed Description
The invention discloses a double-light multi-model long-time target tracking method, which comprises the following steps:
pre-training: mutual reconstruction pre-training is carried out on the dual-light fusion module by utilizing a large number of unmarked visible light-thermal infrared images to obtain better initialization weight parameters; pre-training a large number of unmarked visible light-thermal infrared image pairs by using mutual reconstruction as an agent task before formal training;
training: carrying out weight initialization on the dual-light fusion module by using the initialization weight parameters obtained in the pre-training step, and carrying out tracking training on a visible light-thermal infrared tracking data set by using a regression loss function based on boundary distribution prediction and a classification loss function based on cross-over ratio perception; in formal training, training regression branches by using a regression loss function based on edge distribution prediction, and improving the prediction accuracy of the algorithm in a weak misalignment scene; in formal training, training the classification branches by using a union ratio perception classification loss function, and encouraging an algorithm to select more accurate candidate boxes as final results;
a re-parameterization step: the pre-training step and the training step both use a double-light fusion module with a residual error structure, and the double-light fusion module with the residual error structure is converted into a straight-tube-type double-light fusion module through re-parameterization, so that the actual reasoning speed of the model is improved;
the inference step comprises:
the overall framework of the algorithm proposed by the present invention is shown in fig. 3. After the model initialization is completed, the following steps are carried out on the visible light-thermal infrared pair input in each frame:
step a: extracting features of an input image frame by using a dual-light fusion module;
step b: the current algorithm running state comprises local tracking or global detection, and a local tracking module or a global detection module is run based on the current algorithm running state;
b, based on the result obtained in the step b, operating a state switching module, evaluating whether the current frame is successfully tracked or not, and determining whether to switch the operating state or not; the algorithm state is switched by adopting a state switching module, so that the condition that each frame consumes resources to perform global search is avoided;
and b, based on the result obtained in the step b, and combined with the historical target information, evaluating whether the current frame needs to update the template in the local tracking module and the classifier in the state switching module through an updating control module. And an updating control module is adopted to evaluate whether the current frame is suitable for updating the target state, so that adverse effects of updating on subsequent tracking under the conditions that the target is shielded and the like are avoided.
The invention uses the double-light fusion modules with different structures and the same output in the training step and the reasoning step, and utilizes the re-parameterization technology to achieve better balance between speed and precision.
The invention is explained in detail as follows:
one, dual light fusion module:
in order to further improve the long-term target tracking effect in extreme illumination and severe weather scenes such as rain, snow and the like, and in consideration of the physical complementary characteristics of visible light and thermal infrared, the invention adopts a visible light-thermal infrared image pair as input and introduces a dual-light fusion module to fuse the bimodal characteristics of visible light and thermal infrared for other modules in the algorithm.
The dual optical fusion module is shown in fig. 4, and is composed of a dual-stream convolution network, wherein the dual-stream convolution network has different coupling rates (indicated by numbers of the shared part of each feature layer in fig. 4) in different convolution layers, and the coupling rate is gradually increased along with the increase of the network depth. Through coupled convolution kernels, the model can extract common features of the two modes of visible light and thermal infrared, and through uncoupled convolution kernels, private features of the visible light/thermal infrared images can be extracted respectively. The features extracted by the double-flow convolution network are input into a channel attention module for fusion.
The state switching module:
in order to avoid using the global detection module to search globally when the local tracking module successfully tracks the target, so as to reduce the calculated amount of the long-term tracking method and improve the running speed of the long-term tracking method, the invention introduces the state switching module.
The state switching module comprises a classifier and a preset threshold valueγ s
As shown in fig. 5, the state switching module includes the following steps:
step 1: the prediction result of the local tracking module or the global detection module is input into a classifier of the state switching module, and the classifier evaluates the prediction result to obtain a scores s
Step 2: judging the state of the current algorithm, and executing a first branch step if the current algorithm is in a local tracking state; if the current algorithm is in the global detection state, executing a second branch step;
a first branching step: judging the scores s Whether or not less than a threshold valueγ s If so, the local tracking module is considered to be unsuccessful in tracking the target, and the algorithm is switched to a global detection state, otherwise, the local tracking module is considered to be successful in tracking the target, and the local tracking state is continuously maintained;
a second branching step: judging the scores s Whether or not it is greater than a threshold valueγ s If the target is not detected, the global detection module is continuously kept in the global detection state.
In long-term target tracking, the characteristics of the target, such as appearance, shape and the like, may change greatly, so the classifier is continuously updated according to the prediction result in the tracking process. The update process is as follows: firstly, randomly sampling around a prediction result, dividing the prediction result into positive and negative samples according to the positive and negative samples and the IoU of the prediction result, and then training a classifier by using the samples to update parameters of the classifier.
The update control module:
in order to avoid that the model updates the target model when the model is not suitable (such as the target is partially shielded), thereby causing tracking drift or even failure, the invention introduces an update control module based on historical information, and determines whether to update the local tracking module and the state switching module in the current frame based on the historical state of the target.
After the tracking result of the t-th frame is obtained, the algorithm stores the following information of the target: 1. a target minimum bounding box predicted by the local tracking module; 2. a response graph of the target position predicted by the local tracking module; 3. extracting the characteristics of the target in the current frame according to the bounding box in the step 1; 4. and extracting the characteristics of the target in the initial frame according to the initial information of the local tracking module. Wherein 1 includes the latest motion, scale change and other information of the target, 2 includes the reliability of the local tracking module, and 3 and 4 include the appearance change of the target. Mapping the information into vectors and splicing the vectors to form the state information of the target in the current framex t . More recently, the development of new and more recently developed devicest s The state information of the frame constitutes the history information of the targetX t
The refresh control module is formed by stacking a plurality of layers of Long Short Term Memory (LSTM) networks with gradually shortened time step and fully connected layers as shown in formula 1lRepresenting the number of layers of the LSTM. By mixingX t The context information of the target on the time sequence is aggregated when the context information is input into an LSTM network to obtain the characteristic representing the recent comprehensive state of the target, and the score is obtained after the context information passes through a full connection layer FCs u Only if it is greater than the update thresholdγ u The local tracking module and the switching control module are updated in time.
s u =FC(LSTM l (LSTM l-1 (…LSTM 1 (X t ) 8230;)) formula 1
Performing visible light-thermal infrared image mutual reconstruction pre-training:
parameter initialization during deep learning model training has a great influence on the performance of a training result. In order to improve the generalization capability and robustness of the model, the invention introduces visible light-thermal infrared image mutual reconstruction pre-training, utilizes a large amount of unmarked visible light-thermal infrared images to train the feature extraction capability of the double-light fusion module, and then uses the feature extraction capability as an initial parameter to carry out target tracking training.
The process of visible light-thermal infrared image mutual reconstruction pre-training is shown in fig. 6, wherein each reconstruction module is formed by stacking a plurality of layers formed by convolution operation and up-sampling operation. Each training sample in the pre-training is composed of a pair of visible light-thermal infrared images (
Figure 734427DEST_PATH_IMAGE038
) And (4) forming. Firstly, uniformly dividing two images of visible light and thermal infrared into a plurality of images
Figure 427576DEST_PATH_IMAGE001
Size squares, then randomly selected from each picture
Figure 692335DEST_PATH_IMAGE003
Obtaining the result after each square grid and using the color blocks to shield the image content in the selected square grid
Figure 645860DEST_PATH_IMAGE004
. The randomly shielded image is used as the input of the dual-light fusion module, and the reconstruction module respectively recovers the visible light image and the infrared image by using the characteristics extracted by the dual-light fusion module to obtain
Figure 730491DEST_PATH_IMAGE005
. Finally, the original image is taken as trueValue, calculate the mean square error loss as shown in equation 2 and train the model until the model converges. During tracking training, the model initializes the parameters of the dual-light fusion module obtained by loading pre-training, and the reconstruction module is discarded.
Figure 227331DEST_PATH_IMAGE006
Equation 2
Regression loss function based on boundary distribution prediction
In order to solve the problem that the boundary of the target enclosure frame has uncertainty due to the weak misalignment of the visible light-thermal infrared image pair, the method is different from other methods for directly predicting the distance from each boundary of the target enclosure to the regression center.
Specifically, the boundary is specified to have a distribution interval of
Figure 612176DEST_PATH_IMAGE017
The regression branch prediction result of the tracker is
Figure 270691DEST_PATH_IMAGE039
Wherein
Figure 108197DEST_PATH_IMAGE015
Showing the boundaries of the object
Figure 877570DEST_PATH_IMAGE016
Within the above-mentioned interval and fall within
Figure 116921DEST_PATH_IMAGE019
The probability of (c). As shown in equation 3, for the boundary of the target
Figure 146670DEST_PATH_IMAGE016
Calculating the expected value of the corresponding distribution to obtain the position of the boundary predicted by the model
Figure 737051DEST_PATH_IMAGE040
Figure 44535DEST_PATH_IMAGE041
Equation 3
Typically, the actual location of an object should be in the vicinity of its real tag, even if there is uncertainty in the tag. Therefore, when the boundary is
Figure 669552DEST_PATH_IMAGE016
True tag of
Figure 873131DEST_PATH_IMAGE013
Located in a section
Figure 685229DEST_PATH_IMAGE042
Predicted bounding box distribution probability when in range
Figure 530826DEST_PATH_IMAGE015
And
Figure 948032DEST_PATH_IMAGE043
it should also be larger, in order to encourage the model to predict larger probability values at locations near these truth values, a loss function is introduced as shown in equation 4:
Figure 119250DEST_PATH_IMAGE012
equation 4
The regression branch bulk loss function of the tracker is shown in equation 5, where
Figure 439152DEST_PATH_IMAGE022
Represents the loss of the conventional cross-over ratio,
Figure 88439DEST_PATH_IMAGE023
a predicted target bounding box and a true target bounding box are represented, respectively.
Figure 422469DEST_PATH_IMAGE021
Equation 5
Intersection ratio perception classification loss function:
the tracker generally needs to select the final tracking result from the candidate box according to the confidence score of the classification branch prediction. In order to promote the classification branches to select more accurate target bounding boxes, the invention introduces a cross-over ratio perception classification loss function to train the classification branches of the tracker.
Since the intersection ratio between the prediction bounding box and the truth bounding box directly reflects the accuracy of the prediction result, letting the classification branch learn to predict the intersection ratio between each candidate box and the truth box as its confidence score helps to pick the most accurate prediction result. Based on the above assumptions, the present invention improves the conventional cross entropy loss function: for the positive sample, adjusting the label corresponding to the positive sample from 1 to the intersection ratio between the corresponding candidate box and the truth box; for negative examples, its corresponding label is kept at 0. The resulting loss function is shown in equation 6, where Pos represents the set of positive samples,
Figure 30167DEST_PATH_IMAGE025
is shown in which
Figure 285699DEST_PATH_IMAGE027
Confidence of each sample predicts the result.
Figure 4257DEST_PATH_IMAGE028
Equation 6
Seventh, speed up of heavily parameterized reasoning:
in order to simultaneously utilize the advantage of high performance during multi-branch model training and the advantage of high reasoning speed of a single-path model, the method carries out re-parameterization on the feature extraction network so as to enable the model to obtain better balance on the reasoning speed and the tracking performance.
Of modules with dual light fusion during pre-training and training phasesThe foundation constituting unit is shown in FIG. 7 and is composed of
Figure 661634DEST_PATH_IMAGE030
Convolutional layer and side
Figure 909076DEST_PATH_IMAGE031
A convolution branch and an identity mapping branch. Such a structure is comparable to a single one
Figure 917483DEST_PATH_IMAGE030
Convolution can generate implicit integration of a large number of sub models by constructing residual connection, and therefore performance of the models is improved. After the training is completed, as shown in fig. 7, we can put the training on
Figure 171222DEST_PATH_IMAGE031
The convolution and identity mapping are both regarded as that values of other positions except the central position of the convolution kernel parameter are 0
Figure 683106DEST_PATH_IMAGE030
And (4) convolution. Then, based on the additivity of convolution, the three branches are combined
Figure 835870DEST_PATH_IMAGE030
The convolution parameters are added to obtain the product which is identical to the original model output and only contains one
Figure 331574DEST_PATH_IMAGE030
The convolution one-way model achieves the effect of improving the model reasoning speed. The procedure is shown in equation 7, where Parameters denote correspondences
Figure 126354DEST_PATH_IMAGE030
The parameter space of the convolution kernel.
Figure 758324DEST_PATH_IMAGE032
Equation 7
The potential application scenes of the invention comprise the fields of unmanned driving, auxiliary driving, intelligent security, military and the like. The application mode is that the algorithm and the model are deployed to the computing equipment and the designated target in the input infrared + visible light double-path video stream is tracked.
The invention has the beneficial effects that: through the scheme, the visible light-thermal infrared double-light target tracker has better robustness and generalization capability, and can accurately and quickly realize long-time tracking of the target. The concrete expression is as follows:
1. through the double-light fusion module with the partial coupling convolution layer, the characteristics of the visible light-thermal infrared image pair are better extracted and fused, and the robustness of the long-time tracking algorithm under the challenges of extreme illumination, severe weather and the like is improved.
2. By introducing the state switching module to dynamically switch between local tracking and global detection, the extra calculation amount caused by the fact that each frame needs global detection is avoided, and the operation speed of the algorithm is improved.
3. By introducing the update control module to maintain a more accurate target model, the tracking drift and even the tracking failure caused by unreliable online update are relieved.
4. Before formal training, large-scale pre-training is carried out on the visible light-thermal infrared image pair which is not marked by a mutual reconstruction agent task, and the performance and the generalization capability of the model are improved.
5. During training, a regression loss function based on boundary distribution prediction is used, and the tracking accuracy in a weak misalignment scene is improved.
6. And selecting a better candidate box as a final prediction result by using a cross-over ratio perception classification loss function in training.
7. After the training is finished, the multi-branch large model used in the training is converted into an equivalent straight-barrel small model through re-parameterization, and the reasoning speed of the model is improved.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A double-light multi-model long-time target tracking method is characterized by comprising the following steps:
pre-training: mutual reconstruction pre-training is carried out on the dual-light fusion module by utilizing an unmarked visible light-thermal infrared image to obtain an initialization weight parameter;
training: carrying out weight initialization on the dual-light fusion module by using the initialization weight parameters obtained in the pre-training step, and carrying out tracking training on a visible light-thermal infrared tracking data set by using a regression loss function based on boundary distribution prediction and a classification loss function based on cross-over ratio perception;
a re-parameterization step: the pre-training step and the training step both use a double-light fusion module with a residual error structure, and the double-light fusion module with the residual error structure is converted into a straight-tube type double-light fusion module through parameterization;
the inference step comprises: the following steps are performed for each frame of the input visible-thermal infrared image pair:
step a: extracting features of an input image frame by using a dual-light fusion module;
step b: the current algorithm running state comprises local tracking or global detection, and a local tracking module or a global detection module is run based on the current algorithm running state;
b, based on the result obtained in the step b, operating a state switching module, evaluating whether the current frame is successfully tracked or not, and determining whether to switch the operating state or not;
b, based on the result obtained in the step b, in combination with historical target information, evaluating whether the current frame should update the template in the local tracking module and the classifier in the state switching module through an updating control module;
in the training step, the regression loss function predicted based on the boundary distribution is as shown in equation 5,
Figure 338884DEST_PATH_IMAGE001
equation 4
In equation 4, E represents a certain boundary surrounding a box, E represents a set of boundaries surrounding a box,
Figure 466240DEST_PATH_IMAGE002
the presence of a real label is indicated,
Figure DEST_PATH_IMAGE003
represents an integer part of the real tag and,
Figure 241429DEST_PATH_IMAGE004
representing the boundary of an object
Figure 745223DEST_PATH_IMAGE005
In that
Figure 984574DEST_PATH_IMAGE006
Within a range of
Figure 282831DEST_PATH_IMAGE007
The probability of (a) being in (b),
Figure 669950DEST_PATH_IMAGE008
the boundary e representing the object is [0, L]Probability of falling at l +1 within the interval;
Figure DEST_PATH_IMAGE009
equation 5
In the formula 5, the first and second groups,
Figure 915118DEST_PATH_IMAGE010
represents the loss of the conventional cross-over ratio,
Figure 743397DEST_PATH_IMAGE011
a predicted target bounding box and a true target bounding box are represented, respectively.
2. A dual-light multi-model long-time target tracking method as claimed in claim 1, wherein the dual-light fusion module is composed of a dual-current convolution network, the dual-current convolution network has different coupling rates at different convolution layers, the coupling rates become larger with the increase of the depth of the dual-current convolution network, the dual-light fusion module can extract common features of two modes of visible light and thermal infrared through coupled convolution kernels, and private features of visible light/thermal infrared images can be extracted respectively through uncoupled convolution kernels, and the features extracted by the dual-current convolution network are input into a channel attention module for fusion.
3. The dual-light multi-model long-time target tracking method according to claim 1, wherein the state switching module comprises the following steps:
step 1: the prediction result of the local tracking module or the global detection module is input into a classifier of the state switching module, and the classifier evaluates the prediction result to obtain a scores s
And 2, step: judging the state of the current algorithm, and executing a first branch step if the current algorithm is in a local tracking state; if the current algorithm is in the global detection state, executing a second branch step;
a first branching step: judging the scores s Whether or not less than a threshold valueγ s If so, the local tracking module is considered to be unsuccessful in tracking the target, and the algorithm is switched to a global detection state, otherwise, the local tracking module is considered to be successful in tracking the target, and the local tracking state is continuously maintained;
a second branching step: judging the scores s Whether or not it is greater than a threshold valueγ s If the target is not detected, the global detection module is continuously kept in the global detection state.
4. The dual-optical multi-model long-time target tracking method according to claim 3, wherein the state switching module comprises a classifier, and the classifier is updated according to the prediction result in the tracking process, and the updating process is as follows: firstly, randomly sampling around a prediction result, dividing the prediction result into positive and negative samples according to the positive and negative samples and the IoU of the prediction result, and then training a classifier by using the obtained positive and negative samples to update parameters of the classifier, wherein the IoU is the overlapping rate between a frame obtained by sampling and a frame of the prediction result.
5. The dual-optical multi-model long-time target tracking method according to claim 1, wherein the update control module is formed by stacking LSTM and full connection layer, as shown in formula 1, the LSTM represents a multi-layer long-short term memory network with gradually shortened time step,nrepresenting the number of layers of the LSTM byX t The context information of the target on the time sequence is aggregated by inputting the context information into the LSTM to obtain the characteristic representing the recent comprehensive state of the target, and the score is obtained by the full connection layer FCs u Only when scorings u Greater than the update thresholdγ u Updating the local tracking module and the switching control module;
s u =FC(LSTM n (LSTM n-1 (…LSTM 1 (X t ) 8230;)) formula 1
Wherein the content of the first and second substances,X t indicates the most recentt s The state information of the frame constitutes history information of the target.
6. The dual-light multi-model long-time target tracking method according to claim 1, wherein in the pre-training step, the visible light image and the thermal infrared image are first uniformly divided into a plurality of images
Figure 478135DEST_PATH_IMAGE012
Size squares, then randomized from each pictureSelecting
Figure 86970DEST_PATH_IMAGE013
Obtaining the result after each square grid and using the color blocks to shield the image content in the selected square grid
Figure 198146DEST_PATH_IMAGE014
Figure 146510DEST_PATH_IMAGE014
Respectively representing a visible light image and a thermal infrared image after being shielded, wherein R represents a real number space, and H and W respectively represent the length and the width of the image; the randomly shielded image is used as the input of the dual-light fusion module, the visible light reconstruction module and the thermal infrared reconstruction module respectively recover the visible light image and the infrared image by using the characteristics extracted by the dual-light fusion module to obtain
Figure 52149DEST_PATH_IMAGE015
Figure 148281DEST_PATH_IMAGE015
Respectively representing the restored visible light image and the thermal infrared image; finally, taking the original image as a true value, calculating the mean square error loss shown in a formula 2, and training the model until the model converges; during tracking training, the model initializes the parameters of the dual-light fusion module obtained by loading pre-training, and the visible light reconstruction module and the thermal infrared reconstruction module are discarded;
Figure 797569DEST_PATH_IMAGE017
formula 2;
wherein L is MSE Indicating a loss in the image pair or images,
Figure 600439DEST_PATH_IMAGE018
and
Figure 676980DEST_PATH_IMAGE019
respectively showing a visible light original image and a thermal infrared original image,
Figure 260408DEST_PATH_IMAGE020
representing the restored visible light image,
Figure 447807DEST_PATH_IMAGE021
representing the recovered thermal infrared image.
7. The dual-optical multi-model long-time target tracking method according to claim 1, wherein in the training step, for a positive sample, the corresponding label is adjusted from 1 to the intersection ratio between the corresponding candidate box and the true box; for negative examples, keeping their corresponding labels to 0; the final obtained cross-over ratio perception classification loss function is shown as formula 6, the classification branch of the tracker is trained through the cross-over ratio perception classification loss function, wherein Pos represents a positive sample set,
Figure 370763DEST_PATH_IMAGE022
is shown in which
Figure 618205DEST_PATH_IMAGE023
The confidence degree prediction results of the samples;
Figure 626612DEST_PATH_IMAGE024
equation 6;
Figure 617702DEST_PATH_IMAGE025
representing the rate of overlap between the predicted target bounding box and the true target bounding box.
8. The dual-light multi-model long-time target tracking method according to claim 1, wherein in the re-parameterization step, the dual-light fusion module is composed of
Figure 395165DEST_PATH_IMAGE026
Convolutional layer and side
Figure 875825DEST_PATH_IMAGE027
A convolution branch and an identity mapping branch are formed, will
Figure 105949DEST_PATH_IMAGE027
The convolution and identity mapping are both regarded as that the values of other positions except the central position of the convolution kernel parameter are 0
Figure 900730DEST_PATH_IMAGE026
Convolving, and then dividing the three branches according to the additivity of the convolution
Figure 532700DEST_PATH_IMAGE026
The convolution parameters are added to obtain the product which is identical to the original model output and only contains one
Figure 121944DEST_PATH_IMAGE026
The one-way model of convolution is shown in equation 7, and Parameters represents the corresponding
Figure 167260DEST_PATH_IMAGE026
A parameter space of convolution kernels;
Figure 500153DEST_PATH_IMAGE028
equation 7;
Figure DEST_PATH_IMAGE029
Figure 924312DEST_PATH_IMAGE030
Figure 950037DEST_PATH_IMAGE031
respectively representing the parameters of the 3 x 3 convolution obtained in the training, the parameters of the 1 x 1 convolution obtained in the training and the parameters of the identity map obtained in the training,
Figure 80718DEST_PATH_IMAGE032
to represent
Figure 217301DEST_PATH_IMAGE029
Figure 558284DEST_PATH_IMAGE030
Figure 754910DEST_PATH_IMAGE031
Any one of them.
9. A dual-light multi-model long-time target tracking system is characterized by comprising: a memory, a processor, and a computer program stored on the memory, the computer program configured to, when invoked by the processor, implement the steps of the dual-photon multi-model long-time object tracking method of any one of claims 1-8.
10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the dual-photon multi-model long-time object tracking method of any one of claims 1-8.
CN202211177765.3A 2022-09-27 2022-09-27 Double-light multi-model long-time target tracking method and system and storage medium Active CN115294176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211177765.3A CN115294176B (en) 2022-09-27 2022-09-27 Double-light multi-model long-time target tracking method and system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211177765.3A CN115294176B (en) 2022-09-27 2022-09-27 Double-light multi-model long-time target tracking method and system and storage medium

Publications (2)

Publication Number Publication Date
CN115294176A CN115294176A (en) 2022-11-04
CN115294176B true CN115294176B (en) 2023-04-07

Family

ID=83834523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211177765.3A Active CN115294176B (en) 2022-09-27 2022-09-27 Double-light multi-model long-time target tracking method and system and storage medium

Country Status (1)

Country Link
CN (1) CN115294176B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168322B (en) * 2023-01-10 2024-02-23 中国人民解放军军事科学院国防科技创新研究院 Unmanned aerial vehicle long-time tracking method and system based on multi-mode fusion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327271A (en) * 2021-05-28 2021-08-31 北京理工大学重庆创新中心 Decision-level target tracking method and system based on double-optical twin network and storage medium
CN114170269A (en) * 2021-11-18 2022-03-11 安徽清新互联信息科技有限公司 Multi-target tracking method, equipment and storage medium based on space-time correlation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018105544A1 (en) * 2018-03-09 2019-09-12 Jena-Optronik Gmbh A method for initializing a tracking algorithm, method for training an artificial neural network, computer program product, computer-readable storage medium and data carrier signal for carrying out such methods and apparatus for data processing
CN113077491B (en) * 2021-04-02 2023-05-02 安徽大学 RGBT target tracking method based on cross-modal sharing and specific representation form
CN113628249B (en) * 2021-08-16 2023-04-07 电子科技大学 RGBT target tracking method based on cross-modal attention mechanism and twin structure
CN115100235B (en) * 2022-08-18 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Target tracking method, system and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327271A (en) * 2021-05-28 2021-08-31 北京理工大学重庆创新中心 Decision-level target tracking method and system based on double-optical twin network and storage medium
CN114170269A (en) * 2021-11-18 2022-03-11 安徽清新互联信息科技有限公司 Multi-target tracking method, equipment and storage medium based on space-time correlation

Also Published As

Publication number Publication date
CN115294176A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN109800689B (en) Target tracking method based on space-time feature fusion learning
WO2017004803A1 (en) An apparatus and a method for semantic image labeling
CN110348437B (en) Target detection method based on weak supervised learning and occlusion perception
CN113674416B (en) Three-dimensional map construction method and device, electronic equipment and storage medium
CN112950645B (en) Image semantic segmentation method based on multitask deep learning
Liu et al. Adversarial unsupervised domain adaptation for 3D semantic segmentation with multi-modal learning
Zhao et al. Unsupervised monocular depth estimation in highly complex environments
Bešić et al. Dynamic object removal and spatio-temporal RGB-D inpainting via geometry-aware adversarial learning
CN115797736B (en) Training method, device, equipment and medium for target detection model and target detection method, device, equipment and medium
CN115294176B (en) Double-light multi-model long-time target tracking method and system and storage medium
Li et al. An efficient image-guided-based 3D point cloud moving object segmentation with transformer-attention in autonomous driving
CN113129336A (en) End-to-end multi-vehicle tracking method, system and computer readable medium
CN103473789A (en) Human body video segmentation method fusing multi-cues
Wang et al. Drosophila-inspired 3D moving object detection based on point clouds
CN115601841A (en) Human body abnormal behavior detection method combining appearance texture and motion skeleton
Dong et al. Weighted triplet loss based on deep neural networks for loop closure detection in VSLAM
CN112069997B (en) Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net
CN115035429A (en) Aerial photography target detection method based on composite backbone network and multiple measuring heads
Jin et al. Anomaly detection for robust autonomous navigation
CN113989920A (en) Athlete behavior quality assessment method based on deep learning
CN113361475A (en) Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing
Yang et al. Adaptive fusion of RGBD data for two-stream FCN-based level set tracking
Jin et al. Spatial memory-augmented visual navigation based on hierarchical deep reinforcement learning in unknown environments
Tang Detect Lane Line Based on Bi-directional Feature Pyramid Network
CN117671647B (en) Multitasking road scene perception method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant