CN112862860A

CN112862860A - Object perception image fusion method for multi-modal target tracking

Info

Publication number: CN112862860A
Application number: CN202110169737.6A
Authority: CN
Inventors: 朱鹏飞; 王童; 胡清华
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-05-28
Anticipated expiration: 2041-02-07
Also published as: CN112862860B

Abstract

The invention discloses an object perception image fusion method for multi-modal target tracking, which comprises the following steps: acquiring a self-adaptive fusion image, performing significance detection on the image, cascading and connecting the image with the outputs of different layers in a network, respectively inputting an RGB image and a thermal mode image into two channels, wherein each channel comprises three aggregation modules, and reconstructing the fusion image by an adaptive guide network through judging the size of an image gray value, pixel intensity, similarity measurement and consistency loss according to the characteristics of a connection deep layer; the feature combination module combines the features extracted from the samples and the search images by using a depth cross-correlation operation to generate corresponding similarity features for subsequent target positioning, and then performs classification and regression to perform target positioning; tracking training is carried out by using a regression network without an anchor frame, all pixels in a boundary frame in a groudtruth are used as training samples, weak prediction can be corrected to a certain extent, and the correct position is corrected; the sampling position is changed by adopting deformable convolution, so that the sampling position is aligned with the prediction boundary box, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable.

Description

Object perception image fusion method for multi-modal target tracking

Technical Field

The invention relates to the field of target tracking, in particular to an object perception image fusion method for multi-mode target tracking.

Background

Target tracking is widely used in video surveillance, autopilot and robotics, and has been the focus of computer vision, which is defined as follows: given the size and position of an object in an initial frame of a video sequence, the size and position of the object in subsequent frames are predicted. The main challenge of tracking is that the target object may be severely occluded, greatly deformed, and have illumination variations.

In recent years, it has been found that thermal infrared can provide a more stable signal, and the popularization of thermal infrared cameras has promoted developments in various fields such as object segmentation, human Re-ID and pedestrian detection, and thus multi-modal (RGBT) tracking has led to more research. Multimodal tracking can be seen as an extension of video tracking, which aims at estimating the state of an object using the complementary advantages of RGB information and thermal images. How to fully utilize RGB images and thermal images for robust multi-modal tracking is a problem.

Existing work focuses primarily on mode-specific information integration from two aspects, most still using sparse representations (usually with hand-crafted functionality) for multi-modal tracking. For example, one is to re-join the modal weights and computation of the sparse representation into a model and perform online object tracking in a bayesian filtering framework. Still, a cross-modal ranking algorithm is proposed to improve the robustness of the computation, thereby generating a more robust RGBT feature representation. This way the two modalities are weighted the same, however in practice it is possible that one modality may have better value than the other; on the other hand, by extending the single modality tracker to the multi-modality tracker, some baseline RGBT trackers were designed, done by concatenating the features in RGB and thermal modalities directly into vectors, which are then fed into the tracker, and these weights are used to fuse the complementary features in RGB and thermal images so that the properties of a particular modality can be deployed efficiently, but neglecting the potential value of modality sharing and object information, which is crucial for multi-modality tracking.

These methods rely on hand-made features or a deep network of single-structure adapters for target localization, and are difficult to address challenges of appearance changes due to target deformation, sudden motion, background clutter and occlusion, etc.

Disclosure of Invention

The invention provides an object perception image fusion method for multi-modal target tracking, which is characterized in that an unsupervised fusion method is used for multi-modal tracking to make fused images more obvious, a characteristic combination module is introduced to improve the reliability of a classification network, and meanwhile, the tracking performance and robustness are greatly improved, as described in detail below:

a method of object-aware image fusion for multi-modal target tracking, the method comprising the steps of:

acquiring a self-adaptive fusion image, namely respectively inputting an RGB image and a thermal mode image into two channels, performing significance detection on the image, cascading and connecting the RGB image and the thermal mode image with the output of different layers in a network, wherein each channel comprises three aggregation modules, and judging the size of an image gray value, pixel intensity, similarity measurement and consistency loss through a sliding window according to the characteristics of a connection deep layer to reconstruct the fusion image by an adaptive guide network;

training the adaptive fusion network by using a pair-wise training set, adjusting network parameters according to a hybrid loss function, testing a fusion network model by using a verification set, and improving network weight by adjusting hyper-parameters;

wherein the mixing loss function is specifically:

SSIM is a measure of structural similarity between two images, where X, Y are represented as two modal pictures, where C ═ 9 × 10⁴W is the sliding window, σ is the standard deviation, and σ XY is the cross-correlation between X and Y. Sliding window size 11X 11, passThe average intensity of the pixels in the window is calculated to measure the SSIM score.

Where N is the total number of sliding windows in a single image, I_rgbAs RGB images, I_rAs a thermal image, I_FFor fusing images as E (I)_tI W) is greater than or equal to E (I)_rgbI W), meaning that the thermal infrared image has more texture information, SSIM will direct the network to retain the thermal infrared features, which will cause I_FMore similar to I_tAnd vice versa, wherein P_iIs the value of pixel i.

In an anchor frame-based tracker experiment, we find that when a prediction frame is inaccurate, a tracker can quickly lose a target, and the fundamental reason is that in training, the methods are based on an anchor frame, wherein IoU of the methods and a real frame is greater than a threshold value, namely IoU is greater than or equal to 0.6, so that the methods cannot accurately locate a target area, for example, in the case of small overlap of the anchor frame, in order to solve the problem, we use an anchor frame-free regression network for tracking training, and use all pixels in a boundary frame in a groudtruth as training samples, so that weak prediction can be corrected to a certain extent and the correct position can be corrected; if the pixel coordinate (x, y) falls within the real box B, then it is considered as a regression sample, let B be (x, y)₀,y₀,x₁,y₁)∈R⁴The result represents the upper left and lower right target objects of the true frame of pixels, and the sample label is calculated as T ═ d (d)₁,d₂,d₃,d₄)

d₁＝x-x₀,d₂＝y-y₀,

d₃＝x₁-x,d₄＝y₁-y,

Representing the distance from the location to the four edges of the bounding box, which the regression network learns through four 3 x 3 convolutional layers, in most siemese tracking methods the classification confidence is estimated by sampling features from a fixed region in the feature map, which features depict a fixed local region of the image and cannot be scaled to accommodate changes in object specific interest, with the result that the classification confidence is not reliable in distinguishing between target objects and complex backgrounds.

The sampling position is changed by adopting deformable convolution, so that the sampling position is aligned with the prediction boundary box, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable. Specifically, for each position $ (c _ { x }, c _ { y }) $inthe classification map, there is an object bounding box $ M ═ (M _ { x }, M _ { y }, M _ { w }, M _ { h }) $predictedby the regression network, where $ M _ { x } $ $ M _ { y } $representsthe frame center, and $ M _ { w } $ M _ { h } $representsthe frame center width and height. The goal is to estimate the classification confidence for each position $ (d _ { x }, d _ { y }) $, by sampling features in the corresponding candidate region M. The sampling network G is subjected to a spatial transformation T, and the fixed sampling positions are changed into predicted positions, which can be expressed as:

where F is the input feature map, w represents the learned convolution weights, c is the positions in the feature map, F' represents the output feature map, and the spatial transformation Δ τ ∈ T represents the distance vector aligned from the original sample point to the prediction bounding box.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention provides a multi-modal image fusion target tracking framework with a remarkable target: the multimode image fusion with obvious targets can effectively improve the texture information of the fused image, an unsupervised fusion method is provided, the network robustness is enhanced, the redundant information of the fused image is removed, a method for training a model according to a groudtruth (real labeling) bounding box is provided, and the fused network model trained by the method is excellent in performance;

2. the invention provides a characteristic combination module, so that the reliability of a classification network is improved, and the performance of a trained tracking network model is excellent;

3. the method introduces a regression network without an anchor frame for tracking training, and takes all pixels in a boundary frame in the groudtruth as training samples, so that weak prediction can be corrected to a certain extent and corrected to a correct position.

Drawings

FIG. 1 is a schematic diagram of a converged network architecture;

FIG. 2 is a graph of RGBT234 performance results;

FIG. 3 is an experimental view of the ablation of the assembly;

fig. 4 is a flowchart of an object-aware image fusion method for multi-modal target tracking.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

In order to solve the problems in the background art, the embodiment of the invention provides an object perception image fusion method for multi-modal target tracking, which is highlighted in that an unsupervised algorithm is used for adaptively fusing images, so that the target is more obvious, a tracking network is promoted, the network can adaptively fuse according to picture information, the missing another modal information caused by fusion weight deviation is solved, and modal information sharing is achieved. And a characteristic combination module is provided to improve the reliability of the classification network.

Aiming at hidden dangers brought by anchor frame detection, a regression network without an anchor frame is introduced for tracking training, and weak prediction to a certain degree can be corrected. The robustness of the network is improved.

Example 1

The embodiment of the invention provides an object perception image fusion method for multi-modal target tracking, which comprises the following steps:

101: acquiring a self-adaptive fusion image, namely respectively inputting an RGB image and a thermal mode image into two channels, performing significance detection on the image, cascading and connecting the RGB image and the thermal mode image with the output of different layers in a network, wherein each channel comprises three aggregation modules, and judging the size of an image gray value, pixel intensity, similarity measurement and consistency loss through a sliding window according to the characteristics of a connection deep layer to reconstruct the fusion image by an adaptive guide network;

further, a schematic diagram of a network structure for adjusting the hyper-parameters using the verification set is shown in fig. 1.

102: obtaining a fusion training set by the training set through a self-adaptive fusion network, training a tracking network by the fusion training set, dividing a module image and a search area according to a real frame, training the tracking network, testing a target tracking model by using a verification set, and selecting an optimal tracking network;

103: and testing the whole tracking framework by using the trained tracking network model and the trained fusion model.

In conclusion, the unsupervised fusion method is used for multi-mode tracking, so that the fused image is more obvious, the reliability of the classification network is improved by introducing the characteristic combination module, the tracking performance and the robustness are greatly improved, weak prediction can be corrected by the regression network based on the anchor-frame-free technology, and the classification confidence coefficient can be improved by the deformable convolution. The network tracking speed is realized in real time.

Example 2

The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:

the invention adopts a maximum general multi-modal target tracking data set RGBT234 during training, wherein the data set is expanded from RGBT210 and comprises 234 aligned RGB and thermal mode video sequences, the thermal mode video sequences comprise 20 ten thousand frames of pictures in total, and the longest video sequence reaches 4000 frames.

The fusion task generates an information image containing a large amount of thermal information and texture details, and the first component is mainly composed of three main components as shown in fig. 1: feature extraction, feature fusion and feature reconstruction. Saliency detection is performed on the whole image, cascading and fully connecting the outputs of the different layers in the network, the RGB image and the thermal image are respectively put into two channels, both channels are composed of one C1 and one dense block containing D1, D2 and D3, the first C1 contains 3 × 3 convolution to extract low-level features, each dense block also has 3 × 3 convolution layers, in the feature fusion part, the deep features are directly connected, and finally, the result of the fusion layer passes through the other five convolution layers (C2, C3, C4, C5 and C6), the fusion result image is reconstructed from the fusion features, table 1 summarizes the more detailed architecture of the network, and after the fusion network, the fusion image contains a lot of thermal information and texture details.

TABLE 1

In the test of the tracker based on the anchor frame, when the prediction frame is inaccurate, the tracker can quickly lose the target, and the fundamental reason is that the methods are based on the anchor frame, wherein IoU of the methods and the real frame is larger than a threshold value, namely IoU is more than or equal to 0.6, so that the methods can not accurately locate the target area, for example, in the case of small overlap of the anchor frame, in order to solve the problem, a regression network without the anchor frame is used for tracking training, all pixels in a boundary frame in a groudtuth are used as training samples, weak prediction can be corrected to a certain extent, and the correct position is corrected

In most siemese tracking methods, the classification confidence is estimated by sampling features from a fixed region in a feature map, which depicts a fixed local region of the image and cannot be scaled to accommodate changes in object scale, with the result that the classification confidence is not reliable in distinguishing between target objects and complex backgrounds. The sampling position is changed by adopting deformable convolution, so that the sampling position is aligned with the prediction boundary box, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable.

In an adaptive fusion network, we use SSIM loss, which is a measure of structural similarity between two images, and in order to implement gradient transformation and eliminate some noise, we introduce an objective function to design a mixed loss function, which is described as follows:

OA(i,j)＝I_t+I_rgb-2·I_F(i,j) (1)

wherein OA is the difference between the original image and the fused image, | luminance₂Is a₂Distance, since the two types of losses are not an order of magnitude, a hyper-parameter λ is set, and the loss function is described as follows:

Loss_F＝L_SSIM+λL_OAloss (3)

to optimize the tracking network, a regression and classification network was trained using IoU losses and binary cross-entropy losses, where the losses are defined as:

L_reg＝-∑ln(IoU(P_reg,T^*)) (4)

wherein P is_regRepresenting prediction, i index as training sample, L in classification_clsIs defined as:

L_cls＝-∑p^*log(p)+(1-p^*)log(1-p) (5)

where P is the class score map, j is the index that classifies the training samples, and P^*Representing a true tag, more specifically, P is a binary tag in which the pixel near the center of the object is labeled 1, and the formula is defined as:

the whole network joint training optimizes the following targets:

Loss_T＝L_reg+L_cls (7)

the fusion network model does not track the network during training, the model is input into RGB and thermal infrared images aligned in pairs, and network parameters are adjusted according to a mixing loss function. In the tracking network training, the converged network is not updated. The entire network input is also a pair-aligned RGB and thermal infrared image.

Specifically, the method comprises the following steps:

1. obtaining a loss function of the batch image by using formulas (1) to (3);

2. reversely updating the parameters of the fusion network by using an SGD (random gradient descent) algorithm according to the loss function obtained in the step 1;

3. adjusting the hyper-parameter λ using the validation set;

training a tracking network model, and specifically comprising the following steps:

1. calculating the classification loss according to the formula 6 and the true value;

2. calculate the net prediction and true IoU losses using equation 5;

3. obtaining losses according to formula 7 and updating tracking network parameters in reverse using an SGD (random gradient descent) algorithm;

4. and judging the network performance adjustment learning rate by using the test set.

The embodiment of the invention has the following advantages:

firstly, a multi-modal target tracking image fusion method framework is provided, and multi-modal information is fused in a self-adaptive mode through an unsupervised algorithm, so that the network has more remarkable information improvement, and the robustness of a tracker is greatly improved. Secondly, a regression method without anchor frame is provided, the trained model has good performance, and the generated weak prediction (under the condition that the overlap of the anchor frame is small) can be corrected to a certain degree. And thirdly, providing a method for aligning the characteristics of deformable convolution, wherein on the premise of not influencing the accuracy of the model, the model adopts deformable convolution to change the sampling position so as to align the sampling position with the prediction boundary frame, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable, and the running speed is detected in real time.

Example 3

Example 1 adopted in the embodiment of the present invention is shown in fig. 3, which reflects a fusion result displayed when the parameter λ is 0.01 and the left side is an RGB image, the middle is a thermal image, and the right side is a fusion image, and the fusion image contains a large amount of thermal information and texture details.

Embodiment 2 of the present invention is shown in fig. 2 and 3, which are PR and SR score maps of 12 trackers on the RGBT234 data set. PR is the percentage of frames within a given threshold distance of their output position from true, SR is the ratio of successful frames overlapped by more than the threshold, and consequently the tracker that takes this approach is much better than the others, especially in PR than the second best tracker MANet (77.8%) by more than 3.2%. Better than the second MANet (54.4%) in SR is over 6.1%.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An object-aware image fusion method for multi-modal target tracking, the method comprising the steps of:

acquiring a self-adaptive fusion image, respectively inputting an RGB image and a thermal mode image into two channels, performing significance detection on the image, cascading and connecting the RGB image and the thermal mode image with the output of different layers in a network, wherein each channel comprises three aggregation modules, and judging the size of an image gray value, the pixel intensity, the similarity measurement and the consistency loss through a sliding window according to the characteristics of a connection deep layer to reconstruct the fusion image by an adaptive guide network;

the feature combination module combines the features extracted from the samples and the search images by using a depth cross-correlation operation to generate corresponding similarity features for subsequent target positioning, and then performs classification and regression to perform target positioning;

performing tracking training by using a regression network without an anchor frame, wherein all pixels in a boundary frame in a groudtruth are used as training samples; the classification confidence is mapped to the target object by changing the sampling position using deformable convolution to align it with the prediction bounding box.

2. The object-aware image fusion method for multi-modal target tracking according to claim 1, wherein the similarity measure in the network in the step of obtaining the adaptive fusion image is specifically:

where X, Y represent two modal pictures, where C ═ 9 × 10⁴W is a sliding window, σ is the standard deviation, σ XY is the cross-correlation between X and Y, the sliding window size is 11 × 11, and the SSIM score is measured by calculating the average intensity of the pixels in the window;

wherein P is_iIs the value of pixel I when E (I)_tI W) is greater than or equal to E (I)_rgb| W), thermal infrared imageHas more texture information, and the SSIM guide network reserves the thermal infrared characteristic so that I_FMore similar to I_tAnd vice versa.

3. The object-aware image fusion method for multi-modal target tracking according to claim 1, wherein the changing the sampling position by using deformable convolution is specifically:

and performing spatial transformation T on the sampling network G, and changing the fixed sampling position into a prediction position, wherein F is an input feature map, w represents the learned convolution weight, c is the position in the feature map, F' represents an output feature map, and the spatial transformation delta tau epsilon T represents the distance vector aligned from the original sampling point to the prediction boundary frame.

4. The object-aware image fusion method for multi-modal target tracking according to claim 3, wherein the changing the sampling position by deformable convolution is: and aligning the target object with the prediction bounding box, and corresponding the classification confidence to the target object.

5. The object-aware image fusion method for multi-modal target tracking according to claim 4, wherein the associating classification confidence with target object specifically: for each position $ (c _ { x }, c _ { y }) $, in the classification map, there is an object bounding box $ M ═ (M _ { x }, M _ { y }, M _ { w }, M _ { h }) $predictedby the regression network, where $ M _ { x } $ $ M _ { y } $representsthe frame center, and $ M _ { w } $ M _ { h } $representsthe frame center width and height.

6. The object-aware image fusion method for multi-modal target tracking according to claim 1, wherein the tracking training of the regression network without anchor frame specifically comprises: if the pixel coordinate (x, y) falls within the real box B, it will beIt is regarded as a regression sample, let B ═ x₀,y₀,x₁,y₁)∈R⁴The result represents the upper left and lower right target objects of the true frame of pixels, and the sample label is calculated as T ═ d (d)₁,d₂,d₃,d₄)

d₁＝x-x₀,d₂＝y-y₀,

d₃＝x₁-x,d₄＝y₁-y,

Representing the distance from the location to the four edges of the bounding box.