CN112862860A - Object perception image fusion method for multi-modal target tracking - Google Patents

Object perception image fusion method for multi-modal target tracking Download PDF

Info

Publication number
CN112862860A
CN112862860A CN202110169737.6A CN202110169737A CN112862860A CN 112862860 A CN112862860 A CN 112862860A CN 202110169737 A CN202110169737 A CN 202110169737A CN 112862860 A CN112862860 A CN 112862860A
Authority
CN
China
Prior art keywords
image
network
target
modal
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110169737.6A
Other languages
Chinese (zh)
Other versions
CN112862860B (en
Inventor
朱鹏飞
王童
胡清华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110169737.6A priority Critical patent/CN112862860B/en
Publication of CN112862860A publication Critical patent/CN112862860A/en
Application granted granted Critical
Publication of CN112862860B publication Critical patent/CN112862860B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an object perception image fusion method for multi-modal target tracking, which comprises the following steps: acquiring a self-adaptive fusion image, performing significance detection on the image, cascading and connecting the image with the outputs of different layers in a network, respectively inputting an RGB image and a thermal mode image into two channels, wherein each channel comprises three aggregation modules, and reconstructing the fusion image by an adaptive guide network through judging the size of an image gray value, pixel intensity, similarity measurement and consistency loss according to the characteristics of a connection deep layer; the feature combination module combines the features extracted from the samples and the search images by using a depth cross-correlation operation to generate corresponding similarity features for subsequent target positioning, and then performs classification and regression to perform target positioning; tracking training is carried out by using a regression network without an anchor frame, all pixels in a boundary frame in a groudtruth are used as training samples, weak prediction can be corrected to a certain extent, and the correct position is corrected; the sampling position is changed by adopting deformable convolution, so that the sampling position is aligned with the prediction boundary box, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable.

Description

Object perception image fusion method for multi-modal target tracking
Technical Field
The invention relates to the field of target tracking, in particular to an object perception image fusion method for multi-mode target tracking.
Background
Target tracking is widely used in video surveillance, autopilot and robotics, and has been the focus of computer vision, which is defined as follows: given the size and position of an object in an initial frame of a video sequence, the size and position of the object in subsequent frames are predicted. The main challenge of tracking is that the target object may be severely occluded, greatly deformed, and have illumination variations.
In recent years, it has been found that thermal infrared can provide a more stable signal, and the popularization of thermal infrared cameras has promoted developments in various fields such as object segmentation, human Re-ID and pedestrian detection, and thus multi-modal (RGBT) tracking has led to more research. Multimodal tracking can be seen as an extension of video tracking, which aims at estimating the state of an object using the complementary advantages of RGB information and thermal images. How to fully utilize RGB images and thermal images for robust multi-modal tracking is a problem.
Existing work focuses primarily on mode-specific information integration from two aspects, most still using sparse representations (usually with hand-crafted functionality) for multi-modal tracking. For example, one is to re-join the modal weights and computation of the sparse representation into a model and perform online object tracking in a bayesian filtering framework. Still, a cross-modal ranking algorithm is proposed to improve the robustness of the computation, thereby generating a more robust RGBT feature representation. This way the two modalities are weighted the same, however in practice it is possible that one modality may have better value than the other; on the other hand, by extending the single modality tracker to the multi-modality tracker, some baseline RGBT trackers were designed, done by concatenating the features in RGB and thermal modalities directly into vectors, which are then fed into the tracker, and these weights are used to fuse the complementary features in RGB and thermal images so that the properties of a particular modality can be deployed efficiently, but neglecting the potential value of modality sharing and object information, which is crucial for multi-modality tracking.
These methods rely on hand-made features or a deep network of single-structure adapters for target localization, and are difficult to address challenges of appearance changes due to target deformation, sudden motion, background clutter and occlusion, etc.
Disclosure of Invention
The invention provides an object perception image fusion method for multi-modal target tracking, which is characterized in that an unsupervised fusion method is used for multi-modal tracking to make fused images more obvious, a characteristic combination module is introduced to improve the reliability of a classification network, and meanwhile, the tracking performance and robustness are greatly improved, as described in detail below:
a method of object-aware image fusion for multi-modal target tracking, the method comprising the steps of:
acquiring a self-adaptive fusion image, namely respectively inputting an RGB image and a thermal mode image into two channels, performing significance detection on the image, cascading and connecting the RGB image and the thermal mode image with the output of different layers in a network, wherein each channel comprises three aggregation modules, and judging the size of an image gray value, pixel intensity, similarity measurement and consistency loss through a sliding window according to the characteristics of a connection deep layer to reconstruct the fusion image by an adaptive guide network;
training the adaptive fusion network by using a pair-wise training set, adjusting network parameters according to a hybrid loss function, testing a fusion network model by using a verification set, and improving network weight by adjusting hyper-parameters;
wherein the mixing loss function is specifically:
Figure BDA0002938618610000021
SSIM is a measure of structural similarity between two images, where X, Y are represented as two modal pictures, where C ═ 9 × 104W is the sliding window, σ is the standard deviation, and σ XY is the cross-correlation between X and Y. Sliding window size 11X 11, passThe average intensity of the pixels in the window is calculated to measure the SSIM score.
Figure BDA0002938618610000022
Figure BDA0002938618610000023
Figure BDA0002938618610000024
Where N is the total number of sliding windows in a single image, IrgbAs RGB images, IrAs a thermal image, IFFor fusing images as E (I)tI W) is greater than or equal to E (I)rgbI W), meaning that the thermal infrared image has more texture information, SSIM will direct the network to retain the thermal infrared features, which will cause IFMore similar to ItAnd vice versa, wherein PiIs the value of pixel i.
In an anchor frame-based tracker experiment, we find that when a prediction frame is inaccurate, a tracker can quickly lose a target, and the fundamental reason is that in training, the methods are based on an anchor frame, wherein IoU of the methods and a real frame is greater than a threshold value, namely IoU is greater than or equal to 0.6, so that the methods cannot accurately locate a target area, for example, in the case of small overlap of the anchor frame, in order to solve the problem, we use an anchor frame-free regression network for tracking training, and use all pixels in a boundary frame in a groudtruth as training samples, so that weak prediction can be corrected to a certain extent and the correct position can be corrected; if the pixel coordinate (x, y) falls within the real box B, then it is considered as a regression sample, let B be (x, y)0,y0,x1,y1)∈R4The result represents the upper left and lower right target objects of the true frame of pixels, and the sample label is calculated as T ═ d (d)1,d2,d3,d4)
d1=x-x0,d2=y-y0,
d3=x1-x,d4=y1-y,
Representing the distance from the location to the four edges of the bounding box, which the regression network learns through four 3 x 3 convolutional layers, in most siemese tracking methods the classification confidence is estimated by sampling features from a fixed region in the feature map, which features depict a fixed local region of the image and cannot be scaled to accommodate changes in object specific interest, with the result that the classification confidence is not reliable in distinguishing between target objects and complex backgrounds.
The sampling position is changed by adopting deformable convolution, so that the sampling position is aligned with the prediction boundary box, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable. Specifically, for each position $ (c _ { x }, c _ { y }) $inthe classification map, there is an object bounding box $ M ═ (M _ { x }, M _ { y }, M _ { w }, M _ { h }) $predictedby the regression network, where $ M _ { x } $ $ M _ { y } $representsthe frame center, and $ M _ { w } $ M _ { h } $representsthe frame center width and height. The goal is to estimate the classification confidence for each position $ (d _ { x }, d _ { y }) $, by sampling features in the corresponding candidate region M. The sampling network G is subjected to a spatial transformation T, and the fixed sampling positions are changed into predicted positions, which can be expressed as:
Figure BDA0002938618610000031
where F is the input feature map, w represents the learned convolution weights, c is the positions in the feature map, F' represents the output feature map, and the spatial transformation Δ τ ∈ T represents the distance vector aligned from the original sample point to the prediction bounding box.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention provides a multi-modal image fusion target tracking framework with a remarkable target: the multimode image fusion with obvious targets can effectively improve the texture information of the fused image, an unsupervised fusion method is provided, the network robustness is enhanced, the redundant information of the fused image is removed, a method for training a model according to a groudtruth (real labeling) bounding box is provided, and the fused network model trained by the method is excellent in performance;
2. the invention provides a characteristic combination module, so that the reliability of a classification network is improved, and the performance of a trained tracking network model is excellent;
3. the method introduces a regression network without an anchor frame for tracking training, and takes all pixels in a boundary frame in the groudtruth as training samples, so that weak prediction can be corrected to a certain extent and corrected to a correct position.
Drawings
FIG. 1 is a schematic diagram of a converged network architecture;
FIG. 2 is a graph of RGBT234 performance results;
FIG. 3 is an experimental view of the ablation of the assembly;
fig. 4 is a flowchart of an object-aware image fusion method for multi-modal target tracking.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
In order to solve the problems in the background art, the embodiment of the invention provides an object perception image fusion method for multi-modal target tracking, which is highlighted in that an unsupervised algorithm is used for adaptively fusing images, so that the target is more obvious, a tracking network is promoted, the network can adaptively fuse according to picture information, the missing another modal information caused by fusion weight deviation is solved, and modal information sharing is achieved. And a characteristic combination module is provided to improve the reliability of the classification network.
Aiming at hidden dangers brought by anchor frame detection, a regression network without an anchor frame is introduced for tracking training, and weak prediction to a certain degree can be corrected. The robustness of the network is improved.
Example 1
The embodiment of the invention provides an object perception image fusion method for multi-modal target tracking, which comprises the following steps:
101: acquiring a self-adaptive fusion image, namely respectively inputting an RGB image and a thermal mode image into two channels, performing significance detection on the image, cascading and connecting the RGB image and the thermal mode image with the output of different layers in a network, wherein each channel comprises three aggregation modules, and judging the size of an image gray value, pixel intensity, similarity measurement and consistency loss through a sliding window according to the characteristics of a connection deep layer to reconstruct the fusion image by an adaptive guide network;
further, a schematic diagram of a network structure for adjusting the hyper-parameters using the verification set is shown in fig. 1.
102: obtaining a fusion training set by the training set through a self-adaptive fusion network, training a tracking network by the fusion training set, dividing a module image and a search area according to a real frame, training the tracking network, testing a target tracking model by using a verification set, and selecting an optimal tracking network;
103: and testing the whole tracking framework by using the trained tracking network model and the trained fusion model.
In conclusion, the unsupervised fusion method is used for multi-mode tracking, so that the fused image is more obvious, the reliability of the classification network is improved by introducing the characteristic combination module, the tracking performance and the robustness are greatly improved, weak prediction can be corrected by the regression network based on the anchor-frame-free technology, and the classification confidence coefficient can be improved by the deformable convolution. The network tracking speed is realized in real time.
Example 2
The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:
the invention adopts a maximum general multi-modal target tracking data set RGBT234 during training, wherein the data set is expanded from RGBT210 and comprises 234 aligned RGB and thermal mode video sequences, the thermal mode video sequences comprise 20 ten thousand frames of pictures in total, and the longest video sequence reaches 4000 frames.
The fusion task generates an information image containing a large amount of thermal information and texture details, and the first component is mainly composed of three main components as shown in fig. 1: feature extraction, feature fusion and feature reconstruction. Saliency detection is performed on the whole image, cascading and fully connecting the outputs of the different layers in the network, the RGB image and the thermal image are respectively put into two channels, both channels are composed of one C1 and one dense block containing D1, D2 and D3, the first C1 contains 3 × 3 convolution to extract low-level features, each dense block also has 3 × 3 convolution layers, in the feature fusion part, the deep features are directly connected, and finally, the result of the fusion layer passes through the other five convolution layers (C2, C3, C4, C5 and C6), the fusion result image is reconstructed from the fusion features, table 1 summarizes the more detailed architecture of the network, and after the fusion network, the fusion image contains a lot of thermal information and texture details.
TABLE 1
Figure BDA0002938618610000051
In the test of the tracker based on the anchor frame, when the prediction frame is inaccurate, the tracker can quickly lose the target, and the fundamental reason is that the methods are based on the anchor frame, wherein IoU of the methods and the real frame is larger than a threshold value, namely IoU is more than or equal to 0.6, so that the methods can not accurately locate the target area, for example, in the case of small overlap of the anchor frame, in order to solve the problem, a regression network without the anchor frame is used for tracking training, all pixels in a boundary frame in a groudtuth are used as training samples, weak prediction can be corrected to a certain extent, and the correct position is corrected
In most siemese tracking methods, the classification confidence is estimated by sampling features from a fixed region in a feature map, which depicts a fixed local region of the image and cannot be scaled to accommodate changes in object scale, with the result that the classification confidence is not reliable in distinguishing between target objects and complex backgrounds. The sampling position is changed by adopting deformable convolution, so that the sampling position is aligned with the prediction boundary box, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable.
In an adaptive fusion network, we use SSIM loss, which is a measure of structural similarity between two images, and in order to implement gradient transformation and eliminate some noise, we introduce an objective function to design a mixed loss function, which is described as follows:
OA(i,j)=It+Irgb-2·IF(i,j) (1)
Figure BDA0002938618610000061
wherein OA is the difference between the original image and the fused image, | luminance2Is a2Distance, since the two types of losses are not an order of magnitude, a hyper-parameter λ is set, and the loss function is described as follows:
Figure BDA0002938618610000062
LossF=LSSIM+λLOAloss (3)
to optimize the tracking network, a regression and classification network was trained using IoU losses and binary cross-entropy losses, where the losses are defined as:
Lreg=-∑ln(IoU(Preg,T*)) (4)
wherein P isregRepresenting prediction, i index as training sample, L in classificationclsIs defined as:
Lcls=-∑p*log(p)+(1-p*)log(1-p) (5)
where P is the class score map, j is the index that classifies the training samples, and P*Representing a true tag, more specifically, P is a binary tag in which the pixel near the center of the object is labeled 1, and the formula is defined as:
Figure BDA0002938618610000063
the whole network joint training optimizes the following targets:
LossT=Lreg+Lcls (7)
the fusion network model does not track the network during training, the model is input into RGB and thermal infrared images aligned in pairs, and network parameters are adjusted according to a mixing loss function. In the tracking network training, the converged network is not updated. The entire network input is also a pair-aligned RGB and thermal infrared image.
Specifically, the method comprises the following steps:
1. obtaining a loss function of the batch image by using formulas (1) to (3);
2. reversely updating the parameters of the fusion network by using an SGD (random gradient descent) algorithm according to the loss function obtained in the step 1;
3. adjusting the hyper-parameter λ using the validation set;
training a tracking network model, and specifically comprising the following steps:
1. calculating the classification loss according to the formula 6 and the true value;
2. calculate the net prediction and true IoU losses using equation 5;
3. obtaining losses according to formula 7 and updating tracking network parameters in reverse using an SGD (random gradient descent) algorithm;
4. and judging the network performance adjustment learning rate by using the test set.
The embodiment of the invention has the following advantages:
firstly, a multi-modal target tracking image fusion method framework is provided, and multi-modal information is fused in a self-adaptive mode through an unsupervised algorithm, so that the network has more remarkable information improvement, and the robustness of a tracker is greatly improved. Secondly, a regression method without anchor frame is provided, the trained model has good performance, and the generated weak prediction (under the condition that the overlap of the anchor frame is small) can be corrected to a certain degree. And thirdly, providing a method for aligning the characteristics of deformable convolution, wherein on the premise of not influencing the accuracy of the model, the model adopts deformable convolution to change the sampling position so as to align the sampling position with the prediction boundary frame, and the classification confidence coefficient corresponds to the target object, so that the classification confidence coefficient is more reliable, and the running speed is detected in real time.
In conclusion, the unsupervised fusion method is used for multi-mode tracking, so that the fused image is more obvious, the reliability of the classification network is improved by introducing the characteristic combination module, the tracking performance and the robustness are greatly improved, weak prediction can be corrected by the regression network based on the anchor-frame-free technology, and the classification confidence coefficient can be improved by the deformable convolution. The network tracking speed is realized in real time.
Example 3
Example 1 adopted in the embodiment of the present invention is shown in fig. 3, which reflects a fusion result displayed when the parameter λ is 0.01 and the left side is an RGB image, the middle is a thermal image, and the right side is a fusion image, and the fusion image contains a large amount of thermal information and texture details.
Embodiment 2 of the present invention is shown in fig. 2 and 3, which are PR and SR score maps of 12 trackers on the RGBT234 data set. PR is the percentage of frames within a given threshold distance of their output position from true, SR is the ratio of successful frames overlapped by more than the threshold, and consequently the tracker that takes this approach is much better than the others, especially in PR than the second best tracker MANet (77.8%) by more than 3.2%. Better than the second MANet (54.4%) in SR is over 6.1%.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. An object-aware image fusion method for multi-modal target tracking, the method comprising the steps of:
acquiring a self-adaptive fusion image, respectively inputting an RGB image and a thermal mode image into two channels, performing significance detection on the image, cascading and connecting the RGB image and the thermal mode image with the output of different layers in a network, wherein each channel comprises three aggregation modules, and judging the size of an image gray value, the pixel intensity, the similarity measurement and the consistency loss through a sliding window according to the characteristics of a connection deep layer to reconstruct the fusion image by an adaptive guide network;
the feature combination module combines the features extracted from the samples and the search images by using a depth cross-correlation operation to generate corresponding similarity features for subsequent target positioning, and then performs classification and regression to perform target positioning;
performing tracking training by using a regression network without an anchor frame, wherein all pixels in a boundary frame in a groudtruth are used as training samples; the classification confidence is mapped to the target object by changing the sampling position using deformable convolution to align it with the prediction bounding box.
2. The object-aware image fusion method for multi-modal target tracking according to claim 1, wherein the similarity measure in the network in the step of obtaining the adaptive fusion image is specifically:
Figure FDA0002938618600000011
where X, Y represent two modal pictures, where C ═ 9 × 104W is a sliding window, σ is the standard deviation, σ XY is the cross-correlation between X and Y, the sliding window size is 11 × 11, and the SSIM score is measured by calculating the average intensity of the pixels in the window;
Figure FDA0002938618600000012
Figure FDA0002938618600000013
wherein P isiIs the value of pixel I when E (I)tI W) is greater than or equal to E (I)rgb| W), thermal infrared imageHas more texture information, and the SSIM guide network reserves the thermal infrared characteristic so that IFMore similar to ItAnd vice versa.
3. The object-aware image fusion method for multi-modal target tracking according to claim 1, wherein the changing the sampling position by using deformable convolution is specifically:
Figure FDA0002938618600000014
and performing spatial transformation T on the sampling network G, and changing the fixed sampling position into a prediction position, wherein F is an input feature map, w represents the learned convolution weight, c is the position in the feature map, F' represents an output feature map, and the spatial transformation delta tau epsilon T represents the distance vector aligned from the original sampling point to the prediction boundary frame.
4. The object-aware image fusion method for multi-modal target tracking according to claim 3, wherein the changing the sampling position by deformable convolution is: and aligning the target object with the prediction bounding box, and corresponding the classification confidence to the target object.
5. The object-aware image fusion method for multi-modal target tracking according to claim 4, wherein the associating classification confidence with target object specifically: for each position $ (c _ { x }, c _ { y }) $, in the classification map, there is an object bounding box $ M ═ (M _ { x }, M _ { y }, M _ { w }, M _ { h }) $predictedby the regression network, where $ M _ { x } $ $ M _ { y } $representsthe frame center, and $ M _ { w } $ M _ { h } $representsthe frame center width and height.
6. The object-aware image fusion method for multi-modal target tracking according to claim 1, wherein the tracking training of the regression network without anchor frame specifically comprises: if the pixel coordinate (x, y) falls within the real box B, it will beIt is regarded as a regression sample, let B ═ x0,y0,x1,y1)∈R4The result represents the upper left and lower right target objects of the true frame of pixels, and the sample label is calculated as T ═ d (d)1,d2,d3,d4)
d1=x-x0,d2=y-y0,
d3=x1-x,d4=y1-y,
Representing the distance from the location to the four edges of the bounding box.
CN202110169737.6A 2021-02-07 2021-02-07 Object perception image fusion method for multi-mode target tracking Active CN112862860B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110169737.6A CN112862860B (en) 2021-02-07 2021-02-07 Object perception image fusion method for multi-mode target tracking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110169737.6A CN112862860B (en) 2021-02-07 2021-02-07 Object perception image fusion method for multi-mode target tracking

Publications (2)

Publication Number Publication Date
CN112862860A true CN112862860A (en) 2021-05-28
CN112862860B CN112862860B (en) 2023-08-01

Family

ID=75989032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110169737.6A Active CN112862860B (en) 2021-02-07 2021-02-07 Object perception image fusion method for multi-mode target tracking

Country Status (1)

Country Link
CN (1) CN112862860B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240994A (en) * 2021-11-04 2022-03-25 北京工业大学 Target tracking method and device, electronic equipment and storage medium
CN117893873A (en) * 2024-03-18 2024-04-16 安徽大学 Active tracking method based on multi-mode information fusion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679677A (en) * 2013-12-12 2014-03-26 杭州电子科技大学 Dual-model image decision fusion tracking method based on mutual updating of models
CN107808167A (en) * 2017-10-27 2018-03-16 深圳市唯特视科技有限公司 A kind of method that complete convolutional network based on deformable segment carries out target detection
CN108875465A (en) * 2017-05-26 2018-11-23 北京旷视科技有限公司 Multi-object tracking method, multiple target tracking device and non-volatile memory medium
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN111797716A (en) * 2020-06-16 2020-10-20 电子科技大学 Single target tracking method based on Siamese network
CN111861888A (en) * 2020-07-27 2020-10-30 上海商汤智能科技有限公司 Image processing method, image processing device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679677A (en) * 2013-12-12 2014-03-26 杭州电子科技大学 Dual-model image decision fusion tracking method based on mutual updating of models
CN108875465A (en) * 2017-05-26 2018-11-23 北京旷视科技有限公司 Multi-object tracking method, multiple target tracking device and non-volatile memory medium
CN107808167A (en) * 2017-10-27 2018-03-16 深圳市唯特视科技有限公司 A kind of method that complete convolutional network based on deformable segment carries out target detection
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN111797716A (en) * 2020-06-16 2020-10-20 电子科技大学 Single target tracking method based on Siamese network
CN111861888A (en) * 2020-07-27 2020-10-30 上海商汤智能科技有限公司 Image processing method, image processing device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUI LI ET AL.: "DenseFuse: A Fusion Approach to Infrared and Visible Images", ARXIV, pages 1 - 11 *
徐正梅 等: "基于多模态信息融合的图像显著性检测算法研究", 韶关学院学报·自然科学, vol. 39, no. 12, pages 13 - 17 *
王凯 等: "基于红外和可见光融合的目标跟踪", 计算机系统应用, vol. 27, no. 1, pages 149 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240994A (en) * 2021-11-04 2022-03-25 北京工业大学 Target tracking method and device, electronic equipment and storage medium
CN117893873A (en) * 2024-03-18 2024-04-16 安徽大学 Active tracking method based on multi-mode information fusion
CN117893873B (en) * 2024-03-18 2024-06-07 安徽大学 Active tracking method based on multi-mode information fusion

Also Published As

Publication number Publication date
CN112862860B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN110276316B (en) Human body key point detection method based on deep learning
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN109598196B (en) Multi-form multi-pose face sequence feature point positioning method
CN111046734B (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN113673510B (en) Target detection method combining feature point and anchor frame joint prediction and regression
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN111915571A (en) Image change detection method, device, storage medium and equipment fusing residual error network and U-Net network
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
WO2024032010A1 (en) Transfer learning strategy-based real-time few-shot object detection method
CN109086659A (en) A kind of Human bodys' response method and apparatus based on multimode road Fusion Features
CN115797736B (en) Training method, device, equipment and medium for target detection model and target detection method, device, equipment and medium
Liu et al. Pose-adaptive hierarchical attention network for facial expression recognition
CN112862860A (en) Object perception image fusion method for multi-modal target tracking
CN112084952B (en) Video point location tracking method based on self-supervision training
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN111582154A (en) Pedestrian re-identification method based on multitask skeleton posture division component
CN116824625A (en) Target re-identification method based on generation type multi-mode image fusion
CN114170537A (en) Multi-mode three-dimensional visual attention prediction method and application thereof
Zhang et al. Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention
Jiang et al. Application of a fast RCNN based on upper and lower layers in face recognition
CN116682140A (en) Three-dimensional human body posture estimation algorithm based on attention mechanism multi-mode fusion
CN116977674A (en) Image matching method, related device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant