CN112766102B - Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion - Google Patents

Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion Download PDF

Info

Publication number
CN112766102B
CN112766102B CN202110018918.9A CN202110018918A CN112766102B CN 112766102 B CN112766102 B CN 112766102B CN 202110018918 A CN202110018918 A CN 202110018918A CN 112766102 B CN112766102 B CN 112766102B
Authority
CN
China
Prior art keywords
frame
branch
hyperspectral
tracking
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110018918.9A
Other languages
Chinese (zh)
Other versions
CN112766102A (en
Inventor
王心宇
刘桢杞
钟燕飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110018918.9A priority Critical patent/CN112766102B/en
Publication of CN112766102A publication Critical patent/CN112766102A/en
Application granted granted Critical
Publication of CN112766102B publication Critical patent/CN112766102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/10Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion. The invention combines with a cyclic consistency theory method, designs a hyperspectral target tracking method based on deep learning, can unsupervised train a hyperspectral target tracking deep learning model, and saves the cost of manual labeling. On the basis of the Siamese tracking framework, RGB branches (space branches) and hyperspectral branches are designed; and training a space branch by using RGB video data, loading a trained RGB model into network fixed parameters, and training a hyperspectral branch at the same time to obtain the characteristics of more robustness and discriminant after fusion. The tracking result is finally obtained using the input of the fused features into a correlation filter (DCF). The method can solve the problem of manual labeling of the hyperspectral video data and the problem of few hyperspectral training samples for training the deep learning model, and can effectively improve the precision and speed of the hyperspectral video tracking model.

Description

Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion
Technical Field
The invention relates to the field of processing based on computational vision technology, in particular to an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion.
Background
Hyperspectral video (high spatial resolution-high temporal resolution-hyperspectral resolution) object tracking is an emerging direction aimed at predicting the state of an object in a subsequent frame using object information of a given initial frame in hyperspectral video. In contrast to RGB video object tracking, hyperspectral video object tracking can provide spectral information that distinguishes between different materials in addition to spatial information. Even if the target shapes are the same, the target can be tracked by utilizing hyperspectral video as long as the materials are different, which is an advantage not possessed by RGB video target tracking. Therefore, the hyperspectral video target tracking can play an important role in the fields of camouflage target tracking, small target tracking and the like. On this basis, hyperspectral video object tracking has attracted more and more researchers' attention.
At the same time, hyperspectral video object tracking is a difficult task. Firstly, the existing hyperspectral video target tracking algorithm uses the traditional manual characteristic to represent the characteristic of the target, so that the performance of the hyperspectral video target tracking algorithm is limited; secondly, the hyperspectral video needs to be shot by a special hyperspectral video camera, and training samples are limited, so that a hyperspectral video target algorithm based on deep learning does not exist in the current sense. Thirdly, the supervised deep learning algorithm now requires a large number of samples of manual standards, in particular video annotation, which is time-consuming and laborious. Current hyperspectral video object tracking algorithms tend to perform poorly due to several of the problems described above.
Disclosure of Invention
The invention aims to provide an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion.
The unsupervised hyperspectral video target tracking method based on the spatial spectrum feature fusion has the following three remarkable characteristics. Firstly, by utilizing the principle of cyclic consistency, the whole hyperspectral target tracking algorithm based on deep learning is unsupervised and trained without any manual labeling. Secondly, a relevant filtering hyperspectral video target tracking frame fused with the spatial spectrum features is designed, so that the problem of few hyperspectral video training samples is solved to a certain extent, and simultaneously, the RGB and hyperspectral features are fused to obtain the features with more robustness and recognition capability. And thirdly, designing a channel attention module, and calculating the weights of the characteristic channels only in an initial frame, so that the network can dynamically aggregate different weights of the characteristic channels of different targets.
The invention provides an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion, which comprises the following implementation steps:
Step 1, preprocessing video data;
Step 2, randomly initializing a boundary frame, and acquiring a template frame Z i and a subsequent search frame Z i+x through the initialized boundary frame, wherein the template frame Z i and the search frame Z i+x are RGB video frames or hyperspectral video frames;
Step 3, unsupervised training of RGB branches, also called spatial branches, by using a cyclic consistency principle to finally obtain an optimized spatial branch model
The space branches comprise a template branch 1 and a search branch 1, wherein the template branch 1 takes a template frame Z i containing a tracking target as an input image frame, at the moment, the template frame Z i is an RGB video frame, the search branch 1 takes a search frame Z i+x, namely a subsequent video frame as the input image frame, x is more than 0, the hyperspectral branch is removed when the space branches are trained, and only the space branches are trained;
The template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear activation layer, a convolution layer and a local response normalization layer;
step 4, unsupervised training of hyperspectral branches by using a cyclic consistency principle to finally obtain an optimized space-hyperspectral model
The hyperspectral branch comprises a template branch 2 and a searching branch 2, wherein the template branch 2 takes a template frame Z i containing a tracking target as an input image frame, the searching branch takes a searching frame Z i+x, namely a subsequent video frame is an input image frame, x >0, and a model of a space branch is loaded when training the hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation;
The template branch 2 comprises a plurality of spectrum feature extraction modules and a channel attention module which are connected in series, wherein the first two spectrum feature extraction modules comprise a convolution layer, a batch normalization layer, a nonlinear activation layer, a third spectrum feature extraction module comprises a convolution layer, a batch normalization layer, a nonlinear activation layer, a convolution layer, the channel attention module comprises a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and a Softmax, and the search branch 2 only comprises a plurality of spectrum feature extraction modules connected in series and does not comprise the channel attention module;
Step 5, inputting the hyperspectral video frame X 1 containing the target to be tracked into the network model Middle template branch, input the following frame X 2,X3,X4...Xi to network model/>The search branches of (2) get the tracking result of each frame.
Further, the specific implementation manner of the step 1 is as follows,
Firstly, converting video data into a frame of continuous image X i,Xi which is an RGB video frame or a hyperspectral video frame;
The unlabeled video image frame X i is then fully resized to a size of video image frame Y i.
Further, the implementation manner of the step 2 is as follows,
On the basis of the step 1, selecting a region with the size of 90 multiplied by 90 pixels on a video frame Y i without labels by taking coordinates [ x, Y ] as a center as a target to be tracked, wherein the region is initialized BBOX; the 90×90 region resize is Z i to a 125×125 pixel size; two frames Y i+a and Y i+b, 10> =a >0, 10> =b >0, a > b or a < b are randomly selected simultaneously from the 10 frames Y i+1 to Y i+10, and the 90×90 pixel size region resize is selected as Z i+a and Z i+b of 125×125 pixel size, again centered on the coordinates [ x, Y ].
Further, the specific implementation manner of the step3 is as follows,
Step 3.1, template branch 1 takes template frame Z i as input image frame, search branch 1 takes search frame Z i+x as input image frame, hyperspectral branch is removed when training space branch, only training space branch;
Step 3.2, inputting a template frame Z i into a template branch 1, wherein Z i is an RGB video frame, and Z i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a feature F_t;
Step 3.3, inputting Z i+a into search branch 1, where Z i+a is an RGB video frame; z i+a sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a characteristic F_s;
step 3.4, solving a ridge regression loss function;
Obtaining a filter w, wherein H is an ideal Gaussian response, and lambda is a constant;
wherein, Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, representing the conjugate value,;
Step 3.5, calculating the final response R through the filter w and the characteristic F_s of the subsequent frame;
Wherein F -1 represents an inverse Fourier transform;
step 3.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, three frames form a training pair, and b is greater than a, so that a tracking response R i+a,Ri+b is obtained; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R i is obtained;
Step 3.7, calculating a movement weight M motion;
Where H i is the ideal Gaussian output of initial frame Z i, H i+a is the ideal Gaussian output of Z i+a, and m represents m different training pairs; judging whether the random initialization bounding box contains a dynamic target or not by calculating a moving weight M motion, wherein if the dynamic target exists, the weight of M motion is larger than the value without the dynamic target;
Step 3.8, constructing a loss function:
Where n represents the maximum value of batch size, R i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M motion is used to reduce the influence of the non-dynamic target on the network training;
Step 3.9, updating network model parameters by back propagation loss value, namely loss function value L in step 3.8, back propagation loss value, updating network parameters in step 3.2 by random gradient descent (SGD) algorithm, and finally obtaining an optimized space branch model
Further, the implementation manner of the step 4 is as follows,
Step 4.1, template branch 2 takes template frame Z i as input image frame, at this time Z i is hyperspectral video frame, search branch 2 takes search frame Z i+x as input image frame, at this time Z i+x is hyperspectral video frame, and model of space branch is loaded when training hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation;
Step 4.2, inputting the template frame Z i into the template branch 2, selecting three wave bands from Z i to form a pseudo-color video frame to pass through Obtaining a feature F_t_rgb; meanwhile, Z i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi; the structure of the 3 spectral feature extraction modules is shown in figure 2; f_t_hsi sequentially passes through a global average pooling layer, a full connecting layer, a nonlinear activating layer, a full connecting layer and a Softmax to calculate a weight function a of the F_t_hsi characteristic channel, and finally the hyperspectral characteristic aF_t_his with weight is obtained;
step 4.3, inputting Z i+a into search branch 2, selecting three wave bands from Z i+a to form pseudo-color video frame, and passing Obtaining a feature F_s_rgb; similarly, Z i+a sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_s_hsi; the structure of the 3 spectral feature extraction modules is shown in figure 2; using the a calculated in the step 4.2 to finally obtain hyperspectral characteristics aF_s_hsi with weights;
step 4.4, solving the ridge regression loss function:
Obtaining a filter w f, f_t_f=af_t_hsi+f_t_rgb, H is an ideal gaussian response, λ is a constant;
wherein, Is the Fourier transform of w f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value;
Step 4.5, calculating the final response R f by the filter w f and the characteristic f_s_f=f_s_rgb+af_s_hsi of the subsequent frame;
Wherein F -1 represents an inverse Fourier transform;
Step 4.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, and b > a, so as to obtain a tracking response R f_i+a,Rf_i+b; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R f_i is obtained;
Step 4.7, calculating a movement weight M f_motion:
where H i is the ideal Gaussian output of initial frame Z i and H i+a is the ideal Gaussian output of Z i+a; judging whether the random initialization boundary box contains a dynamic target or not by calculating a weight parameter M f_motion, wherein if the dynamic target exists, the weight of M f_motion is larger than the value of the non-dynamic target;
step 4.8, constructing a loss function:
Where n represents the maximum value of batch size, R f_i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M f_motion is used to reduce the influence of the non-dynamic target on the network training;
Step 4.9, updating network model parameters by back propagation loss values, namely loss function values L f in step 4.8, back propagation of loss values, updating network parameters in step 4.2, and finally obtaining an optimized space-hyperspectral model
The method of the invention has the following remarkable effects: (1) The non-supervision training network based on the periodic consistency principle can save labor cost; (2) The tracking model integrating RGB features and hyperspectral features is trained from end to end by deep learning, so that the reasoning speed is high, and is improved by tens of times compared with the reasoning speed of the traditional manual feature method; (3) And more effective characteristics aiming at the target to be tracked are aggregated by using a channel attention mechanism in the initial frame, so that the discrimination of the network to the target is increased.
Drawings
FIG. 1 is a schematic diagram of loop consistency in step 4 of embodiment 1 of the present invention
Fig. 2 is a schematic diagram of hyperspectral branching in step 4 of example 1 of the present invention.
Fig. 3 is a schematic diagram of the space branch in step 3 of embodiment 1 of the present invention.
Fig. 4 is a schematic diagram of the tracking result in step 3 of embodiment 1 of the present invention, wherein numerals in the figure represent the 4 th frame and the 12 th frame, respectively, and a frame represents the position and the size of the tracking target, and the frame moves and changes with the movement and deformation of the target (the target becomes larger, the frame becomes larger, the target becomes smaller, and the frame becomes smaller).
Fig. 5 is a flowchart of embodiment 1 of the present invention.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Example 1:
the embodiment of the invention provides an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion, which comprises the following steps:
Step 1, preprocessing video data, wherein the step further comprises the following steps:
In step 1.1, the video data is converted into a frame-sequential image X i (RGB video frames or hyperspectral video frames).
In step 1.2, the video image frame X i without labels is totally resized to a 200 by 200 pixel size video image frame Y i.
Step 2, randomly initializing a Bounding Box (BBOX), the step further comprising:
on the basis of step 1, a region with a 90×90 pixel size (a region with a 90×90 pixel size centered on coordinates [ x, Y ]) is randomly selected on the unlabeled video frame Y i as a target to be tracked (this region is the initialized BBOX). The 90×90 region resize is Z i to a size of 125×125 pixels. Two frames Y i+a and Y i+b (10 > =a >0, 10> =b >0, a > b or a < b) are randomly selected simultaneously from the 10 frames Y i+1 to Y i+10, and the 90×90 pixel size region resize is selected to be Z i+a and Z i+b of 125×125 pixel size, again centered on the coordinates [ x, Y ].
Step 3, unsupervised training of RGB branches (spatial branches) using the cyclic consistency principle, the step further comprising:
Step 3.1, the whole network structure is formed by a Siamese network base and is totally divided into a template branch and a search branch. The template branches take a template frame Z i (comprising the target to be tracked, where Z i represents an RGB video frame) as an input image frame, and the template branches are divided into a spatial branch and a hyperspectral branch. The search branches take the search frame Z i+x (subsequent video frame, x > 0) as the input image frame, and are also divided into spatial branches and hyperspectral branches. The hyperspectral branches are removed when the spatial branches are trained, and only the spatial branches are trained.
Step 3.2, template frame Z i is input to the template branch, where Z i is an RGB video frame. Z i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-partial response normalization layer to obtain a feature F_t.
Step 3.3, input Z i+a (assuming b > a) into the search branch, where Z i+a is an RGB video frame. Z i+a passes through the convolution layer-nonlinear activation layer-convolution layer-partial response normalization layer in sequence to obtain a feature F_s.
Step 3.4, by solving the ridge regression loss function:
the resulting filter w, H is an ideal gaussian response and λ is a constant.
Wherein,Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, which represents the conjugate value, +..
Step 3.5, the final response R can be calculated by the filter w and the characteristic f_s of the subsequent frame:
Wherein F -1 represents the inverse Fourier transform.
Step 3.6, tracking forwards in the sequence of Z i-Zi+a-Zi+b (three frames form a training pair) to obtain a tracking response R i+a,Ri+b; and tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R i is obtained.
Step 3.7, calculating a movement weight M motion:
where H i is the ideal Gaussian output of initial frame Z i, H i+a is the ideal Gaussian output of Z i+a, and m represents m different training pairs. Whether the random initialization bounding box contains a dynamic object is determined by calculating the movement weight M motion (if there is a dynamic object, the weight of M motion will be greater than the value without a dynamic object).
Step 3.8, constructing a loss function:
Where n represents the maximum value of the batch size, R i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous training with M mini batches, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the influence of the non-dynamic target on the network training can be reduced by using the weight parameter M motion.
Step 3.9, the network model parameters are updated by the backward propagation of the loss values, the loss values are backward propagated, the network parameters in the step 3.2 are updated by a random gradient descent (SGD) algorithm, and finally the optimized space branch model is obtained
Step 4, unsupervised training of hyperspectral branching by using a cyclic consistency principle, and the step further comprises:
Step 4.1, the whole network structure is formed by a Siamese network base and is totally divided into a template branch and a search branch. The template branch takes a template frame Z i (comprising targets to be tracked, all input video frames Zi and the like at the moment represent hyperspectral video frames) as an input image frame, and is divided into a space branch and a hyperspectral branch. The search branches take the search frame Z i+x (subsequent video frame, x > 0) as the input image frame, and are also divided into spatial branches and hyperspectral branches. Model for loading space branches during training hyperspectral branches While the frozen spatial branch parameters do not participate in the back propagation.
Step 4.2, the template frame Z i is input to the template branch. Selecting three bands from Z i to compose pseudo-color video frame pass(Spatial branching) results in the feature f_t_rgb. Meanwhile, Z i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi. The 3 spectral feature extraction modules are structured as shown in fig. 2. The F_t_hsi then sequentially passes through the global average pooling layer, the full connecting layer, the nonlinear active layer, the full connecting layer and the Softmax (channel attention mechanism) to calculate the weight function a of the F_t_hsi characteristic channel (the weight function is only calculated in a template frame, and the subsequent frame directly uses a), and finally the hyperspectral characteristic aF_t_hsi with weight is obtained.
Step 4.3, input Z i+a (assuming b > a) into the search branch. Selecting three band groups from Z i+a (band composition is same as step 4.2) to make pseudo-colour video frame passThe feature f_s_rgb is obtained. Similarly, Z i+a sequentially passes through a network consisting of 3 spectral feature extraction modules connected in series to obtain the feature F_s_hsi. The 3 spectral feature extraction modules are structured as shown in fig. 2. And (3) finally obtaining the hyperspectral characteristic aF_s_hsi with weight by using the a calculated in the step 4.2.
Step 4.4, solving the ridge regression loss function:
the resulting filter w f, f_t_f=af_t_hsi+f_t_rgb, H is an ideal gaussian response and λ is a constant.
Wherein,Is the Fourier transform of w f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value.
In step 4.5, the final response R f is calculated by the filter w f and the characteristic f_s_f=f_s_rgb+af_s_hsi of the following frame:
Wherein F -1 represents the inverse Fourier transform.
Step 4.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, and a tracking response R f_i+a, Rf_i+b is obtained; and tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R f_i is obtained.
Step 4.7, calculating a movement weight M f_motion:
Where H i is the ideal Gaussian output of initial frame Z i and H i+a is the ideal Gaussian output of Z i+a. Whether the random initialization bounding box contains a dynamic object is determined by calculating the weight parameter M f_motion (if there is a dynamic object, the weight of M f_motion will be greater than the value without a dynamic object).
Step 4.8, constructing a loss function:
Where n represents the maximum value of the batch size, R f_i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous training with M mini batches, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the influence of the non-dynamic target on the network training can be reduced by using the weight parameter M f_motion.
Step 4.9, back propagation of loss value to update network model parameters, back propagation of loss value to update network parameters in step 4.2, and finally obtaining the optimized space-hyperspectral model
Step 5, inputting the hyperspectral video frame X 1 containing the target to be tracked into the network modelMiddle template branch, input the following frame X 2,X3,X4...Xi to network model/>The search branches of (2) get the tracking result of each frame.
The method of the invention has the following remarkable effects: (1) The non-supervision training network based on the periodic consistency principle can save labor cost; (2) The tracking model integrating RGB features and hyperspectral features is trained from end to end by deep learning, so that the reasoning speed is high, and is improved by tens of times compared with the reasoning speed of the traditional manual feature method; (3) And more effective characteristics aiming at the target to be tracked are aggregated by using a channel attention mechanism in the initial frame, so that the discrimination of the network to the target is increased. The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims (5)

1. An unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion is characterized by comprising the following steps:
Step 1, preprocessing video data;
Step 2, randomly initializing a boundary frame, and acquiring a template frame Z i and a subsequent search frame Z i+x through the initialized boundary frame, wherein the template frame Z i and the search frame Z i+x are RGB video frames or hyperspectral video frames;
Step 3, unsupervised training of RGB branches, also called spatial branches, by using a cyclic consistency principle to finally obtain an optimized spatial branch model
The space branches comprise a template branch 1 and a search branch 1, wherein the template branch 1 takes a template frame Z i containing a tracking target as an input image frame, at the moment, the template frame Z i is an RGB video frame, the search branch 1 takes a search frame Z i+x, namely a subsequent video frame as the input image frame, x is more than 0, a hyperspectral branch is removed when the space branch is trained, and only the space branch is trained;
The template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear activation layer, a convolution layer and a local response normalization layer;
step 4, unsupervised training of hyperspectral branches by using a cyclic consistency principle to finally obtain an optimized space-hyperspectral model
The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 takes a template frame Z i containing a tracking target as an input image frame, a template frame Z i is a hyperspectral video frame, the search branch 2 takes a search frame Z i+x, namely a subsequent video frame is an input image frame, x >0, and a model of a space branch is loaded when training the hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation; the template branch 2 comprises a plurality of spectrum feature extraction modules and a channel attention module which are connected in series, wherein the first two spectrum feature extraction modules comprise a convolution layer, a batch normalization layer, a nonlinear activation layer, a third spectrum feature extraction module comprises a convolution layer, a batch normalization layer, a nonlinear activation layer, a convolution layer, the channel attention module comprises a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and a Softmax, and the search branch 2 only comprises a plurality of spectrum feature extraction modules connected in series and does not comprise the channel attention module;
the specific implementation manner of the step 4 is as follows:
Step 4.1, template branch 2 takes template frame Z i as input image frame, search branch 2 takes search frame Z i+x as input image frame, and model of space branch is loaded when training hyperspectral branch Meanwhile, freezing space branch parameters does not participate in back propagation;
Step 4.2, inputting the template frame Z i into the template branch 2, selecting three wave bands from Z i to form a pseudo-color video to pass through Obtaining a feature F_t_rgb; meanwhile, Z i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi; f_t_hsi sequentially passes through a global average pooling layer, a full connecting layer, a nonlinear activating layer, a full connecting layer and a Softmax to calculate a weight function a of the F_t_hsi characteristic channel, and finally the hyperspectral characteristic aF_t_hsi with weight is obtained;
step 4.3, inputting Z i+a into search branch 2, selecting three wave bands from Z i+a to form pseudo-color video frame, and passing Obtaining a feature F_s_rgb; similarly, Z i+a sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_s_hsi; using the a calculated in the step 4.2 to finally obtain hyperspectral characteristics aF_s_hsi with weights;
Step 4.4, obtaining a filter w f by solving a ridge regression loss function;
Step 4.5, calculating the final response R f by the filter w f and the feature f_s_f=f_s_rgb+af_s_hsi;
wherein F -1 represents the inverse fourier transform, Is the fourier transform of w f;
Step 4.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, and b > a, so as to obtain a tracking response R f_i+a,Rf_i+b; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R f_i is obtained;
Step 4.7, calculating a movement weight M f_motion:
Where H i is the ideal Gaussian output of initial frame Z i and H i+a is the ideal Gaussian output of Z i+a; judging whether the random initialization boundary box contains a dynamic target or not by calculating a weight parameter M f_motion, wherein if the dynamic target exists, the weight of M motion is larger than the value of the non-dynamic target;
step 4.8, constructing a loss function:
Where n represents the maximum value of batch size, R f_i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M f_motion is used to reduce the influence of the non-dynamic target on the network training;
Step 4.9, updating network model parameters by back propagation loss values, namely loss function values L f in step 4.8, back propagation of loss values, updating network parameters in step 4.2, and finally obtaining an optimized space-hyperspectral model
Step 5, inputting the hyperspectral video frame X 1 containing the target to be tracked into the network modelMiddle template branch, the subsequent video frames X 2,X3,X4...Xi are sequentially input into a network model/>The search branches of (2) get the tracking result of each frame.
2. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the specific implementation of step 1 is as follows,
Firstly, converting video data into a frame of continuous image X i,Xi which is an RGB video frame or a hyperspectral video frame;
The unlabeled video image frames X i are then all resized to a video image frame Y i.
3. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the implementation of said step 2 is as follows,
On the basis of the step 1, selecting a region with the size of 90 multiplied by 90 pixels on a video frame Y i without labels by taking coordinates [ x, Y ] as a center as a target to be tracked, wherein the region is initialized BBOX; the 90 x 90 region is adjusted to Z i of 125 x 125 pixel size; two frames Y i+a and Y i+b, 10> =a >0, 10> =b >0, a > b or a < b are randomly selected from the 10 frames Y i+1 to Y i+10 at the same time, and the area of 90×90 pixels is selected to be adjusted to Z i+a and Z i+b of 125×125 pixels with the coordinates x, Y as the center.
4. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the specific implementation manner of the step 3 is as follows,
Step 3.1, template branch 1 takes template frame Z i as input image frame, search branch 1 takes search frame Z i+x as input image frame, hyperspectral branch is removed when training space branch, only training space branch;
Step 3.2, inputting a template frame Z i into a template branch 1, wherein Z i is an RGB video frame, and Z i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a feature F_t;
Step 3.3, inputting Z i+a into search branch 1, where Z i+a is an RGB video frame; z i+a sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a characteristic F_s;
step 3.4, solving a ridge regression loss function;
Obtaining a filter w, wherein H is an ideal Gaussian response, and lambda is a constant;
wherein, Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, representing the conjugate value,;
Step 3.5, calculating the final response R through the filter w and the characteristic F_s of the subsequent frame;
Wherein F -1 represents an inverse Fourier transform;
step 3.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, three frames form a training pair, and b is greater than a, so that a tracking response R i+a,Ri+b is obtained; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R i is obtained;
Step 3.7, calculating a movement weight M motion;
Where H i is the ideal Gaussian output of initial frame Z i, H i+a is the ideal Gaussian output of Z i+a, and m represents m different training pairs; judging whether the random initialization bounding box contains a dynamic target or not by calculating a moving weight M motion, wherein if the dynamic target exists, the weight of M motion is larger than the value without the dynamic target;
Step 3.8, constructing a loss function:
Where n represents the maximum value of batch size, R i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M motion is used to reduce the influence of the non-dynamic target on the network training;
Step 3.9, updating network model parameters by back propagation loss values, namely loss function values L in step 3.8, back propagation the loss values, updating the network parameters in step 3.2 by a random gradient descent algorithm, and finally obtaining an optimized space branch model
5. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps:
The calculation formula of the ridge regression loss function in step 4.4 is as follows:
Solving the ridge regression loss function to obtain a filter w f, wherein F_t_f=aF_t_hsi+F_t_rgb, H is an ideal Gaussian response, and lambda is a constant;
wherein, Is the Fourier transform of w f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value.
CN202110018918.9A 2021-01-07 2021-01-07 Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion Active CN112766102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110018918.9A CN112766102B (en) 2021-01-07 2021-01-07 Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110018918.9A CN112766102B (en) 2021-01-07 2021-01-07 Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion

Publications (2)

Publication Number Publication Date
CN112766102A CN112766102A (en) 2021-05-07
CN112766102B true CN112766102B (en) 2024-04-26

Family

ID=75700670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110018918.9A Active CN112766102B (en) 2021-01-07 2021-01-07 Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion

Country Status (1)

Country Link
CN (1) CN112766102B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344932B (en) * 2021-06-01 2022-05-03 电子科技大学 Semi-supervised single-target video segmentation method
CN113628244B (en) * 2021-07-05 2023-11-28 上海交通大学 Target tracking method, system, terminal and medium based on label-free video training
CN117689692A (en) * 2023-12-20 2024-03-12 中国人民解放军海军航空大学 Attention mechanism guiding matching associated hyperspectral and RGB video fusion tracking method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038684A (en) * 2017-04-10 2017-08-11 南京信息工程大学 A kind of method for lifting TMI spatial resolution
CN108765280A (en) * 2018-03-30 2018-11-06 徐国明 A kind of high spectrum image spatial resolution enhancement method
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN111062888A (en) * 2019-12-16 2020-04-24 武汉大学 Hyperspectral image denoising method based on multi-target low-rank sparsity and spatial-spectral total variation
CN111325116A (en) * 2020-02-05 2020-06-23 武汉大学 Remote sensing image target detection method capable of evolving based on offline training-online learning depth
CN111724411A (en) * 2020-05-26 2020-09-29 浙江工业大学 Multi-feature fusion tracking method based on hedging algorithm
WO2020199205A1 (en) * 2019-04-04 2020-10-08 合刃科技(深圳)有限公司 Hybrid hyperspectral image reconstruction method and system
CN111797716A (en) * 2020-06-16 2020-10-20 电子科技大学 Single target tracking method based on Siamese network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060274A (en) * 2019-04-12 2019-07-26 北京影谱科技股份有限公司 The visual target tracking method and device of neural network based on the dense connection of depth

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038684A (en) * 2017-04-10 2017-08-11 南京信息工程大学 A kind of method for lifting TMI spatial resolution
CN108765280A (en) * 2018-03-30 2018-11-06 徐国明 A kind of high spectrum image spatial resolution enhancement method
WO2020199205A1 (en) * 2019-04-04 2020-10-08 合刃科技(深圳)有限公司 Hybrid hyperspectral image reconstruction method and system
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN111062888A (en) * 2019-12-16 2020-04-24 武汉大学 Hyperspectral image denoising method based on multi-target low-rank sparsity and spatial-spectral total variation
CN111325116A (en) * 2020-02-05 2020-06-23 武汉大学 Remote sensing image target detection method capable of evolving based on offline training-online learning depth
CN111724411A (en) * 2020-05-26 2020-09-29 浙江工业大学 Multi-feature fusion tracking method based on hedging algorithm
CN111797716A (en) * 2020-06-16 2020-10-20 电子科技大学 Single target tracking method based on Siamese network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于超像素分割的RGB与高光谱图像融合;洪科;;电子技术与软件工程(第03期);全文 *

Also Published As

Publication number Publication date
CN112766102A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN112766102B (en) Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN109949255B (en) Image reconstruction method and device
WO2021043168A1 (en) Person re-identification network training method and person re-identification method and apparatus
Zhao et al. TBC-Net: A real-time detector for infrared small target detection using semantic constraint
CN107529650B (en) Closed loop detection method and device and computer equipment
CN110717851A (en) Image processing method and device, neural network training method and storage medium
CN113239830B (en) Remote sensing image cloud detection method based on full-scale feature fusion
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN107680116A (en) A kind of method for monitoring moving object in video sequences
US20210312589A1 (en) Image processing apparatus, image processing method, and program
CN113420794B (en) Binaryzation Faster R-CNN citrus disease and pest identification method based on deep learning
Wang et al. Lightweight deep neural networks for ship target detection in SAR imagery
CN113240697A (en) Lettuce multispectral image foreground segmentation method
Kumar et al. Performance analysis of object detection algorithm for intelligent traffic surveillance system
CN113609904B (en) Single-target tracking algorithm based on dynamic global information modeling and twin network
Liang et al. Multi-scale hybrid attention graph convolution neural network for remote sensing images super-resolution
CN115272865A (en) Target detection method based on adaptive activation function and attention mechanism
Jin et al. Learning multiple attention transformer super-resolution method for grape disease recognition
Cheng et al. A new single image super-resolution method based on the infinite mixture model
Che et al. Research on an underwater image segmentation algorithm based on YOLOv8
CN112132207A (en) Target detection neural network construction method based on multi-branch feature mapping
CN116777953A (en) Remote sensing image target tracking method based on multi-scale feature aggregation enhancement
CN116311349A (en) Human body key point detection method based on lightweight neural network
CN115861810A (en) Remote sensing image change detection method and system based on multi-head attention and self-supervision learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant