CN112766102B - Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion - Google Patents
Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion Download PDFInfo
- Publication number
- CN112766102B CN112766102B CN202110018918.9A CN202110018918A CN112766102B CN 112766102 B CN112766102 B CN 112766102B CN 202110018918 A CN202110018918 A CN 202110018918A CN 112766102 B CN112766102 B CN 112766102B
- Authority
- CN
- China
- Prior art keywords
- frame
- branch
- hyperspectral
- tracking
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001228 spectrum Methods 0.000 title claims abstract description 27
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000004927 fusion Effects 0.000 title claims abstract description 14
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 8
- 230000004044 response Effects 0.000 claims description 44
- 238000000605 extraction Methods 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 5
- 230000008014 freezing Effects 0.000 claims description 4
- 238000007710 freezing Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000003213 activating effect Effects 0.000 claims description 2
- 102100025444 Gamma-butyrobetaine dioxygenase Human genes 0.000 claims 1
- 101000934612 Homo sapiens Gamma-butyrobetaine dioxygenase Proteins 0.000 claims 1
- 238000004364 calculation method Methods 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 6
- 238000002372 labelling Methods 0.000 abstract description 3
- 238000011068 loading method Methods 0.000 abstract description 2
- 238000013136 deep learning model Methods 0.000 abstract 2
- 230000006870 function Effects 0.000 description 13
- 230000003595 spectral effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/13—Satellite images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/48—Matching video sequences
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A40/00—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
- Y02A40/10—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Astronomy & Astrophysics (AREA)
- Remote Sensing (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion. The invention combines with a cyclic consistency theory method, designs a hyperspectral target tracking method based on deep learning, can unsupervised train a hyperspectral target tracking deep learning model, and saves the cost of manual labeling. On the basis of the Siamese tracking framework, RGB branches (space branches) and hyperspectral branches are designed; and training a space branch by using RGB video data, loading a trained RGB model into network fixed parameters, and training a hyperspectral branch at the same time to obtain the characteristics of more robustness and discriminant after fusion. The tracking result is finally obtained using the input of the fused features into a correlation filter (DCF). The method can solve the problem of manual labeling of the hyperspectral video data and the problem of few hyperspectral training samples for training the deep learning model, and can effectively improve the precision and speed of the hyperspectral video tracking model.
Description
Technical Field
The invention relates to the field of processing based on computational vision technology, in particular to an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion.
Background
Hyperspectral video (high spatial resolution-high temporal resolution-hyperspectral resolution) object tracking is an emerging direction aimed at predicting the state of an object in a subsequent frame using object information of a given initial frame in hyperspectral video. In contrast to RGB video object tracking, hyperspectral video object tracking can provide spectral information that distinguishes between different materials in addition to spatial information. Even if the target shapes are the same, the target can be tracked by utilizing hyperspectral video as long as the materials are different, which is an advantage not possessed by RGB video target tracking. Therefore, the hyperspectral video target tracking can play an important role in the fields of camouflage target tracking, small target tracking and the like. On this basis, hyperspectral video object tracking has attracted more and more researchers' attention.
At the same time, hyperspectral video object tracking is a difficult task. Firstly, the existing hyperspectral video target tracking algorithm uses the traditional manual characteristic to represent the characteristic of the target, so that the performance of the hyperspectral video target tracking algorithm is limited; secondly, the hyperspectral video needs to be shot by a special hyperspectral video camera, and training samples are limited, so that a hyperspectral video target algorithm based on deep learning does not exist in the current sense. Thirdly, the supervised deep learning algorithm now requires a large number of samples of manual standards, in particular video annotation, which is time-consuming and laborious. Current hyperspectral video object tracking algorithms tend to perform poorly due to several of the problems described above.
Disclosure of Invention
The invention aims to provide an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion.
The unsupervised hyperspectral video target tracking method based on the spatial spectrum feature fusion has the following three remarkable characteristics. Firstly, by utilizing the principle of cyclic consistency, the whole hyperspectral target tracking algorithm based on deep learning is unsupervised and trained without any manual labeling. Secondly, a relevant filtering hyperspectral video target tracking frame fused with the spatial spectrum features is designed, so that the problem of few hyperspectral video training samples is solved to a certain extent, and simultaneously, the RGB and hyperspectral features are fused to obtain the features with more robustness and recognition capability. And thirdly, designing a channel attention module, and calculating the weights of the characteristic channels only in an initial frame, so that the network can dynamically aggregate different weights of the characteristic channels of different targets.
The invention provides an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion, which comprises the following implementation steps:
Step 1, preprocessing video data;
Step 2, randomly initializing a boundary frame, and acquiring a template frame Z i and a subsequent search frame Z i+x through the initialized boundary frame, wherein the template frame Z i and the search frame Z i+x are RGB video frames or hyperspectral video frames;
Step 3, unsupervised training of RGB branches, also called spatial branches, by using a cyclic consistency principle to finally obtain an optimized spatial branch model
The space branches comprise a template branch 1 and a search branch 1, wherein the template branch 1 takes a template frame Z i containing a tracking target as an input image frame, at the moment, the template frame Z i is an RGB video frame, the search branch 1 takes a search frame Z i+x, namely a subsequent video frame as the input image frame, x is more than 0, the hyperspectral branch is removed when the space branches are trained, and only the space branches are trained;
The template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear activation layer, a convolution layer and a local response normalization layer;
step 4, unsupervised training of hyperspectral branches by using a cyclic consistency principle to finally obtain an optimized space-hyperspectral model
The hyperspectral branch comprises a template branch 2 and a searching branch 2, wherein the template branch 2 takes a template frame Z i containing a tracking target as an input image frame, the searching branch takes a searching frame Z i+x, namely a subsequent video frame is an input image frame, x >0, and a model of a space branch is loaded when training the hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation;
The template branch 2 comprises a plurality of spectrum feature extraction modules and a channel attention module which are connected in series, wherein the first two spectrum feature extraction modules comprise a convolution layer, a batch normalization layer, a nonlinear activation layer, a third spectrum feature extraction module comprises a convolution layer, a batch normalization layer, a nonlinear activation layer, a convolution layer, the channel attention module comprises a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and a Softmax, and the search branch 2 only comprises a plurality of spectrum feature extraction modules connected in series and does not comprise the channel attention module;
Step 5, inputting the hyperspectral video frame X 1 containing the target to be tracked into the network model Middle template branch, input the following frame X 2,X3,X4...Xi to network model/>The search branches of (2) get the tracking result of each frame.
Further, the specific implementation manner of the step 1 is as follows,
Firstly, converting video data into a frame of continuous image X i,Xi which is an RGB video frame or a hyperspectral video frame;
The unlabeled video image frame X i is then fully resized to a size of video image frame Y i.
Further, the implementation manner of the step 2 is as follows,
On the basis of the step 1, selecting a region with the size of 90 multiplied by 90 pixels on a video frame Y i without labels by taking coordinates [ x, Y ] as a center as a target to be tracked, wherein the region is initialized BBOX; the 90×90 region resize is Z i to a 125×125 pixel size; two frames Y i+a and Y i+b, 10> =a >0, 10> =b >0, a > b or a < b are randomly selected simultaneously from the 10 frames Y i+1 to Y i+10, and the 90×90 pixel size region resize is selected as Z i+a and Z i+b of 125×125 pixel size, again centered on the coordinates [ x, Y ].
Further, the specific implementation manner of the step3 is as follows,
Step 3.1, template branch 1 takes template frame Z i as input image frame, search branch 1 takes search frame Z i+x as input image frame, hyperspectral branch is removed when training space branch, only training space branch;
Step 3.2, inputting a template frame Z i into a template branch 1, wherein Z i is an RGB video frame, and Z i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a feature F_t;
Step 3.3, inputting Z i+a into search branch 1, where Z i+a is an RGB video frame; z i+a sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a characteristic F_s;
step 3.4, solving a ridge regression loss function;
Obtaining a filter w, wherein H is an ideal Gaussian response, and lambda is a constant;
wherein, Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, representing the conjugate value,;
Step 3.5, calculating the final response R through the filter w and the characteristic F_s of the subsequent frame;
Wherein F -1 represents an inverse Fourier transform;
step 3.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, three frames form a training pair, and b is greater than a, so that a tracking response R i+a,Ri+b is obtained; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R i is obtained;
Step 3.7, calculating a movement weight M motion;
Where H i is the ideal Gaussian output of initial frame Z i, H i+a is the ideal Gaussian output of Z i+a, and m represents m different training pairs; judging whether the random initialization bounding box contains a dynamic target or not by calculating a moving weight M motion, wherein if the dynamic target exists, the weight of M motion is larger than the value without the dynamic target;
Step 3.8, constructing a loss function:
Where n represents the maximum value of batch size, R i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M motion is used to reduce the influence of the non-dynamic target on the network training;
Step 3.9, updating network model parameters by back propagation loss value, namely loss function value L in step 3.8, back propagation loss value, updating network parameters in step 3.2 by random gradient descent (SGD) algorithm, and finally obtaining an optimized space branch model
Further, the implementation manner of the step 4 is as follows,
Step 4.1, template branch 2 takes template frame Z i as input image frame, at this time Z i is hyperspectral video frame, search branch 2 takes search frame Z i+x as input image frame, at this time Z i+x is hyperspectral video frame, and model of space branch is loaded when training hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation;
Step 4.2, inputting the template frame Z i into the template branch 2, selecting three wave bands from Z i to form a pseudo-color video frame to pass through Obtaining a feature F_t_rgb; meanwhile, Z i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi; the structure of the 3 spectral feature extraction modules is shown in figure 2; f_t_hsi sequentially passes through a global average pooling layer, a full connecting layer, a nonlinear activating layer, a full connecting layer and a Softmax to calculate a weight function a of the F_t_hsi characteristic channel, and finally the hyperspectral characteristic aF_t_his with weight is obtained;
step 4.3, inputting Z i+a into search branch 2, selecting three wave bands from Z i+a to form pseudo-color video frame, and passing Obtaining a feature F_s_rgb; similarly, Z i+a sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_s_hsi; the structure of the 3 spectral feature extraction modules is shown in figure 2; using the a calculated in the step 4.2 to finally obtain hyperspectral characteristics aF_s_hsi with weights;
step 4.4, solving the ridge regression loss function:
Obtaining a filter w f, f_t_f=af_t_hsi+f_t_rgb, H is an ideal gaussian response, λ is a constant;
wherein, Is the Fourier transform of w f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value;
Step 4.5, calculating the final response R f by the filter w f and the characteristic f_s_f=f_s_rgb+af_s_hsi of the subsequent frame;
Wherein F -1 represents an inverse Fourier transform;
Step 4.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, and b > a, so as to obtain a tracking response R f_i+a,Rf_i+b; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R f_i is obtained;
Step 4.7, calculating a movement weight M f_motion:
where H i is the ideal Gaussian output of initial frame Z i and H i+a is the ideal Gaussian output of Z i+a; judging whether the random initialization boundary box contains a dynamic target or not by calculating a weight parameter M f_motion, wherein if the dynamic target exists, the weight of M f_motion is larger than the value of the non-dynamic target;
step 4.8, constructing a loss function:
Where n represents the maximum value of batch size, R f_i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M f_motion is used to reduce the influence of the non-dynamic target on the network training;
Step 4.9, updating network model parameters by back propagation loss values, namely loss function values L f in step 4.8, back propagation of loss values, updating network parameters in step 4.2, and finally obtaining an optimized space-hyperspectral model
The method of the invention has the following remarkable effects: (1) The non-supervision training network based on the periodic consistency principle can save labor cost; (2) The tracking model integrating RGB features and hyperspectral features is trained from end to end by deep learning, so that the reasoning speed is high, and is improved by tens of times compared with the reasoning speed of the traditional manual feature method; (3) And more effective characteristics aiming at the target to be tracked are aggregated by using a channel attention mechanism in the initial frame, so that the discrimination of the network to the target is increased.
Drawings
FIG. 1 is a schematic diagram of loop consistency in step 4 of embodiment 1 of the present invention
Fig. 2 is a schematic diagram of hyperspectral branching in step 4 of example 1 of the present invention.
Fig. 3 is a schematic diagram of the space branch in step 3 of embodiment 1 of the present invention.
Fig. 4 is a schematic diagram of the tracking result in step 3 of embodiment 1 of the present invention, wherein numerals in the figure represent the 4 th frame and the 12 th frame, respectively, and a frame represents the position and the size of the tracking target, and the frame moves and changes with the movement and deformation of the target (the target becomes larger, the frame becomes larger, the target becomes smaller, and the frame becomes smaller).
Fig. 5 is a flowchart of embodiment 1 of the present invention.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Example 1:
the embodiment of the invention provides an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion, which comprises the following steps:
Step 1, preprocessing video data, wherein the step further comprises the following steps:
In step 1.1, the video data is converted into a frame-sequential image X i (RGB video frames or hyperspectral video frames).
In step 1.2, the video image frame X i without labels is totally resized to a 200 by 200 pixel size video image frame Y i.
Step 2, randomly initializing a Bounding Box (BBOX), the step further comprising:
on the basis of step 1, a region with a 90×90 pixel size (a region with a 90×90 pixel size centered on coordinates [ x, Y ]) is randomly selected on the unlabeled video frame Y i as a target to be tracked (this region is the initialized BBOX). The 90×90 region resize is Z i to a size of 125×125 pixels. Two frames Y i+a and Y i+b (10 > =a >0, 10> =b >0, a > b or a < b) are randomly selected simultaneously from the 10 frames Y i+1 to Y i+10, and the 90×90 pixel size region resize is selected to be Z i+a and Z i+b of 125×125 pixel size, again centered on the coordinates [ x, Y ].
Step 3, unsupervised training of RGB branches (spatial branches) using the cyclic consistency principle, the step further comprising:
Step 3.1, the whole network structure is formed by a Siamese network base and is totally divided into a template branch and a search branch. The template branches take a template frame Z i (comprising the target to be tracked, where Z i represents an RGB video frame) as an input image frame, and the template branches are divided into a spatial branch and a hyperspectral branch. The search branches take the search frame Z i+x (subsequent video frame, x > 0) as the input image frame, and are also divided into spatial branches and hyperspectral branches. The hyperspectral branches are removed when the spatial branches are trained, and only the spatial branches are trained.
Step 3.2, template frame Z i is input to the template branch, where Z i is an RGB video frame. Z i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-partial response normalization layer to obtain a feature F_t.
Step 3.3, input Z i+a (assuming b > a) into the search branch, where Z i+a is an RGB video frame. Z i+a passes through the convolution layer-nonlinear activation layer-convolution layer-partial response normalization layer in sequence to obtain a feature F_s.
Step 3.4, by solving the ridge regression loss function:
the resulting filter w, H is an ideal gaussian response and λ is a constant.
Wherein,Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, which represents the conjugate value, +..
Step 3.5, the final response R can be calculated by the filter w and the characteristic f_s of the subsequent frame:
Wherein F -1 represents the inverse Fourier transform.
Step 3.6, tracking forwards in the sequence of Z i-Zi+a-Zi+b (three frames form a training pair) to obtain a tracking response R i+a,Ri+b; and tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R i is obtained.
Step 3.7, calculating a movement weight M motion:
where H i is the ideal Gaussian output of initial frame Z i, H i+a is the ideal Gaussian output of Z i+a, and m represents m different training pairs. Whether the random initialization bounding box contains a dynamic object is determined by calculating the movement weight M motion (if there is a dynamic object, the weight of M motion will be greater than the value without a dynamic object).
Step 3.8, constructing a loss function:
Where n represents the maximum value of the batch size, R i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous training with M mini batches, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the influence of the non-dynamic target on the network training can be reduced by using the weight parameter M motion.
Step 3.9, the network model parameters are updated by the backward propagation of the loss values, the loss values are backward propagated, the network parameters in the step 3.2 are updated by a random gradient descent (SGD) algorithm, and finally the optimized space branch model is obtained
Step 4, unsupervised training of hyperspectral branching by using a cyclic consistency principle, and the step further comprises:
Step 4.1, the whole network structure is formed by a Siamese network base and is totally divided into a template branch and a search branch. The template branch takes a template frame Z i (comprising targets to be tracked, all input video frames Zi and the like at the moment represent hyperspectral video frames) as an input image frame, and is divided into a space branch and a hyperspectral branch. The search branches take the search frame Z i+x (subsequent video frame, x > 0) as the input image frame, and are also divided into spatial branches and hyperspectral branches. Model for loading space branches during training hyperspectral branches While the frozen spatial branch parameters do not participate in the back propagation.
Step 4.2, the template frame Z i is input to the template branch. Selecting three bands from Z i to compose pseudo-color video frame pass(Spatial branching) results in the feature f_t_rgb. Meanwhile, Z i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi. The 3 spectral feature extraction modules are structured as shown in fig. 2. The F_t_hsi then sequentially passes through the global average pooling layer, the full connecting layer, the nonlinear active layer, the full connecting layer and the Softmax (channel attention mechanism) to calculate the weight function a of the F_t_hsi characteristic channel (the weight function is only calculated in a template frame, and the subsequent frame directly uses a), and finally the hyperspectral characteristic aF_t_hsi with weight is obtained.
Step 4.3, input Z i+a (assuming b > a) into the search branch. Selecting three band groups from Z i+a (band composition is same as step 4.2) to make pseudo-colour video frame passThe feature f_s_rgb is obtained. Similarly, Z i+a sequentially passes through a network consisting of 3 spectral feature extraction modules connected in series to obtain the feature F_s_hsi. The 3 spectral feature extraction modules are structured as shown in fig. 2. And (3) finally obtaining the hyperspectral characteristic aF_s_hsi with weight by using the a calculated in the step 4.2.
Step 4.4, solving the ridge regression loss function:
the resulting filter w f, f_t_f=af_t_hsi+f_t_rgb, H is an ideal gaussian response and λ is a constant.
Wherein,Is the Fourier transform of w f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value.
In step 4.5, the final response R f is calculated by the filter w f and the characteristic f_s_f=f_s_rgb+af_s_hsi of the following frame:
Wherein F -1 represents the inverse Fourier transform.
Step 4.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, and a tracking response R f_i+a, Rf_i+b is obtained; and tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R f_i is obtained.
Step 4.7, calculating a movement weight M f_motion:
Where H i is the ideal Gaussian output of initial frame Z i and H i+a is the ideal Gaussian output of Z i+a. Whether the random initialization bounding box contains a dynamic object is determined by calculating the weight parameter M f_motion (if there is a dynamic object, the weight of M f_motion will be greater than the value without a dynamic object).
Step 4.8, constructing a loss function:
Where n represents the maximum value of the batch size, R f_i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous training with M mini batches, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the influence of the non-dynamic target on the network training can be reduced by using the weight parameter M f_motion.
Step 4.9, back propagation of loss value to update network model parameters, back propagation of loss value to update network parameters in step 4.2, and finally obtaining the optimized space-hyperspectral model
Step 5, inputting the hyperspectral video frame X 1 containing the target to be tracked into the network modelMiddle template branch, input the following frame X 2,X3,X4...Xi to network model/>The search branches of (2) get the tracking result of each frame.
The method of the invention has the following remarkable effects: (1) The non-supervision training network based on the periodic consistency principle can save labor cost; (2) The tracking model integrating RGB features and hyperspectral features is trained from end to end by deep learning, so that the reasoning speed is high, and is improved by tens of times compared with the reasoning speed of the traditional manual feature method; (3) And more effective characteristics aiming at the target to be tracked are aggregated by using a channel attention mechanism in the initial frame, so that the discrimination of the network to the target is increased. The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.
Claims (5)
1. An unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion is characterized by comprising the following steps:
Step 1, preprocessing video data;
Step 2, randomly initializing a boundary frame, and acquiring a template frame Z i and a subsequent search frame Z i+x through the initialized boundary frame, wherein the template frame Z i and the search frame Z i+x are RGB video frames or hyperspectral video frames;
Step 3, unsupervised training of RGB branches, also called spatial branches, by using a cyclic consistency principle to finally obtain an optimized spatial branch model
The space branches comprise a template branch 1 and a search branch 1, wherein the template branch 1 takes a template frame Z i containing a tracking target as an input image frame, at the moment, the template frame Z i is an RGB video frame, the search branch 1 takes a search frame Z i+x, namely a subsequent video frame as the input image frame, x is more than 0, a hyperspectral branch is removed when the space branch is trained, and only the space branch is trained;
The template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear activation layer, a convolution layer and a local response normalization layer;
step 4, unsupervised training of hyperspectral branches by using a cyclic consistency principle to finally obtain an optimized space-hyperspectral model
The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 takes a template frame Z i containing a tracking target as an input image frame, a template frame Z i is a hyperspectral video frame, the search branch 2 takes a search frame Z i+x, namely a subsequent video frame is an input image frame, x >0, and a model of a space branch is loaded when training the hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation; the template branch 2 comprises a plurality of spectrum feature extraction modules and a channel attention module which are connected in series, wherein the first two spectrum feature extraction modules comprise a convolution layer, a batch normalization layer, a nonlinear activation layer, a third spectrum feature extraction module comprises a convolution layer, a batch normalization layer, a nonlinear activation layer, a convolution layer, the channel attention module comprises a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and a Softmax, and the search branch 2 only comprises a plurality of spectrum feature extraction modules connected in series and does not comprise the channel attention module;
the specific implementation manner of the step 4 is as follows:
Step 4.1, template branch 2 takes template frame Z i as input image frame, search branch 2 takes search frame Z i+x as input image frame, and model of space branch is loaded when training hyperspectral branch Meanwhile, freezing space branch parameters does not participate in back propagation;
Step 4.2, inputting the template frame Z i into the template branch 2, selecting three wave bands from Z i to form a pseudo-color video to pass through Obtaining a feature F_t_rgb; meanwhile, Z i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi; f_t_hsi sequentially passes through a global average pooling layer, a full connecting layer, a nonlinear activating layer, a full connecting layer and a Softmax to calculate a weight function a of the F_t_hsi characteristic channel, and finally the hyperspectral characteristic aF_t_hsi with weight is obtained;
step 4.3, inputting Z i+a into search branch 2, selecting three wave bands from Z i+a to form pseudo-color video frame, and passing Obtaining a feature F_s_rgb; similarly, Z i+a sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_s_hsi; using the a calculated in the step 4.2 to finally obtain hyperspectral characteristics aF_s_hsi with weights;
Step 4.4, obtaining a filter w f by solving a ridge regression loss function;
Step 4.5, calculating the final response R f by the filter w f and the feature f_s_f=f_s_rgb+af_s_hsi;
wherein F -1 represents the inverse fourier transform, Is the fourier transform of w f;
Step 4.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, and b > a, so as to obtain a tracking response R f_i+a,Rf_i+b; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R f_i is obtained;
Step 4.7, calculating a movement weight M f_motion:
Where H i is the ideal Gaussian output of initial frame Z i and H i+a is the ideal Gaussian output of Z i+a; judging whether the random initialization boundary box contains a dynamic target or not by calculating a weight parameter M f_motion, wherein if the dynamic target exists, the weight of M motion is larger than the value of the non-dynamic target;
step 4.8, constructing a loss function:
Where n represents the maximum value of batch size, R f_i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M f_motion is used to reduce the influence of the non-dynamic target on the network training;
Step 4.9, updating network model parameters by back propagation loss values, namely loss function values L f in step 4.8, back propagation of loss values, updating network parameters in step 4.2, and finally obtaining an optimized space-hyperspectral model
Step 5, inputting the hyperspectral video frame X 1 containing the target to be tracked into the network modelMiddle template branch, the subsequent video frames X 2,X3,X4...Xi are sequentially input into a network model/>The search branches of (2) get the tracking result of each frame.
2. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the specific implementation of step 1 is as follows,
Firstly, converting video data into a frame of continuous image X i,Xi which is an RGB video frame or a hyperspectral video frame;
The unlabeled video image frames X i are then all resized to a video image frame Y i.
3. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the implementation of said step 2 is as follows,
On the basis of the step 1, selecting a region with the size of 90 multiplied by 90 pixels on a video frame Y i without labels by taking coordinates [ x, Y ] as a center as a target to be tracked, wherein the region is initialized BBOX; the 90 x 90 region is adjusted to Z i of 125 x 125 pixel size; two frames Y i+a and Y i+b, 10> =a >0, 10> =b >0, a > b or a < b are randomly selected from the 10 frames Y i+1 to Y i+10 at the same time, and the area of 90×90 pixels is selected to be adjusted to Z i+a and Z i+b of 125×125 pixels with the coordinates x, Y as the center.
4. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the specific implementation manner of the step 3 is as follows,
Step 3.1, template branch 1 takes template frame Z i as input image frame, search branch 1 takes search frame Z i+x as input image frame, hyperspectral branch is removed when training space branch, only training space branch;
Step 3.2, inputting a template frame Z i into a template branch 1, wherein Z i is an RGB video frame, and Z i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a feature F_t;
Step 3.3, inputting Z i+a into search branch 1, where Z i+a is an RGB video frame; z i+a sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a characteristic F_s;
step 3.4, solving a ridge regression loss function;
Obtaining a filter w, wherein H is an ideal Gaussian response, and lambda is a constant;
wherein, Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, representing the conjugate value,;
Step 3.5, calculating the final response R through the filter w and the characteristic F_s of the subsequent frame;
Wherein F -1 represents an inverse Fourier transform;
step 3.6, tracking forwards, wherein the tracking sequence is Z i-Zi+a-Zi+b, three frames form a training pair, and b is greater than a, so that a tracking response R i+a,Ri+b is obtained; then tracking backwards, wherein the tracking sequence is Z i+b-Zi, and a tracking response R i is obtained;
Step 3.7, calculating a movement weight M motion;
Where H i is the ideal Gaussian output of initial frame Z i, H i+a is the ideal Gaussian output of Z i+a, and m represents m different training pairs; judging whether the random initialization bounding box contains a dynamic target or not by calculating a moving weight M motion, wherein if the dynamic target exists, the weight of M motion is larger than the value without the dynamic target;
Step 3.8, constructing a loss function:
Where n represents the maximum value of batch size, R i is the tracking response from Z i+b to Z i, H i is the ideal Gaussian response of the initial frame Z i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M motion is used to reduce the influence of the non-dynamic target on the network training;
Step 3.9, updating network model parameters by back propagation loss values, namely loss function values L in step 3.8, back propagation the loss values, updating the network parameters in step 3.2 by a random gradient descent algorithm, and finally obtaining an optimized space branch model
5. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps:
The calculation formula of the ridge regression loss function in step 4.4 is as follows:
Solving the ridge regression loss function to obtain a filter w f, wherein F_t_f=aF_t_hsi+F_t_rgb, H is an ideal Gaussian response, and lambda is a constant;
wherein, Is the Fourier transform of w f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110018918.9A CN112766102B (en) | 2021-01-07 | 2021-01-07 | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110018918.9A CN112766102B (en) | 2021-01-07 | 2021-01-07 | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112766102A CN112766102A (en) | 2021-05-07 |
CN112766102B true CN112766102B (en) | 2024-04-26 |
Family
ID=75700670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110018918.9A Active CN112766102B (en) | 2021-01-07 | 2021-01-07 | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112766102B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113344932B (en) * | 2021-06-01 | 2022-05-03 | 电子科技大学 | Semi-supervised single-target video segmentation method |
CN113628244B (en) * | 2021-07-05 | 2023-11-28 | 上海交通大学 | Target tracking method, system, terminal and medium based on label-free video training |
CN117689692A (en) * | 2023-12-20 | 2024-03-12 | 中国人民解放军海军航空大学 | Attention mechanism guiding matching associated hyperspectral and RGB video fusion tracking method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038684A (en) * | 2017-04-10 | 2017-08-11 | 南京信息工程大学 | A kind of method for lifting TMI spatial resolution |
CN108765280A (en) * | 2018-03-30 | 2018-11-06 | 徐国明 | A kind of high spectrum image spatial resolution enhancement method |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN111062888A (en) * | 2019-12-16 | 2020-04-24 | 武汉大学 | Hyperspectral image denoising method based on multi-target low-rank sparsity and spatial-spectral total variation |
CN111325116A (en) * | 2020-02-05 | 2020-06-23 | 武汉大学 | Remote sensing image target detection method capable of evolving based on offline training-online learning depth |
CN111724411A (en) * | 2020-05-26 | 2020-09-29 | 浙江工业大学 | Multi-feature fusion tracking method based on hedging algorithm |
WO2020199205A1 (en) * | 2019-04-04 | 2020-10-08 | 合刃科技(深圳)有限公司 | Hybrid hyperspectral image reconstruction method and system |
CN111797716A (en) * | 2020-06-16 | 2020-10-20 | 电子科技大学 | Single target tracking method based on Siamese network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110060274A (en) * | 2019-04-12 | 2019-07-26 | 北京影谱科技股份有限公司 | The visual target tracking method and device of neural network based on the dense connection of depth |
-
2021
- 2021-01-07 CN CN202110018918.9A patent/CN112766102B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038684A (en) * | 2017-04-10 | 2017-08-11 | 南京信息工程大学 | A kind of method for lifting TMI spatial resolution |
CN108765280A (en) * | 2018-03-30 | 2018-11-06 | 徐国明 | A kind of high spectrum image spatial resolution enhancement method |
WO2020199205A1 (en) * | 2019-04-04 | 2020-10-08 | 合刃科技(深圳)有限公司 | Hybrid hyperspectral image reconstruction method and system |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN111062888A (en) * | 2019-12-16 | 2020-04-24 | 武汉大学 | Hyperspectral image denoising method based on multi-target low-rank sparsity and spatial-spectral total variation |
CN111325116A (en) * | 2020-02-05 | 2020-06-23 | 武汉大学 | Remote sensing image target detection method capable of evolving based on offline training-online learning depth |
CN111724411A (en) * | 2020-05-26 | 2020-09-29 | 浙江工业大学 | Multi-feature fusion tracking method based on hedging algorithm |
CN111797716A (en) * | 2020-06-16 | 2020-10-20 | 电子科技大学 | Single target tracking method based on Siamese network |
Non-Patent Citations (1)
Title |
---|
基于超像素分割的RGB与高光谱图像融合;洪科;;电子技术与软件工程(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112766102A (en) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112766102B (en) | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion | |
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN109949255B (en) | Image reconstruction method and device | |
WO2021043168A1 (en) | Person re-identification network training method and person re-identification method and apparatus | |
Zhao et al. | TBC-Net: A real-time detector for infrared small target detection using semantic constraint | |
CN107529650B (en) | Closed loop detection method and device and computer equipment | |
CN110717851A (en) | Image processing method and device, neural network training method and storage medium | |
CN113239830B (en) | Remote sensing image cloud detection method based on full-scale feature fusion | |
CN113706581B (en) | Target tracking method based on residual channel attention and multi-level classification regression | |
CN107680116A (en) | A kind of method for monitoring moving object in video sequences | |
US20210312589A1 (en) | Image processing apparatus, image processing method, and program | |
CN113420794B (en) | Binaryzation Faster R-CNN citrus disease and pest identification method based on deep learning | |
Wang et al. | Lightweight deep neural networks for ship target detection in SAR imagery | |
CN113240697A (en) | Lettuce multispectral image foreground segmentation method | |
Kumar et al. | Performance analysis of object detection algorithm for intelligent traffic surveillance system | |
CN113609904B (en) | Single-target tracking algorithm based on dynamic global information modeling and twin network | |
Liang et al. | Multi-scale hybrid attention graph convolution neural network for remote sensing images super-resolution | |
CN115272865A (en) | Target detection method based on adaptive activation function and attention mechanism | |
Jin et al. | Learning multiple attention transformer super-resolution method for grape disease recognition | |
Cheng et al. | A new single image super-resolution method based on the infinite mixture model | |
Che et al. | Research on an underwater image segmentation algorithm based on YOLOv8 | |
CN112132207A (en) | Target detection neural network construction method based on multi-branch feature mapping | |
CN116777953A (en) | Remote sensing image target tracking method based on multi-scale feature aggregation enhancement | |
CN116311349A (en) | Human body key point detection method based on lightweight neural network | |
CN115861810A (en) | Remote sensing image change detection method and system based on multi-head attention and self-supervision learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |