CN112766102B

CN112766102B - Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion

Info

Publication number: CN112766102B
Application number: CN202110018918.9A
Authority: CN
Inventors: 王心宇; 刘桢杞; 钟燕飞
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2024-04-26
Anticipated expiration: 2041-01-07
Also published as: CN112766102A

Abstract

The invention relates to an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion. The invention combines with a cyclic consistency theory method, designs a hyperspectral target tracking method based on deep learning, can unsupervised train a hyperspectral target tracking deep learning model, and saves the cost of manual labeling. On the basis of the Siamese tracking framework, RGB branches (space branches) and hyperspectral branches are designed; and training a space branch by using RGB video data, loading a trained RGB model into network fixed parameters, and training a hyperspectral branch at the same time to obtain the characteristics of more robustness and discriminant after fusion. The tracking result is finally obtained using the input of the fused features into a correlation filter (DCF). The method can solve the problem of manual labeling of the hyperspectral video data and the problem of few hyperspectral training samples for training the deep learning model, and can effectively improve the precision and speed of the hyperspectral video tracking model.

Description

Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion

Technical Field

The invention relates to the field of processing based on computational vision technology, in particular to an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion.

Background

Hyperspectral video (high spatial resolution-high temporal resolution-hyperspectral resolution) object tracking is an emerging direction aimed at predicting the state of an object in a subsequent frame using object information of a given initial frame in hyperspectral video. In contrast to RGB video object tracking, hyperspectral video object tracking can provide spectral information that distinguishes between different materials in addition to spatial information. Even if the target shapes are the same, the target can be tracked by utilizing hyperspectral video as long as the materials are different, which is an advantage not possessed by RGB video target tracking. Therefore, the hyperspectral video target tracking can play an important role in the fields of camouflage target tracking, small target tracking and the like. On this basis, hyperspectral video object tracking has attracted more and more researchers' attention.

At the same time, hyperspectral video object tracking is a difficult task. Firstly, the existing hyperspectral video target tracking algorithm uses the traditional manual characteristic to represent the characteristic of the target, so that the performance of the hyperspectral video target tracking algorithm is limited; secondly, the hyperspectral video needs to be shot by a special hyperspectral video camera, and training samples are limited, so that a hyperspectral video target algorithm based on deep learning does not exist in the current sense. Thirdly, the supervised deep learning algorithm now requires a large number of samples of manual standards, in particular video annotation, which is time-consuming and laborious. Current hyperspectral video object tracking algorithms tend to perform poorly due to several of the problems described above.

Disclosure of Invention

The invention aims to provide an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion.

The unsupervised hyperspectral video target tracking method based on the spatial spectrum feature fusion has the following three remarkable characteristics. Firstly, by utilizing the principle of cyclic consistency, the whole hyperspectral target tracking algorithm based on deep learning is unsupervised and trained without any manual labeling. Secondly, a relevant filtering hyperspectral video target tracking frame fused with the spatial spectrum features is designed, so that the problem of few hyperspectral video training samples is solved to a certain extent, and simultaneously, the RGB and hyperspectral features are fused to obtain the features with more robustness and recognition capability. And thirdly, designing a channel attention module, and calculating the weights of the characteristic channels only in an initial frame, so that the network can dynamically aggregate different weights of the characteristic channels of different targets.

The invention provides an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion, which comprises the following implementation steps:

Step 1, preprocessing video data;

Step 2, randomly initializing a boundary frame, and acquiring a template frame Z _i and a subsequent search frame Z _i+x through the initialized boundary frame, wherein the template frame Z _i and the search frame Z _i+x are RGB video frames or hyperspectral video frames;

Step 3, unsupervised training of RGB branches, also called spatial branches, by using a cyclic consistency principle to finally obtain an optimized spatial branch model

The space branches comprise a template branch 1 and a search branch 1, wherein the template branch 1 takes a template frame Z _i containing a tracking target as an input image frame, at the moment, the template frame Z _i is an RGB video frame, the search branch 1 takes a search frame Z _i+x, namely a subsequent video frame as the input image frame, x is more than 0, the hyperspectral branch is removed when the space branches are trained, and only the space branches are trained;

The template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear activation layer, a convolution layer and a local response normalization layer;

step 4, unsupervised training of hyperspectral branches by using a cyclic consistency principle to finally obtain an optimized space-hyperspectral model

The hyperspectral branch comprises a template branch 2 and a searching branch 2, wherein the template branch 2 takes a template frame Z _i containing a tracking target as an input image frame, the searching branch takes a searching frame Z _i+x, namely a subsequent video frame is an input image frame, x >0, and a model of a space branch is loaded when training the hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation;

The template branch 2 comprises a plurality of spectrum feature extraction modules and a channel attention module which are connected in series, wherein the first two spectrum feature extraction modules comprise a convolution layer, a batch normalization layer, a nonlinear activation layer, a third spectrum feature extraction module comprises a convolution layer, a batch normalization layer, a nonlinear activation layer, a convolution layer, the channel attention module comprises a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and a Softmax, and the search branch 2 only comprises a plurality of spectrum feature extraction modules connected in series and does not comprise the channel attention module;

Step 5, inputting the hyperspectral video frame X ₁ containing the target to be tracked into the network model Middle template branch, input the following frame X ₂,X₃,X₄...X_i to network model/>The search branches of (2) get the tracking result of each frame.

Further, the specific implementation manner of the step 1 is as follows,

Firstly, converting video data into a frame of continuous image X _i,X_i which is an RGB video frame or a hyperspectral video frame;

The unlabeled video image frame X _i is then fully resized to a size of video image frame Y _i.

Further, the implementation manner of the step 2 is as follows,

On the basis of the step 1, selecting a region with the size of 90 multiplied by 90 pixels on a video frame Y _i without labels by taking coordinates [ x, Y ] as a center as a target to be tracked, wherein the region is initialized BBOX; the 90×90 region resize is Z _i to a 125×125 pixel size; two frames Y _i+a and Y _i+b, 10> =a >0, 10> =b >0, a > b or a < b are randomly selected simultaneously from the 10 frames Y _i+1 to Y _i+10, and the 90×90 pixel size region resize is selected as Z _i+a and Z _i+b of 125×125 pixel size, again centered on the coordinates [ x, Y ].

Further, the specific implementation manner of the step3 is as follows,

Step 3.1, template branch 1 takes template frame Z _i as input image frame, search branch 1 takes search frame Z _i+x as input image frame, hyperspectral branch is removed when training space branch, only training space branch;

Step 3.2, inputting a template frame Z _i into a template branch 1, wherein Z _i is an RGB video frame, and Z _i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a feature F_t;

Step 3.3, inputting Z _i+a into search branch 1, where Z _i+a is an RGB video frame; z _i+a sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-local response normalization layer to obtain a characteristic F_s;

step 3.4, solving a ridge regression loss function;

Obtaining a filter w, wherein H is an ideal Gaussian response, and lambda is a constant;

wherein, Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, representing the conjugate value,;

Step 3.5, calculating the final response R through the filter w and the characteristic F_s of the subsequent frame;

Wherein F ^-1 represents an inverse Fourier transform;

step 3.6, tracking forwards, wherein the tracking sequence is Z _i-Z_i+a-Z_i+b, three frames form a training pair, and b is greater than a, so that a tracking response R _i+a,R_i+b is obtained; then tracking backwards, wherein the tracking sequence is Z _i+b-Z_i, and a tracking response R _i is obtained;

Step 3.7, calculating a movement weight M _motion;

Where H _i is the ideal Gaussian output of initial frame Z _i, H _i+a is the ideal Gaussian output of Z _i+a, and m represents m different training pairs; judging whether the random initialization bounding box contains a dynamic target or not by calculating a moving weight M _motion, wherein if the dynamic target exists, the weight of M _motion is larger than the value without the dynamic target;

Step 3.8, constructing a loss function:

Where n represents the maximum value of batch size, R _i is the tracking response from Z _i+b to Z _i, H _i is the ideal Gaussian response of the initial frame Z _i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M _motion is used to reduce the influence of the non-dynamic target on the network training;

Step 3.9, updating network model parameters by back propagation loss value, namely loss function value L in step 3.8, back propagation loss value, updating network parameters in step 3.2 by random gradient descent (SGD) algorithm, and finally obtaining an optimized space branch model

Further, the implementation manner of the step 4 is as follows,

Step 4.1, template branch 2 takes template frame Z _i as input image frame, at this time Z _i is hyperspectral video frame, search branch 2 takes search frame Z _i+x as input image frame, at this time Z _i+x is hyperspectral video frame, and model of space branch is loaded when training hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation;

Step 4.2, inputting the template frame Z _i into the template branch 2, selecting three wave bands from Z _i to form a pseudo-color video frame to pass through Obtaining a feature F_t_rgb; meanwhile, Z _i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi; the structure of the 3 spectral feature extraction modules is shown in figure 2; f_t_hsi sequentially passes through a global average pooling layer, a full connecting layer, a nonlinear activating layer, a full connecting layer and a Softmax to calculate a weight function a of the F_t_hsi characteristic channel, and finally the hyperspectral characteristic aF_t_his with weight is obtained;

step 4.3, inputting Z _i+a into search branch 2, selecting three wave bands from Z _i+a to form pseudo-color video frame, and passing Obtaining a feature F_s_rgb; similarly, Z _i+a sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_s_hsi; the structure of the 3 spectral feature extraction modules is shown in figure 2; using the a calculated in the step 4.2 to finally obtain hyperspectral characteristics aF_s_hsi with weights;

step 4.4, solving the ridge regression loss function:

Obtaining a filter w _f, f_t_f=af_t_hsi+f_t_rgb, H is an ideal gaussian response, λ is a constant;

wherein, Is the Fourier transform of w _f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value;

Step 4.5, calculating the final response R _f by the filter w _f and the characteristic f_s_f=f_s_rgb+af_s_hsi of the subsequent frame;

Wherein F ^-1 represents an inverse Fourier transform;

Step 4.6, tracking forwards, wherein the tracking sequence is Z _i-Z_i+a-Z_i+b, and b > a, so as to obtain a tracking response R _{f_i+a},R_{f_i+b}; then tracking backwards, wherein the tracking sequence is Z _i+b-Z_i, and a tracking response R _{f_i} is obtained;

Step 4.7, calculating a movement weight M _{f_motion}:

where H _i is the ideal Gaussian output of initial frame Z _i and H _i+a is the ideal Gaussian output of Z _i+a; judging whether the random initialization boundary box contains a dynamic target or not by calculating a weight parameter M _{f_motion}, wherein if the dynamic target exists, the weight of M _{f_motion} is larger than the value of the non-dynamic target;

step 4.8, constructing a loss function:

Where n represents the maximum value of batch size, R _{f_i} is the tracking response from Z _i+b to Z _i, H _i is the ideal Gaussian response of the initial frame Z _i, M represents the simultaneous use of M mini batch training, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the weighting parameter M _{f_motion} is used to reduce the influence of the non-dynamic target on the network training;

Step 4.9, updating network model parameters by back propagation loss values, namely loss function values L _f in step 4.8, back propagation of loss values, updating network parameters in step 4.2, and finally obtaining an optimized space-hyperspectral model

The method of the invention has the following remarkable effects: (1) The non-supervision training network based on the periodic consistency principle can save labor cost; (2) The tracking model integrating RGB features and hyperspectral features is trained from end to end by deep learning, so that the reasoning speed is high, and is improved by tens of times compared with the reasoning speed of the traditional manual feature method; (3) And more effective characteristics aiming at the target to be tracked are aggregated by using a channel attention mechanism in the initial frame, so that the discrimination of the network to the target is increased.

Drawings

FIG. 1 is a schematic diagram of loop consistency in step 4 of embodiment 1 of the present invention

Fig. 2 is a schematic diagram of hyperspectral branching in step 4 of example 1 of the present invention.

Fig. 3 is a schematic diagram of the space branch in step 3 of embodiment 1 of the present invention.

Fig. 4 is a schematic diagram of the tracking result in step 3 of embodiment 1 of the present invention, wherein numerals in the figure represent the 4 th frame and the 12 th frame, respectively, and a frame represents the position and the size of the tracking target, and the frame moves and changes with the movement and deformation of the target (the target becomes larger, the frame becomes larger, the target becomes smaller, and the frame becomes smaller).

Fig. 5 is a flowchart of embodiment 1 of the present invention.

Detailed Description

The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.

Example 1:

the embodiment of the invention provides an unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion, which comprises the following steps:

Step 1, preprocessing video data, wherein the step further comprises the following steps:

In step 1.1, the video data is converted into a frame-sequential image X _i (RGB video frames or hyperspectral video frames).

In step 1.2, the video image frame X _i without labels is totally resized to a 200 by 200 pixel size video image frame Y _i.

Step 2, randomly initializing a Bounding Box (BBOX), the step further comprising:

on the basis of step 1, a region with a 90×90 pixel size (a region with a 90×90 pixel size centered on coordinates [ x, Y ]) is randomly selected on the unlabeled video frame Y _i as a target to be tracked (this region is the initialized BBOX). The 90×90 region resize is Z _i to a size of 125×125 pixels. Two frames Y _i+a and Y _i+b (10 > =a >0, 10> =b >0, a > b or a < b) are randomly selected simultaneously from the 10 frames Y _i+1 to Y _i+10, and the 90×90 pixel size region resize is selected to be Z _i+a and Z _i+b of 125×125 pixel size, again centered on the coordinates [ x, Y ].

Step 3, unsupervised training of RGB branches (spatial branches) using the cyclic consistency principle, the step further comprising:

Step 3.1, the whole network structure is formed by a Siamese network base and is totally divided into a template branch and a search branch. The template branches take a template frame Z _i (comprising the target to be tracked, where Z _i represents an RGB video frame) as an input image frame, and the template branches are divided into a spatial branch and a hyperspectral branch. The search branches take the search frame Z _i+x (subsequent video frame, x > 0) as the input image frame, and are also divided into spatial branches and hyperspectral branches. The hyperspectral branches are removed when the spatial branches are trained, and only the spatial branches are trained.

Step 3.2, template frame Z _i is input to the template branch, where Z _i is an RGB video frame. Z _i sequentially passes through a convolution layer-nonlinear activation layer-convolution layer-partial response normalization layer to obtain a feature F_t.

Step 3.3, input Z _i+a (assuming b > a) into the search branch, where Z _i+a is an RGB video frame. Z _i+a passes through the convolution layer-nonlinear activation layer-convolution layer-partial response normalization layer in sequence to obtain a feature F_s.

Step 3.4, by solving the ridge regression loss function:

the resulting filter w, H is an ideal gaussian response and λ is a constant.

Wherein,Is the Fourier transform of w, and is the same as that/>Is the Fourier transform of F_t,/>Is the fourier transform of H, which represents the conjugate value, +..

Step 3.5, the final response R can be calculated by the filter w and the characteristic f_s of the subsequent frame:

Wherein F ^-1 represents the inverse Fourier transform.

Step 3.6, tracking forwards in the sequence of Z _i-Z_i+a-Z_i+b (three frames form a training pair) to obtain a tracking response R _i+a,R_i+b; and tracking backwards, wherein the tracking sequence is Z _i+b-Z_i, and a tracking response R _i is obtained.

Step 3.7, calculating a movement weight M _motion:

where H _i is the ideal Gaussian output of initial frame Z _i, H _i+a is the ideal Gaussian output of Z _i+a, and m represents m different training pairs. Whether the random initialization bounding box contains a dynamic object is determined by calculating the movement weight M _motion (if there is a dynamic object, the weight of M _motion will be greater than the value without a dynamic object).

Step 3.8, constructing a loss function:

Where n represents the maximum value of the batch size, R _i is the tracking response from Z _i+b to Z _i, H _i is the ideal Gaussian response of the initial frame Z _i, M represents the simultaneous training with M mini batches, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the influence of the non-dynamic target on the network training can be reduced by using the weight parameter M _motion.

Step 3.9, the network model parameters are updated by the backward propagation of the loss values, the loss values are backward propagated, the network parameters in the step 3.2 are updated by a random gradient descent (SGD) algorithm, and finally the optimized space branch model is obtained

Step 4, unsupervised training of hyperspectral branching by using a cyclic consistency principle, and the step further comprises:

Step 4.1, the whole network structure is formed by a Siamese network base and is totally divided into a template branch and a search branch. The template branch takes a template frame Z _i (comprising targets to be tracked, all input video frames Zi and the like at the moment represent hyperspectral video frames) as an input image frame, and is divided into a space branch and a hyperspectral branch. The search branches take the search frame Z _i+x (subsequent video frame, x > 0) as the input image frame, and are also divided into spatial branches and hyperspectral branches. Model for loading space branches during training hyperspectral branches While the frozen spatial branch parameters do not participate in the back propagation.

Step 4.2, the template frame Z _i is input to the template branch. Selecting three bands from Z _i to compose pseudo-color video frame pass(Spatial branching) results in the feature f_t_rgb. Meanwhile, Z _i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi. The 3 spectral feature extraction modules are structured as shown in fig. 2. The F_t_hsi then sequentially passes through the global average pooling layer, the full connecting layer, the nonlinear active layer, the full connecting layer and the Softmax (channel attention mechanism) to calculate the weight function a of the F_t_hsi characteristic channel (the weight function is only calculated in a template frame, and the subsequent frame directly uses a), and finally the hyperspectral characteristic aF_t_hsi with weight is obtained.

Step 4.3, input Z _i+a (assuming b > a) into the search branch. Selecting three band groups from Z _i+a (band composition is same as step 4.2) to make pseudo-colour video frame passThe feature f_s_rgb is obtained. Similarly, Z _i+a sequentially passes through a network consisting of 3 spectral feature extraction modules connected in series to obtain the feature F_s_hsi. The 3 spectral feature extraction modules are structured as shown in fig. 2. And (3) finally obtaining the hyperspectral characteristic aF_s_hsi with weight by using the a calculated in the step 4.2.

Step 4.4, solving the ridge regression loss function:

the resulting filter w _f, f_t_f=af_t_hsi+f_t_rgb, H is an ideal gaussian response and λ is a constant.

Wherein,Is the Fourier transform of w _f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value.

In step 4.5, the final response R _f is calculated by the filter w _f and the characteristic f_s_f=f_s_rgb+af_s_hsi of the following frame:

Wherein F ^-1 represents the inverse Fourier transform.

Step 4.6, tracking forwards, wherein the tracking sequence is Z _i-Z_i+a-Z_i+b, and a tracking response R _{f_i+a}, R_{f_i+b} is obtained; and tracking backwards, wherein the tracking sequence is Z _i+b-Z_i, and a tracking response R _{f_i} is obtained.

Step 4.7, calculating a movement weight M _{f_motion}:

Where H _i is the ideal Gaussian output of initial frame Z _i and H _i+a is the ideal Gaussian output of Z _i+a. Whether the random initialization bounding box contains a dynamic object is determined by calculating the weight parameter M _{f_motion} (if there is a dynamic object, the weight of M _{f_motion} will be greater than the value without a dynamic object).

Step 4.8, constructing a loss function:

Where n represents the maximum value of the batch size, R _{f_i} is the tracking response from Z _i+b to Z _i, H _i is the ideal Gaussian response of the initial frame Z _i, M represents the simultaneous training with M mini batches, each mini batch is a set of training pairs, a set of training pairs has three frames of images, and the influence of the non-dynamic target on the network training can be reduced by using the weight parameter M _{f_motion}.

Step 4.9, back propagation of loss value to update network model parameters, back propagation of loss value to update network parameters in step 4.2, and finally obtaining the optimized space-hyperspectral model

Step 5, inputting the hyperspectral video frame X ₁ containing the target to be tracked into the network modelMiddle template branch, input the following frame X ₂,X₃,X₄...X_i to network model/>The search branches of (2) get the tracking result of each frame.

The method of the invention has the following remarkable effects: (1) The non-supervision training network based on the periodic consistency principle can save labor cost; (2) The tracking model integrating RGB features and hyperspectral features is trained from end to end by deep learning, so that the reasoning speed is high, and is improved by tens of times compared with the reasoning speed of the traditional manual feature method; (3) And more effective characteristics aiming at the target to be tracked are aggregated by using a channel attention mechanism in the initial frame, so that the discrimination of the network to the target is increased. The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. An unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion is characterized by comprising the following steps:

Step 1, preprocessing video data;

The space branches comprise a template branch 1 and a search branch 1, wherein the template branch 1 takes a template frame Z _i containing a tracking target as an input image frame, at the moment, the template frame Z _i is an RGB video frame, the search branch 1 takes a search frame Z _i+x, namely a subsequent video frame as the input image frame, x is more than 0, a hyperspectral branch is removed when the space branch is trained, and only the space branch is trained;

The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 takes a template frame Z _i containing a tracking target as an input image frame, a template frame Z _i is a hyperspectral video frame, the search branch 2 takes a search frame Z _i+x, namely a subsequent video frame is an input image frame, x >0, and a model of a space branch is loaded when training the hyperspectral branchMeanwhile, freezing space branch parameters does not participate in back propagation; the template branch 2 comprises a plurality of spectrum feature extraction modules and a channel attention module which are connected in series, wherein the first two spectrum feature extraction modules comprise a convolution layer, a batch normalization layer, a nonlinear activation layer, a third spectrum feature extraction module comprises a convolution layer, a batch normalization layer, a nonlinear activation layer, a convolution layer, the channel attention module comprises a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and a Softmax, and the search branch 2 only comprises a plurality of spectrum feature extraction modules connected in series and does not comprise the channel attention module;

the specific implementation manner of the step 4 is as follows:

Step 4.1, template branch 2 takes template frame Z _i as input image frame, search branch 2 takes search frame Z _i+x as input image frame, and model of space branch is loaded when training hyperspectral branch Meanwhile, freezing space branch parameters does not participate in back propagation;

Step 4.2, inputting the template frame Z _i into the template branch 2, selecting three wave bands from Z _i to form a pseudo-color video to pass through Obtaining a feature F_t_rgb; meanwhile, Z _i sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_t_hsi; f_t_hsi sequentially passes through a global average pooling layer, a full connecting layer, a nonlinear activating layer, a full connecting layer and a Softmax to calculate a weight function a of the F_t_hsi characteristic channel, and finally the hyperspectral characteristic aF_t_hsi with weight is obtained;

step 4.3, inputting Z _i+a into search branch 2, selecting three wave bands from Z _i+a to form pseudo-color video frame, and passing Obtaining a feature F_s_rgb; similarly, Z _i+a sequentially passes through a network formed by 3 spectrum feature extraction modules connected in series to obtain a feature F_s_hsi; using the a calculated in the step 4.2 to finally obtain hyperspectral characteristics aF_s_hsi with weights;

Step 4.4, obtaining a filter w _f by solving a ridge regression loss function;

Step 4.5, calculating the final response R _f by the filter w _f and the feature f_s_f=f_s_rgb+af_s_hsi;

wherein F ^-1 represents the inverse fourier transform, Is the fourier transform of w _f;

Step 4.7, calculating a movement weight M _{f_motion}:

Where H _i is the ideal Gaussian output of initial frame Z _i and H _i+a is the ideal Gaussian output of Z _i+a; judging whether the random initialization boundary box contains a dynamic target or not by calculating a weight parameter M _{f_motion}, wherein if the dynamic target exists, the weight of M _motion is larger than the value of the non-dynamic target;

step 4.8, constructing a loss function:

Step 5, inputting the hyperspectral video frame X ₁ containing the target to be tracked into the network modelMiddle template branch, the subsequent video frames X ₂,X₃,X₄...X_i are sequentially input into a network model/>The search branches of (2) get the tracking result of each frame.

2. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the specific implementation of step 1 is as follows,

The unlabeled video image frames X _i are then all resized to a video image frame Y _i.

3. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the implementation of said step 2 is as follows,

On the basis of the step 1, selecting a region with the size of 90 multiplied by 90 pixels on a video frame Y _i without labels by taking coordinates [ x, Y ] as a center as a target to be tracked, wherein the region is initialized BBOX; the 90 x 90 region is adjusted to Z _i of 125 x 125 pixel size; two frames Y _i+a and Y _i+b, 10> =a >0, 10> =b >0, a > b or a < b are randomly selected from the 10 frames Y _i+1 to Y _i+10 at the same time, and the area of 90×90 pixels is selected to be adjusted to Z _i+a and Z _i+b of 125×125 pixels with the coordinates x, Y as the center.

4. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps: the specific implementation manner of the step 3 is as follows,

step 3.4, solving a ridge regression loss function;

Wherein F ^-1 represents an inverse Fourier transform;

Step 3.7, calculating a movement weight M _motion;

Step 3.8, constructing a loss function:

Step 3.9, updating network model parameters by back propagation loss values, namely loss function values L in step 3.8, back propagation the loss values, updating the network parameters in step 3.2 by a random gradient descent algorithm, and finally obtaining an optimized space branch model

5. The method for tracking the unsupervised hyperspectral video target based on spatial spectrum feature fusion as set forth in claim 1, wherein the method comprises the following steps:

The calculation formula of the ridge regression loss function in step 4.4 is as follows:

Solving the ridge regression loss function to obtain a filter w _f, wherein F_t_f=aF_t_hsi+F_t_rgb, H is an ideal Gaussian response, and lambda is a constant;

wherein, Is the Fourier transform of w _f, and is the same as that/>Is the Fourier transform of F_t_f,/>Is the fourier transform of H, representing the conjugate value.