CN112766102A

CN112766102A - Unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion

Info

Publication number: CN112766102A
Application number: CN202110018918.9A
Authority: CN
Inventors: 王心宇; 刘桢杞; 钟燕飞
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-05-07
Anticipated expiration: 2041-01-07
Also published as: CN112766102B

Abstract

The invention relates to an unsupervised hyperspectral video target tracking method based on spatial-spectral feature fusion. The hyperspectral target tracking method based on deep learning is designed by combining a cycle consistency theoretical method, a hyperspectral target tracking deep learning model can be trained unsupervised, and the cost of manual labeling is saved. On the basis of a Simese tracking framework, RGB (red, green and blue) branches (spatial branches) and hyperspectral branches are designed; and training a spatial branch by using RGB video data, loading the trained RGB model into network fixed parameters, and training a hyperspectral branch at the same time to obtain the fused features with higher robustness and discrimination. The final usage inputs the fused features into a correlation filter (DCF) to obtain the tracking result. The method can solve the problems of manual labeling of hyperspectral video data and few hyperspectral training samples for deep learning model training, and can effectively improve the precision and speed of a hyperspectral video tracking model.

Description

Unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion

Technical Field

The invention relates to the field of computational vision technology processing, in particular to an unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion.

Background

Target tracking of a hyperspectral video (high spatial resolution-high temporal resolution-high spectral resolution) is used as a new direction, and the target information of a given initial frame in the hyperspectral video is used for predicting the state of a target in a subsequent frame. Compared with RGB video target tracking, hyperspectral video target tracking can provide spectral information for distinguishing different materials besides spatial information. Even if the target has the same shape, the hyperspectral video can be used for tracking the target as long as the materials are different, which is an advantage that the RGB video target tracking does not have. Therefore, hyperspectral video target tracking can play an important role in the fields of camouflage target tracking, small target tracking and the like. On the basis, hyperspectral video target tracking also attracts the attention of more and more researchers.

Meanwhile, hyperspectral video target tracking is a difficult task. Firstly, the existing hyperspectral video target tracking algorithm uses the traditional manual features to represent the features of the target, so that the performance of the hyperspectral video target tracking algorithm is limited; secondly, the hyperspectral video needs to be shot by a special hyperspectral video camera, training samples are limited, and therefore the hyperspectral video target algorithm based on deep learning in the real sense does not exist at present. Thirdly, the supervised deep learning algorithm requires a large number of samples of manual standards, especially video annotation, which is time-consuming and labor-consuming. Due to the existence of the problems, the existing hyperspectral video target tracking algorithm is poor in performance.

Disclosure of Invention

The invention aims to provide an unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion.

The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion has the following three remarkable characteristics. Firstly, a cycle consistency principle is utilized, and the whole hyperspectral target tracking algorithm based on deep learning is trained unsupervised under the condition that no manual marking is needed. And secondly, a related filtering hyperspectral video target tracking framework with space-spectrum feature fusion is designed, the problem that hyperspectral video training samples are few is solved to a certain extent, and meanwhile, the RGB and hyperspectral features are fused to obtain features with higher robustness and identification capability. And thirdly, designing a channel attention module, and calculating the weight of the characteristic channel only in an initial frame, so that the network can dynamically aggregate different weights of the characteristic channels of different targets.

The invention provides an unsupervised hyperspectral video target tracking method based on spatial-spectral feature fusion, which comprises the following steps of:

step 1, preprocessing video data;

step 2, initializing the boundary frame randomly and obtaining a template frame Z through the initialized boundary frame_iAnd subsequent search frame Z_i+xTemplate frame Z_iAnd search frame Z_i+xThe video frame is an RGB video frame or a hyperspectral video frame;

step 3, unsupervised training of RGB branches, also called spatial branches, is carried out by utilizing the principle of cycle consistency, and finally an optimized spatial branch model is obtained

The spatial branch comprises a template branch 1 and a search branch 1, wherein the template branch 1 contains a template frame Z of a tracking target_iFor an input image frame, the template frame Z at this time_iFor RGB video frames, search branch 1 to search for frame Z_i+xI.e. the subsequent video frame is the input image frame, x>0, removing the hyperspectral branches when training the spatial branches, and only training the spatial branches;

the template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;

step 4, unsupervised training of hyperspectral branches by using a cycle consistency principle to finally obtain an optimized space-hyperspectral model

The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 comprises a model of a tracked targetPlate frame Z_iFor input image frame, search branch to search frame Z_i+xI.e. the subsequent video frame is the input image frame, x>0, loading the model of the spatial branch during the training of the hyperspectral branch

Meanwhile, the freezing space branch parameters do not participate in back propagation;

the template branch 2 comprises a plurality of spectral feature extraction modules and a channel attention module which are connected in series, wherein the first two spectral feature extraction modules comprise a convolutional layer-batch normalization layer-nonlinear active layer, the third spectral feature extraction module comprises a convolutional layer-batch normalization layer-nonlinear active layer-convolutional layer, the channel attention module comprises a global average pooling layer-full connection layer-nonlinear active layer-full connection layer-Softmax, and the plurality of search branches 2 comprise only the spectral feature extraction modules which are connected in series and do not comprise the channel attention module;

step 5, the hyperspectral video frame X containing the target to be tracked₁Input to a network model

The middle template branches, and the subsequent frame X₂,X₃,X₄...X_iAre sequentially input into the network model

The search branch of (2) results in the tracking result of each frame.

Further, the specific implementation manner of step 1 is as follows,

firstly, converting video data into a continuous image X of one frame_i，X_iThe video frame is an RGB video frame or a hyperspectral video frame;

then the video image frame X without the label is processed_iAll resize sized video image frames Y_i。

Further, the step 2 is realized as follows,

on the basis of step 1, in the video frame Y without label_iBy the coordinate [ x, y ]]Is a centerSelecting a region with the size of 90 x 90 pixels as a target to be tracked, wherein the region is initialized BBOX; bringing the 90 × 90 region resize to Z of 125 × 125 pixels_i(ii) a Simultaneously at Y_i+1To Y_i+10Two frames Y of the 10 frames are randomly selected_i+aAnd Y_i+b，10>＝a>0，10>＝b>0，a>b or a<b, likewise in the coordinates [ x, y ]]Selecting a 90 x 90 pixel size region resize for the center to a 125 x 125 pixel size Z_i+aAnd Z_i+b。

Further, the specific implementation manner of the step 3 is as follows,

step 3.1, template branch 1 with template frame Z_iFor inputting an image frame, branch 1 is searched for frame Z_i+xIn order to input image frames, removing hyperspectral branches when training spatial branches, and only training the spatial branches;

step 3.2, template frame Z_iEnter template branch 1, in this case Z_iIs a RGB video frame, Z_iSequentially obtaining a characteristic F _ t through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer;

step 3.3, adding Z_i+a Input search Branch 1, when Z_i+aIs an RGB video frame; z_i+aSequentially obtaining a characteristic F _ s through a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;

step 3.4, solving a ridge regression loss function;

obtaining a filter w, wherein H is an ideal Gaussian response and lambda is a constant;

wherein the content of the first and second substances,

is a Fourier transform of w, the same way

Is the fourier transform of the F _ t,

is the Fourier transform of H, a represents a conjugate value, a represents a dot product;

step 3.5, calculating through the filter w and the characteristic F _ s of the subsequent frame to obtain the final response R;

wherein F^-1Representing an inverse fourier transform;

step 3.6, first, forward tracking is carried out, and the tracking sequence is Z_i-Z_i+a-Z_i+bThree frames constitute a training pair, and b>a, obtaining a tracking response R_i+a，R_i+b(ii) a Then tracking backwards with a tracking sequence of Z_i+b-Z_iTo obtain a tracking response R_i；

Step 3.7, calculate the moving weight M_motion；

Wherein H_iIs an initial frame Z_iIdeal gaussian output, H_i+aIs Z_i+aThe ideal Gaussian output is obtained, and m represents m different training pairs; by calculating moving weights M_motionTo determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, M_motionMay be weighted more than the value without the dynamic target;

step 3.8, constructing a loss function:

wherein n represents batchMaximum value of size, R_iIs composed of Z_i+bTo Z_iTracking response of H_iIs an initial frame Z_iThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is used_motionThe influence of the non-dynamic target on network training is reduced;

and 3.9, reversely propagating the loss value to update the network model parameters, wherein the loss value is the loss function value L in the step 3.8, reversely propagating the loss value, updating the network parameters in the step 3.2 by a random gradient descent (SGD) algorithm, and finally obtaining the optimized spatial branch model

Further, the implementation manner of the step 4 is as follows,

step 4.1, template Branch 2 with template frame Z_iFor inputting image frames, at this time Z_iFor hyperspectral video frames, search branch 2 is searched for frame Z_i+xFor inputting image frames, at this time Z_i+xModel for loading spatial branching during hyperspectral branching training for hyperspectral video frames

step 4.2, template frame Z_iEnter template branch 2, from Z_iThree bands are selected to form a pseudo color visual frame

Obtaining a characteristic F _ t _ rgb; at the same time, Z_iObtaining the characteristics F _ t _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; the structure of the 3 spectral feature extraction modules is shown in figure 2; calculating a weight function a of the F _ t _ hsi characteristic channel sequentially through a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and Softmax to finally obtain a hyperspectral characteristic aF _ t _ his with weight;

step 4.3, adding Z_i+a Input search branch 2, from Z_i+aThree bands are selected to form a pseudo color video frame

Obtaining a characteristic F _ s _ rgb; in the same way, Z_i+aObtaining the characteristics F _ s _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; the structure of the 3 spectral feature extraction modules is shown in figure 2; using the a calculated in the step 4.2 to finally obtain the hyperspectral characteristic aF _ s _ hsi with weight;

step 4.4, solving a ridge regression loss function:

to obtain a filter w_fF _ t _ F ═ aF _ t _ hsi + F _ t _ rgb, H is an ideal gaussian response, λ is a constant;

wherein the content of the first and second substances,

is w_fFourier transform of (1), like

Is a fourier transform of F _ t _ F,

is the fourier transform of H, representing the conjugate value;

step 4.5, pass filter w_fCalculating a final response R with the characteristic F _ s _ F ═ F _ s _ rgb + aF _ s _ hsi of the subsequent frame_f；

Wherein F^-1Representing an inverse fourier transform;

step 4.6, firstly, forward tracking is carried out, and the tracking sequence is Z_i-Z_i+a-Z_i+b，b>a, obtaining a tracking response R_{f_i+a}，R_{f_i+b}(ii) a Then tracking backwards with a tracking sequence of Z_i+b-Z_iTo obtain a tracking response R_{f_i}；

Step 4.7, calculating the moving weight M_{f_motion}:

Wherein H_iIs an initial frame Z_iIdeal gaussian output, H_i+aIs Z_i+aThe ideal Gaussian output of the system; by calculating a weight parameter M_{f_motion}To determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, M_{f_motion}May be weighted more than the value without the dynamic target;

step 4.8, constructing a loss function:

wherein n represents the maximum value of the batch size, R_{f_i}Is composed of Z_i+bTo Z_iTracking response of H_iIs an initial frame Z_iThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is used_{f_motion}The influence of the non-dynamic target on network training is reduced;

step 4.9, the loss value is propagated reversely to update the network model parameter, and the loss is the loss function value L in step 4.8_fThe loss value is reversely propagated, the network parameters in the 4.2 steps are updated, and finally the optimized space-hyperspectral model is obtained

The method of the invention has the following remarkable effects: (1) the unsupervised training network based on the periodic consistency principle can save labor cost; (2) a tracking model fusing RGB (red, green and blue) features and hyperspectral features is trained end to end by utilizing deep learning, the reasoning speed is high, and compared with a traditional manual feature method, the reasoning speed is improved by tens of times; (3) and features which are more effective for the target to be tracked are aggregated in the initial frame by using a channel attention mechanism, so that the discrimination of the network on the target is increased.

Drawings

FIG. 1 is a schematic view of the cycle consistency in step 4 of example 1 of the present invention

FIG. 2 is a schematic diagram of hyperspectral branching in step 4 of embodiment 1 of the present invention.

FIG. 3 is a schematic diagram of spatial branching in step 3 of example 1 of the present invention.

Fig. 4 is a schematic diagram of the tracking result in step 3 of embodiment 1 of the present invention, in which the numbers in the figure respectively indicate the 4 th frame and the 12 th frame, and the frame represents the position and the size of the tracking target, and the frame moves and changes with the movement and the deformation of the target (the target becomes larger, the frame becomes larger, the target becomes smaller, and the frame becomes smaller).

Fig. 5 is a flowchart of embodiment 1 of the present invention.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

Example 1:

the embodiment of the invention provides an unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion, which comprises the following steps of:

step 1, video data preprocessing, the step further comprising:

step 1.1, converting video data into a frame of continuous image X_i(RGB video frame or hyperspectral video frame).

Step 1.2, the video image frame X without the label is processed_iVideo image frame Y with all resize at 200 x 200 pixel size_i。

Step 2, randomly initializing a Bounding Box (BBOX), and the step further comprises:

on the basis of step 1, in the video frame Y without label_iTo randomly select a region of 90 x 90 pixels in size (in coordinates x, y)]A region of 90 × 90 pixels size at the center) as the target to be tracked (this region is the initialized BBOX). Bringing the 90 × 90 region resize to Z of 125 × 125 pixels_i. Simultaneously at Y_i+1To Y_i+10Two frames Y of the 10 frames are randomly selected_i+aAnd Y_i+b(10>＝a>0，10>＝b>0，a>b or a<b) Also in the coordinates [ x, y ]]Selecting a 90 x 90 pixel size region resize for the center to a 125 x 125 pixel size Z_i+aAnd Z_i+b。

Step 3, unsupervised training of RGB branches (spatial branches) by using a cycle consistency principle, the step further comprises:

and 3.1, forming the whole network structure by using a Siamese network basis, and dividing the whole network structure into a template branch and a search branch. Template branching by template frame Z_i(including the target to be tracked, in this case Z_iRepresenting RGB video frames) are input image frames, and the template branches are further divided into spatial branches and hyperspectral branches. Searching for branches to search for frame Z_i+x(subsequent video frame, x)>0) The input image frame is also divided into a spatial branch and a hyperspectral branch. And (4) removing the hyperspectral branches when training the spatial branches, and only training the spatial branches.

Step 3.2, template frame Z_iEnter template branch, at this time Z_iIs an RGB video frame. Z_iThe characteristic F _ t is obtained through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer in sequence.

Step 3.3, adding Z_i+a(suppose b>a) Enter search Branch, at this time Z_i+aIs an RGB video frame. Z_i+aThe characteristic F _ s is obtained through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer in sequence.

Step 3.4, solving a ridge regression loss function:

the resulting filter w, H is an ideal gaussian response and λ is a constant.

Wherein the content of the first and second substances,

is a Fourier transform of w, the same way

Is the fourier transform of the F _ t,

is the Fourier transform of H, a represents a conjugate value, and represents a dot product.

Step 3.5, a final response R can be calculated through the filter w and the feature F _ s of the subsequent frame:

wherein F^-1Representing an inverse fourier transform.

Step 3.6, first, forward tracking is carried out, and the tracking sequence is Z_i-Z_i+a-Z_i+b(three frames constitute a training pair) to obtain a tracking response R_i+a，R_i+b(ii) a Then tracking backwards with a tracking sequence of Z_i+b-Z_iTo obtain a tracking response R_i。

Step 3.7, calculate the moving weight M_motion:

Wherein H_iIs an initial frame Z_iIdeal gaussian output, H_i+aIs Z_i+aAnd m represents m different training pairs. By calculating moving weights M_motionTo determine whether the random initialization bounding box contains dynamic objects (if there are dynamic objects, M)_motionMay be weighted more than the value without the dynamic target).

Step 3.8, constructing a loss function:

wherein n represents the maximum value of the batch size, R_iIs composed of Z_i+bTo Z_iTracking response of H_iIs an initial frame Z_iThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is used_motionThe influence of the non-dynamic targets on the network training can be reduced.

And 3.9, reversely propagating the loss value to update the network model parameters, reversely propagating the loss value, updating the network parameters in the step 3.2 by a random gradient descent (SGD) algorithm, and finally obtaining the optimized space branch model

Step 4, unsupervised training of hyperspectral branching by using a cycle consistency principle, the method further comprises the following steps:

and 4.1, forming the whole network structure by using a Siamese network basis, and dividing the whole network structure into a template branch and a search branch. Template branching by template frame Z_i(including the target to be tracked, and the input video frames Zi and the like all represent hyperspectral video frames at the moment) are input image frames, and the template branches are divided into space branches and hyperspectral branches. Searching for branches to search for frame Z_i+x(subsequent video frame, x)>0) The input image frame is also divided into a spatial branch and a hyperspectral branch. Model for loading spatial branches during training of hyperspectral branches

While the frozen spatial branch parameters do not participate in the back propagation.

Step 4.2, template frame Z_iAnd inputting template branches. From Z_iThree bands are selected to form a pseudo color video frame

The feature F _ t _ rgb is obtained (spatial branching). At the same time, Z_iAnd sequentially obtaining the characteristics F _ t _ hsi through a network consisting of 3 spectral characteristic extraction modules connected in series. The structure of the 3 spectral feature extraction modules is shown in FIG. 2. And calculating a weight function a of the F _ t _ hsi characteristic channel (the weight function is only calculated in a template frame, and a is directly used in a subsequent frame) sequentially through a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and Softmax (channel attention mechanism) to finally obtain the hyperspectral characteristic aF _ t _ hsi with weight.

Step 4.3, adding Z_i+a(suppose b>a) A search branch is entered. From Z_i+aThree band groups (the band composition is the same as the step 4.2) are selected to form a pseudo color video frame

The feature F _ s _ rgb is obtained. In the same way, Z_i+aAnd obtaining the characteristics F _ s _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series. The structure of the 3 spectral feature extraction modules is shown in FIG. 2. And finally obtaining the hyperspectral characteristic aF _ s _ hsi with the weight by using the a calculated in the step 4.2.

Step 4.4, solving a ridge regression loss function:

to obtain a filter w_fF _ t _ F is aF _ t _ hsi + F _ t _ rgb, H is an ideal gaussian response, and λ is a constant.

Wherein the content of the first and second substances,

is w_fFourier transform of (1), like

Is a fourier transform of F _ t _ F,

is the fourier transform of H, representing the conjugate value.

Step 4.5, pass filter w_fThe final response R can be calculated from the feature F _ s _ F ═ F _ s _ rgb + aF _ s _ hsi of the subsequent frame_f：

Wherein F^-1Representing an inverse fourier transform.

Step 4.6, firstly, forward tracking is carried out, and the tracking sequence is Z_i-Z_i+a-Z_i+bTo obtain a tracking response R_{f_i+a}， R_{f_i+b}(ii) a Then tracking backwards with a tracking sequence of Z_i+b-Z_iTo obtain a tracking response R_{f_i}。

Step 4.7, calculating the moving weight M_{f_motion}:

Wherein H_iIs an initial frame Z_iIdeal gaussian output, H_i+aIs Z_i+aThe ideal gaussian output of. By calculating a weight parameter M_{f_motion}To determine whether the random initialization bounding box contains dynamic objects (if there are dynamic objects, M)_{f_motion}May be weighted more than the value without the dynamic target).

Step 4.8, constructing a loss function:

wherein n represents the maximum value of the batch size, R_{f_i}Is composed of Z_i+bTo Z_iTracking response of H_iIs an initial frame Z_iThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is used_{f_motion}The influence of the non-dynamic targets on the network training can be reduced.

4.9, reversely propagating the loss value to update the network model parameters, reversely propagating the loss value, updating the network parameters in the 4.2 steps, and finally obtaining the optimized space-hyperspectral model

The search branch of (2) results in the tracking result of each frame.

The method of the invention has the following remarkable effects: (1) the unsupervised training network based on the periodic consistency principle can save labor cost; (2) a tracking model fusing RGB (red, green and blue) features and hyperspectral features is trained end to end by utilizing deep learning, the reasoning speed is high, and compared with a traditional manual feature method, the reasoning speed is improved by tens of times; (3) and features which are more effective for the target to be tracked are aggregated in the initial frame by using a channel attention mechanism, so that the discrimination of the network on the target is increased. The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. An unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion is characterized by comprising the following steps of:

step 1, preprocessing video data;

The space branch comprises a template branch 1 and a search branch 1, wherein the template branch 1 contains a template frame Z of a tracking target_iFor an input image frame, the template frame Z at this time_iFor RGB video frames, search branch 1 to search for frame Z_i+xI.e. the subsequent video frame is the input image frame, x>0, removing the hyperspectral branches when training the spatial branches, and only training the spatial branches;

The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 comprises a model of a tracked targetPlate frame Z_iFor an input image frame, the template frame Z at this time_iFor hyperspectral video frames, search branch 2 is searched for frame Z_i+xI.e. the subsequent video frame is the input image frame, x>0, loading the model of the spatial branch during the training of the hyperspectral branch

the template branch 2 comprises a plurality of spectral feature extraction modules and a channel attention module which are connected in series, wherein the first two spectral feature extraction modules comprise a convolutional layer-batch normalization layer-nonlinear active layer, the third spectral feature extraction module comprises a convolutional layer-batch normalization layer-nonlinear active layer-convolutional layer, the channel attention module comprises a global average pooling layer-full-connection layer-nonlinear active layer-full-connection layer-Softmax, and the plurality of search branches 2 comprise only the spectral feature extraction modules which are connected in series and do not comprise the channel attention module;

The middle template branches, and the subsequent video frame X₂,X₃,X₄...X_iAre sequentially input into the network model

The search branch of (2) results in the tracking result of each frame.

2. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the specific implementation of step 1 is as follows,

then the video image frame X without the label is processed_iAll resize is sizedVideo image frame Y_i。

3. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the implementation of said step 2 is as follows,

on the basis of step 1, in the video frame Y without label_iBy the coordinate [ x, y ]]Selecting a region with the size of 90 x 90 pixels as a target to be tracked for the center, wherein the region is initialized BBOX; bringing the 90 × 90 region resize to Z of 125 × 125 pixels_i(ii) a Simultaneously at Y_i+1To Y_i+10Two frames Y of the 10 frames are randomly selected_i+aAnd Y_i+b，10>＝a>0，10>＝b>0，a>b or a<b, likewise in the coordinates [ x, y ]]Selecting a 90 x 90 pixel size region resize for the center to a 125 x 125 pixel size Z_i+aAnd Z_i+b。

4. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the specific implementation of step 3 is as follows,

step 3.3, adding Z_i+aInput search Branch 1, when Z_i+aIs an RGB video frame; z_i+aSequentially obtaining a characteristic F _ s through a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;

step 3.4, solving a ridge regression loss function;

wherein the content of the first and second substances,

is a Fourier transform of w, the same way

Is the fourier transform of the F _ t,

wherein F^-1Representing an inverse fourier transform;

Step 3.7, calculate the moving weight M_motion；

step 3.8, constructing a loss function:

wherein n represents the maximum value of the batch size, R_iIs composed of Z_i+bTo Z_iTracking response of H_iIs an initial frame Z_iThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is used_motionThe influence of the non-dynamic target on network training is reduced;

and 3.9, reversely propagating the loss value to update the network model parameters, wherein the loss value is the loss function value L in the step 3.8, reversely propagating the loss value, updating the network parameters in the step 3.2 based on a stochastic gradient descent algorithm, and finally obtaining the optimized spatial branch model

5. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the implementation of said step 4 is as follows,

step 4.1, template Branch 2 with template frame Z_iFor inputting an image frame, branch 2 is searched for frame Z_i+xTraining a model for loading spatial branches during hyperspectral branching for input image frames

step 4.2, template frame Z_iEnter template branch 2, from Z_iThree bands are selected to form a pseudo color video pass

Obtaining a characteristic F _ t _ rgb; at the same time, Z_iObtaining the characteristics F _ t _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; calculating a weight function a of the F _ t _ hsi characteristic channel sequentially through a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and Softmax to finally obtain a hyperspectral characteristic aF _ t _ hsi with weight;

step 4.3, adding Z_i+aInput search branch 2, from Z_i+aThree bands are selected to form a pseudo color video frame

Obtaining a characteristic F _ s _ rgb; in the same way, Z_i+aObtaining the characteristics F _ s _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; using the a calculated in the step 4.2 to finally obtain the hyperspectral characteristic aF _ s _ hsi with weight;

step 4.4, solving a ridge regression loss function:

wherein the content of the first and second substances,

is w_fOfTransformation, same principle

Is a fourier transform of F _ t _ F,

is the fourier transform of H, representing the conjugate value;

Wherein F^-1Representing an inverse fourier transform;

Step 4.7, calculating the moving weight M_{f_motion}:

Wherein H_iIs an initial frame Z_iIdeal gaussian output, H_i+aIs Z_i+aThe ideal Gaussian output of the system; by calculating a weight parameter M_{f_motion}To determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, M_motionMay be weighted more than the value without the dynamic target;

step 4.8, constructing a loss function: