CN112766102A - Unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion - Google Patents
Unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion Download PDFInfo
- Publication number
- CN112766102A CN112766102A CN202110018918.9A CN202110018918A CN112766102A CN 112766102 A CN112766102 A CN 112766102A CN 202110018918 A CN202110018918 A CN 202110018918A CN 112766102 A CN112766102 A CN 112766102A
- Authority
- CN
- China
- Prior art keywords
- frame
- hyperspectral
- branch
- tracking
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 230000004927 fusion Effects 0.000 title claims abstract description 14
- 238000001228 spectrum Methods 0.000 title claims description 7
- 230000004044 response Effects 0.000 claims description 44
- 230000003595 spectral effect Effects 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 12
- 230000001902 propagating effect Effects 0.000 claims description 8
- 239000000126 substance Substances 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 230000008014 freezing Effects 0.000 claims description 4
- 238000007710 freezing Methods 0.000 claims description 4
- 230000000644 propagated effect Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 102100025444 Gamma-butyrobetaine dioxygenase Human genes 0.000 claims 1
- 101000934612 Homo sapiens Gamma-butyrobetaine dioxygenase Proteins 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 6
- 238000013136 deep learning model Methods 0.000 abstract 2
- 238000002372 labelling Methods 0.000 abstract 2
- 230000006870 function Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/13—Satellite images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/48—Matching video sequences
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A40/00—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
- Y02A40/10—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Astronomy & Astrophysics (AREA)
- Remote Sensing (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an unsupervised hyperspectral video target tracking method based on spatial-spectral feature fusion. The hyperspectral target tracking method based on deep learning is designed by combining a cycle consistency theoretical method, a hyperspectral target tracking deep learning model can be trained unsupervised, and the cost of manual labeling is saved. On the basis of a Simese tracking framework, RGB (red, green and blue) branches (spatial branches) and hyperspectral branches are designed; and training a spatial branch by using RGB video data, loading the trained RGB model into network fixed parameters, and training a hyperspectral branch at the same time to obtain the fused features with higher robustness and discrimination. The final usage inputs the fused features into a correlation filter (DCF) to obtain the tracking result. The method can solve the problems of manual labeling of hyperspectral video data and few hyperspectral training samples for deep learning model training, and can effectively improve the precision and speed of a hyperspectral video tracking model.
Description
Technical Field
The invention relates to the field of computational vision technology processing, in particular to an unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion.
Background
Target tracking of a hyperspectral video (high spatial resolution-high temporal resolution-high spectral resolution) is used as a new direction, and the target information of a given initial frame in the hyperspectral video is used for predicting the state of a target in a subsequent frame. Compared with RGB video target tracking, hyperspectral video target tracking can provide spectral information for distinguishing different materials besides spatial information. Even if the target has the same shape, the hyperspectral video can be used for tracking the target as long as the materials are different, which is an advantage that the RGB video target tracking does not have. Therefore, hyperspectral video target tracking can play an important role in the fields of camouflage target tracking, small target tracking and the like. On the basis, hyperspectral video target tracking also attracts the attention of more and more researchers.
Meanwhile, hyperspectral video target tracking is a difficult task. Firstly, the existing hyperspectral video target tracking algorithm uses the traditional manual features to represent the features of the target, so that the performance of the hyperspectral video target tracking algorithm is limited; secondly, the hyperspectral video needs to be shot by a special hyperspectral video camera, training samples are limited, and therefore the hyperspectral video target algorithm based on deep learning in the real sense does not exist at present. Thirdly, the supervised deep learning algorithm requires a large number of samples of manual standards, especially video annotation, which is time-consuming and labor-consuming. Due to the existence of the problems, the existing hyperspectral video target tracking algorithm is poor in performance.
Disclosure of Invention
The invention aims to provide an unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion.
The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion has the following three remarkable characteristics. Firstly, a cycle consistency principle is utilized, and the whole hyperspectral target tracking algorithm based on deep learning is trained unsupervised under the condition that no manual marking is needed. And secondly, a related filtering hyperspectral video target tracking framework with space-spectrum feature fusion is designed, the problem that hyperspectral video training samples are few is solved to a certain extent, and meanwhile, the RGB and hyperspectral features are fused to obtain features with higher robustness and identification capability. And thirdly, designing a channel attention module, and calculating the weight of the characteristic channel only in an initial frame, so that the network can dynamically aggregate different weights of the characteristic channels of different targets.
The invention provides an unsupervised hyperspectral video target tracking method based on spatial-spectral feature fusion, which comprises the following steps of:
The spatial branch comprises a template branch 1 and a search branch 1, wherein the template branch 1 contains a template frame Z of a tracking targetiFor an input image frame, the template frame Z at this timeiFor RGB video frames, search branch 1 to search for frame Zi+xI.e. the subsequent video frame is the input image frame, x>0, removing the hyperspectral branches when training the spatial branches, and only training the spatial branches;
the template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;
The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 comprises a model of a tracked targetPlate frame ZiFor input image frame, search branch to search frame Zi+xI.e. the subsequent video frame is the input image frame, x>0, loading the model of the spatial branch during the training of the hyperspectral branchMeanwhile, the freezing space branch parameters do not participate in back propagation;
the template branch 2 comprises a plurality of spectral feature extraction modules and a channel attention module which are connected in series, wherein the first two spectral feature extraction modules comprise a convolutional layer-batch normalization layer-nonlinear active layer, the third spectral feature extraction module comprises a convolutional layer-batch normalization layer-nonlinear active layer-convolutional layer, the channel attention module comprises a global average pooling layer-full connection layer-nonlinear active layer-full connection layer-Softmax, and the plurality of search branches 2 comprise only the spectral feature extraction modules which are connected in series and do not comprise the channel attention module;
Further, the specific implementation manner of step 1 is as follows,
firstly, converting video data into a continuous image X of one framei,XiThe video frame is an RGB video frame or a hyperspectral video frame;
then the video image frame X without the label is processediAll resize sized video image frames Yi。
Further, the step 2 is realized as follows,
on the basis of step 1, in the video frame Y without labeliBy the coordinate [ x, y ]]Is a centerSelecting a region with the size of 90 x 90 pixels as a target to be tracked, wherein the region is initialized BBOX; bringing the 90 × 90 region resize to Z of 125 × 125 pixelsi(ii) a Simultaneously at Yi+1To Yi+10Two frames Y of the 10 frames are randomly selectedi+aAnd Yi+b,10>=a>0,10>=b>0,a>b or a<b, likewise in the coordinates [ x, y ]]Selecting a 90 x 90 pixel size region resize for the center to a 125 x 125 pixel size Zi+aAnd Zi+b。
Further, the specific implementation manner of the step 3 is as follows,
step 3.1, template branch 1 with template frame ZiFor inputting an image frame, branch 1 is searched for frame Zi+xIn order to input image frames, removing hyperspectral branches when training spatial branches, and only training the spatial branches;
step 3.2, template frame ZiEnter template branch 1, in this case ZiIs a RGB video frame, ZiSequentially obtaining a characteristic F _ t through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer;
step 3.3, adding Zi+a Input search Branch 1, when Zi+aIs an RGB video frame; zi+aSequentially obtaining a characteristic F _ s through a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;
step 3.4, solving a ridge regression loss function;
obtaining a filter w, wherein H is an ideal Gaussian response and lambda is a constant;
wherein the content of the first and second substances,is a Fourier transform of w, the same wayIs the fourier transform of the F _ t,is the Fourier transform of H, a represents a conjugate value, a represents a dot product;
step 3.5, calculating through the filter w and the characteristic F _ s of the subsequent frame to obtain the final response R;
wherein F-1Representing an inverse fourier transform;
step 3.6, first, forward tracking is carried out, and the tracking sequence is Zi-Zi+a-Zi+bThree frames constitute a training pair, and b>a, obtaining a tracking response Ri+a,Ri+b(ii) a Then tracking backwards with a tracking sequence of Zi+b-ZiTo obtain a tracking response Ri;
Step 3.7, calculate the moving weight Mmotion;
Wherein HiIs an initial frame ZiIdeal gaussian output, Hi+aIs Zi+aThe ideal Gaussian output is obtained, and m represents m different training pairs; by calculating moving weights MmotionTo determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, MmotionMay be weighted more than the value without the dynamic target;
step 3.8, constructing a loss function:
wherein n represents batchMaximum value of size, RiIs composed of Zi+bTo ZiTracking response of HiIs an initial frame ZiThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is usedmotionThe influence of the non-dynamic target on network training is reduced;
and 3.9, reversely propagating the loss value to update the network model parameters, wherein the loss value is the loss function value L in the step 3.8, reversely propagating the loss value, updating the network parameters in the step 3.2 by a random gradient descent (SGD) algorithm, and finally obtaining the optimized spatial branch model
Further, the implementation manner of the step 4 is as follows,
step 4.1, template Branch 2 with template frame ZiFor inputting image frames, at this time ZiFor hyperspectral video frames, search branch 2 is searched for frame Zi+xFor inputting image frames, at this time Zi+xModel for loading spatial branching during hyperspectral branching training for hyperspectral video framesMeanwhile, the freezing space branch parameters do not participate in back propagation;
step 4.2, template frame ZiEnter template branch 2, from ZiThree bands are selected to form a pseudo color visual frameObtaining a characteristic F _ t _ rgb; at the same time, ZiObtaining the characteristics F _ t _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; the structure of the 3 spectral feature extraction modules is shown in figure 2; calculating a weight function a of the F _ t _ hsi characteristic channel sequentially through a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and Softmax to finally obtain a hyperspectral characteristic aF _ t _ his with weight;
step 4.3, adding Zi+a Input search branch 2, from Zi+aThree bands are selected to form a pseudo color video frameObtaining a characteristic F _ s _ rgb; in the same way, Zi+aObtaining the characteristics F _ s _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; the structure of the 3 spectral feature extraction modules is shown in figure 2; using the a calculated in the step 4.2 to finally obtain the hyperspectral characteristic aF _ s _ hsi with weight;
step 4.4, solving a ridge regression loss function:
to obtain a filter wfF _ t _ F ═ aF _ t _ hsi + F _ t _ rgb, H is an ideal gaussian response, λ is a constant;
wherein the content of the first and second substances,is wfFourier transform of (1), likeIs a fourier transform of F _ t _ F,is the fourier transform of H, representing the conjugate value;
step 4.5, pass filter wfCalculating a final response R with the characteristic F _ s _ F ═ F _ s _ rgb + aF _ s _ hsi of the subsequent framef;
Wherein F-1Representing an inverse fourier transform;
step 4.6, firstly, forward tracking is carried out, and the tracking sequence is Zi-Zi+a-Zi+b,b>a, obtaining a tracking response Rf_i+a,Rf_i+b(ii) a Then tracking backwards with a tracking sequence of Zi+b-ZiTo obtain a tracking response Rf_i;
Step 4.7, calculating the moving weight Mf_motion:
Wherein HiIs an initial frame ZiIdeal gaussian output, Hi+aIs Zi+aThe ideal Gaussian output of the system; by calculating a weight parameter Mf_motionTo determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, Mf_motionMay be weighted more than the value without the dynamic target;
step 4.8, constructing a loss function:
wherein n represents the maximum value of the batch size, Rf_iIs composed of Zi+bTo ZiTracking response of HiIs an initial frame ZiThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is usedf_motionThe influence of the non-dynamic target on network training is reduced;
step 4.9, the loss value is propagated reversely to update the network model parameter, and the loss is the loss function value L in step 4.8fThe loss value is reversely propagated, the network parameters in the 4.2 steps are updated, and finally the optimized space-hyperspectral model is obtained
The method of the invention has the following remarkable effects: (1) the unsupervised training network based on the periodic consistency principle can save labor cost; (2) a tracking model fusing RGB (red, green and blue) features and hyperspectral features is trained end to end by utilizing deep learning, the reasoning speed is high, and compared with a traditional manual feature method, the reasoning speed is improved by tens of times; (3) and features which are more effective for the target to be tracked are aggregated in the initial frame by using a channel attention mechanism, so that the discrimination of the network on the target is increased.
Drawings
FIG. 1 is a schematic view of the cycle consistency in step 4 of example 1 of the present invention
FIG. 2 is a schematic diagram of hyperspectral branching in step 4 of embodiment 1 of the present invention.
FIG. 3 is a schematic diagram of spatial branching in step 3 of example 1 of the present invention.
Fig. 4 is a schematic diagram of the tracking result in step 3 of embodiment 1 of the present invention, in which the numbers in the figure respectively indicate the 4 th frame and the 12 th frame, and the frame represents the position and the size of the tracking target, and the frame moves and changes with the movement and the deformation of the target (the target becomes larger, the frame becomes larger, the target becomes smaller, and the frame becomes smaller).
Fig. 5 is a flowchart of embodiment 1 of the present invention.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.
Example 1:
the embodiment of the invention provides an unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion, which comprises the following steps of:
step 1.1, converting video data into a frame of continuous image Xi(RGB video frame or hyperspectral video frame).
Step 1.2, the video image frame X without the label is processediVideo image frame Y with all resize at 200 x 200 pixel sizei。
on the basis of step 1, in the video frame Y without labeliTo randomly select a region of 90 x 90 pixels in size (in coordinates x, y)]A region of 90 × 90 pixels size at the center) as the target to be tracked (this region is the initialized BBOX). Bringing the 90 × 90 region resize to Z of 125 × 125 pixelsi. Simultaneously at Yi+1To Yi+10Two frames Y of the 10 frames are randomly selectedi+aAnd Yi+b(10>=a>0,10>=b>0,a>b or a<b) Also in the coordinates [ x, y ]]Selecting a 90 x 90 pixel size region resize for the center to a 125 x 125 pixel size Zi+aAnd Zi+b。
and 3.1, forming the whole network structure by using a Siamese network basis, and dividing the whole network structure into a template branch and a search branch. Template branching by template frame Zi(including the target to be tracked, in this case ZiRepresenting RGB video frames) are input image frames, and the template branches are further divided into spatial branches and hyperspectral branches. Searching for branches to search for frame Zi+x(subsequent video frame, x)>0) The input image frame is also divided into a spatial branch and a hyperspectral branch. And (4) removing the hyperspectral branches when training the spatial branches, and only training the spatial branches.
Step 3.2, template frame ZiEnter template branch, at this time ZiIs an RGB video frame. ZiThe characteristic F _ t is obtained through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer in sequence.
Step 3.3, adding Zi+a(suppose b>a) Enter search Branch, at this time Zi+aIs an RGB video frame. Zi+aThe characteristic F _ s is obtained through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer in sequence.
Step 3.4, solving a ridge regression loss function:
the resulting filter w, H is an ideal gaussian response and λ is a constant.
Wherein the content of the first and second substances,is a Fourier transform of w, the same wayIs the fourier transform of the F _ t,is the Fourier transform of H, a represents a conjugate value, and represents a dot product.
Step 3.5, a final response R can be calculated through the filter w and the feature F _ s of the subsequent frame:
wherein F-1Representing an inverse fourier transform.
Step 3.6, first, forward tracking is carried out, and the tracking sequence is Zi-Zi+a-Zi+b(three frames constitute a training pair) to obtain a tracking response Ri+a,Ri+b(ii) a Then tracking backwards with a tracking sequence of Zi+b-ZiTo obtain a tracking response Ri。
Step 3.7, calculate the moving weight Mmotion:
Wherein HiIs an initial frame ZiIdeal gaussian output, Hi+aIs Zi+aAnd m represents m different training pairs. By calculating moving weights MmotionTo determine whether the random initialization bounding box contains dynamic objects (if there are dynamic objects, M)motionMay be weighted more than the value without the dynamic target).
Step 3.8, constructing a loss function:
wherein n represents the maximum value of the batch size, RiIs composed of Zi+bTo ZiTracking response of HiIs an initial frame ZiThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is usedmotionThe influence of the non-dynamic targets on the network training can be reduced.
And 3.9, reversely propagating the loss value to update the network model parameters, reversely propagating the loss value, updating the network parameters in the step 3.2 by a random gradient descent (SGD) algorithm, and finally obtaining the optimized space branch model
and 4.1, forming the whole network structure by using a Siamese network basis, and dividing the whole network structure into a template branch and a search branch. Template branching by template frame Zi(including the target to be tracked, and the input video frames Zi and the like all represent hyperspectral video frames at the moment) are input image frames, and the template branches are divided into space branches and hyperspectral branches. Searching for branches to search for frame Zi+x(subsequent video frame, x)>0) The input image frame is also divided into a spatial branch and a hyperspectral branch. Model for loading spatial branches during training of hyperspectral branchesWhile the frozen spatial branch parameters do not participate in the back propagation.
Step 4.2, template frame ZiAnd inputting template branches. From ZiThree bands are selected to form a pseudo color video frameThe feature F _ t _ rgb is obtained (spatial branching). At the same time, ZiAnd sequentially obtaining the characteristics F _ t _ hsi through a network consisting of 3 spectral characteristic extraction modules connected in series. The structure of the 3 spectral feature extraction modules is shown in FIG. 2. And calculating a weight function a of the F _ t _ hsi characteristic channel (the weight function is only calculated in a template frame, and a is directly used in a subsequent frame) sequentially through a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and Softmax (channel attention mechanism) to finally obtain the hyperspectral characteristic aF _ t _ hsi with weight.
Step 4.3, adding Zi+a(suppose b>a) A search branch is entered. From Zi+aThree band groups (the band composition is the same as the step 4.2) are selected to form a pseudo color video frameThe feature F _ s _ rgb is obtained. In the same way, Zi+aAnd obtaining the characteristics F _ s _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series. The structure of the 3 spectral feature extraction modules is shown in FIG. 2. And finally obtaining the hyperspectral characteristic aF _ s _ hsi with the weight by using the a calculated in the step 4.2.
Step 4.4, solving a ridge regression loss function:
to obtain a filter wfF _ t _ F is aF _ t _ hsi + F _ t _ rgb, H is an ideal gaussian response, and λ is a constant.
Wherein the content of the first and second substances,is wfFourier transform of (1), likeIs a fourier transform of F _ t _ F,is the fourier transform of H, representing the conjugate value.
Step 4.5, pass filter wfThe final response R can be calculated from the feature F _ s _ F ═ F _ s _ rgb + aF _ s _ hsi of the subsequent framef:
Wherein F-1Representing an inverse fourier transform.
Step 4.6, firstly, forward tracking is carried out, and the tracking sequence is Zi-Zi+a-Zi+bTo obtain a tracking response Rf_i+a, Rf_i+b(ii) a Then tracking backwards with a tracking sequence of Zi+b-ZiTo obtain a tracking response Rf_i。
Step 4.7, calculating the moving weight Mf_motion:
Wherein HiIs an initial frame ZiIdeal gaussian output, Hi+aIs Zi+aThe ideal gaussian output of. By calculating a weight parameter Mf_motionTo determine whether the random initialization bounding box contains dynamic objects (if there are dynamic objects, M)f_motionMay be weighted more than the value without the dynamic target).
Step 4.8, constructing a loss function:
wherein n represents the maximum value of the batch size, Rf_iIs composed of Zi+bTo ZiTracking response of HiIs an initial frame ZiThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is usedf_motionThe influence of the non-dynamic targets on the network training can be reduced.
4.9, reversely propagating the loss value to update the network model parameters, reversely propagating the loss value, updating the network parameters in the 4.2 steps, and finally obtaining the optimized space-hyperspectral model
The method of the invention has the following remarkable effects: (1) the unsupervised training network based on the periodic consistency principle can save labor cost; (2) a tracking model fusing RGB (red, green and blue) features and hyperspectral features is trained end to end by utilizing deep learning, the reasoning speed is high, and compared with a traditional manual feature method, the reasoning speed is improved by tens of times; (3) and features which are more effective for the target to be tracked are aggregated in the initial frame by using a channel attention mechanism, so that the discrimination of the network on the target is increased. The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (5)
1. An unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion is characterized by comprising the following steps of:
step 1, preprocessing video data;
step 2, initializing the boundary frame randomly and obtaining a template frame Z through the initialized boundary frameiAnd subsequent search frame Zi+xTemplate frame ZiAnd search frame Zi+xThe video frame is an RGB video frame or a hyperspectral video frame;
step 3, unsupervised training of RGB branches, also called spatial branches, is carried out by utilizing the principle of cycle consistency, and finally an optimized spatial branch model is obtained
The space branch comprises a template branch 1 and a search branch 1, wherein the template branch 1 contains a template frame Z of a tracking targetiFor an input image frame, the template frame Z at this timeiFor RGB video frames, search branch 1 to search for frame Zi+xI.e. the subsequent video frame is the input image frame, x>0, removing the hyperspectral branches when training the spatial branches, and only training the spatial branches;
the template branch 1 and the search branch 1 have the same structure and comprise a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;
step 4, unsupervised training of hyperspectral branches by using a cycle consistency principle to finally obtain an optimized space-hyperspectral model
The hyperspectral branch comprises a template branch 2 and a search branch 2, wherein the template branch 2 comprises a model of a tracked targetPlate frame ZiFor an input image frame, the template frame Z at this timeiFor hyperspectral video frames, search branch 2 is searched for frame Zi+xI.e. the subsequent video frame is the input image frame, x>0, loading the model of the spatial branch during the training of the hyperspectral branchMeanwhile, the freezing space branch parameters do not participate in back propagation;
the template branch 2 comprises a plurality of spectral feature extraction modules and a channel attention module which are connected in series, wherein the first two spectral feature extraction modules comprise a convolutional layer-batch normalization layer-nonlinear active layer, the third spectral feature extraction module comprises a convolutional layer-batch normalization layer-nonlinear active layer-convolutional layer, the channel attention module comprises a global average pooling layer-full-connection layer-nonlinear active layer-full-connection layer-Softmax, and the plurality of search branches 2 comprise only the spectral feature extraction modules which are connected in series and do not comprise the channel attention module;
2. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the specific implementation of step 1 is as follows,
firstly, converting video data into a continuous image X of one framei,XiThe video frame is an RGB video frame or a hyperspectral video frame;
then the video image frame X without the label is processediAll resize is sizedVideo image frame Yi。
3. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the implementation of said step 2 is as follows,
on the basis of step 1, in the video frame Y without labeliBy the coordinate [ x, y ]]Selecting a region with the size of 90 x 90 pixels as a target to be tracked for the center, wherein the region is initialized BBOX; bringing the 90 × 90 region resize to Z of 125 × 125 pixelsi(ii) a Simultaneously at Yi+1To Yi+10Two frames Y of the 10 frames are randomly selectedi+aAnd Yi+b,10>=a>0,10>=b>0,a>b or a<b, likewise in the coordinates [ x, y ]]Selecting a 90 x 90 pixel size region resize for the center to a 125 x 125 pixel size Zi+aAnd Zi+b。
4. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the specific implementation of step 3 is as follows,
step 3.1, template branch 1 with template frame ZiFor inputting an image frame, branch 1 is searched for frame Zi+xIn order to input image frames, removing hyperspectral branches when training spatial branches, and only training the spatial branches;
step 3.2, template frame ZiEnter template branch 1, in this case ZiIs a RGB video frame, ZiSequentially obtaining a characteristic F _ t through a convolutional layer, a nonlinear active layer, a convolutional layer and a local response normalization layer;
step 3.3, adding Zi+aInput search Branch 1, when Zi+aIs an RGB video frame; zi+aSequentially obtaining a characteristic F _ s through a convolution layer, a nonlinear active layer, a convolution layer and a local response normalization layer;
step 3.4, solving a ridge regression loss function;
obtaining a filter w, wherein H is an ideal Gaussian response and lambda is a constant;
wherein the content of the first and second substances,is a Fourier transform of w, the same wayIs the fourier transform of the F _ t,is the Fourier transform of H, a represents a conjugate value, a represents a dot product;
step 3.5, calculating through the filter w and the characteristic F _ s of the subsequent frame to obtain the final response R;
wherein F-1Representing an inverse fourier transform;
step 3.6, first, forward tracking is carried out, and the tracking sequence is Zi-Zi+a-Zi+bThree frames constitute a training pair, and b>a, obtaining a tracking response Ri+a,Ri+b(ii) a Then tracking backwards with a tracking sequence of Zi+b-ZiTo obtain a tracking response Ri;
Step 3.7, calculate the moving weight Mmotion;
Wherein HiIs an initial frame ZiIdeal gaussian output, Hi+aIs Zi+aThe ideal Gaussian output is obtained, and m represents m different training pairs; by calculating moving weights MmotionTo determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, MmotionMay be weighted more than the value without the dynamic target;
step 3.8, constructing a loss function:
wherein n represents the maximum value of the batch size, RiIs composed of Zi+bTo ZiTracking response of HiIs an initial frame ZiThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is usedmotionThe influence of the non-dynamic target on network training is reduced;
and 3.9, reversely propagating the loss value to update the network model parameters, wherein the loss value is the loss function value L in the step 3.8, reversely propagating the loss value, updating the network parameters in the step 3.2 based on a stochastic gradient descent algorithm, and finally obtaining the optimized spatial branch model
5. The unsupervised hyperspectral video target tracking method based on the spatial-spectral feature fusion of claim 1, characterized by comprising the following steps: the implementation of said step 4 is as follows,
step 4.1, template Branch 2 with template frame ZiFor inputting an image frame, branch 2 is searched for frame Zi+xTraining a model for loading spatial branches during hyperspectral branching for input image framesMeanwhile, the freezing space branch parameters do not participate in back propagation;
step 4.2, template frame ZiEnter template branch 2, from ZiThree bands are selected to form a pseudo color video passObtaining a characteristic F _ t _ rgb; at the same time, ZiObtaining the characteristics F _ t _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; calculating a weight function a of the F _ t _ hsi characteristic channel sequentially through a global average pooling layer, a full connection layer, a nonlinear activation layer, a full connection layer and Softmax to finally obtain a hyperspectral characteristic aF _ t _ hsi with weight;
step 4.3, adding Zi+aInput search branch 2, from Zi+aThree bands are selected to form a pseudo color video frameObtaining a characteristic F _ s _ rgb; in the same way, Zi+aObtaining the characteristics F _ s _ hsi sequentially through a network consisting of 3 spectral characteristic extraction modules connected in series; using the a calculated in the step 4.2 to finally obtain the hyperspectral characteristic aF _ s _ hsi with weight;
step 4.4, solving a ridge regression loss function:
to obtain a filter wfF _ t _ F ═ aF _ t _ hsi + F _ t _ rgb, H is an ideal gaussian response, λ is a constant;
wherein the content of the first and second substances,is wfOfTransformation, same principleIs a fourier transform of F _ t _ F,is the fourier transform of H, representing the conjugate value;
step 4.5, pass filter wfCalculating a final response R with the characteristic F _ s _ F ═ F _ s _ rgb + aF _ s _ hsi of the subsequent framef;
Wherein F-1Representing an inverse fourier transform;
step 4.6, firstly, forward tracking is carried out, and the tracking sequence is Zi-Zi+a-Zi+b,b>a, obtaining a tracking response Rf_i+a,Rf_i+b(ii) a Then tracking backwards with a tracking sequence of Zi+b-ZiTo obtain a tracking response Rf_i;
Step 4.7, calculating the moving weight Mf_motion:
Wherein HiIs an initial frame ZiIdeal gaussian output, Hi+aIs Zi+aThe ideal Gaussian output of the system; by calculating a weight parameter Mf_motionTo determine whether the random initialization bounding box contains dynamic targets, if there are dynamic targets, MmotionMay be weighted more than the value without the dynamic target;
step 4.8, constructing a loss function:
wherein n represents the maximum value of the batch size, Rf_iIs composed of Zi+bTo ZiTracking response of HiIs an initial frame ZiThe ideal Gaussian response of the method is that M represents that M mini lots are used for training simultaneously, each mini lot is a training pair, each training pair has three frames of images, and a weight parameter M is usedf_motionThe influence of the non-dynamic target on network training is reduced;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110018918.9A CN112766102B (en) | 2021-01-07 | 2021-01-07 | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110018918.9A CN112766102B (en) | 2021-01-07 | 2021-01-07 | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112766102A true CN112766102A (en) | 2021-05-07 |
CN112766102B CN112766102B (en) | 2024-04-26 |
Family
ID=75700670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110018918.9A Active CN112766102B (en) | 2021-01-07 | 2021-01-07 | Unsupervised hyperspectral video target tracking method based on spatial spectrum feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112766102B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113344932A (en) * | 2021-06-01 | 2021-09-03 | 电子科技大学 | Semi-supervised single-target video segmentation method |
CN113628244A (en) * | 2021-07-05 | 2021-11-09 | 上海交通大学 | Target tracking method, system, terminal and medium based on label-free video training |
CN117689692A (en) * | 2023-12-20 | 2024-03-12 | 中国人民解放军海军航空大学 | Attention mechanism guiding matching associated hyperspectral and RGB video fusion tracking method |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038684A (en) * | 2017-04-10 | 2017-08-11 | 南京信息工程大学 | A kind of method for lifting TMI spatial resolution |
CN108765280A (en) * | 2018-03-30 | 2018-11-06 | 徐国明 | A kind of high spectrum image spatial resolution enhancement method |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN111062888A (en) * | 2019-12-16 | 2020-04-24 | 武汉大学 | Hyperspectral image denoising method based on multi-target low-rank sparsity and spatial-spectral total variation |
CN111325116A (en) * | 2020-02-05 | 2020-06-23 | 武汉大学 | Remote sensing image target detection method capable of evolving based on offline training-online learning depth |
CN111724411A (en) * | 2020-05-26 | 2020-09-29 | 浙江工业大学 | Multi-feature fusion tracking method based on hedging algorithm |
WO2020199205A1 (en) * | 2019-04-04 | 2020-10-08 | 合刃科技(深圳)有限公司 | Hybrid hyperspectral image reconstruction method and system |
US20200327679A1 (en) * | 2019-04-12 | 2020-10-15 | Beijing Moviebook Science and Technology Co., Ltd. | Visual target tracking method and apparatus based on deeply and densely connected neural network |
CN111797716A (en) * | 2020-06-16 | 2020-10-20 | 电子科技大学 | Single target tracking method based on Siamese network |
-
2021
- 2021-01-07 CN CN202110018918.9A patent/CN112766102B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038684A (en) * | 2017-04-10 | 2017-08-11 | 南京信息工程大学 | A kind of method for lifting TMI spatial resolution |
CN108765280A (en) * | 2018-03-30 | 2018-11-06 | 徐国明 | A kind of high spectrum image spatial resolution enhancement method |
WO2020199205A1 (en) * | 2019-04-04 | 2020-10-08 | 合刃科技(深圳)有限公司 | Hybrid hyperspectral image reconstruction method and system |
US20200327679A1 (en) * | 2019-04-12 | 2020-10-15 | Beijing Moviebook Science and Technology Co., Ltd. | Visual target tracking method and apparatus based on deeply and densely connected neural network |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN111062888A (en) * | 2019-12-16 | 2020-04-24 | 武汉大学 | Hyperspectral image denoising method based on multi-target low-rank sparsity and spatial-spectral total variation |
CN111325116A (en) * | 2020-02-05 | 2020-06-23 | 武汉大学 | Remote sensing image target detection method capable of evolving based on offline training-online learning depth |
CN111724411A (en) * | 2020-05-26 | 2020-09-29 | 浙江工业大学 | Multi-feature fusion tracking method based on hedging algorithm |
CN111797716A (en) * | 2020-06-16 | 2020-10-20 | 电子科技大学 | Single target tracking method based on Siamese network |
Non-Patent Citations (1)
Title |
---|
洪科;: "基于超像素分割的RGB与高光谱图像融合", 电子技术与软件工程, no. 03 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113344932A (en) * | 2021-06-01 | 2021-09-03 | 电子科技大学 | Semi-supervised single-target video segmentation method |
CN113628244A (en) * | 2021-07-05 | 2021-11-09 | 上海交通大学 | Target tracking method, system, terminal and medium based on label-free video training |
CN113628244B (en) * | 2021-07-05 | 2023-11-28 | 上海交通大学 | Target tracking method, system, terminal and medium based on label-free video training |
CN117689692A (en) * | 2023-12-20 | 2024-03-12 | 中国人民解放军海军航空大学 | Attention mechanism guiding matching associated hyperspectral and RGB video fusion tracking method |
Also Published As
Publication number | Publication date |
---|---|
CN112766102B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN108665496B (en) | End-to-end semantic instant positioning and mapping method based on deep learning | |
CN107945204B (en) | Pixel-level image matting method based on generation countermeasure network | |
CN109584248B (en) | Infrared target instance segmentation method based on feature fusion and dense connection network | |
CN112884064B (en) | Target detection and identification method based on neural network | |
CN110570371B (en) | Image defogging method based on multi-scale residual error learning | |
Yu et al. | High-resolution deep image matting | |
CN112766102A (en) | Unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion | |
CN107680106A (en) | A kind of conspicuousness object detection method based on Faster R CNN | |
CN112507777A (en) | Optical remote sensing image ship detection and segmentation method based on deep learning | |
CN108537824B (en) | Feature map enhanced network structure optimization method based on alternating deconvolution and convolution | |
CN110334584B (en) | Gesture recognition method based on regional full convolution network | |
CN113610905B (en) | Deep learning remote sensing image registration method based on sub-image matching and application | |
CN109919246A (en) | Pedestrian's recognition methods again based on self-adaptive features cluster and multiple risks fusion | |
CN113095371B (en) | Feature point matching method and system for three-dimensional reconstruction | |
CN115512251A (en) | Unmanned aerial vehicle low-illumination target tracking method based on double-branch progressive feature enhancement | |
CN115049945B (en) | Unmanned aerial vehicle image-based wheat lodging area extraction method and device | |
CN114219824A (en) | Visible light-infrared target tracking method and system based on deep network | |
CN115545166A (en) | Improved ConvNeXt convolutional neural network and remote sensing image classification method thereof | |
CN114612709A (en) | Multi-scale target detection method guided by image pyramid characteristics | |
CN106504219B (en) | Constrained path morphology high-resolution remote sensing image road Enhancement Method | |
CN116342536A (en) | Aluminum strip surface defect detection method, system and equipment based on lightweight model | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
Peng et al. | RSBNet: One-shot neural architecture search for a backbone network in remote sensing image recognition | |
CN111339342B (en) | Three-dimensional model retrieval method based on angle ternary center loss |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |