CN112329784A - Correlation filtering tracking method based on space-time perception and multimodal response - Google Patents

Correlation filtering tracking method based on space-time perception and multimodal response Download PDF

Info

Publication number
CN112329784A
CN112329784A CN202011323988.7A CN202011323988A CN112329784A CN 112329784 A CN112329784 A CN 112329784A CN 202011323988 A CN202011323988 A CN 202011323988A CN 112329784 A CN112329784 A CN 112329784A
Authority
CN
China
Prior art keywords
target
frame image
tracking
characteristic
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011323988.7A
Other languages
Chinese (zh)
Inventor
牛军浩
王文胜
苏金操
骆薇羽
许川佩
朱爱军
陈涛
殷贤华
张本鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202011323988.7A priority Critical patent/CN112329784A/en
Publication of CN112329784A publication Critical patent/CN112329784A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a relevant filtering tracking method based on space-time perception and multimodal response, which comprises the steps of firstly determining the position and the size of a tracking target on a first frame image of a tracking video; then, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image; and finally, determining the position and the size of the tracking target on the t frame image by using the target prediction model of the t frame image. The invention can not only generate illumination change around the target and continuously and robustly track the posture change, but also can not influence the error update of the model because the target is shielded or the interference of similar objects, so that the model is always kept in a better state, and the target tracking accuracy rate is higher; meanwhile, real-time data can be rapidly processed in the tracking process, so that the method can be applied to actual life.

Description

Correlation filtering tracking method based on space-time perception and multimodal response
Technical Field
The invention relates to the technical field of target tracking, in particular to a correlation filtering tracking method based on space-time perception and multimodal response.
Background
Target tracking is one of research hotspots in the field of computer vision, and has wide application in the fields of face recognition, robot vision, intelligent monitoring and the like. The target tracking method based on deep learning gradually shows robust tracking performance due to strong characteristic learning capability of a deep neural network, but the method has large calculation amount, low algorithm tracking efficiency, incapability of meeting real-time requirements, high requirements on hardware resources and unsuitability for application in engineering products, so that the deep learning can achieve real-time effect only by using partial networks and continuous optimization. The discriminant tracking method has strong discriminant capability because of obviously distinguishing information of a background and a foreground, and occupies a mainstream position in the field of target tracking at present. In recent years, a correlation filter is introduced into a discriminant tracking framework, and a target tracking method based on the correlation filter achieves good effects. The MOOSE filter with the minimum square error output introduces correlation operation into target tracking, and greatly accelerates calculation through the theory that the spatial domain convolution becomes a Fourier domain. After that, the CSK algorithm of the nuclear detection tracking loop structure adopts a loop matrix to increase the number of samples, thereby improving the effect of the classifier. As an extension to CSK, oriented gradient features, gaussian kernel and ridge regression are used for the kernel correlation filter KCF. Aiming at the scale change of the target, the identification scale space tracking DSST solves the problem of scale estimation through a scale pyramid learning related filter. However, the above trackers do not solve the problem of target occlusion well or only aim at partial occlusion and short-time full occlusion of a target, and the existing occlusion criterion cannot be fused with the tracking algorithm well, so that the occlusion criterion can be judged wrongly in many times, which seriously affects the performance of the trackers. Therefore, although the existing target tracking algorithm has achieved great achievement, many problems still exist in accurate target tracking due to factors such as posture change, illumination change, partial occlusion, rapid movement, scale change and background complexity.
Disclosure of Invention
The invention aims to solve the problem that the tracking effect of the existing target tracking algorithm is poor, and provides a relevant filtering tracking method based on space-time perception and multimodal response.
In order to solve the problems, the invention is realized by the following technical scheme:
a correlation filtering tracking method based on space-time perception and multimodal response comprises the following steps:
step 1, determining the position and the size of a tracking target on a first frame image of a tracking video;
step 2, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image;
step 2.1, firstly, based on the position and the size of the tracking target determined by the t-1 frame image, taking the position of the tracking target as the center and the size of the tracking target as the size of a cell, and selecting a candidate area containing more than 2 cells on the t-1 frame image; then, carrying out sample cyclic shift on the candidate region to obtain a training sample set for training a traditional characteristic correlation filter model and a depth characteristic correlation filter model;
step 2.2, firstly, respectively extracting the HOG characteristic (gradient characteristic), the CN characteristic (COLOR attribute characteristic) and the COLOR characteristic (COLOR characteristic) of each training sample in the training sample set, and performing vector superposition on the HOG characteristic, the CN characteristic and the COLOR characteristic extracted by the training sample to obtain the multi-dimensional fusion characteristic of each training sample; then, taking all the multi-dimensional fusion features as training samples, and training a traditional feature correlation filter model by using a ridge regression algorithm;
step 2.3, firstly, a training sample set is sent to a convolution network of deep learning to extract CNN characteristics (deep characteristics); then, combining and screening the extracted CNN characteristics and training samples in the training sample set by using a GMM model; finally, the merged and screened CNN features are used as training samples, and a ridge regression algorithm is used for training a depth feature correlation filter model;
step 2.4, taking the traditional characteristic correlation filter model obtained in the step 2.2 and the depth characteristic correlation filter model obtained in the step 3 as a target prediction model of the t frame image;
step 3, determining the position and the size of a tracking target on the t frame image by using a target prediction model of the t frame image;
3.1, based on the position and the size of the tracking target determined by the t-1 th frame image, taking the corresponding position of the tracking target as the center and the corresponding size of the tracking target as the size of a cell, and selecting a target search area containing more than 2 cells on the t-1 th frame image;
step 3.2, the target search area is sent into a traditional feature correlation filter model of a target prediction model of the t frame image, and traditional feature fusion response values of each cell of the target search area of the t frame image are predicted;
step 3.3, the target search area is sent into a depth feature correlation filter model of a target prediction model of the t frame image, and depth feature response values of each cell of the target search area of the t frame image are predicted;
step 3.4, respectively carrying out weighted fusion on the traditional characteristic fusion response value of each cell obtained in the step 3.2 and the depth characteristic response value of each cell obtained in the step 3.3 to obtain a target response value of each cell, and regarding the cell with the maximum target response value as the position of the tracking target on the t-th frame image;
step 3.5, with the position of the tracked target tracked in the step 3.4 as the center, constructing a size pyramid by scaling according to the proportion to predict the size of the tracked target on the t frame image;
step 4, repeating the steps 2 and 3 to realize target tracking of all frames of the tracking video;
t is 2,3, ….
In the step 2.1, the candidate area is a cross-shaped area formed by 4 areas with the same size as the tracking target, which are located above, below, on the left, and on the right of the tracking target, and the cross-shaped area is obtained by scaling the cross-shaped area by a preset multiple.
In the step 3.1, the target search area is a cross-shaped area formed by 4 areas with the same size as the tracking target, which are located above, below, on the left, and on the right of the tracking target, and the cross-shaped area is obtained by scaling the cross-shaped area by a preset multiple.
The specific process of the step 3.2 is as follows:
step 3.2.1, respectively extracting HOG characteristics, CN characteristics and COLOR characteristics of each cell in the target search area of the t-th frame image, and performing vector addition on the extracted HOG characteristics, CN characteristics and COLOR characteristics of each cell to form a traditional characteristic sample;
and 3.2.2, sending the tracking target determined by the traditional characteristic sample and the t-1 frame image into a traditional characteristic correlation filter model of a target prediction model of the t-frame image to obtain a traditional characteristic fusion response value of each cell of the target search area.
The specific process of the step 3.3 is as follows:
step 3.3.1, respectively sending each cell of the target search area into a deep learning convolution network to extract CNN characteristics, and finally selecting outputs of three layers of conv3, conv4 and conv5 as characteristic samples of CNN;
and 3.3.2, sending the tracking target determined by the CNN characteristic sample and the t-1 frame image into a depth characteristic correlation filter model of a target prediction model of the t-frame image to obtain depth characteristic response values of each cell of a target search area.
As an improvement, the correlation filtering tracking method based on spatio-temporal perception and multimodal response further includes the following steps:
step 3.6, after the position and the size of the tracking target of the t-th frame image are obtained, whether the determined tracking target is shielded or not needs to be judged by multi-peak target detection: if the occlusion does not exist, directly outputting the position and the size of the tracking target of the obtained t frame image; otherwise, the position and the size of the tracking target of the obtained t-th frame image are discarded.
Compared with the prior art, the invention has the following characteristics:
1. the invention uses the space-time area of the target neighborhood four blocks related to the context to better track by using a large number of samples.
2. The target can be detected more accurately by using a traditional feature and depth feature fusion mode and adaptive response value weight fusion, and the accuracy of the model is better ensured by adaptively distributing weights for the two models by using normalization and feature training loss.
3. Samples are merged by using a GMM modeling method in the deep sample extraction and storage processes, so that the information diversity of the samples can be ensured to a great extent, and the robustness is stronger.
4. The method of simultaneously using the scale pool after the target position is obtained can further determine the real size of the target.
5. And performing multimodal response consideration on target response, mapping the multimodal target into one dimension, and further judging occlusion to better avoid wrong model updating caused by occlusion.
6. Updating the formula using the new learning rate enables the model to be updated in a more stable state.
Drawings
FIG. 1 is a flow chart of a correlation filtering tracking method based on spatiotemporal perception and multimodal response.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.
A correlation filtering tracking method based on spatio-temporal perception and multimodal response is disclosed, as shown in FIG. 1, which specifically includes the following steps:
step 1, determining the position and the size of a tracking target on a first frame image of a tracking video.
And 2, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image. t is 2,3, ….
Step 2.1, firstly, based on the position and the size of the tracking target determined by the t-1 frame image, taking the position of the tracking target as the center and the size of the tracking target as the size of a cell, and selecting a candidate area containing more than 2 cells on the t-1 frame image. And then, carrying out sample cyclic shift on the candidate region to obtain a training sample set for training a traditional feature correlation filter model and a depth feature correlation filter model.
The candidate area is set according to design requirements, and can be a circular area surrounded by a preset radius by taking the position of the tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left, and right sides of the position of the tracking target, and obtained by scaling the cross-shaped area by a preset multiple (e.g., 2 times or 2.5 times). In the invention, the candidate area is a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as the center, and the cross-shaped area is obtained by zooming the cross-shaped area by preset times. In this embodiment, a specified tracking target is first increased by four equal-sized areas, i.e., upper, lower, left, and right, to be a target area, and then the target area is filled with padding of 1.5 times, i.e., the target area is enlarged by 2.5 times to be a candidate area.
And 2.2, firstly, respectively extracting the HOG characteristic, the CN characteristic and the COLOR characteristic of each training sample in the training sample set, and carrying out vector superposition on the HOG characteristic, the CN characteristic and the COLOR characteristic extracted by the training sample to obtain the multi-dimensional fusion characteristic of each training sample. Then, all the multi-dimensional fusion features are used as training samples, and a ridge regression algorithm is used for training a traditional feature correlation filter model.
In this embodiment, for each training sample in the training sample set, first, 31-dimensional HOG features, 11-dimensional CN features, and 1-dimensional COLOR features of the sample are extracted, and during calculation, the 11-dimensional CN features are reduced to 2-dimensional by using a PCA algorithm, and then, the 31-dimensional HOG features, the 2-dimensional CN features, and the 1-dimensional gray COLOR features are vector-superimposed by using a multi-channel technique to obtain 34-dimensional conventional fusion features.
1) Extracting HOG features
Firstly, the image is grayed, and the input image to be detected is normalized in color space by using a Gamma correction method, so that the purposes of adjusting the contrast of the image and reducing the influence caused by local shadow and illumination change are achieved, and meanwhile, the interference caused by noise can be inhibited. Then, the gradient of each pixel of the target image is calculated, so that basic contour information is mainly captured, and the influence caused by illumination is further reduced. Finally, dividing the image into small cell units each containing 4 × 4 pixel points, constructing a gradient histogram for each cell unit, then forming a block by 3 × 3 cell units, and connecting the feature vectors of the cell units contained in the block in series to obtain the gradient histogram of the block, and similarly connecting the feature vectors of the gradient histograms of all the blocks in the target image in series to form the directional gradient histogram of the target image, which is also a feature vector connected in series for use by the discriminant classifier.
2) Extracting CN features
The CN is characterized in that the optical invariance can be effectively self-adapted by mapping RGB color characteristics to 11-dimensional color representation probability, specifically, a subscript of a value mapped to 11-dimensional color space is obtained by calculating each dimension value of RGB, and discrete 11-dimensional color representation is selected, but 11-dimensional color representation can be better described by reducing the dimension of the 11-dimensional color representation to 2-dimensional due to the fact that calculation is complex.
3) Extracting COLOR features
The COLOR feature is represented by a COLOR histogram, which is a statistic of the COLOR distribution on the surface of the moving object and is not affected by changes in the shape, posture and the like of the object. Therefore, the histogram is used as the characteristic of the target, matching is carried out according to color distribution, and the method has the characteristics of good stability, partial occlusion resistance, simple calculation method and small calculation amount, and is an ideal target color characteristic.
4) Performing vector superposition calculation
D1 dimensional local context color feature x in current frame ppColor feature x with dimension (m, n) reduced to D2p(m, n), and the multi-channel fused kernel correlation calculation mode is as follows
Figure BDA0002793756150000051
Wherein
Figure BDA0002793756150000052
Is kernel function, Xc is the fusion feature of the C-th channel after vector superposition of HOG feature, CN feature and COLOR feature, which indicates a dot product operation, σ is the bandwidth of Gaussian kernel function, F-1For inverse Fourier transform, after the characteristics are fused, a weighted value wp of any pixel point Pi in the target frame is calculated according to the characteristic, the weighted value can be called as a space distance weighted value, the weighted value is attenuated along with the distance from the center of the target, the weighted value is 1 at the center of the target, and the value approaches to 0 as the distance from the center point is farther.
5) And training the multi-dimensional fusion features of all the training samples to train the traditional feature correlation filter model by using a ridge regression algorithm.
After the calculation of the interior of the target area, the relevant filter can be trained according to the extracted HOG, CN and COLOR characteristics, and the specific training filter formula is as follows.
Figure BDA0002793756150000053
Wherein
Figure BDA0002793756150000054
That is, a penalty term of the upper and lower regions of the background is added more than that when the upper, lower, left and right blocks are not added, so that the template w to be trained and the background alpha areiThe response is minimized when correlating, where f (x)i)=wTxiRepresenting the output response of the target of the ith frame, wherein xi-wp represents the product of the cyclic matrix in the search area of the positive target and the weight corresponding to the sample in the cyclic matrix relative to the center of the selected target, yi represents the expected output response of the target of the ith frame, and lambda1In order to be a factor for the regularization,
Figure BDA0002793756150000055
representing classifier parameters, αiIth frame classifier parameters representing dual space
Figure BDA0002793756150000056
Representing the appearance model of the object, mui,
Figure BDA0002793756150000057
The expectation and variance of the response gaussian distribution are output separately for the ith frame target,
Figure BDA0002793756150000058
a cyclic matrix representing the ith frame, wherein m is 1, and 4 respectively corresponds to four candidate target search windows, namely the upper part, the lower part, the left part and the right part of a template region, and target features xi of a large number of training samples of the ith frame are determined according to the target features xi
Figure BDA0002793756150000059
Training classifier parameters wdec
Figure BDA0002793756150000061
Then the method is converted into a form of solving alpha, and Gaussian kernels are applied to calculate self-kernel correlation matrixes of target features xi of a large number of training samples in the ith frame
Figure BDA0002793756150000062
Further obtain
Figure BDA0002793756150000063
Wherein the content of the first and second substances,
Figure BDA0002793756150000064
is that
Figure BDA0002793756150000065
To obtain a conjugate matrix of the Fourier transform of
Figure BDA0002793756150000066
Then obtaining alpha through FFT inverse transformationiAnd finishing the training of the classifier parameters of the ith frame.
And 2.3, firstly, sending the training sample set into a deep learning convolution network to extract CNN characteristic samples. The extracted CNN feature samples are then merged and screened with the samples in the sample set using the GMM model. And finally, taking the combined and screened CNN features as training samples, and training a depth feature correlation filter model by using a ridge regression algorithm.
The traditional features are all extracted based on single resolution, mainly concern the appearance of a target, but are greatly influenced by target deformation, the deep CNN features mainly extract semantic features and have strong robustness on appearance deformation, a deep layer is selected to be subjected to data enhancement processing, the existing VGG-19 network is selected as a deep network, a sample to be sampled is sent into the network to be subjected to convolution operation, output results of several layers of conv3-4, conv4-4 and conv5-4 after operation are obtained and are used as object CNN feature samples considered, then the results of the three layers are used as CNN feature samples to be trained to obtain depth filters corresponding to the three layers, finally, the three layers of depth filters are used for carrying out relevant operation on images, the output results of the three layers are fused to form a final confidence map to be output, and the sample is saved if the tracking effect is not wrong, and then, the steps are carried out all the time, the samples are stored, meanwhile, in order to consider the diversity of the samples, GMM modeling is carried out on the samples to be stored firstly, the samples with higher similarity degree are merged, and when the number of the stored samples exceeds a certain number, the samples with the smallest weight are deleted.
Aiming at the problem that the complexity of deep network calculation is high, namely the data calculation amount is large, so that the processing speed is reduced and the use is not facilitated, the method only extracts three layers of deep characteristic samples and uses a GMM sample modeling strategy, so that the influence of the calculation efficiency brought by the deep network can be greatly reduced.
As the resolution of the traditional features is single, CNN feature sampling is simultaneously carried out on a target area, the convolution features of an initial target and context background information are extracted by using a deep learning convolution network, the shallow convolution contains more position information, and the deep convolution contains more semantic information which can be used for identifying the appearance details of the target, so that the features of several layers including conv3-4, conv4-4 and conv5-4 are extracted by using a trained VGG-19 model, the features extracted from the three convolution layers are respectively trained by using a correlation filter, then different convolution templates are obtained, and a correlation filter model f is established by using the extracted CNN features through trainingcnnIn addition, samples are stored in each frame when the deep network is trained, the maximum number of the stored samples is set to be 400, and therefore, along with the passage of time, on one hand, the calculation speed is low due to the calculation of a large number of samples, on the other hand, many calculation and storage samples are fused, the samples are single, and a large number of redundancy situations occur, so that model training in the later period is very bad, overfitting is easily caused by environmental influence and the target is lost. The GMM is modeled as:
Figure BDA0002793756150000071
wherein L is the number of components, the original M samples are reduced to L, where M is 400 and L is 50, the updating process is to initialize one component M and pi is to initialize one sample without updating one samplem=γandμm=xjIf the number of components exceeds L, one component with the smallest weight is discarded, otherwise, the two nearest components k and L are combined
Figure BDA0002793756150000072
And the input sample is the average of L components when calculating the filter.
And (3) sending the training sample set obtained through cyclic shift into a deep learning convolution network to extract CNN characteristics, taking the output of the three-layer convolution as a target characteristic sample, and sending the target characteristic sample into a ridge regression to train filter models corresponding to the three-layer samples. Because a large number of samples are needed in the process of training the depth filter, the result of successful tracking in the previous frame needs to be considered, the result of successful tracking in the previous frame is stored, a fixed value is determined for a candidate sample set, the successful samples are added into the sample set to be trained each time, the difference between the samples of adjacent frames is considered to be not large, if the samples of the similar frames are directly added, the diversity of the samples cannot be ensured, overfitting is easy, the result of successful tracking is added into the fixed number of training sample sets, the GMM model is firstly used for judging the similarity between the samples to be added and the stored samples in the sample set, if the samples are not similar, the samples are directly added as new samples, meanwhile, a certain weight is added to the candidate samples according to the similarity after correlation operation is carried out on the current samples, and the subsequent frames are sequentially processed, and considering the capacity of the sample set, when the sample set is full when a new sample is added, removing the sample with the minimum weight.
And 2.4, taking the traditional characteristic correlation filter model obtained in the step 2.2 and the depth characteristic correlation filter model obtained in the step 3 as a target prediction model of the t frame image.
Because the HOG gradient characteristic, the CN characteristic and the COLOR characteristic can better cope with the influence of rotary translation, illumination and partial shielding, the HOG gradient characteristic, the CN characteristic and the COLOR characteristic can be integrated into the invention to comprehensively cope with the interference of the target in the image due to the above situation. Because the depth feature CNN can comprehensively represent the multi-resolution condition of the image, namely the appearance change of the target is more robust, the constructed filter can be tracked more accurately by using the mode of fusing the traditional feature and the CNN feature.
And 3, determining the position and the size of the tracking target on the t frame image by using the target prediction model of the t frame image. t is 2,3, ….
And 3.1, based on the position and the size of the tracking target determined by the t-1 th frame image, taking the corresponding position of the tracking target as the center and the corresponding size of the tracking target as the size of a cell, and selecting a target search area containing more than 2 cells on the t-1 th frame image.
The target search area is set according to design requirements, and can be a circular area surrounded by a preset radius by taking the position of a tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left, and right sides of the position of the tracking target, and obtained by scaling the cross-shaped area by a preset multiple (e.g., 2 times or 2.5 times). In the invention, the target search area is a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as the center, and the cross-shaped area is obtained by zooming the cross-shaped area by preset times. In this embodiment, four regions of the same size, i.e., the upper, lower, left, and right blocks, are selected as cells centered on the prediction target position of the previous frame on the t-frame image, and the cells are enlarged by 2.5 times to be used as the target search region.
And 3.2, sending the target search area into a traditional feature correlation filter model of the target prediction model of the t frame image, and predicting traditional feature fusion response values of each cell of the target search area of the t frame image.
Step 3.2.1, respectively extracting HOG characteristics, CN characteristics and COLOR characteristics of each cell in the target search area of the t-th frame image, and performing vector addition on the extracted HOG characteristics, CN characteristics and COLOR characteristics of each cell to obtain a 35-dimensional traditional characteristic sample;
and 3.2.2, sending the traditional feature samples of the cells and the tracking target determined by the t-1 th frame image into a traditional feature correlation filter model of a target prediction model of the t-th frame image to obtain a traditional feature fusion response value of each cell of the target search area.
And 3.3, sending the target search area into a depth feature correlation filter model of a target prediction model of the t frame image, and predicting depth feature response values of each cell of the target search area of the t frame image.
And 3.3.1, respectively sending each cell of the target search area into a deep learning convolution network to extract the CNN characteristics, and finally selecting outputs of three layers of conv3, conv4 and conv5 as characteristic samples of the CNN.
And 3.3.2, sending the CNN characteristic samples of all the cells and the tracking target determined by the t-1 frame image into a current depth characteristic correlation filter model to obtain depth characteristic response values of all the cells.
And 3.4, respectively carrying out weighted fusion on the traditional characteristic fusion response value of each cell obtained in the step 3.2 and the depth characteristic response value of each cell obtained in the step 3.3 to obtain a target response value of each cell, and regarding the cell with the maximum target response value as the position of the tracking target on the t-th frame image.
Wherein in the training of the t-th frame corresponding to two features, the summary of the computation of the loss difference is as follows
Figure BDA0002793756150000081
Where sum (·) denotes summing each term in the matrix, and F ═ trad, cnn denotes the set of legacy features and depth features. The feature f corresponds to a normalized weight of
Figure BDA0002793756150000082
F-F represents a feature of F other than F.
Updating the original feature weights
Figure BDA0002793756150000083
For next frame tracking
Figure BDA0002793756150000084
Tau is an updating coefficient and takes an initial value of 0.2, and the trad and cnn characteristic weights of the first frame are initialized to be
Figure BDA0002793756150000085
In the detection stage of the t +1 th frame, response graphs obtained by using two characteristic filters of trad and cnn are respectively shown as
Figure BDA0002793756150000086
And
Figure BDA0002793756150000087
the fusion of the response map takes the following weighting approach,
Figure BDA0002793756150000088
in the formula (I), the compound is shown in the specification,
Figure BDA0002793756150000091
in order to be a weight of the conventional feature,
Figure BDA0002793756150000092
is the depth feature weight.
And 3.5, with the position of the tracked target tracked in the step 3.4 as the center, constructing a size pyramid according to the scaling and predicting to obtain the size of the tracked target on the t frame image.
Adding a scale filter to predict the scale change of the target according to the obtained position, sampling a scale pool at the previously obtained tracking position, and detecting the target scale S according to the detected target position Pt and the target scale S detected in the previous framet-1=wt-1×ht-1Extracting a scale candidate region with Pt as the center is as follows:
Figure BDA0002793756150000093
constructing a scale pyramid by the above method, wherein w-1 and h-1 are the width and height of a target in the last frame, a is a scale factor and S is the total stage number of the scale, then uniformly scaling the obtained target samples with different scales into the size of w multiplied by h, simultaneously extracting the features of target areas with different scales, and then performing related operation with a one-dimensional scale related filter to obtain a scale response graph, wherein the position of the maximum response graph is the optimal scale S of the corresponding templatet
Step 3.6, after the position and the size of the tracking target of the t frame image are obtained, whether the determined tracking target is shielded or not is judged by utilizing multimodal target detection: and if the occlusion does not exist, directly outputting the position and the size of the tracking target of the obtained t-th frame image. Otherwise, the position and the size of the tracking target of the obtained t-th frame image are discarded.
And 4, repeating the steps 2 and 3 to realize target tracking of all frames of the tracking video.
Because the target often causes the tracking failure condition due to shielding, the multimodal target detection adopted in the invention can better detect whether the target is shielded or not and can better detect the shielded or interfered condition of the target, so that the filter can obtain purer samples to update in time, and the filter can be practical for a long time. Generation of multimodal target detection: when the filter responds to the sample in the target search area, the filter is a response value which is centered at the previous frame position, and meanwhile, the filter also generates a response result by performing correlation operation on the sample generated after the sample in the search area is circularly shifted, namely, a plurality of response results are generated. If the target search area has target similar features, a plurality of peaks may appear in the response value of the target search area, if the highest peak is directly selected as the target area, the selection may be inaccurate, a target similar object or an occluded object is selected, so that a correct sample cannot be provided when the target is updated later, and the model may be mistaken according to the update of an error sample, so we select to perform one-dimensional projection on the response confidence of the multi-peak target in the search area to obtain a peak group, peak (peak 1, peak2, peak), and simultaneously, we calculate the specific size of each peak, and set two thresholds, namely, a peak size threshold of 1 and a number threshold of thresholded 2, first select the number of peaks in the peak group greater than threshold of 1 as the target peak group, and consider that the target similar interference or the occluded similar interference occurs if the number of the target peak groups is greater than threshold of 2, and carrying out second detection, wherein the second detection respectively uses each peak position as an image center to recalculate a response value, and finally the target is defined as the highest point of all responses.
The invention can not only generate illumination change around the target and continuously and robustly track the posture change, but also can not influence the error update of the model because the target is shielded or the interference of similar objects, so that the model is always kept in a better state, and the target tracking accuracy rate is higher; meanwhile, real-time data can be rapidly processed in the tracking process, so that the method can be applied to actual life.
It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims (6)

1. A correlation filtering tracking method based on space-time perception and multimodal response is characterized by comprising the following steps:
step 1, determining the position and the size of a tracking target on a first frame image of a tracking video;
step 2, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image;
step 2.1, firstly, based on the position and the size of the tracking target determined by the t-1 frame image, taking the position of the tracking target as the center and the size of the tracking target as the size of a cell, and selecting a candidate area containing more than 2 cells on the t-1 frame image; then, carrying out sample cyclic shift on the candidate region to obtain a training sample set for training a traditional characteristic correlation filter model and a depth characteristic correlation filter model;
step 2.2, firstly, respectively extracting the HOG characteristic, the CN characteristic and the COLOR characteristic of each training sample in the training sample set, and carrying out vector superposition on the HOG characteristic, the CN characteristic and the COLOR characteristic extracted by the training sample to obtain the multi-dimensional fusion characteristic of each training sample; then, taking all the multi-dimensional fusion features as training samples, and training a traditional feature correlation filter model by using a ridge regression algorithm;
step 2.3, firstly, a training sample set is sent to a deep learning convolution network to extract CNN characteristics; then, combining and screening the extracted CNN characteristics and training samples in the training sample set by using a GMM model; finally, the merged and screened CNN features are used as training samples, and a ridge regression algorithm is used for training a depth feature correlation filter model;
step 2.4, taking the traditional characteristic correlation filter model obtained in the step 2.2 and the depth characteristic correlation filter model obtained in the step 3 as a target prediction model of the t frame image;
step 3, determining the position and the size of a tracking target on the t frame image by using a target prediction model of the t frame image;
3.1, based on the position and the size of the tracking target determined by the t-1 th frame image, taking the corresponding position of the tracking target as the center and the corresponding size of the tracking target as the size of a cell, and selecting a target search area containing more than 2 cells on the t-1 th frame image;
step 3.2, the target search area is sent into a traditional feature correlation filter model of a target prediction model of the t frame image, and traditional feature fusion response values of each cell of the target search area of the t frame image are predicted;
step 3.3, the target search area is sent into a depth feature correlation filter model of a target prediction model of the t frame image, and depth feature response values of each cell of the target search area of the t frame image are predicted;
step 3.4, respectively carrying out weighted fusion on the traditional characteristic fusion response value of each cell obtained in the step 3.2 and the depth characteristic response value of each cell obtained in the step 3.3 to obtain a target response value of each cell, and regarding the cell with the maximum target response value as the position of the tracking target on the t-th frame image;
step 3.5, with the position of the tracked target tracked in the step 3.4 as the center, constructing a size pyramid by scaling according to the proportion to predict the size of the tracked target on the t frame image;
step 4, repeating the steps 2 and 3 to realize target tracking of all frames of the tracking video;
t is 2,3, ….
2. The correlation filtering tracking method based on spatio-temporal perception and multimodal response as claimed in claim 1, wherein in step 2.1, the candidate region is a cross-shaped region formed by 4 regions with the same size as the tracked target, which are located above, below, left, right, left, and right of the tracked target, and the cross-shaped region is scaled by a preset multiple.
3. The correlation filtering tracking method based on spatio-temporal perception and multimodal response as claimed in claim 1, wherein in step 3.1, the target search region is a cross-shaped region formed by 4 blocks of regions with the same size as the tracked target, which are located above, below, left, right, and left of the tracked target, and the cross-shaped region is scaled by a preset multiple.
4. The correlation filtering tracking method based on spatio-temporal perception and multi-peak response as claimed in claim 1, wherein the specific process of step 3.2 is as follows:
step 3.2.1, respectively extracting HOG characteristics, CN characteristics and COLOR characteristics of each cell in the target search area of the t-th frame image, and performing vector addition on the extracted HOG characteristics, CN characteristics and COLOR characteristics of each cell to form a traditional characteristic sample;
and 3.2.2, sending the tracking target determined by the traditional characteristic sample and the t-1 frame image into a traditional characteristic correlation filter model of a target prediction model of the t-frame image to obtain a traditional characteristic fusion response value of each cell of the target search area.
5. The correlation filtering tracking method based on spatio-temporal perception and multi-peak response as claimed in claim 1, wherein the specific process of step 3.3 is as follows:
step 3.3.1, respectively sending each cell of the target search area into a deep learning convolution network to extract CNN characteristics, and finally selecting the output of the 3 rd to 5 th layers of convolution layers as CNN characteristic samples;
and 3.3.2, sending the tracking target determined by the CNN characteristic sample and the t-1 frame image into a depth characteristic correlation filter model of a target prediction model of the t-frame image to obtain depth characteristic response values of each cell of a target search area.
6. The correlation filtering tracking method based on spatio-temporal perception and multi-peak response as claimed in claim 1, further comprising the steps of:
step 3.6, after the position and the size of the tracking target of the t-th frame image are obtained, whether the determined tracking target is shielded or not needs to be judged by multi-peak target detection: if the occlusion does not exist, directly outputting the position and the size of the tracking target of the obtained t frame image; otherwise, the position and the size of the tracking target of the obtained t-th frame image are discarded.
CN202011323988.7A 2020-11-23 2020-11-23 Correlation filtering tracking method based on space-time perception and multimodal response Pending CN112329784A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011323988.7A CN112329784A (en) 2020-11-23 2020-11-23 Correlation filtering tracking method based on space-time perception and multimodal response

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011323988.7A CN112329784A (en) 2020-11-23 2020-11-23 Correlation filtering tracking method based on space-time perception and multimodal response

Publications (1)

Publication Number Publication Date
CN112329784A true CN112329784A (en) 2021-02-05

Family

ID=74321099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011323988.7A Pending CN112329784A (en) 2020-11-23 2020-11-23 Correlation filtering tracking method based on space-time perception and multimodal response

Country Status (1)

Country Link
CN (1) CN112329784A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177970A (en) * 2021-04-29 2021-07-27 燕山大学 Multi-scale filtering target tracking method based on self-adaptive feature fusion
CN113222060A (en) * 2021-05-31 2021-08-06 四川轻化工大学 Visual tracking method based on convolution feature and manual feature integration
CN113269809A (en) * 2021-05-07 2021-08-17 桂林电子科技大学 Multi-feature fusion related filtering target tracking method and computer equipment
CN113538509A (en) * 2021-06-02 2021-10-22 天津大学 Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570486A (en) * 2016-11-09 2017-04-19 华南理工大学 Kernel correlation filtering target tracking method based on feature fusion and Bayesian classification
CN107578423A (en) * 2017-09-15 2018-01-12 杭州电子科技大学 The correlation filtering robust tracking method of multiple features hierarchical fusion
CN107748873A (en) * 2017-10-31 2018-03-02 河北工业大学 A kind of multimodal method for tracking target for merging background information
CN108053425A (en) * 2017-12-25 2018-05-18 北京航空航天大学 A kind of high speed correlation filtering method for tracking target based on multi-channel feature
CN108734723A (en) * 2018-05-11 2018-11-02 江南大学 A kind of correlation filtering method for tracking target based on adaptive weighting combination learning
CN109816693A (en) * 2019-01-28 2019-05-28 中国地质大学(武汉) Anti- based on multimodal response blocks correlation filtering tracking and systems/devices
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110110160A (en) * 2017-12-29 2019-08-09 阿里巴巴集团控股有限公司 Determine the method and device of data exception
CN111105436A (en) * 2018-10-26 2020-05-05 曜科智能科技(上海)有限公司 Target tracking method, computer device, and storage medium
CN111260738A (en) * 2020-01-08 2020-06-09 天津大学 Multi-scale target tracking method based on relevant filtering and self-adaptive feature fusion
CN111612817A (en) * 2020-05-07 2020-09-01 桂林电子科技大学 Target tracking method based on depth feature adaptive fusion and context information
CN111931722A (en) * 2020-09-23 2020-11-13 杭州视语智能视觉系统技术有限公司 Correlated filtering tracking method combining color ratio characteristics

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570486A (en) * 2016-11-09 2017-04-19 华南理工大学 Kernel correlation filtering target tracking method based on feature fusion and Bayesian classification
CN107578423A (en) * 2017-09-15 2018-01-12 杭州电子科技大学 The correlation filtering robust tracking method of multiple features hierarchical fusion
CN107748873A (en) * 2017-10-31 2018-03-02 河北工业大学 A kind of multimodal method for tracking target for merging background information
CN108053425A (en) * 2017-12-25 2018-05-18 北京航空航天大学 A kind of high speed correlation filtering method for tracking target based on multi-channel feature
CN110110160A (en) * 2017-12-29 2019-08-09 阿里巴巴集团控股有限公司 Determine the method and device of data exception
CN108734723A (en) * 2018-05-11 2018-11-02 江南大学 A kind of correlation filtering method for tracking target based on adaptive weighting combination learning
CN111105436A (en) * 2018-10-26 2020-05-05 曜科智能科技(上海)有限公司 Target tracking method, computer device, and storage medium
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN109816693A (en) * 2019-01-28 2019-05-28 中国地质大学(武汉) Anti- based on multimodal response blocks correlation filtering tracking and systems/devices
CN111260738A (en) * 2020-01-08 2020-06-09 天津大学 Multi-scale target tracking method based on relevant filtering and self-adaptive feature fusion
CN111612817A (en) * 2020-05-07 2020-09-01 桂林电子科技大学 Target tracking method based on depth feature adaptive fusion and context information
CN111931722A (en) * 2020-09-23 2020-11-13 杭州视语智能视觉系统技术有限公司 Correlated filtering tracking method combining color ratio characteristics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MUELLER M等: ""Context-aware correlation filter tracking"", 《PROCESSINGS OF 2017 IEEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
刘万军等: ""遮挡判别下的多尺度相关滤波跟踪算法"", 《中国图象图形学报》 *
王春平等: ""特征融合和模型自适应更新相结合的相关滤波目标跟踪"", 《光学精密工程》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177970A (en) * 2021-04-29 2021-07-27 燕山大学 Multi-scale filtering target tracking method based on self-adaptive feature fusion
CN113269809A (en) * 2021-05-07 2021-08-17 桂林电子科技大学 Multi-feature fusion related filtering target tracking method and computer equipment
CN113269809B (en) * 2021-05-07 2022-06-21 桂林电子科技大学 Multi-feature fusion related filtering target tracking method and computer equipment
CN113222060A (en) * 2021-05-31 2021-08-06 四川轻化工大学 Visual tracking method based on convolution feature and manual feature integration
CN113538509A (en) * 2021-06-02 2021-10-22 天津大学 Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Similar Documents

Publication Publication Date Title
CN106599883B (en) CNN-based multilayer image semantic face recognition method
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN107633226B (en) Human body motion tracking feature processing method
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN110175649B (en) Rapid multi-scale estimation target tracking method for re-detection
CN108154118A (en) A kind of target detection system and method based on adaptive combined filter with multistage detection
CN110097028B (en) Crowd abnormal event detection method based on three-dimensional pyramid image generation network
CN112837344B (en) Target tracking method for generating twin network based on condition countermeasure
CN107133496B (en) Gene feature extraction method based on manifold learning and closed-loop deep convolution double-network model
CN109033978B (en) Error correction strategy-based CNN-SVM hybrid model gesture recognition method
CN114241548A (en) Small target detection algorithm based on improved YOLOv5
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN111178208A (en) Pedestrian detection method, device and medium based on deep learning
CN110929593A (en) Real-time significance pedestrian detection method based on detail distinguishing and distinguishing
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN106157330B (en) Visual tracking method based on target joint appearance model
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN111860587B (en) Detection method for small targets of pictures
CN114758288A (en) Power distribution network engineering safety control detection method and device
CN113327272B (en) Robustness long-time tracking method based on correlation filtering
CN111640138A (en) Target tracking method, device, equipment and storage medium
Song et al. Feature extraction and target recognition of moving image sequences
CN111242971B (en) Target tracking method based on improved double-center particle swarm optimization algorithm
CN112613565B (en) Anti-occlusion tracking method based on multi-feature fusion and adaptive learning rate updating
CN113763417B (en) Target tracking method based on twin network and residual error structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210205