CN112329784A

CN112329784A - Correlation filtering tracking method based on space-time perception and multimodal response

Info

Publication number: CN112329784A
Application number: CN202011323988.7A
Authority: CN
Inventors: 牛军浩; 王文胜; 苏金操; 骆薇羽; 许川佩; 朱爱军; 陈涛; 殷贤华; 张本鑫
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-02-05

Abstract

The invention discloses a relevant filtering tracking method based on space-time perception and multimodal response, which comprises the steps of firstly determining the position and the size of a tracking target on a first frame image of a tracking video; then, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image; and finally, determining the position and the size of the tracking target on the t frame image by using the target prediction model of the t frame image. The invention can not only generate illumination change around the target and continuously and robustly track the posture change, but also can not influence the error update of the model because the target is shielded or the interference of similar objects, so that the model is always kept in a better state, and the target tracking accuracy rate is higher; meanwhile, real-time data can be rapidly processed in the tracking process, so that the method can be applied to actual life.

Description

Correlation filtering tracking method based on space-time perception and multimodal response

Technical Field

The invention relates to the technical field of target tracking, in particular to a correlation filtering tracking method based on space-time perception and multimodal response.

Background

Target tracking is one of research hotspots in the field of computer vision, and has wide application in the fields of face recognition, robot vision, intelligent monitoring and the like. The target tracking method based on deep learning gradually shows robust tracking performance due to strong characteristic learning capability of a deep neural network, but the method has large calculation amount, low algorithm tracking efficiency, incapability of meeting real-time requirements, high requirements on hardware resources and unsuitability for application in engineering products, so that the deep learning can achieve real-time effect only by using partial networks and continuous optimization. The discriminant tracking method has strong discriminant capability because of obviously distinguishing information of a background and a foreground, and occupies a mainstream position in the field of target tracking at present. In recent years, a correlation filter is introduced into a discriminant tracking framework, and a target tracking method based on the correlation filter achieves good effects. The MOOSE filter with the minimum square error output introduces correlation operation into target tracking, and greatly accelerates calculation through the theory that the spatial domain convolution becomes a Fourier domain. After that, the CSK algorithm of the nuclear detection tracking loop structure adopts a loop matrix to increase the number of samples, thereby improving the effect of the classifier. As an extension to CSK, oriented gradient features, gaussian kernel and ridge regression are used for the kernel correlation filter KCF. Aiming at the scale change of the target, the identification scale space tracking DSST solves the problem of scale estimation through a scale pyramid learning related filter. However, the above trackers do not solve the problem of target occlusion well or only aim at partial occlusion and short-time full occlusion of a target, and the existing occlusion criterion cannot be fused with the tracking algorithm well, so that the occlusion criterion can be judged wrongly in many times, which seriously affects the performance of the trackers. Therefore, although the existing target tracking algorithm has achieved great achievement, many problems still exist in accurate target tracking due to factors such as posture change, illumination change, partial occlusion, rapid movement, scale change and background complexity.

Disclosure of Invention

The invention aims to solve the problem that the tracking effect of the existing target tracking algorithm is poor, and provides a relevant filtering tracking method based on space-time perception and multimodal response.

In order to solve the problems, the invention is realized by the following technical scheme:

a correlation filtering tracking method based on space-time perception and multimodal response comprises the following steps:

step 1, determining the position and the size of a tracking target on a first frame image of a tracking video;

step 2, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image;

step 2.1, firstly, based on the position and the size of the tracking target determined by the t-1 frame image, taking the position of the tracking target as the center and the size of the tracking target as the size of a cell, and selecting a candidate area containing more than 2 cells on the t-1 frame image; then, carrying out sample cyclic shift on the candidate region to obtain a training sample set for training a traditional characteristic correlation filter model and a depth characteristic correlation filter model;

step 2.2, firstly, respectively extracting the HOG characteristic (gradient characteristic), the CN characteristic (COLOR attribute characteristic) and the COLOR characteristic (COLOR characteristic) of each training sample in the training sample set, and performing vector superposition on the HOG characteristic, the CN characteristic and the COLOR characteristic extracted by the training sample to obtain the multi-dimensional fusion characteristic of each training sample; then, taking all the multi-dimensional fusion features as training samples, and training a traditional feature correlation filter model by using a ridge regression algorithm;

step 2.3, firstly, a training sample set is sent to a convolution network of deep learning to extract CNN characteristics (deep characteristics); then, combining and screening the extracted CNN characteristics and training samples in the training sample set by using a GMM model; finally, the merged and screened CNN features are used as training samples, and a ridge regression algorithm is used for training a depth feature correlation filter model;

step 2.4, taking the traditional characteristic correlation filter model obtained in the step 2.2 and the depth characteristic correlation filter model obtained in the step 3 as a target prediction model of the t frame image;

step 3, determining the position and the size of a tracking target on the t frame image by using a target prediction model of the t frame image;

3.1, based on the position and the size of the tracking target determined by the t-1 th frame image, taking the corresponding position of the tracking target as the center and the corresponding size of the tracking target as the size of a cell, and selecting a target search area containing more than 2 cells on the t-1 th frame image;

step 3.2, the target search area is sent into a traditional feature correlation filter model of a target prediction model of the t frame image, and traditional feature fusion response values of each cell of the target search area of the t frame image are predicted;

step 3.3, the target search area is sent into a depth feature correlation filter model of a target prediction model of the t frame image, and depth feature response values of each cell of the target search area of the t frame image are predicted;

step 3.4, respectively carrying out weighted fusion on the traditional characteristic fusion response value of each cell obtained in the step 3.2 and the depth characteristic response value of each cell obtained in the step 3.3 to obtain a target response value of each cell, and regarding the cell with the maximum target response value as the position of the tracking target on the t-th frame image;

step 3.5, with the position of the tracked target tracked in the step 3.4 as the center, constructing a size pyramid by scaling according to the proportion to predict the size of the tracked target on the t frame image;

step 4, repeating the steps 2 and 3 to realize target tracking of all frames of the tracking video;

t is 2,3, ….

In the step 2.1, the candidate area is a cross-shaped area formed by 4 areas with the same size as the tracking target, which are located above, below, on the left, and on the right of the tracking target, and the cross-shaped area is obtained by scaling the cross-shaped area by a preset multiple.

In the step 3.1, the target search area is a cross-shaped area formed by 4 areas with the same size as the tracking target, which are located above, below, on the left, and on the right of the tracking target, and the cross-shaped area is obtained by scaling the cross-shaped area by a preset multiple.

The specific process of the step 3.2 is as follows:

step 3.2.1, respectively extracting HOG characteristics, CN characteristics and COLOR characteristics of each cell in the target search area of the t-th frame image, and performing vector addition on the extracted HOG characteristics, CN characteristics and COLOR characteristics of each cell to form a traditional characteristic sample;

and 3.2.2, sending the tracking target determined by the traditional characteristic sample and the t-1 frame image into a traditional characteristic correlation filter model of a target prediction model of the t-frame image to obtain a traditional characteristic fusion response value of each cell of the target search area.

The specific process of the step 3.3 is as follows:

step 3.3.1, respectively sending each cell of the target search area into a deep learning convolution network to extract CNN characteristics, and finally selecting outputs of three layers of conv3, conv4 and conv5 as characteristic samples of CNN;

and 3.3.2, sending the tracking target determined by the CNN characteristic sample and the t-1 frame image into a depth characteristic correlation filter model of a target prediction model of the t-frame image to obtain depth characteristic response values of each cell of a target search area.

As an improvement, the correlation filtering tracking method based on spatio-temporal perception and multimodal response further includes the following steps:

step 3.6, after the position and the size of the tracking target of the t-th frame image are obtained, whether the determined tracking target is shielded or not needs to be judged by multi-peak target detection: if the occlusion does not exist, directly outputting the position and the size of the tracking target of the obtained t frame image; otherwise, the position and the size of the tracking target of the obtained t-th frame image are discarded.

Compared with the prior art, the invention has the following characteristics:

1. the invention uses the space-time area of the target neighborhood four blocks related to the context to better track by using a large number of samples.

2. The target can be detected more accurately by using a traditional feature and depth feature fusion mode and adaptive response value weight fusion, and the accuracy of the model is better ensured by adaptively distributing weights for the two models by using normalization and feature training loss.

3. Samples are merged by using a GMM modeling method in the deep sample extraction and storage processes, so that the information diversity of the samples can be ensured to a great extent, and the robustness is stronger.

4. The method of simultaneously using the scale pool after the target position is obtained can further determine the real size of the target.

5. And performing multimodal response consideration on target response, mapping the multimodal target into one dimension, and further judging occlusion to better avoid wrong model updating caused by occlusion.

6. Updating the formula using the new learning rate enables the model to be updated in a more stable state.

Drawings

FIG. 1 is a flow chart of a correlation filtering tracking method based on spatiotemporal perception and multimodal response.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.

A correlation filtering tracking method based on spatio-temporal perception and multimodal response is disclosed, as shown in FIG. 1, which specifically includes the following steps:

step 1, determining the position and the size of a tracking target on a first frame image of a tracking video.

And 2, training a target prediction model of the t frame image by using the position and the size of the tracking target determined by the t-1 frame image. t is 2,3, ….

Step 2.1, firstly, based on the position and the size of the tracking target determined by the t-1 frame image, taking the position of the tracking target as the center and the size of the tracking target as the size of a cell, and selecting a candidate area containing more than 2 cells on the t-1 frame image. And then, carrying out sample cyclic shift on the candidate region to obtain a training sample set for training a traditional feature correlation filter model and a depth feature correlation filter model.

The candidate area is set according to design requirements, and can be a circular area surrounded by a preset radius by taking the position of the tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left, and right sides of the position of the tracking target, and obtained by scaling the cross-shaped area by a preset multiple (e.g., 2 times or 2.5 times). In the invention, the candidate area is a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as the center, and the cross-shaped area is obtained by zooming the cross-shaped area by preset times. In this embodiment, a specified tracking target is first increased by four equal-sized areas, i.e., upper, lower, left, and right, to be a target area, and then the target area is filled with padding of 1.5 times, i.e., the target area is enlarged by 2.5 times to be a candidate area.

And 2.2, firstly, respectively extracting the HOG characteristic, the CN characteristic and the COLOR characteristic of each training sample in the training sample set, and carrying out vector superposition on the HOG characteristic, the CN characteristic and the COLOR characteristic extracted by the training sample to obtain the multi-dimensional fusion characteristic of each training sample. Then, all the multi-dimensional fusion features are used as training samples, and a ridge regression algorithm is used for training a traditional feature correlation filter model.

In this embodiment, for each training sample in the training sample set, first, 31-dimensional HOG features, 11-dimensional CN features, and 1-dimensional COLOR features of the sample are extracted, and during calculation, the 11-dimensional CN features are reduced to 2-dimensional by using a PCA algorithm, and then, the 31-dimensional HOG features, the 2-dimensional CN features, and the 1-dimensional gray COLOR features are vector-superimposed by using a multi-channel technique to obtain 34-dimensional conventional fusion features.

1) Extracting HOG features

Firstly, the image is grayed, and the input image to be detected is normalized in color space by using a Gamma correction method, so that the purposes of adjusting the contrast of the image and reducing the influence caused by local shadow and illumination change are achieved, and meanwhile, the interference caused by noise can be inhibited. Then, the gradient of each pixel of the target image is calculated, so that basic contour information is mainly captured, and the influence caused by illumination is further reduced. Finally, dividing the image into small cell units each containing 4 × 4 pixel points, constructing a gradient histogram for each cell unit, then forming a block by 3 × 3 cell units, and connecting the feature vectors of the cell units contained in the block in series to obtain the gradient histogram of the block, and similarly connecting the feature vectors of the gradient histograms of all the blocks in the target image in series to form the directional gradient histogram of the target image, which is also a feature vector connected in series for use by the discriminant classifier.

2) Extracting CN features

The CN is characterized in that the optical invariance can be effectively self-adapted by mapping RGB color characteristics to 11-dimensional color representation probability, specifically, a subscript of a value mapped to 11-dimensional color space is obtained by calculating each dimension value of RGB, and discrete 11-dimensional color representation is selected, but 11-dimensional color representation can be better described by reducing the dimension of the 11-dimensional color representation to 2-dimensional due to the fact that calculation is complex.

3) Extracting COLOR features

The COLOR feature is represented by a COLOR histogram, which is a statistic of the COLOR distribution on the surface of the moving object and is not affected by changes in the shape, posture and the like of the object. Therefore, the histogram is used as the characteristic of the target, matching is carried out according to color distribution, and the method has the characteristics of good stability, partial occlusion resistance, simple calculation method and small calculation amount, and is an ideal target color characteristic.

4) Performing vector superposition calculation

D1 dimensional local context color feature x in current frame p^pColor feature x with dimension (m, n) reduced to D2^p(m, n), and the multi-channel fused kernel correlation calculation mode is as follows

Wherein

Is kernel function, Xc is the fusion feature of the C-th channel after vector superposition of HOG feature, CN feature and COLOR feature, which indicates a dot product operation, σ is the bandwidth of Gaussian kernel function, F^-1For inverse Fourier transform, after the characteristics are fused, a weighted value wp of any pixel point Pi in the target frame is calculated according to the characteristic, the weighted value can be called as a space distance weighted value, the weighted value is attenuated along with the distance from the center of the target, the weighted value is 1 at the center of the target, and the value approaches to 0 as the distance from the center point is farther.

5) And training the multi-dimensional fusion features of all the training samples to train the traditional feature correlation filter model by using a ridge regression algorithm.

After the calculation of the interior of the target area, the relevant filter can be trained according to the extracted HOG, CN and COLOR characteristics, and the specific training filter formula is as follows.

Wherein

That is, a penalty term of the upper and lower regions of the background is added more than that when the upper, lower, left and right blocks are not added, so that the template w to be trained and the background alpha are_iThe response is minimized when correlating, where f (x)_i)＝w^Tx_iRepresenting the output response of the target of the ith frame, wherein xi-wp represents the product of the cyclic matrix in the search area of the positive target and the weight corresponding to the sample in the cyclic matrix relative to the center of the selected target, yi represents the expected output response of the target of the ith frame, and lambda₁In order to be a factor for the regularization,

representing classifier parameters, α_iIth frame classifier parameters representing dual space

Representing the appearance model of the object, mu_i,

The expectation and variance of the response gaussian distribution are output separately for the ith frame target,

a cyclic matrix representing the ith frame, wherein m is 1, and 4 respectively corresponds to four candidate target search windows, namely the upper part, the lower part, the left part and the right part of a template region, and target features xi of a large number of training samples of the ith frame are determined according to the target features xi

Training classifier parameters wdec

Then the method is converted into a form of solving alpha, and Gaussian kernels are applied to calculate self-kernel correlation matrixes of target features xi of a large number of training samples in the ith frame

Further obtain

Wherein the content of the first and second substances,

is that

To obtain a conjugate matrix of the Fourier transform of

Then obtaining alpha through FFT inverse transformation_iAnd finishing the training of the classifier parameters of the ith frame.

And 2.3, firstly, sending the training sample set into a deep learning convolution network to extract CNN characteristic samples. The extracted CNN feature samples are then merged and screened with the samples in the sample set using the GMM model. And finally, taking the combined and screened CNN features as training samples, and training a depth feature correlation filter model by using a ridge regression algorithm.

The traditional features are all extracted based on single resolution, mainly concern the appearance of a target, but are greatly influenced by target deformation, the deep CNN features mainly extract semantic features and have strong robustness on appearance deformation, a deep layer is selected to be subjected to data enhancement processing, the existing VGG-19 network is selected as a deep network, a sample to be sampled is sent into the network to be subjected to convolution operation, output results of several layers of conv3-4, conv4-4 and conv5-4 after operation are obtained and are used as object CNN feature samples considered, then the results of the three layers are used as CNN feature samples to be trained to obtain depth filters corresponding to the three layers, finally, the three layers of depth filters are used for carrying out relevant operation on images, the output results of the three layers are fused to form a final confidence map to be output, and the sample is saved if the tracking effect is not wrong, and then, the steps are carried out all the time, the samples are stored, meanwhile, in order to consider the diversity of the samples, GMM modeling is carried out on the samples to be stored firstly, the samples with higher similarity degree are merged, and when the number of the stored samples exceeds a certain number, the samples with the smallest weight are deleted.

Aiming at the problem that the complexity of deep network calculation is high, namely the data calculation amount is large, so that the processing speed is reduced and the use is not facilitated, the method only extracts three layers of deep characteristic samples and uses a GMM sample modeling strategy, so that the influence of the calculation efficiency brought by the deep network can be greatly reduced.

As the resolution of the traditional features is single, CNN feature sampling is simultaneously carried out on a target area, the convolution features of an initial target and context background information are extracted by using a deep learning convolution network, the shallow convolution contains more position information, and the deep convolution contains more semantic information which can be used for identifying the appearance details of the target, so that the features of several layers including conv3-4, conv4-4 and conv5-4 are extracted by using a trained VGG-19 model, the features extracted from the three convolution layers are respectively trained by using a correlation filter, then different convolution templates are obtained, and a correlation filter model f is established by using the extracted CNN features through training_cnnIn addition, samples are stored in each frame when the deep network is trained, the maximum number of the stored samples is set to be 400, and therefore, along with the passage of time, on one hand, the calculation speed is low due to the calculation of a large number of samples, on the other hand, many calculation and storage samples are fused, the samples are single, and a large number of redundancy situations occur, so that model training in the later period is very bad, overfitting is easily caused by environmental influence and the target is lost. The GMM is modeled as:

wherein L is the number of components, the original M samples are reduced to L, where M is 400 and L is 50, the updating process is to initialize one component M and pi is to initialize one sample without updating one sample_m＝γandμ_m＝x_jIf the number of components exceeds L, one component with the smallest weight is discarded, otherwise, the two nearest components k and L are combined

And the input sample is the average of L components when calculating the filter.

And (3) sending the training sample set obtained through cyclic shift into a deep learning convolution network to extract CNN characteristics, taking the output of the three-layer convolution as a target characteristic sample, and sending the target characteristic sample into a ridge regression to train filter models corresponding to the three-layer samples. Because a large number of samples are needed in the process of training the depth filter, the result of successful tracking in the previous frame needs to be considered, the result of successful tracking in the previous frame is stored, a fixed value is determined for a candidate sample set, the successful samples are added into the sample set to be trained each time, the difference between the samples of adjacent frames is considered to be not large, if the samples of the similar frames are directly added, the diversity of the samples cannot be ensured, overfitting is easy, the result of successful tracking is added into the fixed number of training sample sets, the GMM model is firstly used for judging the similarity between the samples to be added and the stored samples in the sample set, if the samples are not similar, the samples are directly added as new samples, meanwhile, a certain weight is added to the candidate samples according to the similarity after correlation operation is carried out on the current samples, and the subsequent frames are sequentially processed, and considering the capacity of the sample set, when the sample set is full when a new sample is added, removing the sample with the minimum weight.

And 2.4, taking the traditional characteristic correlation filter model obtained in the step 2.2 and the depth characteristic correlation filter model obtained in the step 3 as a target prediction model of the t frame image.

Because the HOG gradient characteristic, the CN characteristic and the COLOR characteristic can better cope with the influence of rotary translation, illumination and partial shielding, the HOG gradient characteristic, the CN characteristic and the COLOR characteristic can be integrated into the invention to comprehensively cope with the interference of the target in the image due to the above situation. Because the depth feature CNN can comprehensively represent the multi-resolution condition of the image, namely the appearance change of the target is more robust, the constructed filter can be tracked more accurately by using the mode of fusing the traditional feature and the CNN feature.

And 3, determining the position and the size of the tracking target on the t frame image by using the target prediction model of the t frame image. t is 2,3, ….

And 3.1, based on the position and the size of the tracking target determined by the t-1 th frame image, taking the corresponding position of the tracking target as the center and the corresponding size of the tracking target as the size of a cell, and selecting a target search area containing more than 2 cells on the t-1 th frame image.

The target search area is set according to design requirements, and can be a circular area surrounded by a preset radius by taking the position of a tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as a center; or a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left, and right sides of the position of the tracking target, and obtained by scaling the cross-shaped area by a preset multiple (e.g., 2 times or 2.5 times). In the invention, the target search area is a cross-shaped area formed by 4 areas with the same size as the tracking target on the upper, lower, left and right sides of the position of the tracking target by taking the position of the tracking target as the center, and the cross-shaped area is obtained by zooming the cross-shaped area by preset times. In this embodiment, four regions of the same size, i.e., the upper, lower, left, and right blocks, are selected as cells centered on the prediction target position of the previous frame on the t-frame image, and the cells are enlarged by 2.5 times to be used as the target search region.

And 3.2, sending the target search area into a traditional feature correlation filter model of the target prediction model of the t frame image, and predicting traditional feature fusion response values of each cell of the target search area of the t frame image.

Step 3.2.1, respectively extracting HOG characteristics, CN characteristics and COLOR characteristics of each cell in the target search area of the t-th frame image, and performing vector addition on the extracted HOG characteristics, CN characteristics and COLOR characteristics of each cell to obtain a 35-dimensional traditional characteristic sample;

and 3.2.2, sending the traditional feature samples of the cells and the tracking target determined by the t-1 th frame image into a traditional feature correlation filter model of a target prediction model of the t-th frame image to obtain a traditional feature fusion response value of each cell of the target search area.

And 3.3, sending the target search area into a depth feature correlation filter model of a target prediction model of the t frame image, and predicting depth feature response values of each cell of the target search area of the t frame image.

And 3.3.1, respectively sending each cell of the target search area into a deep learning convolution network to extract the CNN characteristics, and finally selecting outputs of three layers of conv3, conv4 and conv5 as characteristic samples of the CNN.

And 3.3.2, sending the CNN characteristic samples of all the cells and the tracking target determined by the t-1 frame image into a current depth characteristic correlation filter model to obtain depth characteristic response values of all the cells.

And 3.4, respectively carrying out weighted fusion on the traditional characteristic fusion response value of each cell obtained in the step 3.2 and the depth characteristic response value of each cell obtained in the step 3.3 to obtain a target response value of each cell, and regarding the cell with the maximum target response value as the position of the tracking target on the t-th frame image.

Wherein in the training of the t-th frame corresponding to two features, the summary of the computation of the loss difference is as follows

Where sum (·) denotes summing each term in the matrix, and F ═ trad, cnn denotes the set of legacy features and depth features. The feature f corresponds to a normalized weight of

F-F represents a feature of F other than F.

Updating the original feature weights

For next frame tracking

Tau is an updating coefficient and takes an initial value of 0.2, and the trad and cnn characteristic weights of the first frame are initialized to be

In the detection stage of the t +1 th frame, response graphs obtained by using two characteristic filters of trad and cnn are respectively shown as

And

the fusion of the response map takes the following weighting approach,

in the formula (I), the compound is shown in the specification,

in order to be a weight of the conventional feature,

is the depth feature weight.

And 3.5, with the position of the tracked target tracked in the step 3.4 as the center, constructing a size pyramid according to the scaling and predicting to obtain the size of the tracked target on the t frame image.

Adding a scale filter to predict the scale change of the target according to the obtained position, sampling a scale pool at the previously obtained tracking position, and detecting the target scale S according to the detected target position Pt and the target scale S detected in the previous frame_t-1＝w_t-1×h_t-1Extracting a scale candidate region with Pt as the center is as follows:

constructing a scale pyramid by the above method, wherein w-1 and h-1 are the width and height of a target in the last frame, a is a scale factor and S is the total stage number of the scale, then uniformly scaling the obtained target samples with different scales into the size of w multiplied by h, simultaneously extracting the features of target areas with different scales, and then performing related operation with a one-dimensional scale related filter to obtain a scale response graph, wherein the position of the maximum response graph is the optimal scale S of the corresponding template_t。

Step 3.6, after the position and the size of the tracking target of the t frame image are obtained, whether the determined tracking target is shielded or not is judged by utilizing multimodal target detection: and if the occlusion does not exist, directly outputting the position and the size of the tracking target of the obtained t-th frame image. Otherwise, the position and the size of the tracking target of the obtained t-th frame image are discarded.

And 4, repeating the steps 2 and 3 to realize target tracking of all frames of the tracking video.

Because the target often causes the tracking failure condition due to shielding, the multimodal target detection adopted in the invention can better detect whether the target is shielded or not and can better detect the shielded or interfered condition of the target, so that the filter can obtain purer samples to update in time, and the filter can be practical for a long time. Generation of multimodal target detection: when the filter responds to the sample in the target search area, the filter is a response value which is centered at the previous frame position, and meanwhile, the filter also generates a response result by performing correlation operation on the sample generated after the sample in the search area is circularly shifted, namely, a plurality of response results are generated. If the target search area has target similar features, a plurality of peaks may appear in the response value of the target search area, if the highest peak is directly selected as the target area, the selection may be inaccurate, a target similar object or an occluded object is selected, so that a correct sample cannot be provided when the target is updated later, and the model may be mistaken according to the update of an error sample, so we select to perform one-dimensional projection on the response confidence of the multi-peak target in the search area to obtain a peak group, peak (peak 1, peak2, peak), and simultaneously, we calculate the specific size of each peak, and set two thresholds, namely, a peak size threshold of 1 and a number threshold of thresholded 2, first select the number of peaks in the peak group greater than threshold of 1 as the target peak group, and consider that the target similar interference or the occluded similar interference occurs if the number of the target peak groups is greater than threshold of 2, and carrying out second detection, wherein the second detection respectively uses each peak position as an image center to recalculate a response value, and finally the target is defined as the highest point of all responses.

The invention can not only generate illumination change around the target and continuously and robustly track the posture change, but also can not influence the error update of the model because the target is shielded or the interference of similar objects, so that the model is always kept in a better state, and the target tracking accuracy rate is higher; meanwhile, real-time data can be rapidly processed in the tracking process, so that the method can be applied to actual life.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. A correlation filtering tracking method based on space-time perception and multimodal response is characterized by comprising the following steps:

step 2.2, firstly, respectively extracting the HOG characteristic, the CN characteristic and the COLOR characteristic of each training sample in the training sample set, and carrying out vector superposition on the HOG characteristic, the CN characteristic and the COLOR characteristic extracted by the training sample to obtain the multi-dimensional fusion characteristic of each training sample; then, taking all the multi-dimensional fusion features as training samples, and training a traditional feature correlation filter model by using a ridge regression algorithm;

step 2.3, firstly, a training sample set is sent to a deep learning convolution network to extract CNN characteristics; then, combining and screening the extracted CNN characteristics and training samples in the training sample set by using a GMM model; finally, the merged and screened CNN features are used as training samples, and a ridge regression algorithm is used for training a depth feature correlation filter model;

t is 2,3, ….

2. The correlation filtering tracking method based on spatio-temporal perception and multimodal response as claimed in claim 1, wherein in step 2.1, the candidate region is a cross-shaped region formed by 4 regions with the same size as the tracked target, which are located above, below, left, right, left, and right of the tracked target, and the cross-shaped region is scaled by a preset multiple.

3. The correlation filtering tracking method based on spatio-temporal perception and multimodal response as claimed in claim 1, wherein in step 3.1, the target search region is a cross-shaped region formed by 4 blocks of regions with the same size as the tracked target, which are located above, below, left, right, and left of the tracked target, and the cross-shaped region is scaled by a preset multiple.

4. The correlation filtering tracking method based on spatio-temporal perception and multi-peak response as claimed in claim 1, wherein the specific process of step 3.2 is as follows:

5. The correlation filtering tracking method based on spatio-temporal perception and multi-peak response as claimed in claim 1, wherein the specific process of step 3.3 is as follows:

step 3.3.1, respectively sending each cell of the target search area into a deep learning convolution network to extract CNN characteristics, and finally selecting the output of the 3 rd to 5 th layers of convolution layers as CNN characteristic samples;

6. The correlation filtering tracking method based on spatio-temporal perception and multi-peak response as claimed in claim 1, further comprising the steps of: