CN109993775A

CN109993775A - Monotrack method based on feature compensation

Info

Publication number: CN109993775A
Application number: CN201910258571.8A
Authority: CN
Inventors: 杨云; 白杨
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2019-07-09
Anticipated expiration: 2039-04-01
Also published as: CN109993775B

Abstract

The invention discloses it is a kind of based on posteriority pixel color histogram, histograms of oriented gradients, convolutional neural networks feature compensation video target tracking method, in simple scenario using simple feature to guarantee real-time, complex scene uses complex characteristic, to guarantee accuracy.It is combined by two kinds of features of posteriority pixel histogram and histograms of oriented gradients, obtained response characteristic figure can be good at the fairly simple situation of adaptive video scene；One classifier of training is to judge that it is insincere that the former merges the target when obtained response obtains；Finally further according to the judging result of classifier, the convolutional neural networks tracker for choosing whether the relatively slow still performance more robust of speed to be added is corrected to the target for occurring deviateing is tracked, or is given for change again to tracking lost target.The present invention improves the precision that target sizes and position are judged in video, and it can well adapt to prolonged target following task, to reach the scene of practical application.

Description

Single target tracking method based on characteristic compensation

Technical Field

The invention belongs to the technical field of single-target tracking of computer vision, and particularly relates to a single-target tracking method based on feature compensation.

Background

In the field of computer vision, tracking tasks are always a core problem, and are widely applied to many aspects such as video monitoring, man-machine interaction, robot vision perception, military guidance and the like. In the single-target tracking, the position and the size of a tracking target are manually marked by a rectangular frame in a first frame of a video, and then all that is needed in the tracking method is to closely follow the manually marked object by the rectangular frame in a subsequent frame of the video. Similar to the above, target detection is performed by scanning and searching for a target in a whole frame range in a still image or a dynamic video, and in summary, the target detection focuses on positioning and classification. Target tracking, however, focuses on how to lock a person or object in real time, and it is not aware of what is tracked by itself. Due to the real-time requirement of the tracking method, the cost of the whole frame search calculation is very expensive, the method is obviously not suitable for the scene, and the tracked object has continuity in time and space, so the search range of the tracking method can be greatly reduced. However, because of the continuity, in the tracking process, some complex scenes have interference factors such as illumination change, appearance deformation, rapid movement, occlusion, background similarity and the like, and most models of the tracking method need to continuously update the models themselves in the tracking task process, so once the models learn the background information, errors are easily generated and accumulated all the time, and finally the target is lost.

Currently, most of the mainstream tracking algorithms are short-term tracking (short-term tracking), and mainly have the following defects:

(1) poor robustness

The tracking model cannot be found back again after losing the target, the algorithms mainly make an article on the basis of the precision of the position and the size of the tracking target, the robustness is not high, the method cannot adapt to a long-time tracking task, and the model cannot be well applied to a real scene.

(2) Low speed

Both an end-to-end neural network structure tracking model and a tracking model combining a deep convolution characteristic map and related filtering can obtain high accuracy, but high calculation time is spent, so that the method is less applied to actual scenes. And other traditional tracking models based on correlation filtering can achieve high speed, but do not perform well enough in accuracy and robustness.

(3) Error accumulation

Due to various interference factors existing in a video scene, the model is difficult to track the target correctly in each frame, so background or other interference information is learned when updating the template, and the error is accumulated and is an irreversible process.

Aiming at the defects of the tracking method, the method can be well applied to a real scene, or the entry point is placed on long-term tracking (long-term tracking), so that the robustness is improved as much as possible, and the method can be suitable for a long-time tracking task while ensuring that the speed is real-time.

Disclosure of Invention

The invention aims to provide a feature compensation video target tracking method based on a posterior pixel color histogram, a direction gradient histogram and a convolutional neural network, so that the method is good in robustness and high in speed, and the model can be fully guaranteed to have a higher frame rate in the speed aspect while the accuracy and the robustness are improved; the method improves the accuracy of judging the size and the position of the target in the video, and can better adapt to a long-time target tracking task so as to achieve a scene of practical application.

The invention adopts the technical scheme that a characteristic compensation video target tracking method based on characteristic fusion is provided, and the method comprises the following steps:

s1, establishing a target tracking model branch of the color histogram feature:

s11, before a target tracking task starts, calling an OpenCV toolkit, and cutting out a target sub-image E with background information on the basis of a target image which is manually marked;

s12, separating the foreground area from the background area of the target sub-image with the background information according to the size of the target and a certain proportion; meanwhile, the pixels are subjected to scale compression within an integer range of 0-32 pixel values, and pixel proportions relative to each pixel value in a corresponding foreground region and a corresponding background region, namely a foreground pixel proportion rho (O) and a background pixel proportion rho (B), are calculated by respectively depending on a foreground mask and a background mask which are the same in size, wherein the expression of the pixel proportions rho is as follows:

ρ(O)＝N(O)/|O|； (1-1)

ρ(B)＝N(B)/|B|； (1-2)

wherein, O character represents the image area of foreground O, B character represents the image area of background B, N (O) represents the number of non-zero pixel values in the image area of foreground O, N (B) represents the number of non-zero pixel values in the image area of background B, | O | represents the total number of pixel values in the image area of foreground O, | B | represents the total number of pixel values in the image area of background B, and the weight β of the color histogram template of the posterior pixel of the current frame is calculated and obtained based on the formulas (1-1) and (1-2)_t：

Wherein t represents the current frame, and lambda is a hyper-parameter;

s13, in the next frame of video, in the image range with the target center of the previous frame as the center of the search area, cutting out the sub-image e with the S12, carrying out scale compression on the pixels to obtain psi, and obtaining the weight β of the posterior pixel color histogram template of the previous frame according to the formulas (1-1), (1-2) and (2)_t-1Finally obtaining the color histogram response f by using the integral graph formula_hist：

Wherein psi is a sub-image after M channel pixel compression, and is defined on the current frame clipping picture e; psi_tThe sub-image is the sub-image after the M channel pixel compression of the current frame; h represents each pixel point of the picture corresponding to the integer range; u represents each corresponding one of the H grids, ψ u]Is a corresponding pixel point on psi, and the superscript T is a matrix transposition;

s14, completing the tracking task of one frame each time, and at the same time, predicting the position of the current frame, and weighting β the histogram template of the posterior pixels_tUpdating, namely updating the foreground pixel proportion rho (O) and the background pixel proportion rho (B) respectively to obtain the pixel proportion rho (O) of the foreground O of the current frame after updating_t(O) and the updated pixel ratio ρ of the background B of the current frame_t(B)：

ρ_t(O)＝(1-η_hist)ρ_t-1(O)+η_histρ′_t(O)

ρ_t(B)＝(1-η_hist)ρ_t-1(B)+η_histρ′_t(B)； (4)

Wherein rho'_t(O) is an image area of the foreground O of the current framePixel proportion in the domain, ρ'_t(B) Is the proportion of pixels in the image area of the background B of the current frame, p_t-1(O) is the proportion of pixels in the image area of the foreground O of the previous frame, ρ_t-1(B) Proportion of pixels in image area of background B of previous frame η_histWeights for pixel scale update;

s2, establishing a target tracking model branch of the histogram feature of the directional gradient:

s21, selecting the target image to be tracked by rectangular frame in S11, cutting out another target area sub-image E' with different size and same background information, and extracting K channels of three-dimensional directional gradient histogram characteristic phi^kMultiplying the cosine window function in the OpenCV packet to calculate a template of the directional gradient histogram characteristics

Wherein,all variables are defined in a frequency domain and are obtained through discrete Fourier transform; u represents each of the meshes of Γ, Γ represents Φ^kCorresponding to each grid in the integer range; the superscript i represents each of the K channels,is the conjugate of a gaussian signal after a fourier transform; the conjugate is denoted by x, e the element multiplication,is a histogram feature of directional gradients phi^kEach channel element obtained by Fourier transform, K being the number of channels;

s22, obtaining the directional gradient histogram template in S21Performing inverse Fourier transform to obtain h [ u ]]In the next frame of the video, in the image range with the target center of the previous frame as the center of the search area, a sub-image e' is cut, the directional gradient histogram feature phi of the current sub-image is extracted, and the directional gradient histogram fraction f of the current frame is obtained by utilizing the linear function calculation_hog：

f_hog(φ,h)＝∑_u∈Γh[u]^Tφ[u]； (7)

S23, after completing the tracking task of each frame, updating the template of the histogram feature of the direction gradient at the predicted position of the current frame, namely obtaining the updated final signal respectivelyAnd

wherein,andthe signals representing the current frame are respectively calculated from formula (6),anda signal representing a previous frame of the video signal,andindicating the updated final signal η_hogWeights for histogram of oriented gradient template updates;

s3, feature fusion and classifier establishment:

s31, respectively obtaining the color histogram responses f in S13 and S22_histHistogram fraction of oriented gradient f_hogAnd (3) performing feature fusion by defining a linear function f (x), so as to obtain:

f(x)＝γ_hogf_hog(x)+γ_histf_hist(x)； (9)

wherein, γ_hogIs the weight of the histogram response of the directional gradient, gamma_hsitTaking the coordinate of a point corresponding to the maximum value of f (x) as the central coordinate of the target;

s32, passing through f, f_hogTraining a classifier: selecting a batch of video sequences, and respectively outputting f and f after feature fusion in S31_hogAnd let the input of the data set be X ═ max (f)_hog)；max(f)]Let output tag be h'_θDenotes the true value of the dataset, h'_θAn integer of 0 or 1, 0 indicating that the tracking box of the model has deviated from the target, 1 indicating no deviation from the target; let logistic regression function h_θRepresents the output of the classifier:

dividing the data into a training set and a verification set according to the proportion of 7:3, and obtaining a parameter theta of a logistic regression model in the formula (10) after the training set is converged through a cross entropy loss function and a gradient descent algorithm through multiple iterative computations; then, carrying out fine adjustment on the hyper-parameters by using the data of the verification set, calculating correct results of the parameters under different values by setting the parameters with different values, and selecting the value with the highest correct result as a final parameter value to ensure that a classifier achieves a better classification result on the verification set;

s4, judging whether the convolutional neural network structure tracker needs to be accessed:

s41, mixing f and f_hogInputting the data into the classifier obtained in S32 to obtain output; selecting 0.5 among the continuous values output from the classifier (i.e., equation 10) as a threshold in S32; when the output is greater than 0.5, the result of the fusion model is trustworthy without switching to access a convolutional neural network tracker; when the output is less than 0.5, the result of the fusion model is not accepted, and the convolution neural network tracker needs to be switched and accessed;

s42, when the target response score predicted by the current frame of the convolutional neural network tracker is high, repeatedly using S14 and S23, namely respectively updating the posterior pixel histogram template and the direction gradient histogram template; and then entering the tracking task of the next frame until all video frames are finished.

The invention has the beneficial effects that:

(1) the method of the invention integrates various characteristics, combines the characteristics of the characteristics, can well deal with video scenes such as illumination change, motion blur, object deformation, shielding and the like, quickly completes tracking tasks by using simple characteristics in simple scenes, and reduces the influence of interference information by switching more robust characteristics in complex scenes.

(2) The invention adds the method of self-detection classifier, so that the model can be more intelligently expressed when the characteristics are switched, and invalid information is inhibited from being learned when the template of the characteristics is updated, thereby reducing the accumulation of errors; meanwhile, the classifier is simple and does not need too much calculation overhead.

(3) The tracker with the neural network structure selected by the invention does not need to update the template, can not learn the interference information, and has good performance under the condition of target shielding.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of foreground and background masks.

FIG. 2 is a graph of a posterior pixel histogram and response.

Fig. 3 is a histogram of directional gradients and a response plot.

FIG. 4 is a schematic diagram of a single target tracking algorithm based on feature compensation.

Fig. 5 is a distribution diagram of accuracy and robustness of each algorithm under the reset mechanism.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the field of target tracking, the problems of deformation, illumination change, rapid movement, similar background, plane rotation, scale change, shielding, out-of-view and the like mainly exist.

The specific process is as follows:

s1, establishing a target tracking model branch of the color histogram feature:

s11, because the method is established under the situation of single target tracking task, namely before the target tracking task starts, an OpenCV toolkit is called, the target to be tracked is selected by a rectangular frame in a manual marking mode, a target sub-image E with background information is cut out, and then the model can be distinguished according to the characteristics of the selected target and the background thereof, so as to complete the subsequent tracking task. Therefore, in such a scenario, no matter what kind of characteristics are based on, the model generates an initial characteristic template through the image in the target frame selected by the video starting frame according to the respective modes of the model to match the candidate area of the image of the subsequent frame, so as to predict the position and size of the target.

S12, based on the first frame already selected to be tracked, the color histogram model will separate the target sub-image E with background information according to the size of the target and a certain ratio between the foreground region and the background region. Because the pixel value range is 0-255, if the original pixel value is used for calculation, a large amount of calculation time is consumed, so that one pixel needs to be compressed in a scale, the selected scale is 8, namely, the calculation is performed in the integer range of 0-32, and the model speed is greatly improved. And respectively calculating pixel proportions, namely a foreground pixel proportion rho (O) and a background pixel proportion rho (B), of the two regions relative to each pixel value by depending on a foreground mask (as shown in figure 1-a, a white target region is a single-channel image with a value of 1 and a black background region is a value of 0) and a background mask (as shown in figure 1-B, a single-channel image with a black target region value of 0 and a white background region is a value of 1) which are the same in size:

ρ(O)＝N(O)/|O|； (1-1)

ρ(B)＝N(B)/|B|； (1-2)

wherein, O character represents the image area of foreground O, B character represents the image area of background B, N (O) represents the number of non-zero pixel values in the image area of foreground O, N (B) represents the number of non-zero pixel values in the image area of background B, | O | represents the total number of pixel values in the image area of foreground O, | B | represents the total number of pixel values in the image area of background B, after the pixel proportion of foreground and background are respectively obtained, the weight β of the posterior pixel color histogram template of the current frame can be calculated_t：

t denotes the current frame and λ is the hyper-parameter.

S13, after establishing the posterior pixel histogram template, in the next frame of video, in the image range with the target center of the previous frame as the center of the search area, likewise cutting out the sub-image e, and carrying out scale compression on the pixel to obtain psi, according to the formulas (1-1) and (1-2), obtaining the weight β of the posterior pixel color histogram template of the previous frame_t-1The resulting color histogram response f is obtained using the integrogram formula, as shown in fig. 2-a, 2-b and 2-c_hist：

Wherein psi is a sub-image after M channel pixel compression, and is defined on the current frame clipping picture e; psi_tThe sub-image is the sub-image after the M channel pixel compression of the current frame; h represents each pixel point of the picture corresponding to the integer range; u represents each corresponding one of the H grids, ψ u]Is the corresponding pixel point on psi, upper cornerMarking T as matrix transposition;

s1-4, in the process of on-line tracking, the scene in the video is changed slightly or drastically at any time, and the influence of interference factors such as illumination change, motion blur and the like is particularly serious for the color histogram feature, so in order to better adapt to various changes in the video scene, while the tracking task of one frame is completed each time, the weight β of the posterior pixel histogram template is required to be used at the predicted position of the current frame_tUpdating, i.e. updating the pixel proportions ρ (O), ρ (B) of the foreground and background, respectively:

ρ_t(O)＝(1-η_hist)ρ_t-1(O)+η_histρ′_t(O)

ρ_t(B)＝(1-η_hist)ρ_t-1(B)+η_histρ′_t(B)； (4)

wherein, ρ'_t(O) is the pixel proportion, ρ'_t(B) Is the proportion of pixels in the image area of the background B of the current frame, p_t-1(O) is the proportion of pixels in the image area of the foreground O of the previous frame, ρ_t-1(B) Proportion of pixels in image area of background B of previous frame η_histWeights for pixel scale update;

s2-1, cutting out a target area sub-image E' with background information on the basis of the manually marked tracking target selected in the first frame, and extracting the three-dimensional direction gradient histogram characteristics phi of K channels^k3-a and 3-b, the influence of the peripheral part of the surrounding sub-image is suppressed by multiplying the cosine window function in the OpenCV packet. Template for calculating directional gradient histogram features

Wherein,all variables are defined to be obtained through discrete Fourier transform in a frequency domain, and because cross-correlation operation exists in a model in related filtering, high calculation time cost is consumed, so that after the Fourier transform is carried out on the variables, convolution calculation in a time domain can be converted into element product calculation in the frequency domain, and the calculation time can be greatly reduced. u represents each of the meshes of Γ, Γ represents Φ^kCorresponding to each grid in the integer range; the superscript i represents each of the K channels,is the conjugate of a gaussian signal after a fourier transform; the conjugate is denoted by x, e the element multiplication,is a histogram feature of directional gradients phi^kEach channel element obtained by fourier transform, K being the number of channels.

S22, establishing a histogram template of the directional gradientThen, inverse Fourier transform is carried out to obtain h [ u ]]In the next frame of the video, in the image range with the target center of the previous frame as the center of the search area, a searched sub-image e' is cut, the directional gradient histogram feature phi of the current sub-image is extracted, and the directional gradient histogram fraction f of the current frame can be obtained by calculating a linear function_hogThe effect graph is shown in fig. 3-c:

f_hog(φ,h)＝∑_u∈Γh[u]^Tφ[u]； (7)

s23, in the on-line tracking stage, the histogram of directional gradients also causes interference due to the change of the target in the scene, especially the influence caused by the deformation of the object is large, so that after completing the tracking task of each frame, the template of the histogram of directional gradients feature is updated at the predicted position of the current frame:

wherein,andthe signals representing the current frame are respectively calculated from formula (6),anda signal representing a previous frame of the video signal,andindicating the updated final signal η_hogWeights for histogram template update of directional gradients.

S3, feature fusion and classifier establishment:

s31, the color histogram features have a large influence on the model when there are interference factors such as illumination change and picture blur in the scene, and the histogram features of the directional gradient have a large influence on the model when there are interference factors such as object deformation and fast motion. Therefore, the two characteristics are fused, the interference of the factors can be reduced to a certain degree, the accuracy and the robustness of the tracking model are improved, the more accurate position and size of the target can be predicted in the tracking task, and the target is not easy to lose. Here, the color histogram responses f obtained in S13 and S22 are respectively_histHistogram fraction of oriented gradient f_hogAnd (3) performing feature fusion by defining a linear function f (x), so as to obtain:

f(x)＝γ_hogf_hog(x)+γ_histf_hist(x)； (9)

wherein, γ_hogIs the weight of the histogram response of the directional gradient, gamma_hsitFor the weight of the color histogram response, the coordinate of the point corresponding to the maximum value of f (x) is taken as the center coordinate of the target.

S32, although the two features are fused to show a good effect in most scenes, there is still a large space for improving performance for some complicated video scenes, such as those with similar background, occlusion, and out-of-view. Therefore, other trackers with more robust and better-effect neural network structures are added to improve the performance of the model. Considering that the running speed of the neural network is slow, the current general hardware equipment can not meet the requirement of real-time performance, so that the performance of the model can be exerted to the maximum extent only when the tracking task of the current frame can not be completed well by the model with the fusion of the former two characteristics by using the tracker of the neural network. The most important thing to meet the requirement is to let the feature fusion model know when the tracker of the neural network structure needs to be switched, and the f are analyzed_hist、f_hog(input (x) only when there is a mapping in the formula, separate notation does not require the expression (x)) three ringsThe change conditions of the scores in different scenes can be seen, and f are shown under the conditions that the target is greatly deformed or is shielded and the like_hogLarge fluctuations occur, so one classifier can be trained with these two values as the label of the switching tracker.

Selecting a batch of video sequences, and respectively outputting f and f after feature fusion in S31_hogAnd let the input of the data set be X ═ max (f)_hog)；max(f)]Let output tag be h'_θDenotes the true value of the dataset, h'_θAn integer of 0 or 1, 0 indicating that the tracking box of the model has deviated from the target, 1 indicating no deviation from the target; there is a concept of Intersection-over-Union (IoU) in target detection, which represents the overlapping rate of the predicted image frame and the actual image frame, and the Intersection-over-Union is used herein as the basis for measuring whether the target deviates; through experimental trials of multiple values, 0.35 as a boundary value is a more appropriate choice, namely when IoU>0.35 hour, h'_θ＝1，IoU<0.35 hour, h'_θ0; let logistic regression function h_θRepresents the output of the classifier:

dividing the data into a training set and a verification set according to the proportion of 7:3, and obtaining the weight theta in the formula (10) after the training set is converged through a cross entropy loss function and a gradient descent algorithm through multiple iterative computations. And then, carrying out fine adjustment on the hyper-parameters by using the data of the verification set, calculating correct results of the parameters under different values by setting the parameters with different values, and selecting the value with the highest correct result as a final parameter value, so that the classifier achieves a better classification result on the verification set.

s41, after the classifier model is trained and fine-tuned in the previous step, the color histogram can be judged at the stage of task trackingWhether the model with the fused graph features and the directional gradient histogram features can adapt to the current video scene, and therefore whether the tracker of the neural network needs to be switched is made. During the tracking process, f obtained by the formulas (7) and (9)_hogF is used as the input of the classifier, namely formula (10), and the obtained output is the mark of switching. When the verification data in the previous step is subjected to parameter adjustment, a proper threshold value of 0.5 is selected, and when the output is greater than the threshold value, the result of the fusion model is trustworthy and does not need to be switched; when the output is less than this threshold, the result of the fusion model is not accepted, at which time the tracker of the neural network is switched. The neural network tracker selected here is DaSiamRPN, which combines the thought and structure of target detection RPN (Region proposed network), can better cope with some complex scenes, can more accurately fit the size of the target after deformation, and does not need to update the template of the target on line, so that the template pollution condition caused by accumulated errors does not exist.

S42, in the stage of on-line tracking, the posterior pixel histogram template and the direction gradient histogram template need to be respectively updated by formulas (4) and (8) to adapt to the change of the scene in the video. Likewise, after the switching DaSiamaRPN tracker completes the tracking task of the current frame, the updating operation with the two formulas is still required. However, because the DaSiamRPN also has the condition of tracking failure, the template is updated only when the target response score predicted by the current frame of the tracker is higher, and the tracking task of the next frame is performed until all video frames are completed. The whole tracking process is shown in fig. 4.

Examples

To evaluate the performance of the present invention, experiments on a test set of video sequences were performed. Here, the evaluation method, data set, and evaluation system of the vot (visual object tracking) competition were selected to perform the experiment. The data set comprises 60 video sequences, wherein scenes such as occlusion, illumination change, target movement, scale change, camera movement, field of view and the like are involved, a plurality of attributes can appear in one video sequence, and visual attributes of different frames are different, so that a model can be evaluated more accurately. Before the VOT was proposed, a popular evaluation system was to have the tracker initialize at the first frame of the sequence, and then have the tracker go to the last frame. However, since the tracker may lose some frames (fail) at the beginning due to one or two factors, the final evaluation system only utilizes a small part of the sequence, which is wasteful. The VOT proposes that the evaluation system should detect an error (failure) when a tracker is lost, and reinitialize the tracker after 5 frames of failure, so as to fully utilize the data set.

The experimental scores of the reset mechanism were looked at first, as shown in table 1:

table 1 scoring of different algorithms under a reset mechanism

In table 1, a-R rank represents an Accuracy (Accuracy) and Robustness (Robustness) ranking index, Overlap is equivalent to the Accuracy, and represents the overlapping rate of a target predicted by a tracking method and an artificially and truly labeled target, and the larger the Overlap is, the more accurate the prediction is; failure is used for evaluating the stability of the tracking method, and the stability is better when the numerical value is smaller. By comparison with the 7 tracking methods, it can be seen that the accuracy of the method ranks first and the stability ranks third. The scoring trend of all algorithms in the table can also be seen more intuitively in fig. 5. However, in an actual scene, it is impossible to reset after tracking failure, and it is obvious that the first evaluation system has more reference value for the actual scene, and the experimental scores are as shown in table 2:

TABLE 2 Scoring for different algorithms without reset mechanism

In table 2, AUC (Area Under the Curve and enclosed by coordinate axes) is an index for evaluating the performance of the algorithm, and the larger the value is, the better the performance of the algorithm is. The speed indicator FPS (frame Per Second, frame rate of transmission) is also larger and faster. It can be seen that, in the absence of a reset mechanism, that is, in the case that the target is not located again by the scoring system after the tracking fails, the accuracy of the method reaches the highest compared with the other 7 methods, and the method is fastest in the algorithm with the top three accuracy ranks. In addition, in a native hardware configuration the CPU: intel Core i7-6700, GPU: experiments are carried out on the GeForce GT730 to obtain the fastest speed of the siamFC method in the table which is only 3FPS, and the method can reach the fastest speed of 30FPS, so that compared with other methods, the method has higher accuracy and still has unsophisticated speed, and is more suitable for actual scenes.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A single target tracking method based on feature compensation is characterized by comprising the following steps:

s1, establishing a target tracking model branch of the color histogram feature:

ρ(O)＝N(O)/|O|； (1-1)

ρ(B)＝N(B)/|B|； (1-2)

Wherein t represents the current frame, and lambda is a hyper-parameter;

Wherein psi is a sub-image after M channel pixel compression, and is defined on the current frame clipping picture e; psi_tThe sub-image is the sub-image after the M channel pixel compression of the current frame; h represents each pixel point of the picture corresponding to the integer range; u represents each corresponding one of the H grids, ψ u]Is on psiCorresponding pixel points, wherein the superscript T is a matrix transposition;

ρ_t(O)＝(1-η_hist)ρ_t-1(O)+η_histρ′_t(O)

ρ_t(B)＝(1-η_hist)ρ_t-1(B)+η_histρ′_t(B)； (4)

f_hog(φ,h)＝∑_u∈Γh[u]^Tφ[u]； (7)

s3, feature fusion and classifier establishment:

f(x)＝γ_hogf_hog(x)+γ_histf_hist(x)； (9)

wherein, γ_hogIs the weight of the histogram response of the directional gradient, gamma_hsitTaking the maximum of f (x) as the weight of the color histogram responseThe coordinate of the point corresponding to the value is the central coordinate of the target;

dividing the data into a training set and a verification set according to the proportion of 7:3, and obtaining a parameter theta of a logistic regression model in a formula (10) after the training set is converged through a cross entropy loss function and a gradient descent algorithm through multiple iterative computations; then, carrying out fine adjustment on the hyper-parameters by using the data of the verification set, calculating correct results of the parameters under different values by setting the parameters with different values, and selecting the value with the highest correct result as a final parameter value to ensure that a classifier achieves a better classification result on the verification set;

s41, mixing f and f_hogInputting the data into the classifier obtained in S32 to obtain output; selecting 0.5 of the continuous values output by the classifier as a threshold in S32; when the output is greater than 0.5, the result of the fusion model is trustworthy without switching to access a convolutional neural network tracker; when the output is less than 0.5, the result of the fusion model is not accepted, and the convolution neural network tracker needs to be switched and accessed;