CN104537686A

CN104537686A - Tracing method and device based on target space and time consistency and local sparse representation

Info

Publication number: CN104537686A
Application number: CN201410770556.9A
Authority: CN
Inventors: 张文生; 杨叶辉; 谢源; 胡文锐
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2014-12-12
Filing date: 2014-12-12
Publication date: 2015-04-22
Anticipated expiration: 2034-12-12
Also published as: CN104537686B

Abstract

The invention discloses a tracing method and device based on target space and time consistency and local sparse representation. The method includes the steps that first, a positive sample set and a candidate sample set are acquired according to a known tracing result; second, samples are blocked through fixing grids and provided with marks; third, sparse coding is performed on local sample image blocks, and the average value of codes of the positive sample image blocks with the same marks is calculated; fourth, target space and time consistency measurement is defined, and all the candidate samples are graded according to the measurement; fifth, candidate particles with the highest space and time consistency grade are selected from the candidate sample set and serve as tracing targets. According to the method and device, by calculating the space and time consistency of the samples, the shielding problem is effectively solved, and tracing accuracy is improved. Besides, through local image block sparse representation, influences of outside illumination changes and target posture changes on tracing are eliminated, and target tracing efficiency is also improved.

Description

Tracking method and device based on target space-time consistency and local sparse representation

Technical Field

The invention belongs to the technical field of image target tracking, and relates to a tracking method and a tracking device based on target space-time consistency and local sparse representation.

Background

The target representation problem is a first problem to be considered by the tracking method. At present, the representation of the target can be broadly classified into the following methods: and (4) pixel level representation. Objects are represented using image pixel-level features, such as directly taking the object pixel values as input, or based on a representation of the object color, texture, etc. ② artificial characteristic level representation. And (3) extracting features of the target by using some artificially designed feature operators (such as SIFT features, HOG features, Haar-like features and the like) to represent the target. -representation based on a descriptive model. Such as a subspace representation, a sparse representation, etc.

However, the representation at the pixel level is redundant, and it is straightforward to represent the object by pixels [ Duffner, s., Garcia, c.: pixeltrack: a fast adaptive algorithm for tracking non-edge objects in: ICCV.2480-2487, 2013, not only the identification of the target is difficult to guarantee, but also the calculated amount is correspondingly large. The description model using the artificial features has specific requirements on the characteristics of the target, that is, different targets are suitable for different feature descriptors, and the recognition effect is reduced by adopting the same feature representation for different targets. Therefore, the artificial feature level representation model is relatively dependent on subjective selection of people, so that the generalization and the robustness are poor. The descriptive model can better overcome the problems set forth above compared to the first two representation methods. However, the subspace expression [ d.ross, j.lim, r.s.lin, and m.h.yang, incorporated learning for robust visual tracking, in proc.int.j.com.vis. (IJCV), (2008) 125-. Research has found that the above problem can be solved relatively conveniently by sparsely representing the target.

L₁T[X.Mei and H.Ling，Robust visual tracking using l 1minimization，in：ICCV，1436-1443，2009.]Is a sparse representation-based method that is currently popular and reconstructs the occlusion by adding detail templates (triviral templates) to the coding dictionary. But the method encodes the whole object, and only utilizes the global information of the object and does not utilize the local detail information of the image. And because the dimension of the original target is higher, the time cost is very large for the whole target to be coded.

Disclosure of Invention

Aiming at the problems, the invention provides a tracking method based on target space-time consistency and local sparse representation. The algorithm defines the assumption of space-time consistency of the target between two continuous frames according to the change characteristics of the target in the tracking process: (1) temporal consistency. The object does not change much in two consecutive frames, so in two consecutive frames-the object is considered to be consistent in appearance; (2) and (5) spatial consistency. Since the object is similar in two consecutive frames, the position of each portion of the object is substantially fixed in the two frames. For example, the samples are divided into image blocks that do not overlap with each other, and in a sample, once an image block is determined to be a target image block, if several adjacent image blocks of the image block are also determined to be target blocks, the sample is a target with high confidence. According to the characteristics of the space-time consistency defined above, the invention provides the measure for calculating the score of the space-time consistency of the sample, and solves the problem of target occlusion.

The invention solves the problems of illumination, target attitude change and the like by utilizing sparse coding of the local image blocks of the sample, solves the problem of local shielding by utilizing the partitioning of the target and defining the space-time consistency of the target image block on the basis, and improves the coding speed by only carrying out sparse coding on the image blocks with relatively small dimensions.

The invention provides a tracking method based on target space-time consistency and local sparse representation, which comprises the following steps:

the method comprises the following steps: acquiring a positive sample set for the current frame by Gaussian disturbance according to the position of the tracking result of the previous frame;

step two: dividing each positive sample in the positive sample set into image blocks and marking, and learning a dictionary by using the image blocks obtained by division;

step three: carrying out sparse coding on image blocks obtained by dividing each positive sample in the positive sample set by using the learning dictionary, and calculating average vectors of sparse coding of the image blocks with the same marks;

step four: acquiring a candidate sample set of a current frame from a tracking result of a previous frame by using a motion model with affine transformation state change, and segmenting each candidate sample in the candidate sample set into image blocks and marking the image blocks;

step five: calculating a space-time consistency score of each candidate sample according to the average vector of the image block sparse codes corresponding to the image blocks with the same mark in the positive sample set and the image block sparse code of each candidate sample;

step six: and determining the tracking result of the current frame according to the space-time consistency score of each candidate sample in the candidate sample set based on Bayesian inference.

The invention also provides a tracking device based on the space-time consistency and the local sparse representation of the target, which comprises the following components:

a positive sample collection module: acquiring a positive sample set for the current frame by Gaussian disturbance according to the position of the tracking result of the previous frame;

a segmentation and labeling module: dividing each positive sample in the positive sample set into image blocks and marking, and learning a dictionary by using the image blocks obtained by division;

a sparse coding module: carrying out sparse coding on image blocks obtained by dividing each positive sample in the positive sample set by using the learning dictionary, and calculating average vectors of sparse coding of the image blocks with the same marks;

a candidate sample segmentation module: acquiring a candidate sample set of a current frame from a tracking result of a previous frame by using a motion model with affine transformation state change, and segmenting each candidate sample in the candidate sample set into image blocks and marking the image blocks;

a scoring module: calculating a space-time consistency score of each candidate sample according to the average vector of the image block sparse codes corresponding to the image blocks with the same mark in the positive sample set and the image block sparse code of each candidate sample;

a tracking module: and determining the tracking result of the current frame according to the space-time consistency score of each candidate sample in the candidate sample set based on Bayesian inference.

According to the invention, the tracking method is more robust to illumination, target attitude change and the like through sparse coding, and the method can effectively solve the problem of shielding by partitioning the target and defining the space-time consistency of target image blocks. Since the speed of sparse coding is greatly influenced by the dimension of the target to be coded, the method for carrying out sparse coding on the local image block with relatively small dimension is faster than a tracking method for coding the whole image.

Drawings

FIG. 1 is a schematic diagram of the sampling area and unified sample size in the present invention;

FIG. 2 is a schematic diagram of sample blocking and labeling according to the present invention;

FIG. 3 is a schematic diagram of the steps of calculating a sample spatiotemporal consistency score according to the present invention;

FIG. 4 is a flow chart of a complete tracking method of the present invention;

FIG. 5 is a diagram illustrating the tracking of key frames on a public data set according to the present invention.

Detailed Description

In order to make the technical solution, implementation steps and tracking effect of the present invention more clearly understood, the following detailed description of the embodiments of the present invention is provided in conjunction with the technical solution and the accompanying drawings.

The invention provides a tracking method based on target space-time consistency and local sparse representation. The method defines the assumption of space-time consistency of the target between two continuous frames according to the change characteristics of the target in the tracking process: (1) temporal consistency. The target does not change much in two consecutive frames, so in two consecutive frames, the target is considered to be consistent in appearance; (2) and (5) spatial consistency. Since the object is similar in two consecutive frames, the position of each portion of the object is substantially fixed in the two frames. For example, the samples are divided into image blocks that do not overlap with each other, and in a sample, once an image block is determined to be a target image block, if several adjacent image blocks of the image block are also determined to be target blocks, the sample is a target with high confidence. According to the characteristics of the space-time consistency defined above, the invention provides the measure for calculating the score of the space-time consistency of the sample, and solves the problem of target occlusion.

The invention solves the problems of illumination, target attitude change and the like by utilizing sparse coding of the local image blocks of the sample, solves the problem of local shielding by utilizing the partitioning of the target and defining the space-time consistency of the target image block on the basis, and improves the coding speed by only carrying out sparse coding on the image blocks with relatively small dimensions. The method comprises the following concrete steps:

step S1: and acquiring a sample, namely adding Gaussian disturbance to the central position of the tracking result of the previous frame in the step to obtain a positive sample central position set, acquiring the positive sample set on the image of the previous frame according to the central position set, and using the positive sample set for the current frame. The size of the collected positive sample is the same as the target size of the previous frame. Specifically, the method comprises the following steps: since the present invention is a generative model, only a positive sample needs to be collected. The invention utilizes Gaussian disturbance to collect the positive sample around the target tracking result of the previous frame, and the center of the collected positive sample meets L_posS-L_pos||＜C_innerWherein L is_pos＝[x_pos，y_pos]Is the center position, x, of the previous frame tracking result_pos，y_posRespectively as abscissa and ordinate; l is_posSFor the central position of the collected sample, C_innerThe size of the sample is consistent with the target size of the previous frame. The sample set collected is represented as:where each column is a sample and d and N are the sample dimension and number of samples, respectively. Step S2: division of the sampleAnd cutting and marking. As shown in fig. 2, the target position is defined by a rectangular box. The conventional sample blocking method is different from the present invention in that a series of overlapping image blocks are acquired by sliding a sub-window over a sample region. The invention divides the sample by using the fixed grid, obtains non-overlapped image blocks and can completely represent the sample. By sample segmentation with an h x h grid, for each sample, h is obtained²And each image block. As shown in fig. 2, the image blocks are labeled {1, 2.,. h, according to different positions in the grid²In the following steps, the invention uses these markers to indicate the position of the image block in the sample.

Step S3: and local dictionary learning. The present invention acquires image blocks by sliding on the positive samples acquired in step S1 through a fixed-size sub-window, and learns the dictionary D using these image blocks as training samples. The size of the sub-window for acquiring the training image block is consistent with the size of the image block obtained by using the grid in step S2. The dictionary learning method of the invention is an Online dictionary learning method [ J.Mairal and F.Bach, Online dictionary learning for dictionary coding, in: ICML 689-]. Step S4: the t-th frame candidate sample is acquired based on the resulting position of the known t-1 tracking. The invention uses affine transformation to describe the position relation between two continuous frame targets. The state of the tracking result at frame t-1 can be determined using six parameters of the affine transformation: s_t-1＝{s_(t-1)x，s_(t-1)y，θ_t-1，sc_t-1，_t-1，ψ_t-1In which(s)_(t-1)x，s_(t-1)y) Coordinates of x-direction and y-direction in the image plane, theta, for the center of the t-1 th tracking result_(t-1)、sc_(t-1)、_(t-1)、ψ_(t-1)The t-1 tracking result rotation angle, the scaling scale, the target width-length ratio and the inclination angle parameter are respectively. The candidate sample state for the t-th frame may be determined by counting the number of samples in s_t-1And adding Gaussian disturbance to obtain the target. Definition of_p(s_t|s_t-1)＝N(s_t；s_t-1Sigma) as a state-change movementModels with which the motion model can be based_st-1Acquiring a tth candidate sample state s_tAnd obtaining a candidate sample X on the t frame image according to the candidate sample state^*Where Σ is the diagonal matrix, and the elements on the diagonal are the variances of each parameter of the previous t-1 frame tracking target affine state. The method acquires P candidate samples as a candidate sample set in each frame and records the candidate sample set as a candidate sample setWherein the 1 st item in the small brackets is a candidate sample, and the 2 nd item is the corresponding affine transformation state.

Step S5: and carrying out sparse coding on the image blocks obtained by dividing the positive sample set and the candidate sample set. Through step S1, a positive sample set is obtained. For any sample X in the sample set_iH is obtained by step S2²And each image block. Through step S3, an overcomplete sparse dictionary D may be learned. Through step S4, a candidate sample is obtained. In this step, the present invention performs sparse coding on the image blocks in the positive sample set and the candidate sample set (the candidate samples are divided by step S2) by using the dictionary D, and the sparse coding solving model is as follows:

wherein,for image blocks at the jth position of the ith sample in the sample set, dictionary To correspond to image block p_ijI ∈ {1, 2., N }, j ∈ {1, 2., h }, h ∈²}. λ is control l₁A regularized weighting factor. Tong (Chinese character of 'tong')By solving this Lasso regression equation, the sample p is calculated_ijConversion to sparse coding alpha_ij. The present invention utilizes the open source tool LARS [ B.Efron, T.Hastie, and I.Johnstone, Least angle regression, Ann.Stat., 32(2), (2004)407 499.]And solving the model (1).

Step S6: and (4) calculating the average sparse coding of positive samples. Sparse coding of the image blocks by converting the positive sample image blocks into image blocks { a } through step S5_ijWhere i ═ 1, 2.., N }, j ═ 1, 2.., h²}. In this step, the average vector of image blocks having the same label is calculated according to the following formula.

<math> <mrow> <msub> <mover> <mi>α</mi> <mo>&OverBar;</mo> </mover> <mi>j</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>α</mi> <mi>ij</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

Wherein,is the average vector of the image block marked j.

Step S7: and calculating the consistency score of the candidate sample. The average vector of sparse coding at each position of the positive sample image block in the grid is obtained through step S6, and then any candidate sample X obtained through step S4 is used^*The candidate sample image blocks are blocked and marked by the same grid in step S1, and sparse coded by step S5 to obtain h²Sparse coding of image blocksAccording to the assumption of time domain consistency, the invention designs the following formula to calculate the similarity of the average vectors with the same mark between the sparse codes of the image blocks at different positions and the previous frame, and the similarity is used as the time domain confidence of the candidate samples:

<math> <mrow> <mi>T</mi> <mrow> <mo>(</mo> <msup> <mi>X</mi> <mo>*</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msup> <mi>h</mi> <mn>2</mn> </msup> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msup> <mi>h</mi> <mn>2</mn> </msup> </munderover> <mi>sim</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>α</mi> <mo>&OverBar;</mo> </mover> <mi>j</mi> </msub> <mo>,</mo> <msubsup> <mi>α</mi> <mi>j</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>

where sim (,) can be designed as any similarity measure, in the present invention, image block coding using vector similarity to measure the jth position of candidate sample is usedAverage coding with the jth position of the previous frame positive sampleSimilarity of (2):whereinThe calculation method of (2) is as in step S6.

By analyzing equation (3), we can obtain: t (X) if the candidate sample is more similar to the previous frame positive sample^*) The larger. Thus, the temporal consistency of the target image block is ensured by the formula (3). However, if the target is blocked or has a posture change during the tracking process, in this case, the local part of the target may be very dissimilar to the previous frame sample, and then a large error may occur when the similarity calculated by the formula (3) is used as the discrimination index. Accordingly, the present invention defines the spatial consistency of the target image block. The spatial consistency of the present invention is based on the following assumptions: if an image block of a candidate sample is considered as the target image block, then in this case the more image blocks surrounding this image block are also considered as the target image block, the higher the spatial confidence of the sample. The formula is defined as follows:

whereinAs candidate sample X^*Of the jth image block.The number of elements in the neighboring image block set.Is an indicator function whose output is 1 if the two inputs are the same, and 0 otherwise.For judging candidate sample X^*Whether the jth image block of (1)Is a target image block, and is defined as follows:

wherein,means X^*Is the target image block with a higher confidence. τ is a threshold that controls the confidence of similarity as described above.

By the formula (4), it can be ensured that if the jth image is judged to be the target blockThe more of his neighbours are considered target fast, S (X)^*) The larger, if the jth image is judged not to be the target blockThen its neighbor pair S (X)^*) The method has no influence, and the concept of 'non-penalty viewing' can ensure the establishment of the spatial consistency and the connection between the image block and the adjacent image block, so that the method can make up the problem when the local temporal consistency is invalid by using the spatial information among the sample image blocks.

By combining the time domain confidence coefficient and the space domain confidence coefficient, the invention obtains the final space-time consistency scoring measure of the candidate sample:

f(X^*)＝T(X^*)+S(X^*) (6)

step S8: and determining a tracking result from the candidate sample set based on Bayesian inference. The tracking results obtained from the 1 st frame to the t-th frame are shown ass_1：t＝{s₁，s₂，...，s_t-1And the tracking result is corresponding to the affine transformation state. Let us say that P candidate samples are obtained at the t-th frame using the state transition motion model in step S4WhereinFor the ith candidate sample of the tth frame,for its corresponding affine transformation state. The invention determines the target of the t frame from the candidate sample set omega according to the maximum posterior probability

<math> <mrow> <munder> <mi>max</mi> <mrow> <mo>{</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>}</mo> <mo>&Element;</mo> <mi>Ω</mi> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>|</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mn>1</mn> <mo>:</mo> <mi>t</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>

According to the Bayesian inference criterion,the solution can be recursively solved in the following way:

<math> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>|</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mn>1</mn> <mo>:</mo> <mi>t</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>&Proportional;</mo> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mi>t</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mn>1</mn> <mo>:</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein

<math> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>|</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mn>1</mn> <mo>:</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>&Integral;</mo> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mn>1</mn> <mo>:</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mi>d</mi> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> </mrow> </math>

According to step S4, the motion model is p (S)_t|s_t-1)＝N(s_t；s_t-1Sigma). In the inventionSpatiotemporal consistency score defined as the goal:

p ({\hat{X}}_{t} | s_{t - 1}) = f ({\hat{X}}_{t}) - - - (9)

thus p(s)_t|s_t-1) Substituting equation (9) into equation (8) and solving recursively, equation (7) can be converted into

<math> <mrow> <munder> <mi>max</mi> <mrow> <mo>{</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>}</mo> <mo>&Element;</mo> <mi>Ω</mi> </mrow> </munder> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mi>t</mi> </msub> <mo>)</mo> </mrow> <munderover> <mi>Π</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>

The candidate sample that maximizes equation (10) may be selected from the candidate sample set Ω as the tracking result of the t-th frame by equation (10).

In the above method, the acquisition and segmentation of the candidate sample set in step 4 are completed before step 5, and the order relationship with steps 1-3 is not particularly limited, and in different embodiments, may be selected according to actual situations.

The invention utilizes 8 image sequence data sets disclosed on the network as an example to explain the effectiveness of the method, and 8 images together contain 5707 frame image sequences, covering the difficulties of external illumination change, target attitude change, shielding, target scale change, target rapid motion and the like which are possibly encountered in practical application. The data set may be downloaded in the following links: http:// cvlab. hand. ac. kr/tracker _ benchmark _ v10. html.

Fig. 1 corresponds to step S1, in each frame, the present invention collects N-60 positive samples around the tracking target by using gaussian disturbance, the center of the positive sample is limited to the above frame target center position as the center of the circle, C_iinner-3 pixels are inside the circle with radius (as shown in figure 1 by circular shading). The size of the acquisition window is consistent with the size of the target of the previous frame. All the collected positive samples are uniform in size 32 × 32.

The way in which the sample is segmented and marked is shown in figure 2. The invention divides samples by using a 4 x 4 grid, so each sample is divided into 16 non-overlapping image blocks, and each image block marks the image block by using Arabic numbers 1-16 in the grid.

FIG. 3 is a calculation process of spatiotemporal consistency score, which is mainly divided into the following steps: collecting a positive sample; dividing and marking the positive sample; training dictionary by utilizing positive sample image blockCoding image blocks by utilizing a dictionary, and calculating an average vector of sparse coding of positive sample image blocks with the same marks; acquiring a next frame target candidate sample set through the state change model in the step S4, and segmenting, marking and coding all candidate samples by using the same method; and fifthly, calculating the space-time consistency scores of all the candidate samples in the candidate sample set by using the formulas (3) to (6) in the step S7.

The neighborhood set of image blocks is illustrated by the spatiotemporal consistency score computation phase in fig. 3. As shown by the arrows in the graph space confidence box: the neighbor set of the image block marked as 1 is the image block marked as 2, 5, 6; while the neighbor set of the image block labeled 11 is the image block labeled 6, 7, 8, 10, 12, 14, 15, 16. The neighbors of other image blocks are defined and so on.

τ in equation (5) is set to 0.55.

Fig. 4 is a tracking flowchart of the present invention, which first collects positive samples according to the target position of the known frame 1 (step S1), initializes the dictionary (steps S2-S3), then collects candidate sample sets on the frame 2 according to step S4, then calculates spatio-temporal consistency scores of all candidate samples according to steps S5-S7, and finally selects candidate samples capable of maximizing the formula (10) from the candidate samples as the tracking result of the frame 2 by using bayesian inference (step S8). After the tracking result of a new frame is known, operations such as collecting positive samples can be carried out again. By circulating the steps, the tracking result of each frame in the future can be obtained.

FIG. 5 is an experimental result of key frames on a public image dataset sequence with tracking results represented by boxes according to the present invention. Since the target or external environment changes most drastically in these key frames, tracking drift is most likely to occur in these key frames, resulting in tracking failure. The representation of the invention in these key frames shows the robustness and effectiveness of the method.

In fig. 5 (a) and (g), the external illumination change is the main difficulty in tracking the two data sets, especially for the cases of the target entering into the shadow in about 188 frames and exiting out of the shadow in about 230 frames as in (a), and experiments show that the present invention is robust to the tracking of similar illumination changes.

In (c), (f) and (h) of fig. 5, there are various attitude changes of the target. For example, in (c), the side face of #146 frame, the side body of #171 frame, and the left and right of #294 frame are removed from glasses, and the expression of 305 frame is changed. While the target in (f) observes substantially complete differences in pose during the turn, there is also a large amount of non-planar motion of the target in (h). Nevertheless, the temporal-spatial consistency defined by the present invention in consecutive frames is satisfied because the change of the pose occurs slowly, i.e. the target does not change much in pose in consecutive frames. Therefore, the invention shows the effectiveness of the tracking method under the condition of the target posture change.

While there is a serious occlusion problem near the #456 frame of the sequences (b) and (e) and (f) of fig. 5, the present invention can well deal with similar object occlusion problems by blocking the images and using spatial information between the images.

The sequence (d) of fig. 5 mainly shows that the present invention can effectively deal with the problem of fast motion of the target. The target position change between two consecutive frames in (d) of fig. 5 is relatively large, but the present invention can solve this problem well by adjusting the displacement parameters of affine transformation in the motion model. The target scale change problem, which also exists in fig. 5 (a), (c) and (f), can also be overcome by using the scale change parameters in the motion model.

The above-mentioned embodiments and experimental examples have described the technical solutions, implementation details and effectiveness of the methods of the present invention in detail. It should be understood that the above description is only exemplary of the present invention, and is not intended to limit the present invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A tracking method based on space-time consistency and local sparse representation of a target comprises the following steps:

2. The tracking method as claimed in claim 1, wherein the image block of each candidate sample in step four is partitioned in the same way and in the same number as the image block of the positive sample.

3. The tracking method according to claim 2, wherein in step five, the space-time consistency score is the sum of a time domain confidence degree and a space domain confidence degree; the time domain confidence coefficient is in direct proportion to the average vector corresponding to the sparse coding of the image blocks with the same mark in the positive sample set and the similarity of the sparse coding of the image blocks with the same mark in the candidate samples; the spatial confidence is positively correlated with the number of neighboring image blocks of each image block in the candidate sample that are considered target image blocks.

4. The tracking method according to claim 3, characterized in that the time-domain confidence is calculated as follows:

<math> <mrow> <mi>T</mi> <mrow> <mo>(</mo> <msup> <mi>X</mi> <mo>*</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msup> <mi>h</mi> <mn>2</mn> </msup> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msup> <mi>h</mi> <mn>2</mn> </msup> </munderover> <mi>sim</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>α</mi> <mo>&OverBar;</mo> </mover> <mi>j</mi> </msub> <mo>,</mo> <msubsup> <mi>α</mi> <mi>j</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

wherein, T (X)^*) As candidate sample X^*Time domain confidence of h²Is the number of image blocks of the candidate sample,the average vector of sparse coding marked as the jth image block in the corresponding positive sample set is used;as candidate sample X^*Is marked as sparse coding of the jth image block;

the spatial confidence is calculated as follows:

wherein, the S (X)^*) As candidate sample X^*The spatial confidence of the spatial domain of (c),as candidate sample X^*The neighbor image block set of the jth image block;the number of elements in the neighbor image block set is the number of the elements in the neighbor image block set;is an indicator function whose output is 1 if the two inputs are the same, and 0 otherwise;

wherein,representing candidate sample X^*Is the target image block; τ is the control threshold.

5. The tracking method according to claim 1, wherein in step six, the tracking result of the current frame is determined from the candidate sample set by:

<math> <mrow> <munder> <mi>max</mi> <mrow> <mo>{</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>}</mo> <mo>&Element;</mo> <mi>Ω</mi> </mrow> </munder> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>X</mi> <mo>^</mo> </mover> <mi>t</mi> </msub> <mo>)</mo> </mrow> <munderover> <mi>Π</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </math>

wherein,candidate samples representing the current frame tScore of spatiotemporal consistency, s_tIs a candidate sample of the current frame tCorresponding affine transformation state, wherein omega represents a candidate sample set; p(s)_t|s_t-1)＝N(s_t；s_t-1And sigma) is a motion model of affine transformation state change, sigma is a diagonal matrix, and elements on the diagonal are the variance of each parameter of the previous t-1 frames of affine transformation state.

6. The tracking method according to claim 1, wherein the center of the positive sample collected in the first step satisfies | | | L_posS-L_pos||＜C_innerWherein L is_pos＝[x_pos，y_pos]Is the center position, x, of the last frame tracking result_pos，y_posRespectively as abscissa and ordinate; l is_posSFor the central position of the collected sample, C_innerThe size of the sampled positive sample is consistent with the size of the tracking target in the previous frame by the radius of the circle of the sampling area.

7. The tracking method according to claim 1, wherein the sparse coding of image blocks in step three is obtained according to the following model:

wherein,an image block marked as the jth position for the ith positive sample in the positive sample set,in order to be said learning dictionary,to correspond to image block p_ijI ∈ {1, 2., N }, j ∈ {1, 2., h }, h ∈²N is the number of positive samples in the positive sample set, h²λ is the regularization weighting factor for the number of image blocks per positive sample in the positive sample set.

8. The tracking method according to claim 1, wherein in step two, the positive samples are divided by using a grid with a fixed size, and the divided image blocks are marked according to positions of the grid, and the divided image blocks of each positive sample do not overlap with each other.

9. The tracking method according to claim 1, wherein in the first step, a set of center positions of the positive samples is obtained by adding gaussian disturbance to the center position of the previous frame of tracking result, and a set of positive samples is acquired on the previous frame of image according to the set of center positions and is used for the current frame; the size of the positive sample obtained by sampling is the same as the target size of the previous frame.

10. A tracking apparatus based on target spatiotemporal consistency and local sparse representation, comprising: