CN101610412B

CN101610412B - Visual tracking method based on multi-cue fusion

Info

Publication number: CN101610412B
Application number: CN2009100888784A
Authority: CN
Inventors: 杨戈; 刘宏
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2009-07-21
Filing date: 2009-07-21
Publication date: 2011-01-19
Anticipated expiration: 2029-07-21
Also published as: CN101610412A

Abstract

The invention discloses a visual tracking method based on multi-cue fusion, which belongs to the technical field of information. The method comprises the following steps: a) determining a tracking window comprising a target region and a background region in a first frame of a video sequence; b) obtaining a color feature probability distribution graph, a position feature probability distribution graph and a motion continuity feature probability distribution graph of the previous frame from the second frame; c) adding the three probability distribution graphs in a weighed manner to obtain a total probability distribution graph; and d) using a CAMSHIFT algorithm to obtain the coordinates of a central point of the tracking window of the current frame in the total probability distribution graph. The method can be used in human-computer interaction, visual intelligent surveillance, intelligent robot, virtual reality technology, model-based image encoding, content retrieval of streaming media and other fields.

Description

Visual tracking method based on multi-cue fusion

Technical Field

The invention relates to visual tracking, in particular to a visual tracking method fusing multiple clues, and belongs to the technical field of information.

Background

With the rapid development of information technology and intelligent science, computer vision that utilizes computers to realize human vision functions is one of the most popular research directions in the computer field at present. Visual tracking, which is one of the core problems of computer vision, is to find the position of a moving object of interest in each frame of an image sequence. It is necessary and urgent to study it.

Hong Liu et al, 2007, (IEEE 14th International Conference on image Processing), published a paper "collective Mean Shift based on multi-cue fusion and collaborative Mean Shift tracking of assist objects" that combines color, location and prediction feature cues to dynamically update the weight of each cue according to the background, and implemented a visual tracking method using an assist object using Mean Shift technology. However, it assumes that the background model obeys the single gaussian model, and it needs to train the video sequence without moving object in advance to obtain the background initial model, so it limits its application, and uses a rectangle larger than the target to represent the region of interest in the line cable evaluation function, and the region between the rectangle and the tracking window is defined as the background region, for the reliability evaluation function of a certain cable, the size of the background region directly affects its value, i.e. the larger the tracking window is, the smaller the reliability evaluation function value is, and it lacks generality.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a visual tracking method fusing multiple clues, which is particularly applicable to visual tracking for human body movement, so that when a computer carries out visual automatic tracking on a target (such as a human body), the requirements of accuracy and real-time performance are met.

The invention combines a plurality of clues (color characteristic, position characteristic and motion continuity characteristic) of a video image to realize visual tracking by means of a CAMSHIFT (continuous Adaptive Mean Shift) method, as shown in FIG. 1. The color features preferably adopt hue and saturation features, red channel features, green channel features and blue channel features, and better robustness is realized on the change of occlusion and pose; the position characteristics are realized by using a frame difference technology; the motion continuity feature is done according to inter-frame continuity.

The invention adopts a fixed and unchangeable tracking window, thus, although the management of appearance change and occlusion is limited, the invention does not consider that some areas with similar backgrounds can be regarded as a part of a target, and the tracking effect can be realized as well.

The invention is realized by the following technical scheme, which comprises the following steps:

a) determining a tracking window in a first frame of a video sequence, wherein the tracking window comprises a target area and a background area, and the target area contains a tracked object; preferably, the tracking window is a rectangle, the rectangle is equally divided into three parts, the middle part is the target region, and the two parts are the background region, as shown in fig. 2.

b) For each frame from the second frame, obtaining a color feature probability distribution map, a position feature probability distribution map and a motion continuity feature probability distribution map of the previous frame;

c) weighting and adding the three probability distribution maps to obtain a total probability distribution map;

d) and obtaining the center point coordinate of the tracking window of the current frame in the total probability distribution map through a CAMSHIFT algorithm.

The following describes the various threads and thread fusions to which the present invention relates in detail.

Color characteristics

The color features preferably include Hue (Hue) and Saturation (Saturation) features, r (red) channel features, g (green) channel features, and b (blue) channel features of the image, achieving better robustness to occlusion and pose changes.

Assuming that the invention uses a histogram of m handles (bins), the image has n pixel points, their positions and corresponding values in the histogram are { x }_i}_i＝1...n，{q_u}_{u＝1，...，m.}(R channel characteristics, G channel characteristics, and B channel characteristics) or { q }_u(v)}_{u＝1，...，m；v＝1，...，m.}(hue and saturation characteristics). Defining a function b R²→ 1, …, m, this function characterizes the discrete interval value for each pixel's color information. In the histogram, the value corresponding to the c-th color information interval may be expressed by equations (1) and (2) or equations (1 ') and (2'):

p_{u (v)} = \min (\frac{255}{\max (q_{u (v)})} q_{u (v)}, 255) - - - (2)

or

The color feature probability distribution map can be established by the following method:

firstly, extracting r (Red), g (Green), and b (Blue) channels from an RGB (Red, Green, Blue) image, then converting the RGB image into an HSV (Hue, Saturation, Value) image, extracting a Hue (Hue) channel and a Saturation (Saturation) channel, and calculating a Hue and Saturation probability distribution, a Red probability distribution, a Green probability distribution, and a Blue probability distribution of pixels in a tracking window by using a Histogram Back-Projection (1 '), as shown in formula (1) or formula (1').

Secondly, the value ranges in the hue and saturation probability distribution, the red probability distribution, the green probability distribution and the blue probability distribution are re-valued by the formula (2) or the formula (2'), so that the value ranges are represented by [0, max (q)_u(v))]Or [0, max (q)_u)]Projection to [0, 255]。

Thirdly, selecting proper characteristics from the hue and saturation characteristics, the red characteristics, the green characteristics and the blue characteristics according to a certain rule as the color characteristics of a visual tracking algorithm to form a final color probability distribution map p (x, y).

Among the above four features, the method of the present invention preferably dynamically selects one or more features that best reflect the differences between the target region and the background region, and the method comprises the following steps:

for feature k, let i be the value of feature k, H₁ ^k(i) Histogram showing feature values in target area A, H₂ ^k(i) Histograms, p, representing the characteristic values in the background areas B and C_k(i) Is the discrete probability distribution of the target area A, q_k(i) Is the discrete probability distribution, L, of background regions B and C_i ^kIs the log likelihood of feature k, as in equation (10), taking a very small number with δ > 0, which is mainly to prevent equation (10) from appearing where the denominator is 0 or log 0. var (L)_i ^k；p_k(i) Is a relative target class distribution p_k(i) L of_i ^kVariance of (2), formula (11), var (L)_i ^k；q_k(i) Is a relative background class distribution q_k(i) L of_i ^kVariance of (2), formula (12), var (L)_i ^k；R_k(i) L is a distribution of relative object and background classes_i ^kThe variance of (A) is as shown in formula (13), V (L; p)_k(i)，q_k(i) Is L_i ^kThe variance of (A) is as shown in formula (14), V (L; p)_k(i)，q_k(i) Represents the ability of feature k to separate target and background, V (L; p is a radical of_k(i)，q_k(i) Larger the feature k indicates that the object is easier to separate from the background, the more likely the feature isThe more suitable the feature as a tracking target.

Wherein R is_k(i)＝[p_k(i)+q_k(i)]/2；

V (L_{i}^{k}; p_{k} (i), q_{k} (i)) = \frac{var (L_{i}^{k}; R_{k} (i))}{var (L_{i}^{k}; p_{k} (i)) + var (L_{i}^{k}; q_{k} (i))} - - - (14)

In the video tracking process, the reliability of the hue and saturation characteristics, the R (Red) channel characteristics, the G (Green) channel characteristics and the B (blue) channel characteristics is continuously detected, when the reliability of the hue and saturation characteristics, the reliability of the G (Green) channel characteristics and the B (blue) channel characteristics is changed, the reliability is calculated according to the formula (14), the reliability is rearranged according to the reliability, and V (L) is taken_i ^k；p_k(i)，q_k(i) Maximum W features) as the color features of the tracking target. The value of W is preferably 1, 2 or 3.

Location features

Regarding the position characteristics, the invention utilizes the frame difference to calculate the gray difference value of each point in the front and back two frames of images, and then determines which pixel points are motion points by setting a threshold value, and the motion points are determined when the threshold value is exceeded. If the frame difference threshold is set only by relying on experience, certain blindness is provided, and the method is only suitable for certain specific occasions. The invention preferably uses the Otsu method for dynamically determining this frame difference threshold F for this purpose. The basic idea of the Otsu algorithm is to find a suitable threshold F to minimize the intra-class dispersion moment, which is equivalent to finding a suitable threshold F to maximize the inter-class dispersion moment, i.e. the frame difference image is divided into two classes by the frame difference threshold F, such that the variance of the two classes divided is maximized. The intra-class dispersion moment represents the dispersion of sample points around their mean, and the inter-class dispersion moment represents the dispersion between classes. Smaller inter-class scattering moments mean closer inside the classes of samples, and larger inter-class scattering moments mean better separability between the classes of samples.

Motion continuity feature

Regarding the motion continuity characteristics, the invention estimates the speed of the tracked target through the images of the previous frames, and further estimates the target center position at the current moment according to the target position obtained by the image tracking of the previous frames. In a short time (between video frames), the motion of the target has strong continuity, and the speed of the target can be regarded as constant, so that the speed of the tracked target can be estimated through images of the previous frames, and further the target center position at the current moment can be estimated from the target position obtained by tracking the images of the previous frames.

X(t，row)＝X(t-1，row)±(X(t-1，row)-X(t-2，row)) (3)

X(t，col)＝X(t-1，col)±(X(t-1，col)-X(t-2，col)) (4)

Let X (t, row) represent the line coordinate of the current target center position at time t, as formula (3), X (t, col) represent the ordinate of the current target center position at time t, as formula (4), row is the maximum line number of the image, col is the maximum vertical number of the image, and the current position is predicted by using a linear predictor in consideration of the continuity of the target motion. Therefore, X (t, row), X (t-1, row) and X (t-2, row) are related to the formula (5), and X (t, col), X (t-1, col) and X (t-2, col) are related to the formula (6).

X(t，row)∈[max(X(t-1，row)-(X(t-1，row)-X(t-2，row))，1)，min(X(t-1，row)+(X(t-1，row)-X(t-2，row))，rows)] (5)

X(t，col)∈[max(X(t-1，col)-(X(t-1，col)-X(t-2，col))，1)，min(X(t-1，col)+(X(t-1，col)-X(t-2，col))，cols)] (6)

And (3) setting the line width of the tracking window to be width and the length to be length, then setting the line coordinate of the target at the current moment as formula (7) and the ordinate as formula (8), namely, setting the target in the rectangular range.

Y(t，row)∈[max(X(t，row)-width，1)，min(X(t，row)+width，rows)] (7)

Y(t，col)∈[max(X(t，col)-length，1)，min(X(t，col)+length，cols)] (8)

Let B' (x, y, t) denote the probability distribution of motion continuity features, as in equation (9), where (x, y, t) denotes the pixel of the coordinate (x, y) at time t, 1 denotes the tracked object, and 0 denotes the background pixel.

<math><mrow><msup><mi>B</mi><mo>′</mo></msup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>=</mo><mfenced open='{' close=''><mtable><mtr><mtd><mn>1</mn></mtd><mtd><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>&Element;</mo><mi>Y</mi><mrow><mo>(</mo><mi>t</mi><mo>,</mo><mi>row</mi><mo>)</mo></mrow><mo>∩</mo><mi>Y</mi><mrow><mo>(</mo><mi>t</mi><mo>,</mo><mi>col</mi><mo>)</mo></mrow></mtd></mtr><mtr><mtd><mn>0</mn></mtd><mtd><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>&NotElement;</mo><mi>Y</mi><mrow><mo>(</mo><mi>t</mi><mo>,</mo><mi>row</mi><mo>)</mo></mrow><mo>∩</mo><mi>Y</mi><mrow><mo>(</mo><mi>t</mi><mo>,</mo><mi>col</mi><mo>)</mo></mrow></mtd></mtr></mtable></mfenced><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>9</mn><mo>)</mo></mrow></mrow></math>

Thread fusion

Suppose P_k(row, colu, t) is the probability distribution of the pixel (row, colu) at the time t through the characteristic k, and represents the probability that each pixel (row, colu) belongs to the target area under the characteristic k. P (row, colu, t) represents the final probability distribution after fusion of t, W +2 features (W color features, one predicted target location feature and one motion continuity feature) at time t, which characterizes the probability that each pixel (row, colu) belongs to the target region, as in equation (15). The W color features are used as the basis of competitive judgment through the reliability of the last frame, and if the credibility of a certain feature is high, the W color features dominate in a visual tracking system and provide more information for the tracking system; when confidence is low, the information may be derated or ignored.

Wherein r is_kIs the weight of the feature k, r₁，r₂，...，r_kIs a selected color feature, r_w+1Characterised by the position of the predicted targetWeight, r_w+2Is a weight of the motion continuity feature,

to make P_w+1(row, colu, t) and P_w+2The value range of (row, colu, t) is projected to [0, 255]Taking P_w+1(row，colu，t)＝B(x，y，t)*255，P_w+2(row，colu，t)＝B′(x，y，t)*255。

Compared with the prior art, the method is simple and effective, does not need to assume a background model, does not need to train a video sequence without a moving target in advance, has the key point of realizing the fusion of a plurality of clues, is suitable for different scenes, obtains better tracking effect, and is particularly suitable for the conditions that the color saturation of a target environment in the video sequence is lower and the target is partially shielded.

The visual tracking method and the visual tracking device can be used as a tracking result and can also be used as an intermediate result of the next visual understanding. The invention has wide application prospect in the information field, and can be applied in the fields of Human Robot Interaction technology (HRI for short), visual intelligent monitoring, intelligent robots, virtual reality technology, model-based image coding, content retrieval of streaming media and the like. The video monitoring system is applied to community safety monitoring, fire monitoring, traffic violation, flow control, security and protection of public places such as military affairs, banks, markets, airports, subways and the like. The existing video monitoring system usually only records video images to be used as a post-incident evidence and does not fully play the real-time active monitoring role. If the existing video monitoring system is improved into an intelligent video monitoring system, the monitoring capability can be greatly enhanced, the potential safety hazard can be reduced, and meanwhile, the human and material resources and the investment can be saved. The video intelligent system can solve two problems: one is to release the security operator from the complicated and boring task of 'staring at the screen', and the machine is used for completing the work; the other is that the image to be found is quickly searched in massive video data, namely a target is tracked, such as No. 13 line of Beijing subway, and a thief is caught by video analysis; the Pudong airport, the capital airport and a plurality of railway projects already under construction all expect to use video analysis technology, and the visual tracking method of the invention is one of the core and key technologies of the video analysis technology.

Drawings

FIG. 1 is a schematic diagram of the method of the present invention for fusing threads;

FIG. 2 is a schematic view of the tracking window of the present invention, where A is the target area and B and C are the background areas;

fig. 3-5 are schematic diagrams of visual tracking for 50 frames, 100 frames and 120 frames of a video sequence with a resolution of 640 × 512, respectively, where a represents a motion continuity feature probability distribution diagram, b represents a position feature probability distribution diagram, c represents a total probability distribution diagram, and d represents a current frame tracking result diagram.

Fig. 6a-d are graphs of visual tracking results for 50 frames, 90 frames, 120 frames and 164 frames, respectively, of a video sequence with a resolution of 640 x 480.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The scope of protection of the invention is not limited to the examples described below.

The visual tracking of the present embodiment is performed according to the following steps:

firstly, a tracking window is arranged at the 1 st frame of a video sequence, the length and the width of the tracking window are determined by an operator according to the size of a tracked target and are not changed in the tracking process. The tracking window is divided into three parts, with the middle part (a) being the target area and the left and right (B and C) being background areas, as shown in fig. 2.

Second, from the 2 nd frame, 2(W ═ 2) color features (e.g., R channel and B channel) that are most reliable are selected from the previous frame, and a color feature probability distribution map M is calculated₁。

Thirdly, calculating the probability distribution map M of the position feature₂。

Fourthly, calculating the probability distribution map M of the motion continuity characteristics₃。

Fifth, the three probability distribution maps (M) obtained as described above are applied₁、M₂、M₃) Respectively weighting corresponding r_kTo obtain the final probability distribution map M, which is M in the present embodiment₁-M₃3/7 (where the weights of the R and B channels are 2/7 and 1/7, respectively), 2/7 and 2/7.

Sixthly, in the probability distribution map M, obtaining the center point coordinate of the tracking window of the current frame through a CAMSHIFT algorithm, wherein the core process of the CAMSHIFT algorithm comprises the following steps: the zeroth order moment (equation 16 below) and the first order moment (equations 17 and 18 below) of the tracking window are calculated, and the (x, y) coordinates are iteratively calculated by equation (20) through equation (19) until the coordinates are not significantly displaced (the change in the values of x and y is less than 2) or the coordinates when iterated up to the maximum number of times of 15 are the tracking window center of the current frame.

x = \frac{M_{10}}{M_{00}} - - - (19)

y = \frac{M_{01}}{M_{00}} - - - (20)

Fig. 3-5 are schematic views of visual tracking for 50 frames, 100 frames and 120 frames, respectively, of a video sequence with a resolution of 640 x 512.

Fig. 6 is a graph of visual tracking results for 50 frames, 90 frames, 120 frames, and 164 frames of a video sequence with a resolution of 640 x 480. Although the saturation of the video sequence 2 is low, the tracking target is still realized by comprehensively considering the reliability of the color features and the multi-cue fusion.

Claims

1. A visual tracking method based on multi-cue fusion comprises the following steps:

a) determining a tracking window in a first frame of a video sequence, wherein the tracking window comprises a target area and a background area, and the target area contains a tracked object;

b) for each frame from the second frame, obtaining a color feature probability distribution map, a position feature probability distribution map and a motion continuity feature probability distribution map of the previous frame; the color features in the color feature probability distribution map include one or more of hue and saturation features, R channel features, G channel features, and B channel features;

2. The visual tracking method of claim 1, wherein the tracking window is a rectangle equally divided into three parts, the middle part being the target region and the two parts being the background region.

3. The visual tracking method of claim 1 wherein V (L; p) of the hue and saturation characteristics, R channel characteristics, G channel characteristics, and B channel characteristics is calculated by_k(i)，q_k(i) Value), the color features in the color feature probability distribution map including the previous, two, or three features for which the V value is the greatest:

V (L_{i}^{k}; p_{k} (i), q_{k} (i)) = \frac{var (L_{i}^{k}; R_{k} (i))}{var (L_{i}^{k}; p_{k} (i)) + var (L_{i}^{k}; q_{k} (i))},

wherein,

p_k(i) representing a discrete probability distribution of the target region;

q_k(i) a discrete probability distribution representing the background region;

where δ is used to ensure that no cases occur where the denominator is 0 or log 0;

R_k(i)＝[p_k(i)+q_k(i)]/2；

wherein E represents the mean and var represents the variance;

4. the visual tracking method of claim 1, wherein the location feature probability distribution map is obtained by: and calculating the gray difference value of each pixel point of the tracking window in the current frame and the previous frame, wherein if the difference value is greater than a set threshold value, the pixel point is a motion point, and the position characteristic probability distribution map comprises all the motion points.

5. The visual tracking method of claim 4, wherein the threshold is dynamically determined by an Otsu method.

6. The visual tracking method of claim 1, wherein when the three probability distribution maps are weighted and added to obtain the total probability distribution map, the sum of the weights of the respective probability distribution maps is 1.