KR101741758B1

KR101741758B1 - A Real-time Face Tracking Method Robust to Occlusion Based on Improved CamShift with Depth Information

Info

Publication number: KR101741758B1
Application number: KR1020160007749A
Authority: KR
Inventors: 정현조; 유지상; 이준환
Original assignee: 광운대학교 산학협력단
Priority date: 2016-01-21
Filing date: 2016-01-21
Publication date: 2017-05-30

Abstract

The present invention relates to a camshift-based real-time face tracking method for receiving a color image and a depth image from a Kinect and tracking a face object, the method comprising: (a) setting an initial value of a camshift; (b) receiving an image of a current frame from the color image and the depth image; (c) tracking the object; (d) obtaining a Bata Chaaya street; (e) storing the tracked object as a template according to the batachaya distance; And (f) tracing the object again if it fails to track the object or if the batachaya distance is too long.
By applying the depth information and the skin detection method to the cam shift tracking method by using the face tracking method as described above, the object can be accurately tracked even if there is an object similar in color to the in-background tracking object.

Description

Technical Field [0001] The present invention relates to a real-time face tracking method based on camshift,

The present invention relates to a skin detection algorithm for extracting depth information of a Kinect that can acquire distance information on a pixel-by-pixel basis during a face tracking and a hue saturation value (HSV) color space based skin color candidate Camshift-based real-time face tracking method that compensates for the disadvantage of the existing camshift algorithm (CamShift Algorithm) using only the color distribution.

Generally, when an object is tracked in a photographed image, a face or the like is identified based on a color distribution using a CamShift Algorithm (Non-Patent Document 3). That is, the CamShift Algorithm is an algorithm that tracks an object based on hue information. The CamShift Algorithm extracts color information of a desired portion and compares the color information in a subsequent image to track a desired portion.

However, the camshift algorithm (CamShift Algorithm) is very unstable because it can not accurately identify faces when objects similar in color to the tracking objects exist in the image background. We propose a tracking method that solves the problem of this unstable tracking by using depth information and can trace the tracked object even if the tracking fails.

[Non-Patent Document 1] Edward Rosten and Tom Drummond, "Machine learning for high-speed corner detection ", in ECCV 2006 [Non-Patent Document 2] Calonder, M., Lepetit, V., Strecha, C., and Fua, P, "Brief: Binary robust independent elementary features," in ECCV 2010 [Non-Patent Document 3] Gary R. Bradski, " Computer Vision Face Tracking for a Perceptual User Interface "in intel Technology Journal 1998 [Non-Patent Document 4] P. Viola and M. J. Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features", in CVPR 2001

SUMMARY OF THE INVENTION The object of the present invention is to solve the above-mentioned problems, and it is an object of the present invention to provide an image processing apparatus, A camshift-based real-time face tracking method that compensates for the disadvantage of the existing CamShift Algorithm that uses only the color distribution using a skin detection algorithm that extracts a skin color candidate group.

It is another object of the present invention to provide a camshift based real time face tracking method for tracking an object through matching between images based on feature points in face tracking.

In order to achieve the above object, the present invention provides a camshift-based real-time face tracking method for receiving a color image and a depth image from a Kinect and tracking the face object, the method comprising: (a) ; (b) receiving a color image and a depth image of a current frame from the keynote; (c) generating a back projection image based on a color distribution of a face from the color image, generating a binary mask image from the depth image, and tracking an area corresponding to an intersection of the back projection image and the binary mask image Setting an object as an object; And (d) repeating the steps (b) to (c) with the next frame being the current frame and the current frame being the previous frame.

The present invention also relates to a camshift based real time face tracking method for tracking a face object by receiving a color image and a depth image from a Kinect, the method comprising the steps of: (a) setting an initial position for tracking a face in an image; (b) receiving a color image and a depth image of a current frame from the keynote; (c) generating a back projection image based on a color distribution of a face from the color image, generating a binary mask image from the depth image, and tracking an area corresponding to an intersection of the back projection image and the binary mask image Setting an object as an object; (e) calculating a batachay distance between the object tracked in the previous frame and the object tracked in the current frame; (f) storing an object tracked in a current frame as a template according to the calculated battalay zone distance; (g) if the object can not be traced in the step (c) or if the calculated batachaya distance is greater than or equal to a predetermined second threshold value in the step (d), characteristic point matching between the stored template and the current frame is performed Re-tracing the object through; And (h) repeating the steps (b) to (g) with the next frame as a current frame and the current frame as a previous frame.

The present invention also provides a real-time face tracking method based on camshift, wherein in the step (a), the face region is detected by the bottom detection method and the detected face region is set as an initial object.

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, wherein in step (c), similarity to a histogram of an object tracked for each pixel of a current frame on the basis of a histogram of a tracked object of a previous frame is 255 brightness values to generate a back projection image.

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, wherein in step (c), a depth value at the center of the tracked object is read and only pixels at a distance equal to the depth value are represented by 1 Is expressed as " 0 " to generate a binarized mask.

According to another aspect of the present invention, there is provided a real-time face tracking method based on cam shift, wherein in step (c), each pixel value of the back projection image is multiplied by a corresponding pixel value in the binary mask image, do.

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, wherein in the step (f), if the calculated batta-length distance is greater than a predetermined first threshold value and smaller than the second threshold value, And stores the traced object as a template, wherein the first threshold value is smaller than the second threshold value.

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, wherein the first threshold value is set to 0.15, and the second threshold value is set to 0.5.

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, the method comprising the steps of: detecting a feature point in the stored template and a current frame by applying a feature from Accelerated Segment Test (FAST) A BRIEF (Binary Robust Independent Elementary Features) feature descriptor is obtained for the detected minutiae, minutiae matching between the template and the current frame is performed, and a homomorphic matrix is obtained to check the matching result.

According to the present invention, in the real-time face tracking method based on cam shift, when the feature points are extracted by applying the FAST method in the step (g), the brightness values of the sixteen surrounding pixels defined from the reference pixel P are denoted by P The brightness distribution of neighboring pixels is represented by a 16-dimensional ternary vector, and the brightness distribution of the 16-dimensional Is input to a decision tree to determine whether a feature point exists.

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, wherein in step (g), homography H is calculated using at least four pairs of feature points matched in the current frame with the stored template, the expression of [formula 1] and the time the same, the discriminant D, X _s Y _s, P on the homography calculated using the following [formula 2], it is determined whether or not the range of the value which the discriminant is prior to testing .

[Equation 1]

[Equation 2]

Further, the present invention is characterized in that in the real-time face tracking method based on the cam shift, the discriminant is determined by a logical expression like the following [Equation 3].

[Equation 3]

According to another aspect of the present invention, there is provided a real-time face tracking method based on camshift, wherein, in step (g), when one of the stored templates is a plurality of templates, one of the templates is selected, If the matching is not possible, another template is selected and the object is re-traced through minutia matching.

The present invention also relates to a computer-readable recording medium on which a program for performing a camshift-based real time face tracking method is recorded.

As described above, according to the camshift-based real-time face tracking method of the present invention, by applying the depth information and the skin detection method to the cam shift tracking method, it is possible to accurately track an object An effect can be obtained.

In addition, according to the camshift-based real-time face tracking method according to the present invention, matching can be performed based on a feature point to track a face, and if a face or an object being tracked disappears or is obscured, It is possible to provide a face tracking method having robust characteristics in the shielded area.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a diagram showing a configuration of an overall system for carrying out the present invention; Fig.
FIG. 2 illustrates a HSV color space modeled as a cylinder according to an embodiment of the present invention; FIG.
3 is a flowchart illustrating a camshift-based real-time face tracking method according to an exemplary embodiment of the present invention.
4 is a detailed flowchart illustrating a camshift based real time face tracking method according to an embodiment of the present invention.
5 is a view illustrating a normalized depth image according to an embodiment of the present invention.
FIG. 6 is an image showing a backprojection process using a cam shift result and depth information according to an embodiment of the present invention. The backprojection image includes (a) a camshift result image, (b) (c) a binarized mask image showing the same depth as the center of the object, and (d) an AND operation of the back projection image and the binary mask image.
FIG. 7 is a view for comparing the detection result of the CamShift tracking method using the depth information and the result of the conventional technique, according to an embodiment of the present invention. FIG. 7A is a diagram illustrating the detection of the CamShift tracking method (B) Result image using depth information.
FIG. 8 is an image for retracing an object that fails to track through template matching by minutiae points according to an embodiment of the present invention. FIG. 8 (a) shows an image that performs minutia matching with stored templates, (b) Successfully matching the tracked object back to the found image,
9 is a view showing 16 pixels corresponding to a circle having a distance of 3 from the center pixel P according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the drawings.

In the description of the present invention, the same parts are denoted by the same reference numerals, and repetitive description thereof will be omitted.

First, examples of the configuration of the entire system for carrying out the present invention will be described with reference to Fig.

1, a real-time face tracking method based on camshift according to the present invention includes a depth image 61 photographed by the depth camera 21 of the keynote 20, Or a program system on a computer terminal 30 that receives a color image 62 taken by a camera (or RGB camera) 22 and tracks a face. That is, the face tracking method may be implemented by a program and installed in the computer terminal 30 and executed. The program installed in the computer terminal 30 can operate as one program system 40. [

Meanwhile, as another embodiment, the face tracking method may be implemented by a single electronic circuit such as an ASIC (on-demand semiconductor) in addition to being operated by a general-purpose computer. Or a dedicated computer terminal 30 dedicated to only detecting and tracking the face image in the image. This is called a face tracking device 40. Other possible forms may also be practiced.

The keynote 20 includes a depth camera 21 and a color camera 22.

The depth camera 21 is a camera for measuring the depth of the object 10, and measures the depth information to output a depth image.

Preferably, the depth camera 21 is a depth camera installed in the Kinect and is a depth camera that measures depth information by an infrared pattern. The depth camera 21 is composed of an infrared ray transmitting unit and a receiving unit. When the infrared ray emitted from the transmitting unit is reflected by the object 10, the depth camera 21 receives the infrared rays reflected by the receiving unit and measures the depth of the object 10.

The photographed depth image 61 is a depth image photographed by the depth camera 21.

The color camera 22 is a conventional RGB camera and acquires the color of the object 10. [ Preferably, the color camera 22 is an RGB camera installed in the Kinect. The photographed color image 62 is an RGB image photographed by the color camera 22.

The depth image 61 and the color image 62 are directly input to and stored in the computer terminal 30 and processed by the face tracking device 40. Alternatively, the depth image 61 and the color image 62 may be stored in advance in the storage medium of the computer terminal 30 and read from the depth image 60 stored by the face tracking device 40.

The image consists of consecutive frames in time. For example, if the frame at the current time t is the current frame, the frame at the immediately preceding time t-1 is referred to as the previous frame, and the frame at the time t + 1 is referred to as the next frame. On the other hand, each frame has a color image (or a color image) and a depth image (or depth information).

That is, the depth image 61 and the color image 62 are composed of consecutive frames in time. One frame has one image. Also, the images 61 and 62 may have one frame (or image). That is, the images 61 and 62 correspond to one image.

Detecting a face in a depth image and a color image means detection in each of a depth / color frame (or image), but the term "image" is used below unless there is a need for a special distinction.

Next, the HSV color model, the cam shift tracking method, and the skin detection method used in the present invention will be described.

First, the HSV color model will be described.

In the present invention, color information is used, and color information can be represented by various color spaces. First, the RGB color space, which is a mixture of the three primary colors of light (red, green, and blue), is often used. This color model is used in devices such as display devices and printers These are suitable color models.

The HSV color space is more suitable for the tracking algorithm than this. The HSV color model consists of three elements: Hue, Saturation, and Value. First, the hue corresponding to the color value is defined as 0 °, which is the longest wavelength of the visible light spectrum, and has a larger angle as the wavelength becomes longer. When the maximum angle is 360 °, it becomes red again. The second element, Saturation, represents the darkest state of the color at 100% and the darkest state at 0% for a particular H (color) value. When the saturation is 0%, it becomes achromatic. The third element, Value, indicates the degree of darkness of the color, which is black when the lightness value is 0%.

Next, the cam shift tracking method (CamShift) will be described.

As the name suggests, CamShift (Continuously Adaptive Mean SHIFT) is an improvement to effectively track objects using the MeanShift algorithm. It compensates for the disadvantages of MeanShift by adjusting the size of the search window itself Algorithm.

CamShift is an algorithm that tracks objects at high speed based on the histogram of color information set. When the user inputs the size and position of the initial search area, the comparison and comparison of iterative histograms To track the object. The biggest difference with MeanShift is that it allows you to easily track objects that change in size by applying a variable search window.

The CamShift method basically operates in the following order.

1. Set the initial position and size of the search window.

2. Compute the probability distribution of color information and then perform the MeanShift algorithm to find the center of the search window.

3. The search window is reset through the center position and the size of the color distribution obtained by calculating the moment of the image.

4. The MeanShift algorithm is repeatedly performed using the reset search window, and the search window is converged or iteratively performed 2 to 4 processes the predetermined number of times. In this case, the position and size of the search window in 3 are obtained by calculating the 0th, 1st, and 2nd moments of the color distribution in the search window. The equations for the 0th, 1st, and 2nd moments are expressed as Equations 1, 2, and 3, respectively, as follows.

[Equation 1]

&Quot; (2) "

&Quot; (3) "

In the above equation, I (x, y) represents the value of the pixel corresponding to the (x, y) coordinate in the search window. The center position (x _c , y _c ) of the search window is calculated as follows by calculation of the zeroth, first and second moments as described above.

&Quot; (4) "

Next, the skin detection algorithm (Skin Detection Algorithm) will be described.

In the present invention, a candidate color group of a human skin color is extracted from the RGB image of the camera by utilizing the fact that the object to be traced is a human face. It is not easy to accurately detect a person's skin color. Some people have darker skin color depending on the person, some have a bright person, and some people have a very different skin color, making it difficult to generalize. However, all humans have a red color, regardless of skin color, because red blood flows in common.

This red color model is used to extract candidates of the skin color from the image. The HSV color space is used in which the brightness component and the color component are separated from the RGB color space, which is a general color space. In the HSV color space, candidate regions of the skin color are extracted using a specific range of the hue value as a color component. In the present invention, the candidate region of the skin color is used for both Haar Detection for detecting a face, CamShift for tracking a detected face, and matching of a feature point with a template stored when a tracked object disappears.

The candidate color of the skin color has the color information and the rest of the area has lost the color information in black color, so it has a great advantage in the operation speed compared with the general RGB image. In case of Haar Detection, it does not need to search all areas and also reduces false positives. In the case of CamShift, a histogram of a face, which is an object to be tracked in the present invention, among the candidate colors of the skin color is compared and tracked. Also, in the process of searching the face again by matching the feature points based on the disappearance of the tracked face or object, the number of extracted feature points is smaller than that of matching the feature points using the conventional RGB image. Therefore, it is effective to shorten the time for matching the stored template.

Next, a camshift based real time face tracking method according to an embodiment of the present invention will be described with reference to FIG. FIG. 3 is a flowchart illustrating a camshift based real time face tracking method according to the present invention. Figure 4 also shows a detailed flow chart.

As shown in FIG. 3, the camshift-based real-time face tracking method according to the present invention includes a step (S10) of setting an initial value of a cam shift by detecting a face region, receiving a color image and a depth image from a Kinect (S30) of tracing an object, a step of obtaining a batachaya distance (S50), a step of storing a traced object as a template (S60), and a step of retracing the object (S70).

A fatal problem with existing CamShifts is that the tracked object is no longer able to track when it moves at high speed or occlusion occurs. In addition, since the conventional CamShift is a color-based tracking method using a variable-size search window, if a color similar to a tracked object exists in the background, it is necessary to track both foreground and background . Therefore, the results of CamShift tracking are very large and there is a problem of tracking errors.

In the real time face tracking method according to the present invention, the depth sensor (Depth Sensor) capable of solving the two problems of the user having to directly set the initial coordinates and the camshift mentioned above, Is an enhanced CamShift method for Face Tracking using Microsoft's Kinect v2 with a built-in microphone.

In the present invention, the object to be tracked is a face. However, since a common camshift is a method of tracking an object, a face and an object are mixed in the following.

Hereinafter, each step will be described in detail.

First, a step S10 of detecting the face area and setting the initial value (or the initial object) of the cam shift will be described. That is, the face region is detected for the face tracking and set as the initial value of the cam shift (S10). Preferably, the region of the face to be tracked is detected through Haar Detection, and the detected face region is set as an initial coordinate of the cam shift.

The face tracking method according to the present invention is a method for tracking the face of a stronger person by supplementing the problems of the existing CamShift. The existing CamShift tracking method is only for tracking purposes. Therefore, the user must manually input the initial coordinates, size, width, and height of the object to be tracked first.

The present invention solves the hassle of inputting separately through Haar Detection. Haar Detection (Haar Detection) is one of the representative methods that can be used to find a specific type of object in an image. It is a technique to detect an object based on data learned by a Haar feature. Correctly learned data suitable for an object is needed depending on which object is to be detected. In the present invention, since the object to be detected is a face of a person, the face is detected based on the learned data of the face. When the RGB input image of the Kinect enters, the image is compared with the bottom feature and the face is found. Since there are various kinds of sub features and the size of the object to be detected is unknown, a large number of operations are required to search all the possible sizes over all the ranges. In the present invention, however, We could improve the speed by searching only the generated candidate region. By setting the detected face area in this way as the initial coordinates of the object to be tracked by the camshift, tracking is automatically started after the face is detected.

Here, the initial coordinate of the object or face to be traced of the camshift refers to the initial position of the search window of the camshift.

Next, the color image and the depth image (or the current frame of each image) are input from the key knob 20 (S20). As described above, the depth image 61 photographed by the depth camera 21 of the key knob 20 and the color image photographed by the color camera (or RGB camera) 22 of the keynote 20 (62).

Next, step (S30) of tracking an object for the current frame will be described.

That is, the back projection is obtained based on the color distribution of the object from the color image (S31), and the binary mask image is generated from the depth image (S32). Then, an object tracking the range corresponding to the intersection of the back projection image and the binary mask image is set as an object (S33).

That is, a face object is traced through a camshift to a color image.

CamShift is an algorithm that tracks an object by comparing the histogram, which is the color distribution of the previous frame of the tracked object, with the histogram of the next frame. If there is an object similar to the color of the object, or if it moves to a background of a similar color, the result of the tracking area tends to include similar objects or even include the background.

Depth information is used to solve this problem. Microsoft's Kinect is a camera that can acquire depth information on basic RGB images. It has the advantage that it can be solved in one device without using a general camera and depth sensor separately. Kinect is used to extract RGB images and depth images of ordinary cameras.

FIG. 5 is a picture obtained by normalizing the pixel value from 0 to 255 according to the distance so that the obtained depth information can be easily viewed.

Based on the acquired depth information, it is possible to solve the problem of wrong tracking by combining with background of similar color which is one of the biggest problems of CamShift.

FIG. 6 is a diagram illustrating a process of improving CamShift so that object tracking can be made more robust when there is a similar color using depth information.

6 (a) shows the result of camshift tracked in the RGB image of the Kinect.

6 (b) is a backprojection diagram that visually expresses the probability based on the color distribution of a face that is a tracking object. Backprojection is a histogram of the object tracked for every pixel of the current frame based on the histogram of the tracked object of the previous frame, and is expressed as a brightness value from 0 to 255. That is, in back projection, the brighter the pixel (the larger the pixel value), the more likely it is the tracked object, and the darker the pixel (the smaller the pixel value), the less likely it is the tracked object. The camshift result is obtained by calculating the center moment based on the back projection, which is a similarity set to the tracked object.

FIG. 6 (c) is a binarized mask in which only the pixels at the same distance as the depth value are read, and the rest is darkened. That is, the binarized mask is an image having values of 0 and 1, and when the pixel value is 1, it indicates an area similar to the depth of the object.

FIG. 6 (d) is a diagram in which only pixels corresponding to FIG. 6 (b) and FIG. 6 (c) are brightly represented by intersection or multiplication of FIG. 6 (b) and FIG. 6 (c). That is, each pixel value of the back projection image is masked by multiplying the corresponding pixel value of the binary mask.

In the conventional CamShift tracking method, an object is tracked by calculating the center of each frame based on the backprojection of FIG. 6 (b). However, in the method according to the present invention, The object is tracked using backprojection using the depth information of the object.

FIG. 7 is a view for comparing the detection result of the conventional CamShift tracking method and the detection result of the CamShift tracking method using the depth information according to the present invention. CamShift with depth information shows stable and correct tracking of objects even in backgrounds similar to the tracing object as shown in FIG.

Next, it is determined whether or not the object can be traced in step S33. If the object tracking is successful, a step S50 of obtaining a batachaya distance is performed. If the object tracking is unsuccessful, a step S70 ).

If the object can not be traced, the object moves at high speed or moves to the occlusion. The existing CamShift has a problem that the object can not be traced any longer when the object is moving at high speed or occlusion occurs. On the other hand, the method according to the present invention performs matching based on the feature point of the image when the CamShift fails to track the object so that the failed tracking can be performed again.

Next, the similarity is determined by calculating the Bhattacharyya distance between the previous frame and the tracked object in the current frame (S50). (S70) and stores the template of the tracked object (object of the current frame) in accordance with the batachay distance D (S60). Or receiving the next frame (S20) and tracking the object (S30). At this time, the next frame becomes the current frame, and the current frame becomes the previous frame.

In the present invention, the similarity between the tracked object in each frame and the tracked object in the previous frame is calculated while recognizing and tracking the object. If the current frame is the first frame, the previously determined initial object is determined to be the object of the previous frame.

There are various methods for calculating the similarity between two frames, but in the present invention, the Bhattacharyya distance is calculated. Calculating the Bhattacharyya Distance has the advantage that it can be calculated and the amount of calculation is small even if the size of the compared images is different.

&Quot; (5) "

Bhattacharyya Distance D is defined as above.

H ₁ is the histogram of the first image to be compared and H ₂ is the histogram of the second image to be compared. The Bhattacharyya distance has a value of 0 when the two images to be compared completely match each other, and a value close to 1 as the similarity decreases.

When the similarity is larger than a certain threshold value (or the first threshold value), the image of the tracked face or object is stored in the template. The fact that the Bhattacharyya distance is larger than the certain threshold means that the object is not completely stopped, but a change occurs due to rotation or movement.

This calculation prevents the problem of over-redundant storage of trace results for static objects with little change, which reduces the time required for effective memory use and re-tracking through later stored templates and feature point matching. Effect can be obtained.

In addition to tracing the object, it also learns for robust re-tracing in case of high-speed movement of the object that can occur later or tracking failure due to disappearance of the object.

Due to this feature, the method according to the present invention exhibits a characteristic that the tracking is robust when the tracking of the object fails because the number of stored templates increases as the number of tracking success increases. In other words, more robust tracking performance is obtained after sufficient learning has been done on the post-execution object immediately after the present invention is executed. However, if the number of stored templates becomes too large, there is a disadvantage that it takes a lot of time because there are too many templates to be matched based on the minutiae points when retracing. Therefore, adjusting the number of templates to be stored is an important factor in determining the time and performance of retracement.

Also, if the Bhattacharyya distance is too large, it is judged that the object tracking has failed. That is, if the Bhattacharyya distance is equal to or greater than the predetermined second threshold value, the object tracking is failed due to the fast moving or blocking of the object, and the object is tracked again based on the feature matching. That is, the object is re-traced (S70).

Therefore, as shown in FIG. 3, if the batachay distance is less than or equal to the first threshold value d1, object tracking for the next frame is performed immediately before storing the tracked object template. In addition, if the batachay distance is larger than the first threshold value and smaller than the second threshold value, the tracked object is stored as a template. Also, if the batta-length distance is equal to or greater than the second threshold value, the object re-tracking step S70 is performed.

At this time, the first threshold value d1 is smaller than the second threshold value d2. Preferably, the first threshold value is set to 0.15, and the second threshold value is set to 0.5.

Next, the object re-tracking step S70 will be described based on the minutia matching.

In case of object tracking failure due to high-speed movement, obscuration, or disappearance of an object, the feature point matching between the template images calculated by the above-mentioned Bhattacharyya distance and the current frame is performed and tracked in the current frame The object is found.

There are various techniques for extracting feature points and descriptors of extracted feature points. Since the present invention needs to be applied in real time, it is not necessary to use techniques with high computation cost, Use the descriptor of the minutiae. That is, preferably, the method for extracting feature points is extracted using the FAST technique, which is fast and has excellent repeatability. The descriptor of the minutiae point is described in detail in 128 dimensions in the case of the SIFT technique and can be expected to have a high accuracy, but it is not suitable in a real time based system like the present invention because it requires a large amount of calculations as the number of descriptors is high. Thus, in the present invention, feature points are matched using a BRIEF descriptor, which is a technician with a great advantage in speed.

FIG. 8 is a diagram for retracing the object because the object is covered by the shielded region and the tracking fails. In Fig. 8 (a) or 8 (b), the upper left small area is one of the stored templates and the right area is the current frame. FIG. 8 (a) is a figure for performing a feature point matching with one of various stored templates, and FIG. 8 (b) is a figure that the object that has been successfully matched and tracked is found again.

First, the FAST method is applied to each of the current frame and the template image to extract feature points (S71).

FAST (Features from Accelerated Segment Test) is a method suitable for real-time feature point extraction. Feature points can be extracted faster than Harris Corner Detector, SIFT, and SURF. FIG. 9 shows a relationship between a center pixel and surrounding pixels for determining a minutia.

The FAST method defines a circle with a distance of 3 from the reference pixel P in the image. Letting p -> x _k be the surrounding pixels in the defined circle, the threshold value is added or subtracted from the brightness value (I _{p - > xk} ) and the brightness value (I _p ) of the reference pixel P, P is a minutiae point. If there are N or more consecutive pixels whose brightness value is greater than the brightness value obtained by adding the threshold value to the reference pixel P or the brightness value obtained by subtracting the threshold value from the reference pixel P among the brightness values of the 16 pixels, do. The value of N is mainly 9, 10, 11, 12 and so on. When the value of N is 9, the repeatability of the feature point is the highest [Non-Patent Document 1]. Like the N value, the threshold value can be set by the user. If the threshold value is set low, many feature points are extracted. If the threshold value is set large, the feature points are selected to be small.

In the FAST algorithm, to determine whether the reference pixel P is a feature point, a threshold value may be added to or subtracted from the brightness value of P, and the brightness value of 16 pixels may be compared with the brightness value of the reference pixel P. However, a decision tree is used for a faster speed. When the brightness values of 16 pixels defined from the reference pixel P are much larger than P and P is much smaller than P, the brightness distribution of the pixels is classified into three types, Expressed as a vector. Then, the 16-dimensional vector is input to the decision tree to determine whether or not the feature point is present.

Since the feature points extracted by the above method are extracted using only the difference between the brightness values of the surrounding pixels and the feature points, many feature points are gathered. We substitute similar representative feature points that are populated using the NMS (Non Maximum Maximum) method, which represents representative points among the clustered similar feature points, with one representative feature point. The FAST algorithm extracts a large number of feature points if the image size is large and extracts a small number of feature points if the image size is small. Therefore, the FAST algorithm has a very sensitive characteristic according to the size of the image. The reason for this problem is that the NMS is robust to the change in size (to some extent), because the number of feature points is reduced by replacing the feature points extracted from a large image with representative values.

Next, the BRIEF (Binary Robust Independent Elementary Features) feature descriptor is obtained and feature point matching between the two images is performed (S72). That is, if the BRIEF descriptors are similar to each other, it is determined that the template matches the minutiae in the frame.

As the name suggests, unlike other engineers, BRIEF uses binary descriptors. So it has a great advantage in memory usage. It is a technician who compares in a simple way with relatively few bits rather than general technicians but shows good performance. In the process of matching between feature points, the BRIEF technician can compare the Hamming distance (Euclidean distance), which is a comparison method of other common descriptors, with the engineer matching method to be. For example, the distance between 1111 and 0000 is 4, and the distance between 1010 and 1110 is 1. The method of measuring the distance between these technicians can greatly reduce the computation time compared to the conventional Euclidean distance, which requires the calculation of squares and routes.

BRIEF is a binary descriptor consisting of only 0 and 1, but similar or better recognition performance can be achieved at a much faster rate than SURF using a 64-dimensional float descriptor [Non-Patent Document 2].

There is a case that the homography calculation step S73, which is the next step, can not be performed because the feature point matching is not performed. In this case, matching with another stored face template is performed again. That is, another template is selected and the feature point extraction step S71 and the feature point matching step S72 are repeated. Since there are a number of stored templates, iterates until the feature is matched. If there is no valid template (matching template), the next frame is skipped.

Next, the homography is calculated (S73).

Homography is a 3 × 3 matrix that represents the projection relationship of the corresponding two image tubes. In order to obtain the homography matrix, four matched minimum feature points are required. If there are more than 4 correct matching pairs through the feature point based matching between the stored template and the current frame, a homography calculation is performed between the matching feature points.

At this time, RANSAC (RANOM SAmple Comsensus), which is a technique to reduce the error by predicting the correct model from the data with mixed error and noise, is used to remove outliers. The transformed relationship obtained at this time is not always the correct result because it also includes the case where the RANSAC fails to remove the outlier, which is a false matching pair. Therefore, it is necessary to confirm that the calculated homography is correct.

&Quot; (6) "

&Quot; (7) "

The matrix of the third row and the third column of the homography matrix shown in FIG. 6 has a matrix normalized to 1, and it is determined as several factors as shown in FIG.

D is the determinant value of the 2 × 2 partial matrix of matrix H, and if D is less than 0, it can be seen that the order of rotation is not maintained. It is the most important factor that can be identified as false homography in which distortion or flipping occurs.

X _s is a scale factor indicating how much the length in the X axis direction has changed. It is a discriminant that can confirm whether the X axis is too large or too small compared to the size of the original object. Y _s is the scale factor in the Y-axis direction, not the X-axis direction, and is the same as X _s .

P is a perspective factor, which indicates the trapezoidal extent of a rectangle. If P is 0, it is a complete rectangle. If P is larger, it is a trapezoid with a greater distance from the rectangle.

In this case, the misjudged homography recognition is confirmed using various factors as described above. If any one of the six discrimination equations in Equation (8) is satisfied, it is determined that the conversion is abnormal.

&Quot; (8) "

If the correct homography is normally calculated, the position of the tracked face in the current frame mapped using homography is set as the initial coordinate of the face to be tracked in CamShift, and the tracking is started again.

In summary, homography is the process of finding out where a template matches a template in a current frame when a stored template is found in the current frame. A successful matching template is calculated to know the position in the current frame of the warped template when image warping is performed with the current frame. Since the stored template is a rectangle, if the calculated homography approaches a rectangle, the area within the current frame that has been warped through homography is inserted as the initial value of the camshift and the tracking is executed again.

If a template warped into the current frame due to homography calculated by homography calculation based on matched feature points is seriously distorted, warped, or inverted, validation is performed by the discriminant function, . In this case, matching with other stored face templates is performed and the homography-based examination process is repeated. That is, another template is selected and the feature point extraction step S71, the feature point matching step S72, and the homography calculation step S73 are repeated. Since there are a number of stored templates, iterates until the feature is matched. If there is no valid template (matching template), the next frame is skipped.

Although the present invention has been described in detail with reference to the above embodiments, it is needless to say that the present invention is not limited to the above-described embodiments, and various modifications may be made without departing from the spirit of the present invention.

10: Face 20: Kinect
21: depth camera 22: color camera
30: computer terminal 40: program system
61: depth image 62: color image

Claims

delete

A real-time face tracking method based on camshift for receiving a color image and a depth image from a Kinect and tracking a face,
(a) setting an initial object to track a face in an image;
(b) receiving a color image and a depth image of a current frame from the keynote;
(c) generating a back projection image from the color image in the cam shift tracking method to track an object, generating a binary mask image from the depth image, and masking the back projection image;
(e) calculating a batachay distance between the object tracked in the previous frame and the object tracked in the current frame;
(f) storing an object tracked in a current frame as a template according to the calculated battalay zone distance;
(g) if the object can not be traced in the step (c) or if the calculated batachaya distance is greater than or equal to a predetermined second threshold value in the step (e), the feature point matching between the stored template and the current frame is performed Re-tracing the object through; And
(h) repeating the steps (b) to (g) with the next frame being the current frame and the current frame being the previous frame,
In the step (g), a feature point is detected in the stored template and the current frame by applying a feature from Accelerated Segment Test (FAST) method, and a BRIEF (Binary Robust Independent Elementary Features) A feature point matching between the template and the current frame is performed, and a homography matrix is obtained to check the matching result.

3. The method of claim 2,
Wherein the face region is detected by the bottom detection method and the detected face region is set as an initial object in the step (a).

3. The method of claim 2,
In the step (c), similarity to the histogram of the object tracked every pixel of the current frame is expressed by a brightness value from 0 to 255 based on the histogram of the tracked object of the previous frame to generate a back projection image Based real time face tracking method.

5. The method of claim 4,
Wherein in step (c), the depth value at the center of the tracked object is read and only the pixels at a distance equal to the depth value are represented by 1 and the remainder is represented by 0, thereby generating a binarized mask. Based real time face tracking method.

6. The method of claim 5,
Wherein, in step (c), each pixel value of the back projection image is multiplied by a corresponding pixel value in the binary mask image, and masked.

3. The method of claim 2,
Wherein in the step (f), if the calculated batta road distance is greater than a predetermined first threshold value and smaller than the second threshold value, the tracked object is stored as a template in the current frame, and the first threshold value Wherein the second threshold value is smaller than the second threshold value.

delete

3. The method of claim 2,
In the step (g), when the feature points are extracted by applying the FAST method, if the brightness values of the sixteen surrounding pixels defined from the reference pixel P are larger than the predetermined size than P, The brightness distribution of the surrounding pixels is represented by a 16-dimensional ternary vector, and a 16-dimensional vector is input to the decision tree to determine whether or not the feature points are included in the decision tree. Based face tracking method.

3. The method of claim 2,
In the step (g), a homography H is calculated with at least four pairs of matching points in the current template and the stored template, and when the calculated homography H is equal to [Equation 1], discriminant D , X _s Y _s, P the following [formula 2] to the determined shift cam real-time face tracking method of the base characterized in that the check to determine if the expression is in the range of values specified in advance through calculation.
[Equation 1]

[Equation 2]

,

11. The method of claim 10,
Wherein the discriminant is determined by the same logical expression as in Equation (3).
[Equation 3]

3. The method of claim 2,
In step (g), if there are a plurality of stored templates, one of the templates is selected, and the object is retraced through the minutia matching. If the minutia matching is not performed, another template is selected and the object is re- Wherein the step of repeating the tracking process is repeated.

A computer-readable recording medium having recorded thereon a program for performing a real-time face tracking method based on a camshift according to any one of claims 2 to 7 and 9 to 12.