CN110210306B

CN110210306B - Face tracking method and camera

Info

Publication number: CN110210306B
Application number: CN201910361317.0A
Authority: CN
Inventors: 刘子伟; 吴涛
Original assignee: Qingdao Xiaoniao Kankan Technology Co Ltd
Current assignee: Qingdao Xiaoniao Kankan Technology Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2021-09-14
Anticipated expiration: 2039-04-30
Also published as: CN110210306A

Abstract

The invention discloses a face tracking method and a camera. The method of the invention comprises the following steps: acquiring an initial position of a facial feature point of a first frame image in an acquired image frame sequence; taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain the pixel positions of the facial feature points in the first frame image; and acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by utilizing the pixel position of the facial feature point in the first frame image, thereby realizing face tracking. The method and the device can realize face alignment tracking between frames, avoid the problems of instability and jitter of face feature points between frames, and improve the tracking accuracy and the tracking speed.

Description

Face tracking method and camera

Technical Field

The invention relates to the technical field of machine learning, in particular to a face tracking method and a camera.

Background

The face alignment is to detect a face in an image and label each specific point. Face alignment techniques are commonly used in video processing, such as live, short video, and other applications.

The real-time tracking problem of the human face alignment greatly helps the processing of videos, however, the current research on the human face alignment is enthusiastic, the real-time tracking research on the human face alignment is less, and most of point positions determined by a tracking algorithm have jitter among frames, so that the tracking result has obvious distortion.

Disclosure of Invention

The present invention provides a face tracking method and camera to at least partially solve the above problems.

In a first aspect, the present invention provides a face tracking method, including: acquiring an initial position of a facial feature point of a first frame image in an acquired image frame sequence; taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain the pixel positions of the facial feature points in the first frame image; wherein the location probability heatmap represents a probability that the facial feature point is at a pixel location in the first frame image; and acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by utilizing the pixel position of the facial feature point in the first frame image, thereby realizing face tracking.

In some embodiments, obtaining an initial position of a facial feature point of a second frame image in the sequence of image frames using a pixel position of the facial feature point in the first frame image comprises: performing face detection on the second frame image by using a multitask cascade convolution network or by using a machine learning tool Dlib; when a face is detected in the second frame image, the initial position of the facial feature point in the second frame image is obtained according to the position information of the facial feature point in the first frame image and a preset relaxation amount, wherein the preset relaxation amount represents the position change of the same feature point in the adjacent frame image.

In some embodiments, obtaining the initial position of the facial feature point in the second frame image according to the position information of the facial feature point in the first frame image and a preset slack amount comprises: according to

Acquiring initial positions of facial feature points in the second frame image; wherein the content of the first and second substances,

as the initial position of the facial feature point i in the second frame image,

the position information of the face characteristic point i in the first frame image is represented by i which is a natural number greater than 1, alpha is greater than 0 and less than 1, alpha is a preset adjustment factor, dx_monIs the preset amount of relaxation.

In some embodiments, the performing face detection on the second frame image by using a multi-task cascaded convolutional network or by using a machine learning tool Dlib further includes: when the face in the second frame image is not detected, extracting face characteristic points from a pre-constructed average face image; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the second frame image.

In some embodiments, obtaining an initial position of a facial feature point of a first frame image in the sequence of image frames comprises: extracting face characteristic points from a pre-constructed average face image; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the first frame image.

In some embodiments, extracting facial feature points from the pre-constructed average face image comprises: acquiring facial feature points of each face training sample in a face training sample set, wherein the facial feature points of each face training sample in the face training sample set are calibrated; constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample; and respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image.

In a second aspect, the present invention provides a camera comprising: a camera and a processor; the camera collects an image frame sequence of the face of the user and sends the image frame sequence to the processor; the processor is used for acquiring the initial position of the facial feature point of the first frame image in the image frame sequence; taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain position information of the facial feature points in the first frame image; wherein the location probability heat map represents a probability that the facial feature point is at each pixel location in the first frame image; and acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by utilizing the position information of the facial feature point in the first frame image, thereby realizing the face tracking.

In some embodiments, the processor further performs face detection on the second frame image using a multitask cascaded convolutional network or using a machine learning tool Dlib; when a face is detected in the second frame image, the initial position of the facial feature point in the second frame image is obtained according to the position information of the facial feature point in the first frame image and a preset relaxation amount, wherein the preset relaxation amount represents the position change of the same feature point in the adjacent frame image.

In some embodiments, the processor extracts facial feature points from a pre-constructed average face image when a face in the second frame image is not detected; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the second frame image.

In some embodiments, the processor obtains facial feature points of each face training sample in a face training sample set, wherein the facial feature points of each face training sample in the face training sample set are calibrated; constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample; and respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image.

The method and the device predict the initial position of the facial feature point of the second frame image by using the facial feature point of the first frame image, and identify the specific pixel position of the facial feature point in each frame image by adopting a mode of combining a convolutional neural network and PDM when the initial position of the facial feature point in each frame image is obtained, thereby realizing face alignment tracking between frames, avoiding the problems of instability and jitter of the facial feature point between frames, and improving the tracking accuracy and tracking speed.

Drawings

FIG. 1 is a flow chart of a face tracking method according to an embodiment of the present invention;

fig. 2 is a block diagram of a camera according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It is to be understood that such description is merely illustrative and not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Thus, the techniques of the present invention may be implemented in hardware and/or in software (including firmware, microcode, etc.). Furthermore, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system. In the context of the present invention, a computer-readable storage medium may be any medium that can contain, store, communicate, propagate, or transport the instructions. For example, a computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the computer-readable storage medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.

The face tracking means that the position of the facial feature point of the frame is predicted according to the information of the previous frame, and the face tracking method has very important significance for the face alignment of the frame, can reduce the search range of the frame, and can improve the tracking accuracy. The embodiment provides a jitter-free interframe tracking method aiming at the unreal problem of face alignment caused by jitter of interframe tracking.

Fig. 1 is a flowchart of a face tracking method according to an embodiment of the present invention, and as shown in fig. 1, the method according to the embodiment includes:

s110, acquiring the initial position of the facial feature point of the first frame image in the acquired image frame sequence.

The facial feature points include feature points for identifying eyes, nose, mouth, eyebrows, face contours and the like.

S120, taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain the pixel positions of the facial feature points in the first frame image; wherein the location probability heat map represents a probability that the facial feature point is at each pixel location in the first frame image.

In the embodiment, the positions of the facial feature points are processed by using a convolutional neural network, then iterative regression processing is performed by using a Point Distribution Model (PDM) based on the position probability of the facial feature points output by the convolutional neural network, so as to obtain the specific pixel position information of the facial feature points, and the recognition accuracy of the facial feature points is improved by combining the convolutional neural network and the PDM.

S130, acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by using the pixel position of the facial feature point in the first frame image, and realizing face tracking.

The initial positions of the facial feature points in the second frame image can be predicted based on the pixel positions of the facial feature points in the first frame image, then the pixel positions of the facial feature points in the second frame image can be obtained by combining a convolutional neural network and a PDM, the pixel positions of the facial feature points in each frame of subsequent images in the image frame sequence can be predicted in sequence, and tracking and identification of the human face are achieved.

In the embodiment, the initial position of the facial feature point of the second frame image is predicted by using the facial feature point of the first frame image, and when the initial position of the facial feature point in each frame image is obtained, the specific pixel position of the facial feature point in each frame image is identified by adopting a mode of combining a convolutional neural network and a PDM (product data model), so that the inter-frame face alignment tracking is realized, the problems of instability and jitter of the inter-frame facial feature point are avoided, and the tracking accuracy and tracking speed are improved.

The above steps S110 to S130 will be described in detail.

First, step S110 is performed, i.e., an initial position of a facial feature point of a first frame image in the captured image frame sequence is acquired.

In some embodiments, the initial position of the facial feature point in the first frame image is obtained as follows: extracting face characteristic points from a pre-constructed average face image; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the first frame image.

The method comprises the steps of obtaining facial feature points of each face training sample in a face training sample set, wherein the facial feature points of each face training sample in the face training sample set are calibrated; constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample; and respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image.

After acquiring the initial positions of the facial feature points of the first frame image, continuing to execute step S120, that is, taking the initial positions of the facial feature points as the input of a convolutional neural network, wherein the convolutional neural network outputs a position probability heat map of the facial feature points, and performing iterative regression processing on the position probability heat map of the facial feature points by using a point distribution model to obtain position information of the facial feature points in the first frame image; wherein the location probability heat map represents a probability that the facial feature point is at each pixel location in the first frame image.

The embodiment firstly uses a convolutional neural network to calculate a position probability heat map of facial feature points relative to positions, the convolutional neural network can improve the calculation speed, the PDM is a model which can obtain specific positions of the points through iterative regression processing through the position probability heat map of the points, the PDM is used for performing iterative regression processing on the position probability heat map of the facial feature points relative to the positions after the position probability heat map of the facial feature points relative to the positions is calculated by the convolutional neural network, and specific pixel positions of the facial feature points are determined.

After the position information of the facial feature points in the first frame image is obtained, step S130 is continuously executed, that is, the initial positions of the facial feature points in the second frame image in the image frame sequence are obtained by using the position information of the facial feature points in the first frame image, so as to implement face tracking.

In some embodiments, the initial positions of the facial feature points in the second frame image are obtained by: performing face detection on the second frame image by using a multitask cascade convolution network or by using a machine learning tool Dlib; when a face is detected in the second frame image, the initial position of the facial feature point in the second frame image is obtained according to the position information of the facial feature point in the first frame image and a preset relaxation amount, wherein the preset relaxation amount represents the position change of the same feature point in the adjacent frame image.

In connection with an example of this embodiment, may be according to

When the second frame image is subjected to face detection by utilizing a multitask cascade convolution network or a machine learning tool Dlib, if a face is not detected in the second frame image, extracting face characteristic points from a pre-constructed average face image; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the second frame image.

The method comprises the steps of obtaining facial feature points of each face training sample in a face training sample set, wherein the facial feature points of each face training sample in the face training sample set are calibrated; constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample; respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the second frame image.

The invention also provides a camera.

Fig. 2 is a block diagram of a camera according to an embodiment of the present invention, and as shown in fig. 2, the camera according to the embodiment includes: a camera and a processor; wherein the content of the first and second substances,

the camera is used for collecting an image frame sequence of the face of the user and sending the image frame sequence to the processor;

the processor is used for acquiring the initial position of the facial feature point of the first frame image in the image frame sequence; taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain position information of the facial feature points in the first frame image; wherein the location probability heat map represents a probability that the facial feature point is at each pixel location in the first frame image; and acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by utilizing the position information of the facial feature point in the first frame image, thereby realizing the face tracking.

In some embodiments, the processor further performs face detection on the second frame image by using a multitask cascade convolution network or by using a machine learning tool Dlib; when a face is detected in the second frame image, the initial position of the facial feature point in the second frame image is obtained according to the position information of the facial feature point in the first frame image and a preset relaxation amount, wherein the preset relaxation amount represents the position change of the same feature point in the adjacent frame image.

In connection with one example of the embodiment, a processor

In some embodiments, the processor, when the human face in the second frame image is not detected, extracts facial feature points from a pre-constructed average face image, and determines point positions of the facial feature points on the average face as initial positions of the facial feature points in the second frame image.

In some embodiments, the processor extracts facial feature points from a pre-constructed average face image, and determines point locations of the facial feature points on the average face as initial positions of the facial feature points in the first frame image.

The processor is used for specifically acquiring facial feature points of each face training sample in a face training sample set, wherein the facial feature points of each face training sample in the face training sample set are calibrated; constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample; and respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image.

For the camera embodiment, since it basically corresponds to the method embodiment, the relevant points may be referred to the partial description of the method embodiment. The above-described camera embodiments are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

For the convenience of clearly describing the technical solutions of the embodiments of the present invention, in the embodiments of the present invention, the words "first", "second", and the like are used to distinguish the same items or similar items with basically the same functions and actions, and those skilled in the art can understand that the words "first", "second", and the like do not limit the quantity and execution order.

While the foregoing is directed to embodiments of the present invention, other modifications and variations of the present invention may be devised by those skilled in the art in light of the above teachings. It should be understood by those skilled in the art that the foregoing detailed description is for the purpose of better explaining the present invention, and the scope of the present invention should be determined by the scope of the appended claims.

Claims

1. A face tracking method, comprising:

acquiring an initial position of a facial feature point of a first frame image in an acquired image frame sequence;

taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain the pixel positions of the facial feature points in the first frame image; wherein the location probability heatmap represents a probability that the facial feature point is at a pixel location in the first frame image;

acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by using the pixel position of the facial feature point in the first frame image to realize face tracking, specifically: and obtaining the initial position of the facial feature point in the second frame image according to the position information of the facial feature point in the first frame image and a preset relaxation amount, wherein the preset relaxation amount represents the position change of the same feature point in the adjacent frame images.

2. The method of claim 1, further comprising, before obtaining an initial position of a facial feature point of a second frame image in the sequence of image frames using a pixel position of the facial feature point in a first frame image:

performing face detection on the second frame image by using a multitask cascade convolution network or by using a machine learning tool Dlib;

and when a human face is detected in the second frame image, acquiring the initial position of the facial feature point in the second frame image according to the position information of the facial feature point in the first frame image and a preset relaxation amount.

3. The method according to claim 2, wherein the obtaining of the initial position of the facial feature point in the second frame image according to the position information of the facial feature point in the first frame image and a preset slack amount comprises:

according to

Acquiring initial positions of facial feature points in the second frame image;

wherein the content of the first and second substances,

4. The method according to claim 2, wherein the performing face detection on the second frame image by using a multitask cascade convolution network or by using a machine learning tool Dlib further comprises:

when the face in the second frame image is not detected, extracting face characteristic points from a pre-constructed average face image;

determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the second frame image.

5. The method of claim 1, wherein obtaining the initial position of the facial feature point of the first image in the sequence of image frames comprises:

extracting face characteristic points from a pre-constructed average face image;

determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the first frame image.

6. The method according to claim 4 or 5, wherein the extracting facial feature points from the pre-constructed average face image comprises:

acquiring facial feature points of each face training sample in a face training sample set, wherein the facial feature points of each face training sample in the face training sample set are calibrated;

constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample;

and respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image.

7. A camera, comprising: a camera and a processor;

the camera collects an image frame sequence of the face of the user and sends the image frame sequence to the processor;

the processor is used for acquiring the initial position of the facial feature point of the first frame image in the image frame sequence; taking the initial positions of the facial feature points as the input of a convolutional neural network, outputting a position probability heat map of the facial feature points by the convolutional neural network, and performing iterative regression processing on the position probability heat map of the facial feature points by adopting a point distribution model to obtain position information of the facial feature points in the first frame image; wherein the location probability heat map represents a probability that the facial feature point is at each pixel location in the first frame image; acquiring the initial position of the facial feature point of the second frame image in the image frame sequence by using the position information of the facial feature point in the first frame image, specifically: and obtaining the initial position of the facial feature point in the second frame image according to the position information of the facial feature point in the first frame image and a preset relaxation amount to realize face tracking, wherein the preset relaxation amount represents the position change of the same feature point in the adjacent frame images.

8. The camera according to claim 7, wherein the processor further performs face detection on the second frame image by using a multitask cascade convolution network or by using a machine learning tool Dlib; and when a human face is detected in the second frame image, acquiring the initial position of the facial feature point in the second frame image according to the position information of the facial feature point in the first frame image and a preset relaxation amount.

9. The camera according to claim 8, wherein the processor extracts facial feature points from a pre-constructed average face image when a human face is not detected in the second frame image; determining a point location of a facial feature point on the average face as an initial location of the facial feature point in the second frame image.

10. The camera according to claim 7, wherein the processor obtains facial feature points of each face training sample in a face training sample set, and the facial feature points of each face training sample in the face training sample set are calibrated; constructing a mapping matrix from the face training samples to an average face model according to the face feature points of each face training sample; and respectively superposing the facial feature points of all the face training samples to the average face model according to the mapping matrix to obtain the average face image, and determining the superposed facial feature points as the facial feature points of the average face image.