CN113450386B

CN113450386B - Face tracking method and device

Info

Publication number: CN113450386B
Application number: CN202111010539.1A
Authority: CN
Inventors: 李博贤; 彭丽江; 周朋; 郑鹏程
Original assignee: Beijing Meishe Network Technology Co ltd
Current assignee: Beijing Meishe Network Technology Co ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-03
Anticipated expiration: 2041-08-31
Also published as: CN113450386A

Abstract

The embodiment of the application provides a face tracking method and a face tracking device, relates to the technical field of image processing, and is characterized in that a detector is used for carrying out face detection on a previous frame image of a current frame image to obtain a reference detection frame prestored in the current frame image, the reference detection frame prestored in the current frame image is used for cutting the current frame image to obtain a local image of the current frame image, a tracker is used for tracking a face in the local image according to the reference detection frame, and a detection frame of the face in the current frame image is obtained through inference, so that face tracking is realized. Through the combination of the detector and the tracker, the human face can be quickly and accurately detected in real time, the occupation of the computational power of equipment is effectively reduced, the phenomena of delay and blockage are avoided while the human face is tracked in real time, and the use experience of a user is improved.

Description

Face tracking method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a method and an apparatus for tracking a human face.

Background

With the development of image processing technology, more and more applications relate to the detection, identification, tracking, and the like of a certain target. For example, in tasks such as beauty, portrait special effects, video analysis, and the like, there is often a need to locate and track a human face in a scene photographed in real time.

In the prior art, a detector is usually adopted to detect each frame of picture in a video stream, although the detection accuracy of the detector is high, the inference time is long, and the detection accuracy is limited by the computational power and the buffer bandwidth, a traditional face tracking algorithm is difficult to perform real-time effective accurate positioning on a face, and a large delay is usually generated, so that the phenomenon of blocking appears visually, and the user experience is seriously influenced.

Disclosure of Invention

The application provides a face tracking method and a face tracking device, which are used for solving the problems that a traditional face tracking algorithm is difficult to carry out real-time effective accurate positioning on a face and delay and blockage are easy to occur.

In order to solve the problems, the following technical scheme is adopted:

in a first aspect, an embodiment of the present application provides a face tracking method, where the method includes:

determining whether each frame image in a video stream is a detection frame image or a tracking frame image by taking each frame image in the video stream as a current frame image;

when the current frame image is a tracking frame image, cutting the current frame image by using a reference detection frame prestored for the current frame image to obtain a local image of the current frame image, wherein the reference detection frame prestored for the current frame image is obtained by using a detector to detect a face of a previous frame image of the current frame image and is used for representing the position of the face in the previous frame image;

and tracking the face in the local image by using a tracker according to the reference detection frame to obtain a detection frame of the face in the current frame image.

In an embodiment of the present application, the method further includes:

when the number of faces in the previous frame image is multiple, clipping the current frame image according to a reference detection frame pre-stored for the current frame image to obtain a local image of the current frame image, including:

cutting the current frame image according to respective reference detection frames of a plurality of faces prestored in the current frame image to obtain a plurality of local images of the current frame image;

and starting a corresponding number of trackers, tracking the faces in the local images to obtain detection frames of the faces in the current frame image, wherein each tracker correspondingly tracks the face in one local image.

In an embodiment of the present application, the method further includes:

detecting whether a first frame image in the video stream contains a human face by using a detector;

under the condition that a first frame image contains a human face, obtaining a detection frame output by a detector, cutting the first frame image according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of a next frame image of the first frame image;

under the condition that the first frame image does not contain a human face, detecting whether the multiple frame image of the nth frame image in the video stream contains the human face or not by using a detector;

and under the condition that the nth multiple frame image contains a human face, obtaining a detection frame output by a detector, cutting the first frame image according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of the next frame image of the nth multiple frame image.

In an embodiment of the present application, the cropping the first frame image or the multiple frame image of the nth frame image according to the detection frame output by the detector to obtain a cropping area of the detection frame output by the detector includes:

under the condition that a first frame image or an nth multiple frame image contains a human face, obtaining a detection frame of the first frame image or the nth multiple frame image output by a detector;

determining the size information and the coordinate information of the detection frame of the first frame image or the nth multiple frame image according to the detection frame of the first frame image or the nth multiple frame image;

expanding the detection frame of the first frame image or the nth multiple frame image according to the size information and the coordinate information of the detection frame of the first frame image or the nth multiple frame image, the size information of each frame image in the video stream and a preset completion coefficient to obtain the coordinate information of the region to be cut of the detection frame of the first frame image or the nth multiple frame image;

and cutting the region to be cut of the detection frame of the first frame image or the nth multiple frame image according to the coordinate information of the region to be cut of the detection frame of the first frame image or the nth multiple frame image to obtain the cutting region of the detection frame of the first frame image or the nth multiple frame image.

In an embodiment of the present application, the expanding the detection frame of the first frame image or the n-th multiple frame image according to the size information and the coordinate information of the detection frame of the first frame image or the n-th multiple frame image, the size information of each frame image in the video stream, and a preset completion coefficient to obtain the coordinate information of the region to be cropped of the detection frame of the first frame image or the n-th multiple frame image includes:

expanding the detection frame of the first frame image or the nth multiple frame image according to the following formula to obtain the coordinate information of the region to be cut of the detection frame of the first frame image or the nth multiple frame image;

wherein: pad is the completion coefficient, x₁The abscissa, y, of the coordinate of the upper left corner of the detection frame of the first frame image or the multiple frame image of the nth frame image₁The ordinate, w, of the upper left corner coordinate of the detection frame of the first frame image or the multiple frame image of the nth frame image₁A width h of a detection frame of the first frame image or the n-th multiple frame image₁The height of the detection frame of the first frame image or the multiple frame image of the nth frame image, W is the width of each frame image in the video stream, H is the height of each frame image in the video stream, nx₁Is the abscissa and ny of the coordinates of the upper left corner of the region to be cut₁The ordinate and nx of the coordinate of the upper left corner of the area to be cut₂Is the abscissa and ny of the coordinate of the lower right corner of the region to be cut₂And the vertical coordinate of the lower right corner of the area to be cut.

In an embodiment of the present application, when the current frame image is a tracking frame image, clipping the current frame image by using a reference detection frame pre-stored for the current frame image to obtain a local image of the current frame image, includes:

determining a cutting area of a previous frame image of the current frame image according to a reference detection frame prestored for the current frame image;

dividing a cutting area of a previous frame image of the current frame image into two non-overlapping areas from inside to outside as an inertia area and a damping area of the current frame image, wherein the inertia area is located inside the cutting area of the previous frame image of the current frame image;

determining the clipping region of the current frame image according to the position relationship between the inertia region and the damping region of the previous frame image and the face detection frame in the previous frame image, or according to the condition that the area ratio of the face detection frame in the previous frame image to the area of the clipping region of the previous frame image is smaller than the area ratio threshold value;

and cutting the current frame image according to the cutting area of the current frame image to obtain a local image of the current frame image.

In an embodiment of the present application, dividing a clipping region of a previous frame image of a current frame image from inside to outside into two non-overlapping regions as an inertia region and a damping region of the current frame image includes:

dividing a cutting area of a previous frame image of the current frame image into two non-overlapping areas from inside to outside according to the following formula;

wherein: pad is a completion coefficient, x is an abscissa of any point in the inertial zone, y is an ordinate of any point in the inertial zone, w₂Is the width h of the cutting area of the previous frame image of the current frame image₂Is the height of the clipping region of the previous frame image of the current frame image, a₁The abscissa of the coordinate of the upper left corner of the cutting area of the previous frame image of the current frame image, b₁The vertical coordinate, a, of the upper left corner coordinate of the cutting area of the previous frame image of the current frame image₂The abscissa of the coordinate of the lower right corner of the clipping region of the previous frame image of the current frame image b₂And the vertical coordinate of the lower right corner of the cutting area of the previous frame image of the current frame image.

In an embodiment of the present application, determining a clipping region of the current frame image according to a position relationship between a detection frame of a face in the previous frame image and an inertia region and a damping region of the current frame image includes:

under the condition that a detection frame of the face in the previous frame image falls into an inertia area of the current frame image, determining a cutting area of the previous frame image as a cutting area of the current frame image;

under the condition that a plurality of edges of a detection frame of the face in the previous frame image fall into a damping region of the current frame image, translating the cutting region of the previous frame image along the moving direction and distance of the plurality of edges to obtain the cutting region of the current frame image;

and under the condition that the detection frame of the face in the previous frame image exceeds the inertia area of the current frame image, or under the condition that the opposite two sides of the detection frame of the face in the previous frame image fall into the damping area of the current frame image, cutting the detection frame of the face in the previous frame image according to the size information and the coordinate information of the detection frame of the face in the previous frame image, the size information of each frame image in the video stream and a preset completion coefficient to obtain the cutting area of the current frame image.

In an embodiment of the present application, the method further includes:

when the tracker detects the nth multiple frame image in the video stream, comparing a detection frame output by the tracker with a detection frame output by the detector;

and if the number of the detection frames output by the detector is not consistent with the number of the detection frames output by the tracker, and/or the difference between the detection frames output by the detector and the detection frames output by the tracker aiming at the same face exceeds a difference threshold value, determining that the output of the detector is effective, otherwise, determining that the output of the tracker is effective.

In an embodiment of the present application, the method further includes:

determining the distance between the detection frames of every two adjacent frames of images aiming at the same face;

and when the distance is smaller than the distance threshold value, taking the detection frame of the face of the previous frame image in the two adjacent frame images as the detection frame of the same face in the current frame image.

In a second aspect, an embodiment of the present application provides a face tracking apparatus, including:

the first determining module is used for determining whether each frame image in the video stream is a detection frame image or a tracking frame image by taking each frame image as a current frame image;

the first cropping module is used for cropping the current frame image by using a reference detection frame prestored for the current frame image under the condition that the current frame image is a tracking frame image to obtain a local image of the current frame image, wherein the reference detection frame prestored for the current frame image is obtained by using a detector to detect the face of a previous frame image of the current frame image and is used for representing the position of the face in the previous frame image;

and the first tracking module is used for tracking the face in the local image by using a tracker according to the reference detection frame to obtain a detection frame of the face in the current frame image.

In an embodiment of the present application, the apparatus further includes:

the second cutting module is used for cutting the current frame image according to respective reference detection frames of a plurality of faces prestored in the current frame image to obtain a plurality of local images of the current frame image;

and the second tracking module is used for starting a corresponding number of trackers to track the faces in the local images to obtain detection frames of the faces in the current frame image, wherein each tracker correspondingly tracks the face in one local image.

In an embodiment of the present application, the apparatus further includes:

the first detection module is used for detecting whether a first frame image in the video stream contains a human face by using a detector, and is also used for detecting whether a multiple frame image of the nth frame image in the video stream contains the human face by using the detector under the condition that the first frame image does not contain the human face;

a third cropping module, configured to, when the first frame image includes a human face, obtain a detection frame output by a detector, crop the first frame image according to the detection frame output by the detector, to obtain a cropped area of the detection frame output by the detector, and store the cropped area of the detection frame output by the detector as a reference detection frame of a next frame image of the first frame image; and under the condition that the multiple frame image of the nth contains a human face, obtaining a detection frame output by a detector, cutting the multiple frame image of the nth according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of a next frame image of the multiple frame image of the nth.

In an embodiment of the present application, the third clipping module includes:

the first obtaining sub-module is used for obtaining a detection frame of the first frame image or the nth frame multiple image output by the detector under the condition that the first frame image or the nth frame multiple image contains a human face;

the first determining submodule is used for determining the size information and the coordinate information of the detection frame of the first frame image or the nth multiple frame image according to the detection frame of the first frame image or the nth multiple frame image;

a second determining submodule, configured to enlarge the detection frame of the first frame image or the n-th multiple frame image according to the size information and the coordinate information of the detection frame of the first frame image or the n-th multiple frame image, the size information of each frame image in the video stream, and a preset completion coefficient, so as to obtain coordinate information of a region to be clipped of the detection frame of the first frame image or the n-th multiple frame image;

and the first clipping submodule is used for clipping the region to be clipped of the detection frame of the first frame image or the nth multiple frame image according to the coordinate information of the region to be clipped of the detection frame of the first frame image or the nth multiple frame image to obtain the clipping region of the detection frame of the first frame image or the nth multiple frame image.

In an embodiment of the present application, the first clipping module includes:

a third determining submodule, configured to determine, according to a reference detection frame pre-stored for the current frame image, a clipping region of a previous frame image of the current frame image;

the division submodule is used for dividing the cutting area of the previous frame image of the current frame image into two non-overlapping areas from inside to outside as an inertia area and a damping area of the current frame image, and the inertia area is positioned in the cutting area of the previous frame image of the current frame image;

a fourth determining submodule, configured to determine the clipping region of the current frame image according to a position relationship between a detection frame of a face in the previous frame image and the inertia region and the damping region of the current frame image, or according to that an area ratio of the detection frame of the face in the previous frame image to the clipping region of the previous frame image is smaller than an area ratio threshold;

and the second cutting submodule is used for cutting the current frame image according to the cutting area of the current frame image to obtain a local image of the current frame image.

In an embodiment of the present application, the fourth determining sub-module includes:

the first judgment submodule is used for determining the cutting area of the previous frame image as the cutting area of the current frame image under the condition that the detection frame of the face in the previous frame image falls into the inertia area of the current frame image;

the second judgment submodule is used for translating the cutting area of the previous frame image along the moving direction and distance of the edges under the condition that the edges of the detection frame of the face in the previous frame image fall into the damping area of the current frame image to obtain the cutting area of the current frame image;

and the third judgment sub-module is used for cutting the detection frame of the face in the previous frame image according to the size information and the coordinate information of the detection frame of the face in the previous frame image, the size information of each frame image in the video stream and a preset completion coefficient under the condition that the detection frame of the face in the previous frame image exceeds the inertia area of the current frame image or under the condition that the opposite sides of the detection frame of the face in the previous frame image fall into the damping area of the current frame image, so as to obtain the cutting area of the current frame image.

In an embodiment of the present application, the apparatus further includes:

the comparison module is used for comparing a detection frame output by the tracker with a detection frame output by the detector when the tracker detects the nth multiple frame image in the video stream;

and the second determining module is used for determining that the output of the detector is valid if the number of the detection frames output by the detector is inconsistent with the number of the detection frames output by the tracker and/or the difference between the detection frames output by the detector and the tracker for the same face exceeds a difference threshold, otherwise, determining that the output of the tracker is valid.

In an embodiment of the present application, the apparatus further includes:

the third determining module is used for determining the distance between the detection frames of every two adjacent frames of images aiming at the same face;

and the updating module is used for taking the detection frame of the face of the previous frame image in the two adjacent frame images as the detection frame of the same face in the current frame image when the distance is smaller than the distance threshold value.

Compared with the prior art, the method has the following advantages:

in the embodiment of the application, a detector is used for carrying out face detection on a previous frame image of a current frame image to obtain a reference detection frame pre-stored in the current frame image, the reference detection frame pre-stored in the current frame image is used for cutting the current frame image to obtain a local image of the current frame image, a tracker is used for tracking a face in the local image according to the reference detection frame, and a detection frame of the face in the current frame image is obtained through inference. The detector is high in detection accuracy but long in reasoning time, the tracker is high in reasoning speed and weak in reasoning ability, so that the detector is put into use only when a detection frame image is obtained, the rest frames are tracked by the tracker, the human face can be rapidly and accurately detected in real time through the combination of the detector and the tracker, the occupation of equipment calculation capacity is effectively reduced, delay and pause phenomena are avoided while the human face is tracked in real time, and the use experience of a user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of a face tracking method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a face tracking device according to an embodiment of the present application.

Reference numerals: 200-a face tracking device; 201-a first determination module; 202-a clipping module; 203-tracking module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application provides a face tracking method, which aims to provide a light-weight and efficient face tracking method, aims to solve the technical problems that the traditional face tracking algorithm is limited by equipment computing power and cache bandwidth, is difficult to effectively and accurately position a face in real time and has delay and stuck phenomena, and can effectively reduce the occupation of the equipment computing power by matching a detector with high detection accuracy but long inference time with a tracker with high inference speed and weak inference capability, avoid the delay and stuck phenomena while realizing the real-time face tracking and improve the use experience of users.

Referring to fig. 1, a face tracking method according to the present application is shown, where the method specifically includes the following steps:

step S101: and determining whether the current frame image is a detection frame image or a tracking frame image by taking each frame image in the video stream as the current frame image.

It should be noted that in this embodiment, the detected frame image is a first frame image of the video stream or an image detected by the detector every nth multiple frame, the tracked frame image is any frame image in a tracking state of the tracker after the first frame image, that is, the detector is started only at the first frame image and the nth multiple frame image of the video stream, and the rest frames are not started.

In the embodiment, a detector is used for detecting whether a first frame image in the video stream contains a human face; when a first frame image contains a human face, obtaining a detection frame output by a detector, cutting the first frame image according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of a next frame image of the first frame image.

Under the condition that the first frame image does not contain a human face, detecting whether the multiple frame image of the nth frame image in the video stream contains the human face or not by using a detector; and under the condition that the nth multiple frame image contains a human face, obtaining a detection frame output by a detector, cutting the first frame image according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of the next frame image of the nth multiple frame image.

In this embodiment, before performing image detection, each frame of image in a video stream is preprocessed, and each frame of image is scaled to a uniform input size, which may be, but is not limited to, a size of 256 pixels × 256 pixels, only a Y channel is reserved after each frame of image is converted from an RGB format to a YUV format, and then each frame of image after conversion is normalized, it should be noted that, only a Y channel is reserved, which not only can reduce the amount of computation of a detector and a tracker, but also can shield useless color information, and by normalization processing, each frame of image is converted to a uniform standard form on the basis of not changing original information, so that data distribution is more uniform, and detection by the detector and the tracker is facilitated.

Step S102: and under the condition that the current frame image is a tracking frame image, cutting the current frame image by using a reference detection frame prestored for the current frame image to obtain a local image of the current frame image, wherein the reference detection frame prestored for the current frame image is obtained by using a detector to detect the face of a previous frame image of the current frame image and is used for representing the position of the face in the previous frame image.

In this embodiment, after detecting that the current frame includes a human face, the detection frame in the next frame image can be obtained according to the detection frame output by the detector, and the detection frame in the next frame image can be obtained according to the detection frame in the next frame image, that is, each frame image only uses the data of the previous frame image, and the memory usage can be effectively saved by a real-time updating mode; meanwhile, the detection of each frame of image takes the detection frame of the above frame of image as reference, so that the occupation of computing power of equipment is reduced.

In this embodiment, if the number of faces in the previous frame of image is multiple, the current frame of image is clipped according to the reference detection frames of multiple faces pre-stored in the current frame of image, so as to obtain multiple partial images of the current frame of image; and starting a corresponding number of trackers, tracking the faces in the local images to obtain detection frames of the faces in the current frame image, wherein each tracker correspondingly tracks the face in one local image.

In this embodiment, when a target face is not detected or a tracking target is lost in the previous frame of image, the tracker is not started, it should be noted that both the face detector and the face tracker output a number between 0 and 1 as a detection confidence during inference, and when the detection confidence is smaller than a detection confidence threshold, it is considered that a face is not detected or a tracking target is lost, which usually occurs when a face moves out of a screen or moves at a high speed, etc. Preferably, the detection confidence threshold may be set to 0.5.

Step S103: and tracking the face in the local image by using a tracker according to the reference detection frame to obtain a detection frame of the face in the current frame image.

In this embodiment, the tracker performs inference on the face in the local image according to the reference detection frame to obtain a detection frame of the face in the current frame image, where the detection frame includes key point information of the face, and when the tracker is used for face detection, the tracker includes but is not limited to outputting 5 face key points, and the 5 face key points respectively correspond to eyes, a nose tip, and two mouth corners.

In the embodiment of the application, the combination of the detector and the tracker can quickly and accurately detect the face in real time and effectively reduce the occupation of the calculation capacity of equipment, thereby avoiding the phenomena of delay and pause while realizing the real-time tracking of the face and improving the use experience of a user.

In a possible implementation manner, the step S101 of cropping the first frame image or the image of the multiple of the nth frame according to the detection frame output by the detector to obtain the cropped area of the detection frame output by the detector may include:

step S101-1: and under the condition that the first frame image or the multiple frame image of the nth contains the human face, obtaining a detection frame of the first frame image or the multiple frame image of the nth output by the detector.

Step S101-2: and determining the size information and the coordinate information of the detection frame of the first frame image or the nth frame image according to the detection frame of the first frame image or the nth frame image.

It should be noted that each frame of image generally includes a detection frame of the target human face and a tag file corresponding to the detection frame, where the tag file is used to record position information of each face included in the image, and the position information includes size information and coordinate information of the detection frame.

Step S101-3: and expanding the detection frame of the first frame image or the nth multiple frame image according to the size information and the coordinate information of the detection frame of the first frame image or the nth multiple frame image, the size information of each frame image in the video stream and a preset completion coefficient to obtain the coordinate information of the region to be cut of the detection frame of the first frame image or the nth multiple frame image.

Preferably, the detection frame of the first frame image or the nth multiple frame image is expanded according to the following formula, so as to obtain the coordinate information of the region to be cropped of the detection frame of the first frame image or the nth multiple frame image.

Wherein: pad is the completion coefficient, x₁The abscissa, y, of the coordinate of the upper left corner of the detection frame of the first frame image or the multiple frame image of the nth frame image₁The ordinate, w, of the upper left corner coordinate of the detection frame of the first frame image or the multiple frame image of the nth frame image₁A width h of a detection frame of the first frame image or the n-th multiple frame image₁The height of the detection frame of the first frame image or the multiple frame image of the nth frame image, W is the width of each frame image in the video stream, H is the height of each frame image in the video stream, nx₁Is the abscissa and ny of the coordinates of the upper left corner of the region to be cut₁The abscissa and nx of the coordinate of the upper left corner of the area to be cut₂Is the abscissa and ny of the coordinate of the lower right corner of the region to be cut₂And the vertical coordinate of the lower right corner of the area to be cut.

Step S101-4: and cutting the region to be cut of the detection frame of the first frame image or the nth multiple frame image according to the coordinate information of the region to be cut of the detection frame of the first frame image or the nth multiple frame image to obtain the cutting region of the detection frame of the first frame image or the nth multiple frame image.

In this embodiment, the cut region of the detection frame of the first frame image or the nth multiple frame image obtained after cutting can more comprehensively and truly contain the target face, and the detection frame of the first frame image or the nth multiple frame image is sent to the corresponding tracker after being cut, so that the face tracker can reason according to the cut region, and the tracking accuracy is improved.

In a possible implementation manner, in step S102, in the case that the current frame image is a tracking frame image, the step of cropping the current frame image by using a reference detection frame pre-stored for the current frame image to obtain a local image of the current frame image may include:

step S102-1: and determining a cutting area of a previous frame image of the current frame image according to a reference detection frame prestored for the current frame image.

In this embodiment, based on the same clipping policy, the reference detection frame pre-stored in the current frame image is clipped according to the size information and the coordinate information of the reference detection frame pre-stored in the current frame image, the size information of each frame image in the video stream, and a preset completion coefficient, so as to determine a clipping region of the previous frame image of the current frame image.

Step S102-2: and dividing the cutting area of the previous frame image of the current frame image into two non-overlapping areas from inside to outside as an inertia area and a damping area of the current frame image, wherein the inertia area is located inside the cutting area of the previous frame image of the current frame image.

Preferably, the cropping area of the previous frame image of the current frame image is divided into two non-overlapping areas from inside to outside according to the following formula.

Step S102-3: and determining the clipping area of the current frame image according to the position relation between the inertia area and the damping area of the detection frame of the face in the previous frame image and the current frame image, or according to the condition that the area ratio of the detection frame of the face in the previous frame image to the area of the clipping area of the previous frame image is smaller than the area ratio threshold value.

In this embodiment, according to the position relationship between the inertia area and the damping area of the previous frame image and the detection frame of the face in the previous frame image, whether the corresponding target face in the current frame image moves, enlarges, or reduces relative to the previous frame image or not can be inferred, and according to different conditions, a corresponding clipping strategy is performed, so as to finally determine the clipping area of the current frame image.

Step S102-4: and cutting the current frame image according to the cutting area of the current frame image to obtain a local image of the current frame image.

In this embodiment, if the clipping region of the current frame image exceeds the image boundary of the current frame image, filling pixels in the clipping region of the current frame image that exceeds the image boundary of the current frame image, so as to ensure the integrity of the clipping region of the current frame image.

In a possible embodiment, step S102-3 may specifically include the following steps:

step S102-3-1: and under the condition that the detection frame of the face in the previous frame image falls into the inertia area of the current frame image, determining the cutting area of the previous frame image as the cutting area of the current frame image.

At this time, it is described that the face in the current frame image does not move relative to the previous frame image or the movement amplitude can be ignored, so that the cropping area of the previous frame image can be directly determined as the cropping area of the current frame image.

Step S102-3-2: and under the condition that the plurality of edges of the detection frame of the face in the previous frame image fall into the damping area of the current frame image, translating the cutting area of the previous frame image along the moving direction and distance of the plurality of edges to obtain the cutting area of the current frame image.

At this time, it is described that the face in the current frame image is translated relative to the previous frame image, so that the cropping area of the previous frame image is translated along the direction and distance of the movement of the plurality of edges, and the cropping area of the current frame image can be obtained.

Step S102-3-3: and under the condition that the detection frame of the face in the previous frame image exceeds the inertia area of the current frame image, or under the condition that the opposite two sides of the detection frame of the face in the previous frame image fall into the damping area of the current frame image, cutting the detection frame of the face in the previous frame image according to the size information and the coordinate information of the detection frame of the face in the previous frame image, the size information of each frame image in the video stream and a preset completion coefficient to obtain the cutting area of the current frame image.

At this time, it is described that the face in the current frame image is enlarged or reduced relative to the previous frame image, and the macroscopically corresponding target face approaches or leaves the lens, and at this time, based on the same clipping strategy, the detection frame of the face in the previous frame image is clipped according to the size information and the coordinate information of the detection frame of the face in the previous frame image, the size information of each frame image in the video stream, and a preset completion coefficient, so as to obtain a clipping region of the current frame image.

In this embodiment, in the case that the current frame image is the tracking frame image, that is, the tracker is enabled, a corresponding clipping strategy can be adopted according to different situations, so that the tracker can reduce the occupation of the calculation power of the device in the tracking process.

In one possible embodiment, the method further comprises:

step S104-1: when the tracker detects the nth multiple frame image in the video stream, comparing the detection frame output by the tracker with the detection frame output by the detector.

Because each tracker tracks the face in a local image correspondingly, when the multiple frame image of the nth in the video stream is tracked, the detector and the tracker both output detection frames aiming at the same target face, and the effective detection frames can be determined by comparing the two detection frames, so that the detection precision is ensured.

Step S104-2: and if the number of the detection frames output by the detector is not consistent with the number of the detection frames output by the tracker, and/or the difference between the detection frames output by the detector and the detection frames output by the tracker aiming at the same face exceeds a difference threshold value, determining that the output of the detector is effective, otherwise, determining that the output of the tracker is effective.

It should be noted that, because the tracker accumulates errors during tracking, when the number of detection frames output by the detector is not consistent with the number of detection frames output by the tracker, and/or when the difference between the detection frames output by the detector and the tracker for the same face exceeds a difference threshold, it indicates that the difference between the two is large, and correction needs to be performed by the detection frames output by the detector, and at this time, the detection frames output by the detector are directly determined to be valid, so as to eliminate the accumulated errors.

Specifically, the difference between the detection frames output by the detector and the tracker for the same face can be evaluated by an IOU method, namely, the intersection and combination ratio of the two detection frames is calculated to obtain a decimal between 0 and 1, when the decimal exceeds a difference threshold value, the difference between the two detection frames is proved to be not obvious, the detection frames output by the tracker can be directly used at the moment, otherwise, the difference is considered to be larger, the detection frames output by the detector are adopted at the moment, the error accumulated by the tracker is eliminated, and the generation of larger jitter is avoided. Preferably, the difference threshold may be set to 0.5.

In one possible embodiment, the method further comprises:

step S105-1: and determining the distance between the detection frames of every two adjacent frames of images aiming at the same face.

Step S105-2: and when the distance is smaller than a distance threshold value, taking the detection frame of the face of the previous frame image in the two adjacent frame images as the detection frame of the same face in the current frame image, wherein the distance threshold value can be set to be 1 pixel.

It should be noted that each detection frame includes its corresponding coordinate information, and the coordinates must be integers for the image, so that there is an rounding process. In the rounding process, no matter rounding is performed downwards or rounding is performed, error shaking of one pixel is caused near a critical point, for example, the coordinate of a certain point of the previous frame image is 0.49, the coordinate of the point of the current frame image is 0.51, and after the rounding is performed, the coordinate of the point of the current frame image is changed from 0 to 1, so that the error is amplified, and a shaking phenomenon is generated.

Based on the same inventive concept, referring to fig. 2, an embodiment of the present application provides a face tracking apparatus 200, including:

the first determining module 201 is configured to use each frame image in the video stream as a current frame image, and determine whether the current frame image is a detection frame image or a tracking frame image.

A first cropping module 202, configured to crop the current frame image to obtain a local image of the current frame image by using a reference detection frame pre-stored for the current frame image when the current frame image is a tracking frame image, where the reference detection frame pre-stored for the current frame image is obtained by using a detector to perform face detection on a previous frame image of the current frame image, and is used to represent a position of a face in the previous frame image.

The first tracking module 203 is configured to track the face in the local image by using a tracker according to the reference detection frame, so as to obtain a detection frame of the face in the current frame image.

In an embodiment of the present application, the apparatus further includes:

the second clipping module 202 is configured to clip the current frame image according to the reference detection frames of the multiple faces pre-stored in the current frame image, so as to obtain multiple local images of the current frame image.

The second tracking module 203 is configured to enable a corresponding number of trackers, track the faces in the multiple partial images, and obtain detection frames of the multiple faces in the current frame image, where each tracker tracks the face in one partial image correspondingly.

In an embodiment of the present application, the apparatus further includes:

the first detection module is used for detecting whether the first frame image in the video stream contains a human face by using the detector, and is also used for detecting whether the multiple frame image in the nth frame image in the video stream contains the human face by using the detector under the condition that the first frame image does not contain the human face.

A third cropping module 202, configured to, when the first frame image includes a human face, obtain a detection frame output by a detector, crop the first frame image according to the detection frame output by the detector, to obtain a cropped area of the detection frame output by the detector, and store the cropped area of the detection frame output by the detector as a reference detection frame of a next frame image of the first frame image. And under the condition that the multiple frame image of the nth contains a human face, obtaining a detection frame output by a detector, cutting the multiple frame image of the nth according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of a next frame image of the multiple frame image of the nth.

In an embodiment of the present application, the third clipping module 202 includes:

the first obtaining sub-module is used for obtaining a detection frame of the first frame image or the nth frame multiple image output by the detector under the condition that the first frame image or the nth frame multiple image contains a human face.

The first determining submodule is used for determining the size information and the coordinate information of the detection frame of the first frame image or the nth multiple frame image according to the detection frame of the first frame image or the nth multiple frame image.

And the second determining submodule is used for expanding the detection frame of the first frame image or the nth multiple frame image according to the size information and the coordinate information of the detection frame of the first frame image or the nth multiple frame image, the size information of each frame image in the video stream and a preset completion coefficient to obtain the coordinate information of the region to be cropped of the detection frame of the first frame image or the nth multiple frame image.

In an embodiment of the present application, the first clipping module 202 includes:

and the third determining submodule is used for determining a cutting area of the previous frame image of the current frame image according to a reference detection frame prestored for the current frame image.

And the division submodule is used for dividing the cutting area of the previous frame image of the current frame image into two non-overlapping areas from inside to outside as an inertia area and a damping area of the current frame image, and the inertia area is positioned in the cutting area of the previous frame image of the current frame image.

And the fourth determining submodule is used for determining the clipping area of the current frame image according to the position relation between the inertia area and the damping area of the detection frame of the face in the previous frame image and the current frame image, or according to the condition that the area ratio of the detection frame of the face in the previous frame image to the clipping area of the previous frame image is smaller than the area ratio threshold value.

and the first judgment sub-module is used for determining the cutting area of the previous frame image as the cutting area of the current frame image under the condition that the detection frame of the face in the previous frame image falls into the inertia area of the current frame image.

And the second judgment submodule is used for translating the cutting area of the previous frame image along the moving direction and distance of the edges under the condition that the edges of the detection frame of the face in the previous frame image fall into the damping area of the current frame image, so as to obtain the cutting area of the current frame image.

In an embodiment of the present application, the apparatus further includes:

and the comparison module is used for comparing a detection frame output by the tracker with a detection frame output by the detector when the tracker detects the nth multiple frame image in the video stream.

In an embodiment of the present application, the apparatus further includes:

and the third determining module is used for determining the distance between the detection frames of every two adjacent frames of images aiming at the same face.

In the embodiment of the application, first, it is determined by the first determining module 201 whether the current frame image is a detection frame image or a tracking frame image, then the first clipping module 202 uses the detector to perform face detection on the previous frame image of the current frame image to obtain a reference detection frame pre-stored in the current frame image, uses the reference detection frame pre-stored in the current frame image to clip the current frame image to obtain a local image of the current frame image, and finally, the first tracking module 203 uses the tracker to track the face in the local image according to the reference detection frame, so as to infer the detection frame of the face in the current frame image, thereby realizing face tracking. Through the combination of the detector and the tracker, the human face can be quickly and accurately detected in real time, the occupation of the computational power of equipment is effectively reduced, the phenomena of delay and blockage are avoided while the human face is tracked in real time, and the use experience of a user is improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is given to a face tracking method and apparatus provided by the present invention, and a specific example is applied in the description to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for face tracking, the method comprising:

tracking the face in the local image by using a tracker according to the reference detection frame to obtain a detection frame of the face in the current frame image;

when the current frame image is a tracking frame image, clipping the current frame image by using a reference detection frame pre-stored for the current frame image to obtain a local image of the current frame image, including:

determining a clipping area of the current frame image according to the position relation between a detection frame of the face in the previous frame image and the inertia area and the damping area of the current frame image;

according to the cutting area of the current frame image, cutting the current frame image to obtain a local image of the current frame image;

dividing a cutting area of a previous frame image of the current frame image into two non-overlapping areas from inside to outside as an inertia area and a damping area of the current frame image, including:

wherein: pad is a completion coefficient, x is an abscissa of any point in the inertial zone, y is an ordinate of any point in the inertial zone, w₂Is the width h of the cutting area of the previous frame image of the current frame image₂Is that it isHeight of a cropped area of a previous frame image of a current frame image, a₁The abscissa of the coordinate of the upper left corner of the cutting area of the previous frame image of the current frame image, b₁The vertical coordinate, a, of the upper left corner coordinate of the cutting area of the previous frame image of the current frame image₂The abscissa of the coordinate of the lower right corner of the clipping region of the previous frame image of the current frame image b₂And the vertical coordinate of the lower right corner of the cutting area of the previous frame image of the current frame image.

2. The method according to claim 1, wherein, in a case that the number of faces in the previous frame image is multiple, cropping the current frame image according to a reference detection frame pre-stored for the current frame image to obtain a partial image of the current frame image comprises:

3. The method of claim 1, further comprising:

under the condition that the first frame image contains a human face, obtaining a detection frame output by a detector, cutting the first frame image according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of the next frame image of the first frame image;

and under the condition that the multiple frame image of the nth contains a human face, obtaining a detection frame output by a detector, cutting the multiple frame image of the nth according to the detection frame output by the detector to obtain a cutting area of the detection frame output by the detector, and storing the cutting area of the detection frame output by the detector as a reference detection frame of a next frame image of the multiple frame image of the nth.

4. The method according to claim 3, wherein cropping the first frame image or the n-th multiple frame image according to the detection frame output by the detector to obtain a cropping area of the detection frame output by the detector comprises:

5. The method according to claim 4, wherein the expanding the detection frame of the first frame image or the n-th multiple frame image according to the size information and the coordinate information of the detection frame of the first frame image or the n-th multiple frame image, the size information of each frame image in the video stream, and a preset completion coefficient to obtain the coordinate information of the region to be cropped of the detection frame of the first frame image or the n-th multiple frame image comprises:

6. The method according to claim 1, wherein determining the clipping region of the current frame image according to a position relationship between a detection frame of a face in the previous frame image and the inertia region and the damping region of the current frame image comprises:

7. The method of claim 3, further comprising:

8. The method according to any one of claims 1-7, further comprising:

9. An apparatus for face tracking, the apparatus comprising:

the cutting module is used for cutting the current frame image by utilizing a reference detection frame prestored for the current frame image under the condition that the current frame image is a tracking frame image to obtain a local image of the current frame image, wherein the reference detection frame prestored for the current frame image is obtained by utilizing a detector to detect the face of a previous frame image of the current frame image and is used for representing the position of the face in the previous frame image;

the tracking module is used for tracking the face in the local image by using a tracker according to the reference detection frame to obtain a detection frame of the face in the current frame image;

wherein the cropping module comprises:

the fourth determining submodule is used for determining a cutting area of the current frame image according to the position relation between the detection frame of the face in the previous frame image and the inertia area and the damping area of the current frame image;

the second cutting submodule is used for cutting the current frame image according to the cutting area of the current frame image to obtain a local image of the current frame image;

the division submodule divides a cutting area of a previous frame image of the current frame image into two non-overlapping areas from inside to outside according to the following formula;