CN110688930B

CN110688930B - Face detection method and device, mobile terminal and storage medium

Info

Publication number: CN110688930B
Application number: CN201910893159.3A
Authority: CN
Inventors: 王多民
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2023-07-18
Anticipated expiration: 2039-09-20
Also published as: CN110688930A

Abstract

The embodiment of the application provides a face detection method, a face detection device, a mobile terminal and a storage medium, wherein the method is applied to the mobile terminal and comprises the following steps: acquiring a current video frame of a video stream, and identifying whether the current video frame contains M target face frames; if the face key point detection model comprises the first target face key point information corresponding to the first target face frame is detected through the trained face key point detection model; calculating the matching degree of the first target face key point information and the first target tracked face key point information, and determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream. The embodiment of the application can reduce the error of the detection result of the face feature points in the video stream.

Description

Face detection method and device, mobile terminal and storage medium

Technical Field

The application relates to the technical field of face recognition, in particular to a face detection method, a face detection device, a mobile terminal and a storage medium.

Background

Currently, in the face recognition process, face features need to be detected. Face feature detection is typically performed using an active shape model (Active Shape Model, ASM) scheme or using convolutional neural networks. When the face detection is carried out on the video stream, if jitter occurs in the video, the ASM scheme or the convolutional neural network is adopted to cause the jitter of the face feature point detection result to be severe, so that the error of the face feature point detection result is larger.

Disclosure of Invention

The embodiment of the application provides a face detection method, a face detection device, a mobile terminal and a storage medium, which can reduce errors of face feature point detection results in video streams.

A first aspect of an embodiment of the present application provides a face detection method, including:

acquiring a current video frame of a video stream, and identifying whether the current video frame contains M target face frames, wherein the target face frames are the same face frames detected by the current video frame and the previous video frame, and M is a positive integer;

if the method comprises the steps of detecting first target face key point information corresponding to a first target face frame through a trained face key point detection model, wherein the first target face frame is any one of the M target face frames;

Calculating the matching degree of the first target face key point information and the first target tracked face key point information, and determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream.

A second aspect of an embodiment of the present application provides a face detection apparatus, including:

an obtaining unit, configured to obtain a current video frame of a video stream;

the identification unit is used for identifying whether the current video frame contains M target face frames, wherein the target face frames are the same face frames detected by the current video frame and the previous video frame, and M is a positive integer;

the detection unit is used for detecting first target face key point information corresponding to a first target face frame through a trained face key point detection model when the identification unit identifies that the current video frame contains M target face frames, and the first target face frame is any one of the M target face frames;

The computing unit is used for computing the matching degree of the first target face key point information and the first target tracked face key point information;

the processing unit is used for determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream.

A third aspect of the embodiments of the present application provides a mobile terminal comprising a processor and a memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to execute the step instructions as in the first aspect of the embodiments of the present application.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform part or all of the steps as described in the first aspect of the embodiments of the present application.

A fifth aspect of the embodiments of the present application provides a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

In the embodiment of the application, when detecting a face in a video stream, a current video frame of the video stream is acquired, whether the current video frame contains M target face frames or not is identified, wherein the target face frames are the same face frame detected by the current video frame and the last video frame, and M is a positive integer; if the method comprises the steps of detecting first target face key point information corresponding to a first target face frame through a trained face key point detection model, wherein the first target face frame is any one of M target face frames; calculating the matching degree of the first target face key point information and the first target tracked face key point information, and determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream. The method and the device can update the face key point information of the same face frame detected by two continuous frames, and can ensure that the tracked face key point information in the face detection result pool of the video stream is the face key point information with the latest minimum jitter, so that the face feature point detection result error in the video stream can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a face detection method provided in an embodiment of the present application;

fig. 2a is a schematic diagram of a face key point numbering in a face picture according to an embodiment of the present application;

fig. 2b is a schematic view of a face frame in two consecutive video frames according to an embodiment of the present application;

fig. 3 is a flowchart of another face detection method according to an embodiment of the present application;

fig. 4 is a flowchart of another face detection method according to an embodiment of the present application;

FIG. 5 is a schematic view of Euler angles of a user's head in three dimensions according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a face detection apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a mobile terminal according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.

The mobile terminal according to the embodiments of the present application may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and various forms of User Equipment (UE), mobile Station (MS), terminal devices (terminal devices), and so on. For convenience of description, the above-mentioned devices are collectively referred to as a mobile terminal.

Referring to fig. 1, fig. 1 is a flow chart of a face detection method according to an embodiment of the present application. As shown in fig. 1, the face detection method is applied to a mobile terminal, and may include the following steps.

101, the mobile terminal acquires a current video frame of a video stream, and identifies whether the current video frame contains M target face frames, wherein the target face frames are the same face frames detected by the current video frame and the previous video frame, and M is a positive integer.

The face detection method can be applied to face detection in video streams. The video stream may contain one or more faces, with new faces added or old faces dropped out. The face detection method can track the stable face frames and the stable face key point information in the video stream, can be used for face processing applications (such as face makeup, face mapping and the like), and can also be used for face unlocking and the like.

In this embodiment of the present application, the mobile terminal may obtain the video stream through a camera (for example, a front camera and a rear camera), or may obtain the video stream through local or cloud data.

The number of video frames contained by the video stream within one second may be determined based on the frame rate of the video stream. For example, if the frame rate is 30, the number of video frames contained in one second by the video stream is 30 frames.

The mobile terminal can perform face frame recognition on each frame of the video stream to recognize whether each frame of the video stream contains a face frame. The face frames in the video stream may include target face frames and newly added face frames. The target face frame is the same face frame detected by the current video frame and the last video frame, and the newly added face frame is the face frame detected by the current video frame but not detected by the last video frame.

The mobile terminal may identify a face frame of the current video frame using a face frame identification algorithm. The face frame recognition algorithm is used for recognizing whether the video frame contains face key point data or not. The face key point data may include face key point data corresponding to each region of the face, and each region of the face may include a left eye eyebrow region, a right eye eyebrow region, a left eye region, a right eye region, a nose region, and a mouth region. If the certain region in the video frame is identified to comprise the face key point data corresponding to each region of the face (for example, the left eye eyebrow region, the right eye eyebrow region, the left eye region, the right eye region, the nose region and the mouth region are included), the region is determined to comprise the face frame.

Specifically, referring to fig. 2a, fig. 2a is a schematic diagram of a face key point number in a face picture according to an embodiment of the present application. As shown in fig. 2a, there are 106 face key points on the face picture, the face key points are numbered 33-37 and 64-67 as left eye eyebrow region, the face key points are numbered 38-42 and 68-71 as right eye eyebrow region, the face key points are numbered 43-51 and 80-83 as nose region, and the face key points are numbered 84-106 as mouth region. The mobile terminal can adopt a face frame recognition algorithm to recognize whether the current video frame contains the face key points corresponding to the face key point numbers, and if so, the mobile terminal considers that the face frame is recognized.

Because the time interval between two consecutive video frames is short, the positions of the face frames in the two consecutive video frames generally do not change greatly. Referring to fig. 2b, fig. 2b is a schematic diagram of a face frame in two consecutive video frames according to an embodiment of the present application. As shown in fig. 2b, in the previous video frame, four face frames in total are detected, face frame 1, face frame 2, face frame 3 and face frame 4, and in the current video frame, four face frames in total are detected, face frame a, face frame b, face frame c and face frame d. The embodiment of the application can calculate the overlapping degree (Intersection over Union, IOU) of all face frames in the previous video frame and all face frames in the current video frame. The overlap may also be referred to as overlap. The calculation formula of the overlapping degree may be:

IOU＝2S _O /(S1+S2)；

Wherein S1 is the area of one face frame, S2 is the area of the other face frame, S _O The IOU is the overlap ratio of the two face frames, which is the area of the overlap portion of the two face frames.

Taking fig. 2b as an example, after overlapping the previous video frame and the current video frame, it can be calculated that the overlapping degree of the face frame a and the face frame 1 is very high (almost 100%), the overlapping degree of the face frame b and the face frame 3 is very high (almost 100%), the overlapping degree of the face frame c and the face frame 4 is very high (almost 100%), and the overlapping degree of the face frame d and the face frames 1, the face frame 2, the face frame 3 and the face frame 4 is 0, and the overlapping degree of the face frame 2 and the face frames a, the face frame b, the face frame c and the face frame d is 0. Therefore, the face frame 1 and the face frame a can be determined to be the same face frame, the face frame 3 and the face frame b can be determined to be the same face frame, and the face frame 4 and the face frame c can be determined to be the same face frame. The face frame a, the face frame b and the face frame c in the front video frame can be judged to be the face frame with stable tracking, the face frame d in the front video frame can be judged to be the newly added face frame, and the face frame 2 can be judged to be the face frame with failed tracking. Therefore, the face frame a, the face frame b and the face frame c in the current video frame are target face frames.

Optionally, in step 101, the mobile terminal identifies whether the current video frame includes M target face frames, specifically:

(11) The mobile terminal identifies whether the current video frame contains P human face frames, wherein P is a positive integer greater than or equal to M;

(12) If the face frames P are included, the mobile terminal calculates IOU of any two-to-two face frame pairs between the face frames P and N tracked face frames in a face detection result pool respectively, wherein N is a natural number;

(13) The mobile terminal determines whether M pairwise face frame pairs larger than a first preset threshold exist in the IOU of any pairwise face frame pair, wherein the M pairwise face frame pairs comprise M target face frames and M tracked face frames corresponding to the M target face frames one by one, the M target face frames belong to P face frames, and the M target face frames belong to N tracked face frames.

In the embodiment of the application, the mobile terminal can identify the face frame of the current video frame by adopting a face frame identification algorithm. If the current video frame is identified to contain P face frames, the IOU of any two-to-two face frame pairs between the P face frames and N tracked face frames in the face detection result pool can be calculated, P x N IOUs can be obtained, and each IOU corresponds to one face frame pair.

If M pairwise face frame pairs larger than a first preset threshold exist in the IOU of any pairwise face frame pair, determining that the current video frame contains M target face frames; and if the IOU of any two-by-two face frame pairs does not have the face frame pairs larger than the first preset threshold value, determining that the current video frame does not contain the target face frame.

Optionally, after performing step (13), the following steps may also be performed:

(21) If M pairwise face frame pairs larger than a first preset threshold exist in the IOU of any pairwise face frame pair, determining whether M1 pairwise face frame pairs larger than a second preset threshold exist in the IOU of the M pairwise face frame pairs, wherein the M1 pairwise face frame pairs comprise M1 target faces and M1 tracked face frames corresponding to the M1 target face frames one by one, the M1 target face frames belong to the M target face frames, the M1 tracked face frames belong to the M tracked face frames, the second preset threshold is larger than the first preset threshold, and M1 is a positive integer;

(22) If the face detection result pool exists, the mobile terminal does not update m1 tracked face frames in the face detection result pool;

(23) If not, the mobile terminal respectively updates the M tracked face frames into M target face frames corresponding to each other one by one.

In this embodiment of the present application, whether to update the tracked face frames in the face detection result pool may be determined according to whether M1 pairwise face frame pairs greater than the second preset threshold exist in the IOU of the M pairwise face frame pairs.

For example, P equals 3, N equals 4, M equals 2, and m1 equals 1. If the current frame detects the face frame 1, the face frame 2 and the face frame 3, the face detection result pool comprises the tracked face frame 1, the tracked face frame 2, the tracked face frame 3 and the tracked face frame 4. The mobile terminal may calculate the ios of the face 1 and the tracked face 1, the ios of the face 1 and the tracked face 2, the ios of the face 1 and the tracked face 3, the ios of the face 1 and the tracked face 4, the ios of the face 2 and the tracked face 1, the ios of the face 2 and the tracked face 2, the ios of the face 2 and the tracked face 3, the ios of the face 2 and the tracked face 4, the ios of the face 3 and the tracked face 1, the ios of the face 3 and the tracked face 2, the ios of the face 3 and the tracked face 3, and the ios of the face 3 and the tracked face 4, respectively. Obtaining 12 IOU results, if the IOU of the face frame 1 and the tracked face frame 1 is detected to be larger than a first preset threshold value, further judging whether the IOU of the face frame 1 and the tracked face frame 1 is larger than a second preset threshold value or not and whether the IOU of the face frame 3 and the tracked face frame 2 is larger than a second preset threshold value or not; if the IOU of the face frame 3 and the tracked face frame 2 is detected to be larger than the second preset threshold value, and the IOU of the face frame 1 and the tracked face frame 1 is detected to be larger than the first preset threshold value and smaller than the second preset threshold value, the tracked face frame 1 in the face detection result pool can be updated to the face frame 1 (namely, the tracked face frame 1 is replaced by the face frame 1), and the tracked face frame 2 in the face detection result pool is not updated. According to the method and the device for updating the face detection result pool, whether the same face frame in the face detection result pool is updated or not can be determined according to the IOU of the same face frame in the face detection result pool detected by the current frame and the face detection result pool detected by the history, if the jitter of the face frame of the current frame is larger, the IOU of the same face frame is possibly larger than a first preset threshold value and smaller than a second preset threshold value, the face detection result pool is updated, and if the jitter of the face frame of the current frame is smaller, the IOU of the same face frame is possibly larger than the second preset threshold value, and updating of the face detection result pool is not needed. The embodiment of the application can still use the tracked face frames in the face detection result pool for the same face frame with smaller continuous two-frame jitter, and can improve the face tracking speed.

102, if the mobile terminal comprises M target face frames, the mobile terminal detects first target face key point information corresponding to the first target face frame through a trained face key point detection model, and the first target face frame is any one of the M target face frames.

In the embodiment of the application, the trained face key point detection model is used for detecting the first face key point information corresponding to the first target face frame. Specifically, a first target face picture corresponding to a first target face frame may be intercepted from the current video frame, and the first target face picture is input into the trained face key point detection model to obtain first face key point information of the first target face picture.

The face key point information may include coordinates of key points corresponding to each region of the face. For example, coordinates of a plurality of key points corresponding to the left eye eyebrow, coordinates of a plurality of key points corresponding to the right eye eyebrow, coordinates of a plurality of key points corresponding to the left eye region, coordinates of a plurality of key points corresponding to the right eye region, coordinates of a plurality of key points corresponding to the nose region, and coordinates of a plurality of key points corresponding to the mouth region may be included. For example, it may be detected that the coordinates of the tip keypoints of the nose region are (0.0,0.0,0.0), the coordinates of the lowest chin are (0.0, -330.0, 65.0), the coordinates of the keypoints corresponding to the left-eye corners of the left-eye region are (-225.0, 170.0, -135.0), the coordinates of the keypoints corresponding to the right-eye corners of the right-eye region are (225.0, 170.0, -135.0), the coordinates of the keypoints corresponding to the left-mouth corners of the mouth region are (-150.0, -150.0, -125.0), and the coordinates of the keypoints corresponding to the right-mouth corners of the mouth region are (150.0, -150.0, -125.0).

103, the mobile terminal calculates the matching degree of the first target face key point information and the first target tracked face key point information, and decides whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream.

In this embodiment of the present application, the face detection result pool may include the tracked face frames in all previous video frames of the video stream and the face key point information corresponding to the tracked face frames. Each tracked face frame corresponds to a different face frame number, and each tracked face key point information corresponding to each tracked face frame corresponds to a different face key point information number. The face frame number of the same face frame can be the same as the face key point information number corresponding to the face frame, and the face frame and the face key point information in the face detection result pool can be conveniently and uniformly managed.

The tracked face frames in all previous video frames of the video stream may be referred to as target tracked face frames. The first target tracked face key point information corresponding to the first target tracked face frame may include face key point coordinates corresponding to a left eye brow hair region of the first target tracked face picture corresponding to the first target tracked face frame, face key point coordinates corresponding to a right eye brow hair region of the first target tracked face picture, face key point coordinates corresponding to a left eye region of the first target tracked face picture, face key point coordinates corresponding to a right eye region of the first target tracked face picture, face key point coordinates corresponding to a nose region of the first target tracked face picture, and face key point coordinates corresponding to a mouth region of the first target tracked face picture.

The first target face key point information may include a face key point coordinate corresponding to a left eye brow hair region of the first target face picture, a face key point coordinate corresponding to a right eye brow hair region of the first target face picture, a face key point coordinate corresponding to a left eye region of the first target face picture, a face key point coordinate corresponding to a right eye region of the first target face picture, a face key point coordinate corresponding to a nose region of the first target face picture, and a face key point coordinate corresponding to a mouth region of the first target face picture.

The mobile terminal calculates the matching degree of the key point information of the first target face and the key point information of the first target tracked face, and specifically comprises the following steps:

the mobile terminal calculates a first Euclidean distance between a face key point coordinate corresponding to a left eye brow hair region of a first target face picture and a face key point coordinate corresponding to a left eye brow hair region of the first target tracked face picture, calculates a second Euclidean distance between a face key point coordinate corresponding to a right eye brow hair region of the first target face picture and a face key point coordinate corresponding to a right eye brow hair region of the first target tracked face picture, calculates a third Euclidean distance between a face key point coordinate corresponding to a left eye region of the first target face picture and a face key point coordinate corresponding to a left eye region of the first target tracked face picture, calculates a fourth Euclidean distance between a face key point coordinate corresponding to a right eye region of the first target face picture and a face key point coordinate corresponding to a right eye region of the first target tracked face picture, and a fifth Euclidean distance between a face key point coordinate corresponding to a nose region of the first target face picture and a face coordinate corresponding to a first target tracked face picture;

If the first Euclidean distance and the second Euclidean distance are both larger than or equal to a first preset distance threshold, and the third Euclidean distance and the fourth Euclidean distance are both larger than or equal to a second preset distance threshold, the fifth Euclidean distance is larger than or equal to a third preset distance threshold, and the sixth Euclidean distance is larger than or equal to a fourth preset distance threshold, determining that the matching degree of the first target face key point information and the first target tracked face key point information is larger than or equal to a preset matching degree threshold, if the matching of the first Euclidean distance and the fourth Euclidean distance is successful, and updating the face detection result pool;

if the first Euclidean distance is smaller than a first preset distance threshold, or the second Euclidean distance is smaller than the first preset distance threshold, or the third Euclidean distance is smaller than the second preset distance threshold, or the fourth Euclidean distance is smaller than the second preset distance threshold, or the fifth Euclidean distance is smaller than the third preset distance threshold, or the sixth Euclidean distance is smaller than the fourth preset distance threshold, the matching degree of the first target face key point information and the first target tracked face key point information is determined to be smaller than a preset matching degree threshold, the matching of the first target face key point information and the first target tracked face key point information is failed, and a face detection result pool is updated.

Optionally, in step 103, the determining, by the mobile terminal, whether to update the face detection result pool according to the matching degree between the first target face key point information and the first target tracked face key point information includes:

if the matching degree of the first target face key point information and the first target tracked face key point information is smaller than a preset matching degree threshold value, updating the first target tracked face key point information in the face detection result pool into first target face key point information, and updating the first target tracked face frame in the face detection result pool into a first target face frame;

if the matching degree of the first target face key point information and the first target tracked face key point information is greater than or equal to a preset matching degree threshold value, the face detection result pool is not updated.

In this embodiment of the present application, the preset matching degree threshold may be preset and stored in a memory (for example, a nonvolatile memory) of the mobile terminal. If the matching degree of the first target face key point information and the first target tracked face key point information is smaller than a preset matching degree threshold value, which indicates that the difference between the first target face of the current frame and the first target tracked face in the face detection result pool is large, updating the face detection result pool, updating the first target tracked face key point information in the face detection result pool into the first target face key point information, and updating the first target tracked face frame in the face detection result pool into the first target face frame. If the matching degree of the first target face key point information and the first target tracked face key point information is larger than or equal to a preset matching degree threshold value, the difference between the first target face of the current frame and the first target tracked face in the face detection result pool is smaller, and in order to improve the face tracking speed, the face detection result pool is not updated.

In the embodiment of the application, the face key point information of the same face frame detected by two continuous frames can be updated, the tracked face key point information in the face detection result pool of the video stream can be ensured to be the latest face key point information with the minimum jitter, and therefore the face feature point detection result error in the video stream can be reduced.

Referring to fig. 3, fig. 3 is a flowchart of another face detection method according to an embodiment of the present application. Fig. 3 is a view of further optimizing on the basis of fig. 1, and as shown in fig. 3, the face detection method is applied to a mobile terminal, and may include the following steps.

301, a mobile terminal acquires a current video frame of a video stream, identifies whether the current video frame contains M target face frames, identifies whether the current video frame contains Q newly-added face frames, wherein the target face frames are the same face frames detected by both the current video frame and the last video frame, the newly-added face frames are face frames detected by the current video frame and not detected by the last video frame, or the newly-added face frames are face frames detected by a first frame of the video stream, M is a positive integer, and Q is a positive integer.

In this embodiment of the present application, the mobile terminal may perform identifying whether the current video frame includes M target face frames and identifying whether the current video frame includes Q newly added face frames simultaneously. The newly added face frame is a face frame which is appeared in the current video frame and is not detected in the previous video frame, or all face frames detected by the current video frame are newly added face frames when the current video frame is the first frame of the video stream.

If P face frames exist in the current video frame, N tracked face frames exist in the face detection result pool. The mobile terminal can respectively calculate IOU of any two-two face frame pairs between P face frames and N tracked face frames in the face detection result pool, and if the P face frames have the face frames with N IOUs of the N tracked face frames smaller than a first preset threshold value, the face frames with N IOUs of the N tracked face frames smaller than the first preset threshold value in the P face frames are determined to be newly added face frames.

Optionally, the following steps may be performed while step 301 is performed:

the mobile terminal identifies whether the face detection result pool comprises a face frame with failed tracking, and if so, the face frame with failed tracking is deleted from the face detection result pool.

The face frames which fail to track are face frames which exist in a face detection result pool but do not exist in the current video frame.

Specifically, if the first face frame of the face detection result pool and the P IOUs of the P face frames in the current video frame are smaller than the face frame of the first preset threshold, determining that the first face frame is the face frame with tracking failure. In the embodiment of the application, the face frame which fails to track can be deleted from the face detection result pool, only the continuously tracked face frame is stored in the face detection result pool, the face frame which is tracked in a short time is prevented from being stored in the face detection result pool, and the processing efficiency of processing the face frame in the face detection result pool (for example, the face recognition verification of the scenes such as unlocking and payment) is improved. If all the face frames tracked by the history are stored in the face detection result pool, when a scene with large traffic (such as a large mall, a gym, a square, etc.) is shot to acquire a video stream, the number of faces in the face detection result pool of the video stream is extremely large, so that the calculation amount of subsequent face matching verification is extremely large, and the subsequent processing efficiency is reduced.

Optionally, the mobile terminal may delete the face key point information corresponding to the face frame with failed tracking in the face detection result pool while deleting the face frame with failed tracking from the face detection result pool.

302, if M target face frames are included, the mobile terminal detects first target face key point information corresponding to a first target face frame through a trained face key point detection model, where the first target face frame is any one of the M target face frames.

303, the mobile terminal calculates the matching degree of the first target face key point information and the first target tracked face key point information, and decides whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream.

Optionally, the specific implementation of step 302 and step 303 may refer to step 102 and step 103 in fig. 1, which are not described herein.

304, if the Q newly added face frames are included, the mobile terminal newly adds the Q newly added face frames to the face detection result pool.

In the embodiment of the application, for the newly added face frame, the newly added face frame can be directly added to the face detection result pool, and the latest added face frame result in the face detection result pool is ensured.

Optionally, when the current video frame is identified to include Q newly-added face frames, the mobile terminal may further detect Q newly-added face key point information corresponding to the Q newly-added face frames through a trained face key point detection model, and newly add the Q newly-added face key point information to a face detection result pool. Q newly-added face frames are in one-to-one correspondence with Q newly-added face key point information.

Each newly added face key point information may include coordinates of key points corresponding to each region of the face. For example, coordinates of a plurality of key points corresponding to the left eye eyebrow, coordinates of a plurality of key points corresponding to the right eye eyebrow, coordinates of a plurality of key points corresponding to the left eye region, coordinates of a plurality of key points corresponding to the right eye region, coordinates of a plurality of key points corresponding to the nose region, and coordinates of a plurality of key points corresponding to the mouth region may be included.

In the embodiment of the application, for the newly added face frame, the newly added face frame can be newly added to the face detection result pool, the latest added face frame result in the face detection result pool is ensured, and the accuracy of the face frame result detection is improved.

Referring to fig. 4, fig. 4 is a flowchart of another face detection method according to an embodiment of the present application. Fig. 4 is a view of further optimizing on the basis of fig. 1, and as shown in fig. 4, the face detection method is applied to a mobile terminal, and may include the following steps.

401, the mobile terminal acquires an initial training data set for training, where the initial training data set includes a first preset number of initial face images and face key point information corresponding to the first preset number of initial face images.

In this embodiment, steps 401 to 405 are training processes of the face keypoint detection model.

The initial face picture in the initial training dataset may be acquired in advance. Specifically, the initial face picture may include a plurality of large classifications, for example, 7 large classifications: beards, black frame glasses, caps, masks, no decorations, scarves and sunglasses. There may be multiple subcategories under each large category. For example, there may be 23 subcategories for a total of 25000 pictures without decoration, which may specifically include: the novel multifunctional face mask has the advantages of being free of expression 1000, closed in eyes 1000, exposed teeth smile 1000, big mouth 1000, pursed mouth 1000, air blowing 1000, eyebrow lifting 1000, eyebrow creasing 1000, monocular closing 1000, 0-degree 1000, left rotation 45-degree 1500, left rotation 15-degree 1500, right rotation 45-degree 1500, right rotation 15-degree 1500, head tilting 15-degree 1000, head tilting 45-degree 1000, upper left tilting 45-degree 1000, lower left tilting 45-degree 1000 and lower right tilting 45-degree 1000; the remaining 6 major categories are respectively 12 sub-categories, namely 200 positive faces, 200 faces deflected rightwards, 200 faces deflected leftwards, 200 low heads, 200 back heads, 200 short eyes, 200 closed eyes, 200 frowns, 200 cheeks, 200 stickers, 200 yawns, 200 laughs.

Each initial face picture can be marked corresponding to 106 face key points. Referring specifically to fig. 2a, the 106 face key point labels may be obtained by manual labeling, so as to obtain coordinates of the 106 face key points of each initial face picture. The face key point information corresponding to the initial face picture may include coordinates of 106 face key points of the initial face picture.

402, the mobile terminal analyzes pose distribution information of a first preset number of initial face pictures based on face key point information corresponding to the first preset number of initial face pictures.

403, the mobile terminal determines whether the pose distribution information of the first preset number of initial face pictures meets a preset pose distribution rule.

404, if the pose distribution information of the first preset number of initial face images does not meet the preset pose distribution rule, the mobile terminal amplifies the number of face images of the initial training dataset based on the pose distribution information of the first preset number of initial face images to obtain an amplified training dataset meeting the preset pose distribution rule, wherein the amplified training dataset comprises the first preset number of initial face images, face key point information corresponding to the first preset number of initial face images, the second preset number of amplified face images and face key point information corresponding to the second preset number of amplified face images.

And 405, training the face key point detection model by the mobile terminal based on the augmentation training data set to obtain a trained face key point detection model.

In the embodiment of the present application, the implementation of steps 402 to 405 may include the following steps.

S1: on a preset standard face template, three-dimensional coordinates of six sampling key points are analyzed, and the coordinate positions of 6 sampling key points of the nose tip, the chin, the left eye angle, the right eye angle, the left mouth angle and the right mouth angle of the standard face template are specifically analyzed, wherein three-dimensional coordinates are respectively set for 6 points: nose tip (0.0,0.0,0.0), chin (0.0, -330.0, 65.0), left eye corner (-225.0, 170.0, -135.0), right eye corner (225.0, 170.0, -135.0), left mouth corner (-150.0, -150.0, -125.0), right mouth corner (150.0, -150.0, -125.0). The three-dimensional coordinates are in the order (X, Y, Z).

S2: setting camera internal parameters, wherein the center point of the camera is (h/2,w/2), and h and w are the height and width of the face image respectively. Setting the focal length of the camera as h, the camera matrix is:

s3: coordinate positions of 6 sampling key points of each initial face picture in the initial training data set and position coordinates of 6 corresponding sampling key points on a preset template are obtained, on the 6 corresponding sampling key points (such as, for example), camera internal parameters are combined, pnP (perspective n points, perspective n-point projection) is solved, and a rotation vector and a translation vector of each initial face picture are obtained.

PnP solves for the rotation vector and translation vector of each initial face picture.

S4: and (3) applying the Rodrigues (which can be calculated by using a built-in function Rodrigues of opencv) transformation to the rotation vector of each initial face picture to obtain a rotation matrix of each initial face picture, and splicing (hstack) the rotation matrix with the translation vector obtained above to obtain a projection matrix of each initial face picture.

S5: carrying out projection matrix decomposition on the projection matrix of each initial face picture to obtain Euler angles representing the face gestures: pitch angle (x-axis), yaw angle (y-axis) and roll angle (z-axis).

The euler angle calculation may use opencv built-in functions.

S6: analyzing the gesture of each initial face picture in the initial training data set, and respectively calculating the pitch angle, the yaw angle and the roll angle of each initial face picture in the initial training data set. Referring specifically to fig. 5, fig. 5 is a schematic diagram of euler angles of a user's head in three-dimensional space according to an embodiment of the present application. Dividing pitch angles (namely head-up and head-down of a human face) in a distribution range from-45 degrees to 45 degrees by taking 15 degrees as a section, and dividing the pitch angles into 6 sections (bins); for the yaw angle (namely the face turns left and right), the distribution range is from-90 degrees to 90 degrees, 15 degrees are taken as one interval to be divided, and 12 intervals (bins) can be obtained in total; the roll angle (i.e. the left and right head of the face) is divided into 6 sections (bins) with a distribution range of-45 degrees to 45 degrees and 15 degrees as one section. Since the pitch angle can be divided into 6 sections according to the angle, the law angle can be divided into 12 sections according to the angle, the roll angle can be divided into 6 sections according to the angle, and the initial face picture can have 6×12×6=432 total gestures.

S7: and respectively counting the distribution condition of each initial face picture in the initial training data set on each Euler angle, wherein in general, most pictures with correct face gestures in the initial training data set are counted, so that the face gesture distribution is extremely unbalanced. According to the pose distribution condition of the face pictures, data enhancement is carried out on the poses with less distribution, and the image quantity is amplified by adopting modes of cutting, adding noise, changing image brightness and the like, so that a state that the pose distribution is balanced is achieved, for example, each pose has the face pictures, and the proportion of the face pictures in any two different poses is 1:1.

S8: after the training data set is subjected to gesture amplification, the face key point detection model can be trained by using the amplified training data set. In the training process, a simple 5-layer convolution network and a 2-layer full-connection network can be used for directly regressing the position coordinates of the key points, so that the calculation speed of the position coordinates of the key points of the face key point detection model can be improved.

When the training steps of the face key point detection model reach a certain step number (for example, 30 ten thousand steps), the training error of the face key point detection model can be converged, the training error of the face key point detection model can be measured by using NME, and the NME refers to the distance between the key point predicted by the model and the key point mark compared with the distance between the two eyes. For example, when the NME is less than the preset error threshold, the face keypoint detection model is determined to be a trained face keypoint detection model. For example, when NME is less than 3.2%, a trained face keypoint detection model is determined. The smaller the NME, the better the computational accuracy of the model.

It should be noted that, steps 401 to 405 are training phases of the face key point detection model, and steps 406 to 408 are face key point recognition phases. When the mobile terminal executes steps 406 to 408, the trained face key point detection model in steps 401 to 405 is needed. Steps 401 to 405 need to be performed before steps 406 to 408. In the case where the data amount of the augmentation training data set is large, the training phase of the face key point detection model may take a long time. Therefore, steps 401 to 405 may be performed by the mobile terminal in advance, and after the trained face key point detection model is obtained, no training is required. That is, the mobile terminal does not need to perform the above steps 401 to 405 every time the steps 406 to 408 are performed.

Alternatively, the training phase of the face keypoint detection model of steps 401 to 405 may be implemented using a machine dedicated to model training, and the mobile terminal itself is not used for model training. The mobile terminal only needs to use the face key point detection model trained by the machine.

406, the mobile terminal acquires a current video frame of the video stream, and identifies whether the current video frame contains M target face frames, wherein the target face frames are the same face frames detected by the current video frame and the previous video frame, and M is a positive integer.

407, if the mobile terminal includes M target face frames, detecting, by the mobile terminal, first target face key point information corresponding to the first target face frame through the trained face key point detection model, where the first target face frame is any one of the M target face frames.

Optionally, in step 407, the mobile terminal detects, through the trained face key point detection model, first target face key point information corresponding to the first target face frame, specifically:

the mobile terminal detects first target face key point information of a first target face picture corresponding to a first target face frame through a trained face key point detection model;

408, the mobile terminal calculates the matching degree of the first target face key point information and the first target tracked face key point information, and decides whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream.

Optionally, after performing step 408, the following steps may also be performed:

(31) The mobile terminal analyzes the gesture information of the first target face picture based on the first target face key point information;

(32) The mobile terminal corrects the first target face picture according to the gesture information of the first target face picture to obtain a first target face correction picture;

the mobile terminal corrects the first target face picture, which may specifically be: correcting the first face picture into a face picture of which the pitch angle, the yaw angle and the roll angle are all in an angle interval corresponding to the pre-stored face template. For example, the pitch angle, the yaw angle and the roll angle of the pre-stored face template are all in a 0-15 degree interval, and the pitch angle, the yaw angle and the roll angle of the first target face picture are not in a 0-15 degree interval, so that the rotation vector and the translation vector of the first target face picture can be determined according to the pitch angle, the yaw angle and the roll angle of the first target face picture and the pitch angle, the yaw angle and the roll angle of the pre-stored face template, and the rotation and translation operations are performed on the first target face picture to obtain the first target face correction picture, so that the pitch angle of the first target face correction picture and the pitch angle of the pre-stored face template are in the same interval, the yaw angle of the first target face correction picture and the roll angle of the pre-stored face template are in the same interval, and the roll angle of the first target face correction picture and the pre-stored face template are in the same interval.

Because the first target face correction picture and the pitch angle, the yaw angle and the roll angle of the pre-stored face template are all in the same interval, the success rate of face template comparison can be improved, and the accuracy of face recognition is improved.

(33) And the mobile terminal compares the first target face correction picture with a pre-stored face template, and determines whether the first target face correction picture is matched with the pre-stored face template according to a comparison result.

Specifically, the first target face correction picture can be sent to a face comparison model, a face code of the first target face correction picture is generated, the face codes of the user of the face template are prestored in the mobile terminal for comparison, and if the comparison result is smaller than a preset threshold value, the comparison is successful, and the matching is determined. The face comparison model may be a general face comparison model. According to the embodiment of the application, unlocking, payment and other operations can be executed according to the matching result.

In this embodiment of the present application, the first target face picture is a target face picture corresponding to any one of M target face frames detected in the current video frame. The face images corresponding to the target face frames detected in the video stream can be compared with the pre-stored face templates, and whether the face images are matched is determined according to the comparison result. The face unlocking method and device can be applied to face unlocking scenes.

For a first target face picture detected in a video stream, the application can detect first target face key point information corresponding to a first target face frame through a trained face key point detection model, and analyze gesture information of the first target face picture based on the first target face key point information. The analyzing of the pose information of the first target face picture may specifically refer to the steps S3 to S6, which are not described herein.

Optionally, before performing the steps, the following steps may also be performed:

(41) The mobile terminal starts a face picture input process to acquire a video stream acquired by a front camera in real time;

(42) The mobile terminal extracts face pictures of all angles of a user from a video stream acquired in real time;

for example, the user head can circle by 360 degrees according to the interface prompt of the mobile terminal, and the mobile terminal can acquire face pictures of all angles of the user by adopting a face detection algorithm.

(43) The mobile terminal detects face key points of face pictures of all angles of the user to obtain the face key points of the face pictures of all angles of the user;

the mobile terminal can detect the face key points of the face pictures of the angles of the user through the trained face key point detection model, and face key point coordinates of the face pictures of the angles of the user are obtained.

(44) The mobile terminal analyzes the face posture information of the face pictures of the angles of the user through the face key points of the face pictures of the angles of the user;

specifically, the method of analyzing the face pose information of the face picture of each angle of the user in step (44) may refer to steps S3 to S6, which are not described herein.

(45) The mobile terminal corrects the face pictures of the angles of the user according to the face posture information of the face pictures of the angles of the user to obtain face correction pictures of the angles of the user;

(46) The mobile terminal generates a face template based on face correction pictures of all angles of the user.

In the embodiment of the application, the mobile terminal can correct the face picture of each angle of the user to obtain the face correction picture of each angle of the user, and specifically, the pitch angle, the yaw angle and the roll angle of the face picture of each angle of the user can be corrected to be within a preset interval. And then, the average value of the corresponding key point coordinates of each face in the face correction picture of each angle of the user is obtained, an average face is obtained, and the average face is used as a face template. Comparing the face template with the existing face template, determining whether the face template is a newly added face template, if so, sending the average face into a face comparison model for analysis, and obtaining the face code of the newly added face template.

Each face template in the memory of the mobile terminal corresponds to one face code.

In the embodiment of the present application, the steps (41) to (46) are the process of inputting the face template.

The specific implementation of the face tracking flow in steps 406 to 408 in the embodiment of the present application may refer to steps 101 to 103 shown in fig. 1, and will not be described herein.

When the pose distribution of the pictures used for training in the training data set is uneven, the face key point detection model obtained through training can have better robustness to the pose change by means of amplifying the training data set, and accuracy of calculating the key point position coordinates of the face key point detection model is improved.

The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that, in order to achieve the above-described functions, the mobile terminal may include corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application may divide the functional units of the mobile terminal according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

In accordance with the foregoing, referring to fig. 6, fig. 6 is a schematic structural diagram of a face detection apparatus provided in an embodiment of the present application, and the face detection apparatus 600 may include an obtaining unit 601, a recognition unit 602, a detection unit 603, a calculating unit 604, and a processing unit 605, where:

an obtaining unit 601, configured to obtain a current video frame of a video stream;

the identifying unit 602 is configured to identify whether the current video frame includes M target face frames, where the target face frames are the same face frame detected by both the current video frame and the previous video frame, and M is a positive integer;

A detection unit 603, configured to detect, when the identification unit 602 identifies that the current video frame includes M target face frames, first target face key point information corresponding to a first target face frame through a trained face key point detection model, where the first target face frame is any one of the M target face frames;

a calculating unit 604, configured to calculate a matching degree between the first target face key point information and the first target tracked face key point information;

a processing unit 605, configured to determine whether to update the face detection result pool according to the matching degree between the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream.

Optionally, the processing unit 605 determines whether to update the face detection result pool according to the matching degree between the first target face key point information and the first target tracked face key point information, specifically: if the matching degree of the first target face key point information and the first target tracked face key point information is smaller than a preset matching degree threshold value, updating the first target tracked face key point information in the face detection result pool to the first target face key point information, and updating the first target tracked face frame in the face detection result pool to the first target face frame; and if the matching degree of the first target face key point information and the first target tracked face key point information is greater than or equal to the preset matching degree threshold value, not updating the face detection result pool.

Optionally, the identifying unit 602 identifies whether the current video frame includes M target face frames, specifically: identifying whether the current video frame contains P human face frames, wherein P is a positive integer greater than or equal to M; if the face detection result pool comprises P face frames, respectively calculating IOU of any two face frame pairs between the P face frames and N tracked face frames in the face detection result pool, wherein N is a natural number; determining whether M pairwise face frame pairs larger than a first preset threshold exist in the IOU of any pairwise face frame pair, wherein the M pairwise face frame pairs comprise M target face frames and M tracked face frames which are in one-to-one correspondence with the M target face frames, the M target face frames belong to the P face frames, and the M target face frames belong to the N tracked face frames.

Optionally, the identifying unit 602 is further configured to determine, after determining whether M pairwise face frame pairs greater than a first preset threshold exist in the ios of the arbitrary pairing of two face frames, if M pairwise face frame pairs greater than the first preset threshold exist in the ios of the arbitrary pairing of two face frames, determine whether M1 pairwise face frame pairs greater than a second preset threshold exist in the ios of the M pairing of two face frames, where the M1 pairwise face frame pairs include M1 target faces and M1 tracked face frames corresponding to the M1 target face frames one to one, the M1 target face frames belong to the M target face frames, the M1 tracked face frames belong to the M tracked face frames, the second preset threshold is greater than the first preset threshold, and the M1 is a positive integer;

The processing unit 605 is further configured to not update the M1 tracked face frames in the face detection result pool when M1 pairwise face frame pairs greater than a second preset threshold exist in the IOU of the M pairwise face frame pairs;

the processing unit 605 is further configured to update the M tracked face frames to the M target face frames corresponding to each other in a one-to-one manner if M1 pairwise face frame pairs greater than a second preset threshold do not exist in the IOUs of the M pairwise face frame pairs.

Optionally, the identifying unit 602 is further configured to identify, after the acquiring unit 601 acquires a current video frame of a video stream, whether the current video frame includes Q newly added face frames, where the newly added face frames are face frames detected by the current video frame and not detected by a previous video frame, or the newly added face frames are face frames detected by a first frame of the video stream;

the processing unit 605 is further configured to, when the identifying unit 602 identifies that the current video frame includes Q newly added face frames, newly add the Q newly added face frames to the face detection result pool.

Optionally, the face detection apparatus 600 may further include a deletion unit (not shown in the figure).

The identifying unit 602 is further configured to identify whether a face frame with tracking failure is included in the face detection result pool;

the deleting unit is configured to delete, when the identifying unit 602 identifies that the face detection result pool includes a face frame that fails to be tracked, the face frame that fails to be tracked from the face detection result pool.

The face frame with the tracking failure is a face frame which exists in the face detection result pool but does not exist in the current video frame.

Optionally, the deleting unit is further configured to delete the face frame that fails to track from the face detection result pool, and delete face key point information corresponding to the face frame that fails to track in the face detection result pool.

Optionally, the detecting unit 603 is further configured to detect, after the identifying unit 602 identifies whether the current video frame includes Q new face frames, first new face key point information corresponding to a first new face frame through a trained face key point detection model when the current video frame includes the Q new face frames, where the first new face frame is any one of the Q new face frames;

The processing unit 605 is further configured to add the first newly added face key point information to the face detection result pool.

Optionally, the face detection apparatus 600 may further include an analysis unit 606, a determination unit 607, a data augmentation unit 608, and a training unit 609.

The acquiring unit 601 is further configured to acquire an initial training data set for training before the detecting unit 603 detects first target face key point information corresponding to a first target face frame through a trained face key point detection model, where the initial training data set includes a first preset number of initial face pictures and face key point information corresponding to the first preset number of initial face pictures;

an analysis unit 606, configured to analyze pose distribution information of the first preset number of initial face pictures based on face key point information corresponding to the first preset number of initial face pictures;

a determining unit 607, configured to determine whether the pose distribution information of the first preset number of initial face pictures meets a preset pose distribution rule;

the data augmentation unit 608 is configured to augment, based on the pose distribution information of the first preset number of initial face images, the number of face images in the initial training dataset to obtain an augmentation training dataset satisfying the preset pose distribution rule, where the augmentation training dataset includes the first preset number of initial face images and face key point information corresponding to the first preset number of initial face images, the second preset number of augmentation face images, and face key point information corresponding to the second preset number of augmentation face images, where the pose distribution information of the first preset number of initial face images does not satisfy the preset pose distribution rule;

And the training unit 609 is configured to train the face key point detection model based on the augmentation training data set, and obtain the trained face key point detection model.

Optionally, the face detection apparatus 600 may further include a correction unit 610 and a comparison unit 611.

The detection unit 603 detects first target face key point information corresponding to a first target face frame through a trained face key point detection model, specifically: detecting first target face key point information of a first target face picture corresponding to a first target face frame through a trained face key point detection model;

the analyzing unit 606 is further configured to analyze pose information of the first target face picture based on the first target face key point information after the processing unit 605 determines whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information;

the correcting unit 610 is configured to correct the first target face picture according to the pose information of the first target face picture, to obtain a first target face corrected picture;

The comparing unit 611 is configured to compare the first target face correction picture with a pre-stored face template, and determine whether the first target face correction picture is matched according to a comparison result.

Optionally, the face detection apparatus 600 may further include a template generation unit 612.

The acquiring unit 601 is further configured to start a face picture input procedure before acquiring a current video frame of a video stream, acquire a video stream acquired by a front camera in real time, and extract face pictures of all angles of a user from the video stream acquired in real time;

the detecting unit 603 is further configured to detect face key points of the face pictures of the angles of the user, so as to obtain the face key points of the face pictures of the angles of the user;

the analysis unit 606 is further configured to analyze face pose information of the face pictures of each angle of the user through face key points of the face pictures of each angle of the user;

the correction unit 610 is further configured to correct the face picture of each angle of the user according to face pose information of the face picture of each angle of the user, so as to obtain a corrected face picture of each angle of the user;

the template generating unit 612 is configured to generate a face template based on the face correction picture of each angle of the user.

The acquiring unit 601 may correspond to a camera module of the mobile terminal, and the identifying unit 602, the detecting unit 603, the calculating unit 604, the processing unit 605, the analyzing unit 606, the determining unit 607, the data amplifying unit 608, the training unit 609, the correcting unit 610, the comparing unit 611, the template generating unit 612, and the deleting unit may correspond to a processor of the mobile terminal.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a mobile terminal according to an embodiment of the present application, and as shown in fig. 7, the mobile terminal 700 includes a processor 701 and a memory 702, where the processor 701 and the memory 702 may be connected to each other through a communication bus 703. The communication bus 703 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. Communication bus 704 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus. The memory 702 is used for storing a computer program comprising program instructions, the processor 701 being configured for invoking program instructions, the program comprising instructions for performing the methods shown in fig. 1-4.

The processor 701 may be a general purpose Central Processing Unit (CPU), microprocessor, application Specific Integrated Circuit (ASIC), or one or more integrated circuits for controlling the execution of the above program schemes.

The Memory 702 may be, but is not limited to, read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, random access Memory (random access Memory, RAM) or other type of dynamic storage device that can store information and instructions, but may also be electrically erasable programmable read-Only Memory (EEPROM), compact disc read-Only Memory (Compact Disc Read-Only Memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and coupled to the processor via a bus. The memory may also be integrated with the processor.

The mobile terminal 700 may further include a camera, a display, a communication interface, an antenna, and other common components, which are not described in detail herein.

The present application also provides a computer storage medium storing a computer program for electronic data exchange, the computer program causing a computer to execute part or all of the steps of any one of the face detection methods described in the above method embodiments.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program that causes a computer to perform some or all of the steps of any one of the face detection methods described in the method embodiments above.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-only memory, random access memory, magnetic or optical disk, etc.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A face detection method, comprising:

if M target face frames are included, detecting first target face key point information corresponding to a first target face frame through a trained face key point detection model, wherein the first target face frame is any one of the M target face frames;

Calculating the matching degree of the first target face key point information and the first target tracked face key point information, and determining whether to update a face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream; the face detection result pool comprises the tracked face frames in all video frames before the current video frame of the video stream and face key point information corresponding to the tracked face frames; the face detection result pool is used for face recognition of face processing application;

identifying whether the face detection result pool comprises a face frame with failed tracking, if so, deleting the face frame with failed tracking from the face detection result pool, and deleting face key point information corresponding to the face frame with failed tracking in the face detection result pool; the face frames which fail tracking are face frames which exist in the face detection result pool and do not exist in the current video frame;

The determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information includes:

if the matching degree of the first target face key point information and the first target tracked face key point information is smaller than a preset matching degree threshold value, updating the first target tracked face key point information in the face detection result pool to the first target face key point information, and updating the first target tracked face frame in the face detection result pool to the first target face frame;

if the matching degree of the first target face key point information and the first target tracked face key point information is greater than or equal to the preset matching degree threshold value, not updating the face detection result pool;

the identifying whether the current video frame contains M target face frames comprises:

identifying whether the current video frame contains P human face frames, wherein P is a positive integer greater than or equal to M;

if the face detection result pool comprises P face frames, respectively calculating IOU of any two face frame pairs between the P face frames and N tracked face frames in the face detection result pool, wherein N is a natural number;

Determining whether M pairwise face frame pairs larger than a first preset threshold exist in the IOU of any pairwise face frame pair, wherein the M pairwise face frame pairs comprise M target face frames and M tracked face frames which are in one-to-one correspondence with the M target face frames, the M target face frames belong to the P face frames, and the M target face frames belong to the N tracked face frames;

if M pairwise face frame pairs larger than the first preset threshold exist in the IOU of any pairwise face frame pair, determining whether M1 pairwise face frame pairs larger than a second preset threshold exist in the IOU of the M pairwise face frame pairs, wherein the M1 pairwise face frame pairs comprise M1 target faces and M1 tracked face frames corresponding to the M1 target face frames one by one, the M1 target face frames belong to the M target face frames, the M1 tracked face frames belong to the M tracked face frames, the second preset threshold is larger than the first preset threshold, and M1 is a positive integer;

if M1 two-by-two face frame pairs larger than a second preset threshold exist in the IOU of the M two-by-two face frame pairs, not updating the M1 tracked face frames in the face detection result pool;

If M1 two-by-two face frame pairs larger than a second preset threshold value do not exist in the IOU of the M two-by-two face frame pairs, the M tracked face frames are respectively updated into the M target face frames in a one-to-one correspondence.

2. The method of claim 1, wherein after the obtaining the current video frame of the video stream, the method further comprises:

identifying whether the current video frame contains Q newly added face frames, wherein the newly added face frames are detected by the current video frame and are not detected by the previous video frame, or the newly added face frames are detected by the first frame of the video stream;

and if so, newly adding the Q newly added face frames to the face detection result pool.

3. The method of claim 2, wherein after said identifying whether the current video frame contains Q newly added face frames, the method further comprises:

if the current video frame comprises the Q newly added face frames, detecting first newly added face key point information corresponding to a first newly added face frame through a trained face key point detection model, wherein the first newly added face frame is any one of the Q newly added face frames;

And newly adding the first newly added face key point information to the face detection result pool.

4. A method according to any one of claims 1 to 3, wherein before the step of detecting the first target face key point information corresponding to the first target face frame by using the trained face key point detection model, the method further includes:

acquiring an initial training data set for training, wherein the initial training data set comprises a first preset number of initial face pictures and face key point information corresponding to the first preset number of initial face pictures;

analyzing the gesture distribution information of the first preset number of initial face pictures based on the face key point information corresponding to the first preset number of initial face pictures;

determining whether the gesture distribution information of the first preset number of initial face pictures meets a preset gesture distribution rule;

if the gesture distribution information of the first preset number of initial face pictures does not meet the preset gesture distribution rule, amplifying the number of face pictures of the initial training data set based on the gesture distribution information of the first preset number of initial face pictures to obtain an amplified training data set meeting the preset gesture distribution rule, wherein the amplified training data set comprises the first preset number of initial face pictures, face key point information corresponding to the first preset number of initial face pictures, the second preset number of amplified face pictures and face key point information corresponding to the second preset number of amplified face pictures;

And training the face key point detection model based on the amplification training data set to obtain the trained face key point detection model.

5. The method of claim 4, wherein the detecting, by the trained face keypoint detection model, the first target face keypoint information corresponding to the first target face frame comprises:

detecting first target face key point information of a first target face picture corresponding to a first target face frame through a trained face key point detection model;

after determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information, the method further comprises:

analyzing the gesture information of the first target face picture based on the first target face key point information;

correcting the first target face picture according to the attitude information of the first target face picture to obtain a first target face correction picture;

and comparing the first target face correction picture with a pre-stored face template, and determining whether the first target face correction picture is matched with the pre-stored face template according to a comparison result.

6. The method of claim 5, wherein prior to the obtaining the current video frame of the video stream, the method further comprises:

Starting a face picture input process to acquire a video stream acquired by a front camera in real time;

extracting face pictures of all angles of a user from the video stream acquired in real time;

detecting face key points of the face pictures of the angles of the user to obtain the face key points of the face pictures of the angles of the user;

analyzing the face pose information of the face pictures of the angles of the user through the face key points of the face pictures of the angles of the user;

correcting the face picture of each angle of the user according to the face posture information of the face picture of each angle of the user to obtain a face correction picture of each angle of the user;

and generating a face template based on the face correction pictures of the angles of the user.

7. A face detection apparatus, comprising:

the processing unit is used for determining whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information; the first target tracked face key point information is tracked face key point information corresponding to a first target tracked face frame with the highest overlapping degree IOU of the first target face frame in a face detection result pool of the video stream; the face detection result pool comprises the tracked face frames in all video frames before the current video frame of the video stream and face key point information corresponding to the tracked face frames; the face detection result pool is used for face recognition of face processing application;

the identification unit is also used for identifying whether the face detection result pool comprises a face frame with tracking failure or not;

a deleting unit, configured to delete, when the identifying unit identifies that the face detection result pool includes the face frame that fails to track, the face frame that fails to track from the face detection result pool, and delete face key point information corresponding to the face frame that fails to track in the face detection result pool; the face frames which fail tracking are face frames which exist in the face detection result pool and do not exist in the current video frame;

The processing unit decides whether to update the face detection result pool according to the matching degree of the first target face key point information and the first target tracked face key point information, specifically: if the matching degree of the first target face key point information and the first target tracked face key point information is smaller than a preset matching degree threshold value, updating the first target tracked face key point information in the face detection result pool to the first target face key point information, and updating the first target tracked face frame in the face detection result pool to the first target face frame; if the matching degree of the first target face key point information and the first target tracked face key point information is greater than or equal to the preset matching degree threshold value, not updating the face detection result pool;

the identifying unit identifies whether the current video frame contains M target face frames, including: identifying whether the current video frame contains P human face frames, wherein P is a positive integer greater than or equal to M; if the face detection result pool comprises P face frames, respectively calculating IOU of any two face frame pairs between the P face frames and N tracked face frames in the face detection result pool, wherein N is a natural number; determining whether M pairwise face frame pairs larger than a first preset threshold exist in the IOU of any pairwise face frame pair, wherein the M pairwise face frame pairs comprise M target face frames and M tracked face frames which are in one-to-one correspondence with the M target face frames, the M target face frames belong to the P face frames, and the M target face frames belong to the N tracked face frames;

The identification unit is further configured to determine, when M two-by-two face frame pairs greater than the first preset threshold exist in the IOUs of any two-by-two face frame pairs, whether M1 two-by-two face frame pairs greater than a second preset threshold exist in the IOUs of the M two-by-two face frame pairs, where the M1 two-by-two face frame pairs include M1 target faces and M1 tracked face frames corresponding to the M1 target face frames one to one, the M1 target face frames belong to the M target face frames, the M1 tracked face frames belong to the M tracked face frames, and the second preset threshold is greater than the first preset threshold, where M1 is a positive integer;

the processing unit is further configured to not update the M1 tracked face frames in the face detection result pool when M1 pairwise face frame pairs greater than a second preset threshold exist in the IOU of the M pairwise face frame pairs;

the processing unit is further configured to update the M tracked face frames to the M target face frames corresponding to each other one by one, if M1 pairwise face frame pairs greater than a second preset threshold do not exist in the IOUs of the M pairwise face frame pairs.

8. A mobile terminal comprising a processor and a memory, the memory for storing a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-6.

9. A computer readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-6.