WO2020238374A1

WO2020238374A1 - Method, apparatus, and device for facial key point detection, and storage medium

Info

Publication number: WO2020238374A1
Application number: PCT/CN2020/081262
Authority: WO
Inventors: 项伟; 张小伟
Original assignee: 广州市百果园信息技术有限公司
Priority date: 2019-05-31
Filing date: 2020-03-26
Publication date: 2020-12-03
Also published as: CN112016371A; CN112016371B

Abstract

A method, apparatus and device for facial key point detection, and a storage medium, the method comprising: acquiring image frame information of a video, the image frame information of the video comprising key frame information and non-key frame information; determining face box position information according to the key frame information; detecting facial key points by means of a pre-trained first neural network on the basis of the face box position information so as to obtain initial key point position information; and detecting facial key points by means of a pre-trained second neural network on the basis of the initial key point position information and the image frame information of the video so as to obtain a facial key point detection result of the video, the facial key point detection result comprising facial key point position information corresponding to the key frame information and facial key point position information corresponding to the non-key frame information.

Description

Method, device, equipment and storage medium for detecting key points of human face

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with application number 201910473174.2 on May 31, 2019. The entire content of this application is incorporated into this application by reference.

Technical field

This application relates to the field of computer vision technology, for example, to a method, device, device, and storage medium for detecting key points of a human face.

Background technique

In the field of computer vision, algorithm development based on video data has always received extensive attention from academia and industry. Among them, face video data occupies a very important position due to its very realistic application scenarios in the fields of biological information verification, surveillance security, and video live broadcast. The detection of key points on the face is a very important step in face image processing. Its main function is to accurately locate the position of the key points on the face in the picture to prepare for subsequent operations, such as locating eyes, nose, The position of key points of the face on the picture, such as the corners of the mouth and the contour points of the face, is used to prepare for the subsequent operations such as face alignment and face recognition.

In implementation, face key point detection is usually a link after face detection. The face detector usually inputs the detected face position information and the corresponding face picture into the key point detection algorithm to obtain the key point position of the face, such as a face that will be given in the form of a rectangular or square frame The position information is input into the key point detection algorithm for calculation, so that the calculated result is determined as the key point position of the face. The face key point detection algorithm based on deep convolutional network has a great improvement in accuracy compared with the traditional face key point algorithm. However, the face key point detection method based on a deep convolutional network is usually computationally intensive, and the network structure of the deep convolutional network needs to be carefully designed and arranged, otherwise it will be difficult to achieve real-time processing on a platform with limited computing resources For example, it is difficult to achieve real-time processing effects on mobile terminals such as mobile phones.

Summary of the invention

The embodiments of this application provide a new face key point detection method, system, device and storage medium to solve the limitation of limited computing power, small storage space and high real-time requirements in the mobile terminal of the face key point detection method. The problem.

An embodiment of the present application provides a method for detecting key points of a face, including: acquiring image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information; determining according to the key frame information Face frame position information; based on the face frame position information, face key point detection is performed through the pre-trained first neural network to obtain initial key point position information; based on the initial key point position information and the video Image frame information, face key point detection is performed through a pre-trained second neural network to obtain the face key point detection result of the video, wherein the face key point detection result includes the person corresponding to the key frame information Face key point position information and face key point position information corresponding to the non-key frame information.

The embodiment of the present application also provides a face key point detection device, including:

A video image frame acquisition module, configured to acquire image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information;

The first face key point detection module is configured to determine face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through the pre-trained first neural network to obtain Initial key point location information;

The second face key point detection module is configured to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face of the video The key point detection result, wherein the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.

An embodiment of the present application also provides a device, including: a processor and a memory; the memory stores at least one instruction, and the instruction is executed by the processor, so that the device executes the aforementioned face key point detection method .

The embodiment of the present application also provides a computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the device, the device can execute the aforementioned method for detecting key points of a human face.

Description of the drawings

FIG. 1 is a schematic flowchart of steps of a method for detecting key points of a face in an embodiment of the present application;

FIG. 2 is a schematic flow chart of the steps of a method for detecting key points of a face in an optional embodiment of the present application;

FIG. 3 is a schematic diagram of the detection and tracking process of face key points in a video in an example of the present application;

FIG. 4 is a schematic diagram of the process of correcting the key points of the face in the previous frame in an example of the present application;

FIG. 5 is a schematic structural block diagram of an embodiment of an apparatus for detecting face key points in an embodiment of the present application;

Fig. 6 is a schematic block diagram of the structure of a device in an example of the present application.

Detailed ways

The application will be described below with reference to the drawings and embodiments. The specific embodiments described here are only used to explain the application, but not to limit the application. For the convenience of description, the drawings only show parts related to the present application instead of all the structures or components.

Most of the face key point detection algorithms are designed based on a single static image. For the detection of face key points in the video, usually frame by frame or use general object tracking algorithms to track the face and then detect the person. Key points of the face. Face key point tracking schemes can be roughly divided into two categories: one is to perform face detection and face key point detection frame by frame; the other is to perform face detection on the first image frame and then detect Use general object tracking methods to track the face frame of subsequent image frames, use the key point detection algorithm on each tracked face, and re-apply if the tracking fails in an image frame and no face is found The face detector detects human faces. Among them, the first type of scheme requires face detection and key point detection for each image frame, and does not make full use of the associated information between adjacent frames, so the speed is limited; in addition, because each image frame is independent Processing, the problem of key point jitter is prone to affect the subsequent modules that rely on the stability of key points, such as the module that relies on the detected key points of the face for face sticker special effect setting, and reduces the user experience. The second type of scheme re-uses the face detector to detect the face when an image frame tracking fails and no face is found. Although the key point stability is higher than the first type of scheme, on the one hand, general object tracking methods are usually more Time-consuming, on the other hand there are two potential problems. A potential problem is that faces in videos often show rapid changes in posture, scale, occlusion, and expressions. For example, faces in mobile videos often show rapid changes in posture, scale, occlusion, and expressions, leading to object tracking. The method fails and the face detector is reused; another problem is that the general face key point algorithm is more sensitive to the relative position of the input face in the face frame, that is, for the face frame that disturbs the input, the key point detection algorithm is The output results before and after the disturbance will be very different, and the face frame obtained by tracking is lower than the face frame obtained by the detector, which will lead to errors in the key point detection. It can be seen that these face key point detection methods have problems such as high computational complexity and easy loss of tracking targets. In addition, most of the application scenarios of face key point detection are on mobile terminals such as mobile phones. The face key point detection solution has limitations such as limited computing power, small storage space, and high real-time requirements.

In order to achieve fast and stable face key point detection and tracking, an embodiment of the present application proposes a face key point detection method. After obtaining the video information in the embodiment of the application, the face frame position information can be determined according to the key frame information in the video information, so as to determine the initial key point position information through the first neural network according to the face frame position information. Then, based on the initial key point location information, the face key point detection can be performed through the second neural network to obtain the face key point detection result of the video, that is, the two-level neural network is used to realize the face key point in the video information Point detection, which can efficiently process the detection of key points of the face in the video.

Referring to FIG. 1, it shows a schematic flow chart of the steps of a method for detecting key points of a face in an embodiment of the present application. The face key point detection method can be used in face vision applications such as face recognition, special effects stickers on faces, and face-changing special effects, and may include the following steps:

Step 110: Obtain image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information.

A video may include one or more video frames; each video frame may include an image frame used to display a video screen and/or an audio frame used to play a video sound. The image frame information of the video in this embodiment can represent the image frame in the video, for example, it can refer to the image information in the video frame, which can be used to display the video screen so that the user can watch the playback screen of the video.

In the embodiment of the present application, when detecting face key points in a video, the image frame information of the video to be detected may be obtained, so as to perform face key point detection according to the face picture information in the image frame information. Among them, the face picture information can be used to characterize the face picture contained in the video frame. For example, when a video frame contains a face picture of a person, it can be determined based on the face picture information in the image frame information that the person is in the Face pictures displayed in a video frame; for another example, when a video frame contains multiple face pictures, the face pictures of multiple people displayed in the video frame can be determined based on the face picture information in the image frame information Wait.

After obtaining the image frame information of the video, the embodiment of the present application may divide the obtained image frame information into key frame information and non-key frame information to detect the position of the face frame based on the key frame information, that is, step 120 is performed. Among them, the key frame information can characterize the key image frames in the video (referred to as key frames), and the non-key frame information can characterize the non-key image frames in the video (referred to as non-key frames).

Step 120: Determine face frame position information according to the key frame information, and perform face key point detection through a pre-trained first neural network based on the face frame position information to obtain initial key point position information.

In this embodiment, after obtaining the key frame information that characterizes the key frame, a preset face detector can be used, such as a multi-task cascaded convolutional network (Multi-Task Convolutional Network) that is used as a face detector for joint face positioning and alignment. Convolutional Neural Network, MTCNN) detects the key frame information to generate face frame position information. The position information of the face frame can represent the position of the face frame, and the display position of the face frame in the image frame of the video can be determined. Subsequently, the face frame picture can be cut out from the key frame based on the face frame position information, that is, the face frame picture containing the face can be cut out from the image frame as the video key frame according to the face frame position, and The corresponding face frame picture information is generated to use the face frame picture information to characterize the cropped face picture, and then the generated face frame picture information can be input to the pre-trained first neural network for face key point detection, Preliminarily detect the position of the key points of the face. For example, the output information of the first neural network can be used as the initial key point position information, so that the initial key point position information can be used to initially determine the position of the key points of the face, such as determining the face The key point is at the approximate position of the current key frame.

Step 130: Based on the initial key point location information and the image frame information of the video, perform face key point detection through a pre-trained second neural network to obtain a face key point detection result of the video.

Wherein, the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.

The face key point detection result of the video in this embodiment can be used to determine the face key point position of each image frame in the video, and may include the face key point position information corresponding to each image frame information in the video, such as It may include face key point position information corresponding to the key frame information, face key point position information corresponding to the non-key frame information, and the like. Among them, the face key point position information corresponding to the image frame information can be used to characterize the face key point position of the image frame, for example, the face key point position information corresponding to the key frame information can be used to characterize the face key point in the key frame The position, for example, the key point position information of the face corresponding to the non-key frame information can be used to characterize the position of the key point of the face in the non-key frame.

In implementation, after the initial key point position information is determined in the embodiment of the present application, based on the initial key point position information, a picture cropping frame can be generated according to the approximate position of the face key point in the current key frame, and then the picture can be cropped. The frame crops the face picture in the current key frame, that is, the picture crop frame can be used to crop the image frame of the video to obtain the key frame face picture information, and the key frame face picture information can be used to characterize this time The picture obtained after cropping. Subsequently, the obtained key frame face image information can be input to the pre-trained second neural network for face key point detection, and the information output after the second neural network detection can be determined as the face key point of the current key frame Information, and can perform face key point detection and tracking on non-key frames in the video based on the face key point information of the key frame to obtain the face key point information of the non-key frame, which can be based on the face of the key frame The key point information and/or the face key point information of the non-key frame generates the face key point detection result of the video.

To sum up, after obtaining the image frame information of the video, the embodiment of the present application can determine the position information of the face frame according to the key frame information in the image frame information of the video, and pass the first neural network based on the position information of the face frame. Perform face key point detection to obtain initial key point position information. Then, based on the initial key point position information and the video image frame information, the face key point detection can be performed through the second neural network, that is, the two-level neural network is used for human Face key point detection, which solves the problems of high computational complexity, large amount of calculation, and poor real-time processing effect in the use of a single deep convolutional network to achieve face key point detection. It can be detected quickly and stably The location of the key points of the face can quickly and stably deal with the key point detection and tracking of the face in the video.

In actual processing, after acquiring the image frames of the video in this embodiment, one or more frames of the video image frames can be selected as key frames according to preset rules, such as selecting one of them from every N image frames of the video. One frame is used as a key frame, and the remaining image frames can be used as non-key frames, that is, the image frames of consecutive (N-1) frames adjacent to the key frame are determined as the non-key frames corresponding to the key frame. The value can be determined according to different application scenarios, that is, the value of N can be changed according to different application scenarios. Subsequently, the face detector can be used to detect the position of the face frame in the key frame, so that the face frame image information can be cut out from the key frame information according to the face frame position, and then the face frame image information can be obtained through the first neural network. The face frame image information performs face key point detection to detect the approximate face key points in the key frame.

On the basis of the foregoing embodiment, optionally, before determining the position information of the face frame according to the key frame information, the face key point detection method provided in this embodiment may further include: image frame information from the video Select the key frame information and the non-key frame information corresponding to the key frame information. Subsequently, the position information of the face frame may be determined according to the key frame information, so as to determine the approximate position of the face key points based on the position information of the face frame through the first neural network. For example, the t-th frame picture in the video can be determined as key frame information, and the pictures from the (t+1)th frame to the (t+N-1)th frame in the video can be determined as non-key frame information, and The non-key frame information is associated with the t-th frame picture as the key frame information, so as to determine the non-key frame information as the non-key frame information corresponding to the key frame information. Wherein, t can be an integer greater than zero.

Optionally, determining the position information of the face frame according to the key frame information may include: inputting the key frame information into a face detector, where the face detector is used to detect the position of the face frame; The output information of the face detector is determined to be the position information of the face frame. Therefore, the position of the face frame can be determined based on the position information of the face frame, so as to cut out the face frame picture from the key frame according to the face frame position, and generate the corresponding face frame picture information, such as using as a face detector MTCNN detects the position of the face frame, and performs square expansion processing on the box corresponding to each face frame position. That is, the center of the frame is the center of the square and the long side of the frame is the side of the square. The face frame picture information cut by the square, and the face frame picture information can be input into the first neural network for face key point detection. The first neural network can be used as the first-level face key point detection network in the face key point detection process, which can be used to perform key point detection on the face frame picture of the video key frame, and output initial key point position information, So that the subsequent process also performs face key point detection on key frames and/or non-key frames according to the initial key point position information, so that the key point positions of the face in the video can be detected quickly and stably. Among them, the initial key point position information can be used to preliminarily determine the approximate position of the key point on the face.

In an optional embodiment of the present application, based on the initial key point position information and the image frame information of the video, the face key point detection is performed through a pre-trained second neural network to obtain the person in the video. The face key point detection result includes: generating a picture cropping frame according to the initial key point position information; performing cropping processing on the key frame information through the picture cropping frame to obtain key frame face picture information, and combining the key The frame face picture information is input to the second neural network for face key point detection, and the face key point position information corresponding to the key frame information is obtained.

After the initial key point position information is determined in the embodiment of the application, a picture cropping frame can be generated based on the initial key point position information, so as to use the picture cropping frame to crop the face corresponding to the current key frame according to the approximate position of the face key points Picture information, that is, the key frame information is cropped to obtain the key frame face picture information. Subsequently, the key frame face image information can be used as the input of the second neural network, and input to the second neural network for face key point detection, so as to accurately determine the face key point position of the key frame, and Determine the output information of the second neural network as the face key point position information corresponding to the key frame information, so that subsequent face key points can be performed on the non-key frame information corresponding to the key frame information based on the face key point position information Perform inspection and tracking, that is, use the information between adjacent frames of the video to track the key points of the face, and generate the key point position information of the face corresponding to the non-key frame information, which can be based on the key frame information corresponding to the face key point position information and / The face key point information corresponding to the non-key frame information generates the face key point detection result of the video to achieve the purpose of high-speed processing of the face key point detection in the video.

On the basis of the foregoing embodiment, optionally, in the embodiment of the present application, based on the initial key point position information and the image frame information of the video, the face key point detection is performed through a pre-trained second neural network to obtain The detection result of the key points of the face of the video may also include: cropping the non-key frame information corresponding to the key frame information through the picture cropping box to obtain non-key frame picture information; when the non-key frame When the picture information includes face picture information, generate non-key frame face picture information according to the face key point position information corresponding to the key frame information, and input the non-key frame face picture information to the second nerve The network performs face key point detection, and obtains face key point position information corresponding to the non-key frame information.

Referring to FIG. 2, there is shown a schematic flowchart of the steps of a method for detecting key points of a face in an optional embodiment of the present application. The method for detecting key points of a face may include the following steps:

Step 210: Obtain image frame information of the video.

Wherein, the image frame information of the video includes key frame information and non-key frame information.

Step 220: Select key frame information and non-key frame information corresponding to the key frame information from the image frame information of the video.

Step 230: Input the key frame information into the face detector.

Wherein, the face detector is used to detect the position of the face frame.

Step 240: Determine the output information of the face detector as the position information of the face frame.

In the embodiment of the present application, after selecting the key frame information from the video, the key frame information can be input into the face detector to detect the position of the face frame of the key frame through the face detector, which can be based on the face The output information of the detector is determined as the face frame position information, and the face frame picture information is cut out from the key frames according to the face frame position based on the face frame position information to perform preliminary face key point detection, that is, step 250 is performed.

Step 250: Perform face key point detection through the pre-trained first neural network based on the face frame position information to obtain initial key point position information.

Step 260: Generate a picture cropping frame according to the initial key point position information.

Step 270: Perform cropping processing on the key frame information through the picture cropping box to obtain key frame face picture information, and input the key frame face picture information to the second neural network to perform face key points By detecting, obtain the key point position information of the face corresponding to the key frame information.

After determining the position information of the face frame in the embodiment of the present application, the face frame picture information may be cropped from the key frame according to the face frame position based on the face frame position information, so as to input the cropped face picture information Perform face key point detection in the first neural network to obtain initial key point position information, so that a picture cropping frame can be generated based on the initial key point position information, and the picture cropping frame can be used to set the key according to the approximate position of the face key point The frame is cropped to obtain the key frame face image information. The key frame face picture information can be used to characterize the face picture in the video key frame. Subsequently, the key frame face picture information can be used as the input of the second neural network, and input to the second neural network for face key point detection, so as to accurately and stably determine the key frame based on the information output by the second neural network For example, the information output by the second neural network can be determined as the face key point position information corresponding to the key frame information, so that the subsequent face key point position information corresponding to the key frame information can be paired Non-key frame key points of the face are detected and tracked.

As an example of this application, when one frame is selected from every N frames of the video as a key frame, and the remaining frames are regarded as non-key frames, as shown in FIG. 3, the t-th frame and the t+N-th frame in the video can be Determined as the key frame information, and in each key frame, the MTCNN as a face detector can be used to detect the position of the face frame, and each frame determined by the MTCNN can be squared to expand according to this square Crop the face picture, as shown in Figure 3, the Crop Face Picture I module, and the cropped face picture can be scaled to 70 pixels in width and height and input to the face key point detection as the first neural network The network C processes, and obtains 106 face key point coordinates as initial key point position information. Subsequently, a face picture can be cropped according to the smallest square frame formed by the 106 key point coordinates of the face. As shown in Figure 3, the cropped face picture II module can use the smallest square frame as the picture cropping frame. The picture cropping frame cropping performs crop processing on the key frame information to obtain the key frame face picture information. For example, the cropped face picture can be scaled to a width and height of 70 pixels and then input to the face as the second neural network The key point detection network F performs processing to obtain more accurate 106 face key point coordinates, which are used as the face key point position information corresponding to the key frame information, so that the corresponding face key point position information can be generated based on the key frame information The face key point detection result of the video, so that subsequent key point processing can be performed based on the face key point position information corresponding to the key frame information, and the non-key frame can be processed according to the face key point position information corresponding to the obtained key frame information The key points of the face are detected and tracked, and step 280 is performed to use the information between adjacent frames to directly track the key points of the face in the video to achieve the purpose of efficiently processing the detection of the key points of the face in the video.

Both the face key point detection network C and the face key point detection network F in this example can extract features through multiple convolution layers (Convolution Layer) and feature pooling layer (Pooling Layer), and can use a fully connected layer (Fully Connected Layer) to return to the relative position of key points. Although the two face key point detection networks have the same network structure, fewer channels are used in each layer of the face key point detection network C, so they are used as the face key points of the first neural network The detection network C is lighter than the face key point detection network F as the second neural network. In addition, the input pictures of the two face key point detection networks have different cropping methods. The input picture of the face key point detection network C can be cropped through the face frame, while the input picture of the face key point detection network F can It is cropped based on 106 key points of the face, and the input picture cropped according to the position of the 106 key points of the face will be closer to the face. In addition, the two face key point detection networks can be trained separately, and the weight of each convolutional layer can be different to reduce the impact of inaccurate key points caused by the face frame not being close enough to the face . It can be seen that the embodiment of the present application can use a two-level neural network to perform a progressive method of face key point detection to obtain more accurate key point positions. The face key point detection network C can return to the rough positions of key points, and The face key point detection network F is improved to get more accurate key points.

Step 280: Perform cropping processing on the non-key frame information corresponding to the key frame information through the picture cropping box to obtain non-key frame picture information.

When processing non-key frames in the embodiment of the application, the current frame can be cropped using the picture cropping box. For example, the t+1th frame picture shown in FIG. 3 is cropped to generate a corresponding image based on the cropped picture. Non-key frame picture information. The non-key frame picture information can be used to characterize a picture cropped from the non-key frame of the video according to the position of the face key point of the key frame. Subsequently, the non-key frame picture information can be used as the input of the face detection and tracking network to detect and track the face in the non-key frame picture information through the face detection and tracking network, such as determining the non-key frame picture information Whether to include face picture information. Among them, the face detection and tracking network can be used as a non-key frame face detector. For example, it can be the face detector tracking network (Tracking Net, TNet) shown in Figure 3. The face detector TNet can determine the non-key frame Whether the frame picture information contains face picture information, to determine whether the input picture is a face picture, and when it is judged that the input picture is a face picture, output the relative position of the face frame and the relative position of the key points of the face, etc. . The face picture information may include a variety of information used to characterize the face picture, such as image information corresponding to the face picture, etc., which is not limited in this embodiment.

Step 290: Generate non-key frame face picture information according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information, and input the non-key frame face picture information Perform face key point detection on the second neural network to obtain face key point position information corresponding to the non-key frame information.

Optionally, after the non-key frame picture information is obtained in the embodiment of the present application, in the case that the non-key frame picture information includes face picture information, it is based on the face key point position information corresponding to the key frame information and all Before the non-key frame information corresponding to the key frame information generates the non-key frame face picture information, it further includes: inputting the non-key frame picture information into the face detection and tracking network to obtain the output information of the face detection and tracking network, The output information includes face probability information; based on the face probability information, it is determined whether the non-key frame picture information includes face picture information. The face probability information can be used to determine whether a non-key frame contains a face picture, such as the probability that the non-key frame contains a face picture; when the value of the face probability information exceeds a certain threshold, the non-key frame picture information can be determined Contains face picture information; accordingly, when the value of the face probability information does not exceed the above threshold, it can be determined that the non-key frame picture information does not contain face picture information. If the non-key frame picture information does not include face picture information, it can be determined that the current non-key frame does not include a face picture, and then the non-key frame can be ignored, and face key point detection is not actively performed on the non-key frame. If the non-key frame picture information contains the face picture information, it can be determined that the current non-key frame contains the face picture, and the network can be tracked through face detection, and the face key point position of the key frame is used to perform the face of the non-key frame Detect to detect the approximate position of the key points of the face in the non-key frame, and generate corresponding non-key frame face picture information, that is, step 290 is executed. Among them, the non-key frame face picture information may represent the face picture information of the non-key frame, and may include the face key point information of the non-key frame information, for example, may include five face key point coordinates in the non-key frame. The coordinates of the five key points of the face can be the position coordinates of the center of the left eye, the center of the right eye, the tip of the nose, the left corner of the mouth, and the right corner of the mouth.

In an optional embodiment of the present application, the output information of the face detection and tracking network may also include the relative position information of the face frame and the relative position information of the key points. After obtaining the non-key frame picture information, in the case that the non-key frame picture information includes face picture information, according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information Before the frame information generates the non-key frame face picture information, it may further include: determining the face key point information of the non-key frame information according to the relative position information of the face frame and the key point relative position information. Among them, the relative position information of the face frame can represent the relative position of the returned face frame, such as a 4-dimensional vector output by the face detection and tracking network through the output layer; the relative position information of the key points can be used to represent 5 key points of the face The relative position of is, for example, a 10-dimensional vector output by the face detection and tracking network through the output layer.

In actual processing, after receiving the input non-key frame picture information, the face detection and tracking network in the embodiment of the present application can determine whether the picture displayed in the non-key frame is a face picture based on the non-key frame picture information. ; If it is a face picture, return to the position of the face picture in the face frame of the current frame, and output the position coordinates of the left eye center, right eye center, nose tip, left corner of the mouth, and right corner of the mouth. , That is, output the face key point information of the non-key frame information as the output information of the face detection and tracking network.

Combining the above examples, the five face key point information output by the face detection and tracking network can be part of the 106 key points output by the face key point detection network C and the face key point detection network F, such as in face detection and tracking When the output layer of the net is a fully connected layer FC, a 2-dimensional vector (p0, p1) can be output through the fully connected layer FC as face probability information to indicate the probability that the input picture is/not a face. For example, when p0 represents the probability of a non-human face and p1 represents the probability of a human face, the sum of p0 and p1 can be 1, that is, p0+p1=1. When p1 exceeds the preset threshold, it can be determined that a person is detected Face, when p does not exceed the preset threshold, it can be judged that the current input picture is a non-face picture; and a 4-dimensional vector (x0, y0, w, h) can be output as the relative position information of the face frame to indicate The relative position of the returned face frame, where (x0, y0) can be the coordinates of the upper left corner of the face frame in the picture, and (w, h) can be the width and height of the face frame, such as the input TNet A box information as non-key frame picture information is (x0, y0, w, h), when the output 4-dimensional vector is (dx0, dy0, dx1, dy1), these 4 of the output 4-dimensional vector can be used The number indicates the relative position of the detected box relative to the input box, and the corresponding detection box is (x0+dx0*w,y0+dy0*h,(dx1-dx0)*w,(dy1-dy0)*h ), so that the detected frame can be used as a non-key frame face frame to crop the non-key frame picture; and a 10-dimensional vector (dx0,dy0,...,dx4,dy4) can be output as The relative position information of the key points is used to indicate the relative positions of the five face key points, so that the five face key point coordinates of the non-key frame can be determined as (x0+dx0*w,y0+dy0*h,..., x0+dx4*w,y0+dy4*h), as the face key point information of the non-key frame information, and then based on these 5 face key point coordinates, the face key point position information of the previous frame can be corrected. , Get the non-key frame face picture information input to the second neural network, as shown in Figure 3, crop the face picture III, and then the non-key frame face picture information can be keyed by the second neural network Point detection to generate non-key frame key point position information of the face. Compared with the face detector MTCNN used on key frames, TNet can be a network with a smaller amount of calculation. Because the position of the face in the picture between adjacent frames in the video does not change much, the key point position information of the face passed from the previous frame has given the approximate position of the face in the current frame, so only one is needed A simple face detection and tracking network can return to the position of the face frame.

Since the information of the previous frame is used as the key frame information and used in the current frame, there may be deviations for fast face movement, so a key point modification module is introduced to correct the coordinate positions of these face key points. In an optional implementation manner, a linear transformation can be used to use the new information of the current frame learned by TNet to correct the position of the 106 key points of the face passed in the previous frame, so that the corrected 106 The coordinates of the key points of the personal face form the smallest square frame to crop a face picture, as shown in Figure 3, crop the face picture III, and the cropped face picture can be scaled to a width and height of 70 pixels as non-key The frame face picture information is input to the face key point detection network F for face key point detection, and the 106 face key point coordinates of the current frame are obtained as the face key point position information corresponding to the non-key frame information.

Optionally, in this embodiment, generating non-key frame face picture information based on the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information may include: The face key point information of the frame information is corrected to the face key point position information corresponding to the key frame information to obtain key point correction information; the key point is determined according to the key point correction information and the initial key point position information Point tracking position information; generating a face picture cropping frame according to the key point tracking position information; cropping the non-key frame information and/or the non-key frame picture information through the face picture cropping frame to obtain The non-key frame face picture information.

In the process of tracking key points of the face in this embodiment, for non-key frames, the coordinates of the key points of the face of the previous frame can be used as the approximate position of the key points of the face corresponding to the current frame, so as to use the information of adjacent frames to achieve the correct Non-key frames are used for the purpose of face key point detection and tracking. In order to cope with the changes between frames, a correction step is added, which is based on the five face key point coordinates output by TNet. According to the 5 face key point coordinates output by TNet and the corresponding 5 face key point coordinates of the previous frame Calculate the linear transformation information (A*, b*) from the difference, and then apply the linear transformation information (A*, b*) to all 106 face key points in the previous frame to obtain the corrected 106 face Key point information, the face image is cropped according to the modified 106 face key point information, so that the cropped face image will more closely fit the face of the current frame, which plays a key role in the key frame processing flow. Point to detect the role of network C.

For example, in combination with the above example, on non-key frames, face detection and tracking networks such as TNet can return the coordinates of five key points of the face, such as {(u ₁ ', v ₁ '),..., (u ₅ ', v ₅ ')}, as shown in Figure 4, TNet outputs 5 face key point coordinates; it can be extracted from the 106 face key point coordinates output by the face key point detection network F in the previous frame The coordinates of these five face key points are denoted as {(u ₁ , v ₁ ),……, (u ₅ , v ₅ )}, and the remaining 101 face keys that can be output by the face key point detection network F The coordinates of the points are used as {(u ₆ , v ₆ ),..., (u ₁₀₆ , v ₁₀₆ )}, and then the coordinates of the five key points of the face extracted {(u ₁ , v ₁ ),... , (U ₅ , v ₅ )} is used as the face key point position information corresponding to the key frame information, and the face key point position information corresponding to the key frame information is corrected to obtain the key point correction information.

As an optional implementation of this application, you can use the calculation formula

Determine the linear transformation information (A*, b*) as the key point correction information. Among them, A can be calculated by the formula

To determine, b can be determined according to the formula b=(b _x , b _y ). S can be a characterizing scaling factor, R can be a 2x2 rotation transformation matrix, and b can be a 2-dimensional displacement vector.

In an embodiment, the linear transformation information (A*, b*) can be obtained by the following steps:

Step S10, according to the formula

And formula

Find the average coordinates of the key points of the two groups of faces, and the coordinates of the key points of the two groups of faces can be calculated centrally, for example, according to the formula

Centered as the coordinates of the key points of the face in the previous frame, and can follow the formula

The center is used as the coordinates of the key points of the face in the current frame.

Step S20, according to the formula

Calculate the 2x2 matrix C, and decompose the matrix C with the singular value according to the formula C=U∑V ^T to obtain the optimal 2x2 rotation matrix R*, and R*=V ^T U.

Step S30 ^: Calculate the value S* according to the optimal 2x2 rotation matrix R ^* , such as calculated according to the calculation formula S ^* =e/d, where,

Step S40 ^: Determine A* and b* according to the optimal 2x2 rotation matrix R ^* and the value S*, where,

Subsequently, linear transformation information (A*, b*) can be used to correct the position of the 106 key points of the face transmitted from the previous frame by using the new information of the current frame learned by TNet, for example, according to the correction formula

The linear transformation linear information (A*, b*) is applied to the coordinate positions of all 106 key points of the face in the previous frame, so that the face picture is cut out according to the corrected 106 key points to be closer to the person in the current frame face. You can crop a face picture by using the coordinates of the 106 key points after correction to form the smallest square frame, as shown in Figure 3, cropping the face picture III, and scaling the cropped face picture to a width and height of 70 pixels Input the face key point detection network F to obtain the 106 face key point coordinates of the current frame, that is, generate a face picture cropping frame according to the key point tracking position information, and use the face picture cropping frame to correct the non-key frame Information and/or the non-key frame picture information is cropped to obtain non-key frame face picture information, and the non-key frame face picture information is input to the face key point detection network F as the second neural network for human Face key point detection, to obtain the face key point position information corresponding to the non-key frame information, so that the subsequent face key point detection results of the video can be generated based on the face key point position information corresponding to the non-key frame information to achieve the video The purpose of face key point detection and tracking.

For the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations. The embodiments of the present application are not limited by the described sequence of actions, because according to the embodiments of the present application, some steps may adopt other sequences or Simultaneously.

Referring to FIG. 5, there is shown a schematic structural block diagram of an embodiment of a face key point detection apparatus in an embodiment of the present application. The face key point detection apparatus may include the following modules:

The video image frame acquisition module 510 is configured to acquire image frame information of the video, where the image frame information of the video includes key frame information and non-key frame information; the first face key point detection module 520 is configured to The key frame information determines the position information of the face frame, and based on the face frame position information, the face key point detection is performed through the first neural network trained in advance to obtain the initial key point position information; the second face key point detection module 530. Set to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face key point detection result of the video, where The face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.

In implementation, the above-mentioned face key point detection device can be integrated in the device. The device can be composed of two or more physical entities, or one physical entity. For example, the device can be a personal computer (PC), computer, mobile phone, tablet device, personal digital assistant, server, messaging device , Game console, etc.

An embodiment of the present application also provides a device, including a processor and a memory. At least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes the face key point detection method as described in the foregoing method embodiment.

Referring to Fig. 6, there is shown a schematic structural diagram of a device in an example of the present application. As shown in FIG. 6, the device may include: a processor 60, a memory 61, a display screen 62 with a touch function, an input device 63, an output device 64, and a communication device 65. The processor 60, the memory 61, the display screen 62, the input device 63, the output device 64, and the communication device 65 of the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.

The memory 61, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the method for detecting key points of a face according to any embodiment of the present application (for example, face The video image frame acquisition module 510, the first face key point detection module 520, and the second face key point detection module 530 in the key point detection device, etc.).

The processor 60 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 61, that is, realizes the aforementioned method for detecting key points of a human face.

In an embodiment, when the processor 60 executes one or more programs stored in the memory 61, the following operations are implemented: acquiring image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information; Determine the face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through a pre-trained first neural network to obtain initial key point position information; based on the initial key The point position information and the image frame information of the video are detected by the pre-trained second neural network for face key point detection to obtain the face key point detection result of the video, wherein the face key point detection result includes The face key point position information corresponding to the key frame information and the face key point position information corresponding to the non-key frame information.

An embodiment of the present application also provides a computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the device, the device can execute the face key point detection method as described in the above method embodiment. Exemplarily, the method for detecting key points of a face includes: acquiring image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information; and determining the position of the face frame according to the key frame information Information, and based on the face frame position information, through the pre-trained first neural network to perform face key point detection to obtain initial key point position information; based on the initial key point position information and the image frame information of the video , Performing face key point detection through a pre-trained second neural network to obtain the face key point detection result of the video, wherein the face key point detection result includes the face key point corresponding to the key frame information The position information and the key point position information of the face corresponding to the non-key frame information.

Claims

A method for detecting key points of a face, including:

Acquiring image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information;

Determining face frame position information according to the key frame information, and performing face key point detection through a pre-trained first neural network based on the face frame position information to obtain initial key point position information;

Based on the initial key point position information and the image frame information of the video, the face key point detection is performed through the second neural network trained in advance to obtain the face key point detection result of the video, wherein the face The key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
The method according to claim 1, wherein, based on the initial key point position information and the image frame information of the video, the face key point detection is performed through a pre-trained second neural network to obtain the video The detection results of key points on the face, including:

Generating a picture cropping frame according to the initial key point position information;

The key frame information is cropped through the picture cropping frame to obtain key frame face picture information, and the key frame face picture information is input to the second neural network for face key point detection, and obtain Position information of key points on the face corresponding to the key frame information.
The method according to claim 2, before said determining the position information of the face frame according to the key frame information, further comprising:

Selecting key frame information and non-key frame information corresponding to the key frame information from the image frame information of the video;

The step of performing face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face key point detection result of the video, further includes:

Performing crop processing on the non-key frame information corresponding to the key frame information by using the picture cropping box to obtain non-key frame picture information;

In the case that the non-key frame picture information includes face picture information, a non-key frame face picture is generated according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information Information, and input the non-key frame face picture information to the second neural network to perform face key point detection, and obtain face key point position information corresponding to the non-key frame information.
The method according to claim 3, after the non-key frame picture information is obtained, in a case where the non-key frame picture information includes face picture information, according to the face key point position corresponding to the key frame information Information and the non-key frame information corresponding to the key frame information before generating the non-key frame face picture information, further includes:

Inputting the non-key frame picture information into a face detection and tracking network to obtain output information of the face detection and tracking network, wherein the output information includes face probability information;

Determine whether the non-key frame picture information includes face picture information based on the face probability information.
The method according to claim 4, wherein the output information further includes relative position information of the face frame and relative position information of key points;

After the non-key frame picture information is obtained, in the case where the non-key frame picture information includes face picture information, according to the face key point position information corresponding to the key frame information and the key frame information corresponding Before generating the non-key frame face picture information from the non-key frame information, the method further includes:

Determine the face key point information of the non-key frame information according to the relative position information of the face frame and the relative position information of the key points.
The method according to claim 5, wherein the generating non-key frame face picture information according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information comprises:

Based on the face key point information of the non-key frame information, correcting the face key point position information corresponding to the key frame information to obtain key point correction information;

Determining key point tracking position information according to the key point correction information and the initial key point position information;

Generating a face picture cropping frame according to the key point tracking position information;

Perform cropping processing on at least one of the non-key frame information and the non-key frame picture information by using the face picture cropping frame to obtain the non-key frame face picture information.
The method according to any one of claims 1 to 6, wherein the determining the position information of the face frame according to the key frame information comprises:

Inputting the key frame information into a face detector, where the face detector is used to detect the position of a face frame;

The output information of the face detector is determined as the position information of the face frame.
A device for detecting key points of a face includes:

The video image frame acquisition module is configured to acquire image frame information of the video, wherein the image frame information of the video includes key frame information and non-key frame information;

The first face key point detection module is configured to determine face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through the pre-trained first neural network to obtain Initial key point location information;

The second face key point detection module is configured to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face of the video The key point detection result, wherein the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
The apparatus according to claim 8, wherein the second face key point detection module comprises:

The picture cropping frame generating submodule is set to generate a picture cropping frame according to the initial key point position information;

A key frame cropping processing sub-module, configured to perform crop processing on the key frame information through the picture cropping frame to obtain key frame face picture information;

A key frame face key point detection sub-module, configured to input the key frame face picture information to the second neural network for face key point detection, and obtain face key point position information corresponding to the key frame information .
A device including: a processor and a memory;

At least one instruction is stored in the memory, and the at least one instruction is executed by the processor so that the device executes the method for detecting key points of a face according to any one of claims 1 to 7.
A computer-readable storage medium, wherein when the instructions in the storage medium are executed by the processor of the device, the device executes the method for detecting key points of a human face according to any one of claims 1 to 7.