WO2020238374A1 - Method, apparatus, and device for facial key point detection, and storage medium - Google Patents

Method, apparatus, and device for facial key point detection, and storage medium Download PDF

Info

Publication number
WO2020238374A1
WO2020238374A1 PCT/CN2020/081262 CN2020081262W WO2020238374A1 WO 2020238374 A1 WO2020238374 A1 WO 2020238374A1 CN 2020081262 W CN2020081262 W CN 2020081262W WO 2020238374 A1 WO2020238374 A1 WO 2020238374A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
information
key
frame
key point
Prior art date
Application number
PCT/CN2020/081262
Other languages
French (fr)
Chinese (zh)
Inventor
项伟
张小伟
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Publication of WO2020238374A1 publication Critical patent/WO2020238374A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of computer vision technology, for example, to a method, device, device, and storage medium for detecting key points of a human face.
  • face video data occupies a very important position due to its very realistic application scenarios in the fields of biological information verification, surveillance security, and video live broadcast.
  • the detection of key points on the face is a very important step in face image processing. Its main function is to accurately locate the position of the key points on the face in the picture to prepare for subsequent operations, such as locating eyes, nose,
  • the position of key points of the face on the picture, such as the corners of the mouth and the contour points of the face is used to prepare for the subsequent operations such as face alignment and face recognition.
  • face key point detection is usually a link after face detection.
  • the face detector usually inputs the detected face position information and the corresponding face picture into the key point detection algorithm to obtain the key point position of the face, such as a face that will be given in the form of a rectangular or square frame
  • the position information is input into the key point detection algorithm for calculation, so that the calculated result is determined as the key point position of the face.
  • the face key point detection algorithm based on deep convolutional network has a great improvement in accuracy compared with the traditional face key point algorithm.
  • the face key point detection method based on a deep convolutional network is usually computationally intensive, and the network structure of the deep convolutional network needs to be carefully designed and arranged, otherwise it will be difficult to achieve real-time processing on a platform with limited computing resources For example, it is difficult to achieve real-time processing effects on mobile terminals such as mobile phones.
  • the embodiments of this application provide a new face key point detection method, system, device and storage medium to solve the limitation of limited computing power, small storage space and high real-time requirements in the mobile terminal of the face key point detection method.
  • the problem is that
  • An embodiment of the present application provides a method for detecting key points of a face, including: acquiring image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information; determining according to the key frame information Face frame position information; based on the face frame position information, face key point detection is performed through the pre-trained first neural network to obtain initial key point position information; based on the initial key point position information and the video Image frame information, face key point detection is performed through a pre-trained second neural network to obtain the face key point detection result of the video, wherein the face key point detection result includes the person corresponding to the key frame information Face key point position information and face key point position information corresponding to the non-key frame information.
  • the embodiment of the present application also provides a face key point detection device, including:
  • a video image frame acquisition module configured to acquire image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information;
  • the first face key point detection module is configured to determine face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through the pre-trained first neural network to obtain Initial key point location information;
  • the second face key point detection module is configured to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face of the video
  • the key point detection result wherein the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
  • An embodiment of the present application also provides a device, including: a processor and a memory; the memory stores at least one instruction, and the instruction is executed by the processor, so that the device executes the aforementioned face key point detection method .
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the instructions in the storage medium are executed by the processor of the device, the device can execute the aforementioned method for detecting key points of a human face.
  • FIG. 1 is a schematic flowchart of steps of a method for detecting key points of a face in an embodiment of the present application
  • FIG. 2 is a schematic flow chart of the steps of a method for detecting key points of a face in an optional embodiment of the present application
  • FIG. 3 is a schematic diagram of the detection and tracking process of face key points in a video in an example of the present application
  • FIG. 4 is a schematic diagram of the process of correcting the key points of the face in the previous frame in an example of the present application
  • FIG. 5 is a schematic structural block diagram of an embodiment of an apparatus for detecting face key points in an embodiment of the present application
  • Fig. 6 is a schematic block diagram of the structure of a device in an example of the present application.
  • Face key point detection algorithms are designed based on a single static image. For the detection of face key points in the video, usually frame by frame or use general object tracking algorithms to track the face and then detect the person. Key points of the face. Face key point tracking schemes can be roughly divided into two categories: one is to perform face detection and face key point detection frame by frame; the other is to perform face detection on the first image frame and then detect Use general object tracking methods to track the face frame of subsequent image frames, use the key point detection algorithm on each tracked face, and re-apply if the tracking fails in an image frame and no face is found The face detector detects human faces.
  • the first type of scheme requires face detection and key point detection for each image frame, and does not make full use of the associated information between adjacent frames, so the speed is limited; in addition, because each image frame is independent Processing, the problem of key point jitter is prone to affect the subsequent modules that rely on the stability of key points, such as the module that relies on the detected key points of the face for face sticker special effect setting, and reduces the user experience.
  • the second type of scheme re-uses the face detector to detect the face when an image frame tracking fails and no face is found.
  • a potential problem is that faces in videos often show rapid changes in posture, scale, occlusion, and expressions.
  • faces in mobile videos often show rapid changes in posture, scale, occlusion, and expressions, leading to object tracking.
  • the method fails and the face detector is reused;
  • the general face key point algorithm is more sensitive to the relative position of the input face in the face frame, that is, for the face frame that disturbs the input, the key point detection algorithm is The output results before and after the disturbance will be very different, and the face frame obtained by tracking is lower than the face frame obtained by the detector, which will lead to errors in the key point detection.
  • these face key point detection methods have problems such as high computational complexity and easy loss of tracking targets.
  • most of the application scenarios of face key point detection are on mobile terminals such as mobile phones.
  • the face key point detection solution has limitations such as limited computing power, small storage space, and high real-time requirements.
  • an embodiment of the present application proposes a face key point detection method.
  • the face frame position information can be determined according to the key frame information in the video information, so as to determine the initial key point position information through the first neural network according to the face frame position information.
  • the face key point detection can be performed through the second neural network to obtain the face key point detection result of the video, that is, the two-level neural network is used to realize the face key point in the video information Point detection, which can efficiently process the detection of key points of the face in the video.
  • FIG. 1 shows a schematic flow chart of the steps of a method for detecting key points of a face in an embodiment of the present application.
  • the face key point detection method can be used in face vision applications such as face recognition, special effects stickers on faces, and face-changing special effects, and may include the following steps:
  • Step 110 Obtain image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information.
  • a video may include one or more video frames; each video frame may include an image frame used to display a video screen and/or an audio frame used to play a video sound.
  • the image frame information of the video in this embodiment can represent the image frame in the video, for example, it can refer to the image information in the video frame, which can be used to display the video screen so that the user can watch the playback screen of the video.
  • the image frame information of the video to be detected may be obtained, so as to perform face key point detection according to the face picture information in the image frame information.
  • the face picture information can be used to characterize the face picture contained in the video frame. For example, when a video frame contains a face picture of a person, it can be determined based on the face picture information in the image frame information that the person is in the Face pictures displayed in a video frame; for another example, when a video frame contains multiple face pictures, the face pictures of multiple people displayed in the video frame can be determined based on the face picture information in the image frame information Wait.
  • the embodiment of the present application may divide the obtained image frame information into key frame information and non-key frame information to detect the position of the face frame based on the key frame information, that is, step 120 is performed.
  • the key frame information can characterize the key image frames in the video (referred to as key frames)
  • the non-key frame information can characterize the non-key image frames in the video (referred to as non-key frames).
  • Step 120 Determine face frame position information according to the key frame information, and perform face key point detection through a pre-trained first neural network based on the face frame position information to obtain initial key point position information.
  • a preset face detector can be used, such as a multi-task cascaded convolutional network (Multi-Task Convolutional Network) that is used as a face detector for joint face positioning and alignment.
  • Convolutional Neural Network, MTCNN detects the key frame information to generate face frame position information.
  • the position information of the face frame can represent the position of the face frame, and the display position of the face frame in the image frame of the video can be determined.
  • the face frame picture can be cut out from the key frame based on the face frame position information, that is, the face frame picture containing the face can be cut out from the image frame as the video key frame according to the face frame position, and
  • the corresponding face frame picture information is generated to use the face frame picture information to characterize the cropped face picture, and then the generated face frame picture information can be input to the pre-trained first neural network for face key point detection, Preliminarily detect the position of the key points of the face.
  • the output information of the first neural network can be used as the initial key point position information, so that the initial key point position information can be used to initially determine the position of the key points of the face, such as determining the face
  • the key point is at the approximate position of the current key frame.
  • Step 130 Based on the initial key point location information and the image frame information of the video, perform face key point detection through a pre-trained second neural network to obtain a face key point detection result of the video.
  • the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
  • the face key point detection result of the video in this embodiment can be used to determine the face key point position of each image frame in the video, and may include the face key point position information corresponding to each image frame information in the video, such as It may include face key point position information corresponding to the key frame information, face key point position information corresponding to the non-key frame information, and the like.
  • the face key point position information corresponding to the image frame information can be used to characterize the face key point position of the image frame, for example, the face key point position information corresponding to the key frame information can be used to characterize the face key point in the key frame
  • the position, for example, the key point position information of the face corresponding to the non-key frame information can be used to characterize the position of the key point of the face in the non-key frame.
  • a picture cropping frame can be generated according to the approximate position of the face key point in the current key frame, and then the picture can be cropped.
  • the frame crops the face picture in the current key frame, that is, the picture crop frame can be used to crop the image frame of the video to obtain the key frame face picture information, and the key frame face picture information can be used to characterize this time The picture obtained after cropping.
  • the obtained key frame face image information can be input to the pre-trained second neural network for face key point detection, and the information output after the second neural network detection can be determined as the face key point of the current key frame Information, and can perform face key point detection and tracking on non-key frames in the video based on the face key point information of the key frame to obtain the face key point information of the non-key frame, which can be based on the face of the key frame
  • the key point information and/or the face key point information of the non-key frame generates the face key point detection result of the video.
  • the embodiment of the present application can determine the position information of the face frame according to the key frame information in the image frame information of the video, and pass the first neural network based on the position information of the face frame. Perform face key point detection to obtain initial key point position information. Then, based on the initial key point position information and the video image frame information, the face key point detection can be performed through the second neural network, that is, the two-level neural network is used for human Face key point detection, which solves the problems of high computational complexity, large amount of calculation, and poor real-time processing effect in the use of a single deep convolutional network to achieve face key point detection. It can be detected quickly and stably The location of the key points of the face can quickly and stably deal with the key point detection and tracking of the face in the video.
  • one or more frames of the video image frames can be selected as key frames according to preset rules, such as selecting one of them from every N image frames of the video.
  • One frame is used as a key frame, and the remaining image frames can be used as non-key frames, that is, the image frames of consecutive (N-1) frames adjacent to the key frame are determined as the non-key frames corresponding to the key frame.
  • the value can be determined according to different application scenarios, that is, the value of N can be changed according to different application scenarios.
  • the face detector can be used to detect the position of the face frame in the key frame, so that the face frame image information can be cut out from the key frame information according to the face frame position, and then the face frame image information can be obtained through the first neural network.
  • the face frame image information performs face key point detection to detect the approximate face key points in the key frame.
  • the face key point detection method provided in this embodiment may further include: image frame information from the video Select the key frame information and the non-key frame information corresponding to the key frame information. Subsequently, the position information of the face frame may be determined according to the key frame information, so as to determine the approximate position of the face key points based on the position information of the face frame through the first neural network.
  • the t-th frame picture in the video can be determined as key frame information
  • the pictures from the (t+1)th frame to the (t+N-1)th frame in the video can be determined as non-key frame information
  • the non-key frame information is associated with the t-th frame picture as the key frame information, so as to determine the non-key frame information as the non-key frame information corresponding to the key frame information.
  • t can be an integer greater than zero.
  • determining the position information of the face frame according to the key frame information may include: inputting the key frame information into a face detector, where the face detector is used to detect the position of the face frame;
  • the output information of the face detector is determined to be the position information of the face frame. Therefore, the position of the face frame can be determined based on the position information of the face frame, so as to cut out the face frame picture from the key frame according to the face frame position, and generate the corresponding face frame picture information, such as using as a face detector MTCNN detects the position of the face frame, and performs square expansion processing on the box corresponding to each face frame position. That is, the center of the frame is the center of the square and the long side of the frame is the side of the square.
  • the face frame picture information cut by the square, and the face frame picture information can be input into the first neural network for face key point detection.
  • the first neural network can be used as the first-level face key point detection network in the face key point detection process, which can be used to perform key point detection on the face frame picture of the video key frame, and output initial key point position information, So that the subsequent process also performs face key point detection on key frames and/or non-key frames according to the initial key point position information, so that the key point positions of the face in the video can be detected quickly and stably.
  • the initial key point position information can be used to preliminarily determine the approximate position of the key point on the face.
  • the face key point detection is performed through a pre-trained second neural network to obtain the person in the video.
  • the face key point detection result includes: generating a picture cropping frame according to the initial key point position information; performing cropping processing on the key frame information through the picture cropping frame to obtain key frame face picture information, and combining the key
  • the frame face picture information is input to the second neural network for face key point detection, and the face key point position information corresponding to the key frame information is obtained.
  • a picture cropping frame can be generated based on the initial key point position information, so as to use the picture cropping frame to crop the face corresponding to the current key frame according to the approximate position of the face key points Picture information, that is, the key frame information is cropped to obtain the key frame face picture information.
  • the key frame face image information can be used as the input of the second neural network, and input to the second neural network for face key point detection, so as to accurately determine the face key point position of the key frame, and Determine the output information of the second neural network as the face key point position information corresponding to the key frame information, so that subsequent face key points can be performed on the non-key frame information corresponding to the key frame information based on the face key point position information
  • Perform inspection and tracking that is, use the information between adjacent frames of the video to track the key points of the face, and generate the key point position information of the face corresponding to the non-key frame information, which can be based on the key frame information corresponding to the face key point position information and /
  • the face key point information corresponding to the non-key frame information generates the face key point detection result of the video to achieve the purpose of high-speed processing of the face key point detection in the video.
  • the face key point detection is performed through a pre-trained second neural network to obtain
  • the detection result of the key points of the face of the video may also include: cropping the non-key frame information corresponding to the key frame information through the picture cropping box to obtain non-key frame picture information; when the non-key frame
  • the picture information includes face picture information, generate non-key frame face picture information according to the face key point position information corresponding to the key frame information, and input the non-key frame face picture information to the second nerve
  • the network performs face key point detection, and obtains face key point position information corresponding to the non-key frame information.
  • the method for detecting key points of a face may include the following steps:
  • Step 210 Obtain image frame information of the video.
  • the image frame information of the video includes key frame information and non-key frame information.
  • Step 220 Select key frame information and non-key frame information corresponding to the key frame information from the image frame information of the video.
  • Step 230 Input the key frame information into the face detector.
  • the face detector is used to detect the position of the face frame.
  • Step 240 Determine the output information of the face detector as the position information of the face frame.
  • the key frame information after selecting the key frame information from the video, the key frame information can be input into the face detector to detect the position of the face frame of the key frame through the face detector, which can be based on the face
  • the output information of the detector is determined as the face frame position information, and the face frame picture information is cut out from the key frames according to the face frame position based on the face frame position information to perform preliminary face key point detection, that is, step 250 is performed.
  • Step 250 Perform face key point detection through the pre-trained first neural network based on the face frame position information to obtain initial key point position information.
  • Step 260 Generate a picture cropping frame according to the initial key point position information.
  • Step 270 Perform cropping processing on the key frame information through the picture cropping box to obtain key frame face picture information, and input the key frame face picture information to the second neural network to perform face key points By detecting, obtain the key point position information of the face corresponding to the key frame information.
  • the face frame picture information may be cropped from the key frame according to the face frame position based on the face frame position information, so as to input the cropped face picture information Perform face key point detection in the first neural network to obtain initial key point position information, so that a picture cropping frame can be generated based on the initial key point position information, and the picture cropping frame can be used to set the key according to the approximate position of the face key point
  • the frame is cropped to obtain the key frame face image information.
  • the key frame face picture information can be used to characterize the face picture in the video key frame.
  • the key frame face picture information can be used as the input of the second neural network, and input to the second neural network for face key point detection, so as to accurately and stably determine the key frame based on the information output by the second neural network
  • the information output by the second neural network can be determined as the face key point position information corresponding to the key frame information, so that the subsequent face key point position information corresponding to the key frame information can be paired Non-key frame key points of the face are detected and tracked.
  • the t-th frame and the t+N-th frame in the video can be Determined as the key frame information
  • the MTCNN as a face detector can be used to detect the position of the face frame
  • each frame determined by the MTCNN can be squared to expand according to this square
  • Crop the face picture as shown in Figure 3
  • the Crop Face Picture I module, and the cropped face picture can be scaled to 70 pixels in width and height and input to the face key point detection as the first neural network
  • the network C processes, and obtains 106 face key point coordinates as initial key point position information.
  • a face picture can be cropped according to the smallest square frame formed by the 106 key point coordinates of the face.
  • the cropped face picture II module can use the smallest square frame as the picture cropping frame.
  • the picture cropping frame cropping performs crop processing on the key frame information to obtain the key frame face picture information.
  • the cropped face picture can be scaled to a width and height of 70 pixels and then input to the face as the second neural network
  • the key point detection network F performs processing to obtain more accurate 106 face key point coordinates, which are used as the face key point position information corresponding to the key frame information, so that the corresponding face key point position information can be generated based on the key frame information
  • the face key point detection result of the video so that subsequent key point processing can be performed based on the face key point position information corresponding to the key frame information, and the non-key frame can be processed according to the face key point position information corresponding to the obtained key frame information
  • the key points of the face are detected and tracked, and step 280 is performed to use the information between adjacent frames to directly track the key points of the face in the video to achieve the purpose of efficiently processing the detection of the key points of the face in the video.
  • Both the face key point detection network C and the face key point detection network F in this example can extract features through multiple convolution layers (Convolution Layer) and feature pooling layer (Pooling Layer), and can use a fully connected layer (Fully Connected Layer) to return to the relative position of key points.
  • Convolution Layer convolution Layer
  • Pooling Layer feature pooling layer
  • Fully Connected Layer Fully Connected Layer
  • the two face key point detection networks have the same network structure, fewer channels are used in each layer of the face key point detection network C, so they are used as the face key points of the first neural network
  • the detection network C is lighter than the face key point detection network F as the second neural network.
  • the input pictures of the two face key point detection networks have different cropping methods.
  • the input picture of the face key point detection network C can be cropped through the face frame, while the input picture of the face key point detection network F can It is cropped based on 106 key points of the face, and the input picture cropped according to the position of the 106 key points of the face will be closer to the face.
  • the two face key point detection networks can be trained separately, and the weight of each convolutional layer can be different to reduce the impact of inaccurate key points caused by the face frame not being close enough to the face .
  • the embodiment of the present application can use a two-level neural network to perform a progressive method of face key point detection to obtain more accurate key point positions.
  • the face key point detection network C can return to the rough positions of key points, and
  • the face key point detection network F is improved to get more accurate key points.
  • Step 280 Perform cropping processing on the non-key frame information corresponding to the key frame information through the picture cropping box to obtain non-key frame picture information.
  • the current frame can be cropped using the picture cropping box.
  • the t+1th frame picture shown in FIG. 3 is cropped to generate a corresponding image based on the cropped picture.
  • Non-key frame picture information can be used to characterize a picture cropped from the non-key frame of the video according to the position of the face key point of the key frame.
  • the non-key frame picture information can be used as the input of the face detection and tracking network to detect and track the face in the non-key frame picture information through the face detection and tracking network, such as determining the non-key frame picture information Whether to include face picture information.
  • the face detection and tracking network can be used as a non-key frame face detector.
  • it can be the face detector tracking network (Tracking Net, TNet) shown in Figure 3.
  • the face detector TNet can determine the non-key frame Whether the frame picture information contains face picture information, to determine whether the input picture is a face picture, and when it is judged that the input picture is a face picture, output the relative position of the face frame and the relative position of the key points of the face, etc. .
  • the face picture information may include a variety of information used to characterize the face picture, such as image information corresponding to the face picture, etc., which is not limited in this embodiment.
  • Step 290 Generate non-key frame face picture information according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information, and input the non-key frame face picture information Perform face key point detection on the second neural network to obtain face key point position information corresponding to the non-key frame information.
  • the non-key frame picture information in the case that the non-key frame picture information includes face picture information, it is based on the face key point position information corresponding to the key frame information and all Before the non-key frame information corresponding to the key frame information generates the non-key frame face picture information, it further includes: inputting the non-key frame picture information into the face detection and tracking network to obtain the output information of the face detection and tracking network, The output information includes face probability information; based on the face probability information, it is determined whether the non-key frame picture information includes face picture information.
  • the face probability information can be used to determine whether a non-key frame contains a face picture, such as the probability that the non-key frame contains a face picture; when the value of the face probability information exceeds a certain threshold, the non-key frame picture information can be determined Contains face picture information; accordingly, when the value of the face probability information does not exceed the above threshold, it can be determined that the non-key frame picture information does not contain face picture information. If the non-key frame picture information does not include face picture information, it can be determined that the current non-key frame does not include a face picture, and then the non-key frame can be ignored, and face key point detection is not actively performed on the non-key frame.
  • the non-key frame face picture information may represent the face picture information of the non-key frame, and may include the face key point information of the non-key frame information, for example, may include five face key point coordinates in the non-key frame.
  • the coordinates of the five key points of the face can be the position coordinates of the center of the left eye, the center of the right eye, the tip of the nose, the left corner of the mouth, and the right corner of the mouth.
  • the output information of the face detection and tracking network may also include the relative position information of the face frame and the relative position information of the key points.
  • the non-key frame picture information includes face picture information, according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information
  • it may further include: determining the face key point information of the non-key frame information according to the relative position information of the face frame and the key point relative position information.
  • the relative position information of the face frame can represent the relative position of the returned face frame, such as a 4-dimensional vector output by the face detection and tracking network through the output layer; the relative position information of the key points can be used to represent 5 key points of the face
  • the relative position of is, for example, a 10-dimensional vector output by the face detection and tracking network through the output layer.
  • the face detection and tracking network in the embodiment of the present application can determine whether the picture displayed in the non-key frame is a face picture based on the non-key frame picture information. ; If it is a face picture, return to the position of the face picture in the face frame of the current frame, and output the position coordinates of the left eye center, right eye center, nose tip, left corner of the mouth, and right corner of the mouth. , That is, output the face key point information of the non-key frame information as the output information of the face detection and tracking network.
  • the five face key point information output by the face detection and tracking network can be part of the 106 key points output by the face key point detection network C and the face key point detection network F, such as in face detection and tracking
  • a 2-dimensional vector (p0, p1) can be output through the fully connected layer FC as face probability information to indicate the probability that the input picture is/not a face.
  • p0 represents the probability of a non-human face
  • p1 represents the probability of a human face
  • a 4-dimensional vector (x0, y0, w, h) can be output as the relative position information of the face frame to indicate The relative position of the returned face frame, where (x0, y0) can be the coordinates of the upper left corner of the face frame in the picture, and (w, h) can be the width and height of the face frame, such as the input TNet
  • a box information as non-key frame picture information is (x0, y0, w, h)
  • the output 4-dimensional vector is (dx0, dy0, dx1, dy1)
  • the number indicates the relative position of the detected box relative to the input box, and the corresponding detection box is (x0+dx0*w,y0+dy0*h,(dx1-
  • TNet can be a network with a smaller amount of calculation. Because the position of the face in the picture between adjacent frames in the video does not change much, the key point position information of the face passed from the previous frame has given the approximate position of the face in the current frame, so only one is needed A simple face detection and tracking network can return to the position of the face frame.
  • a key point modification module is introduced to correct the coordinate positions of these face key points.
  • a linear transformation can be used to use the new information of the current frame learned by TNet to correct the position of the 106 key points of the face passed in the previous frame, so that the corrected 106
  • the coordinates of the key points of the personal face form the smallest square frame to crop a face picture, as shown in Figure 3, crop the face picture III, and the cropped face picture can be scaled to a width and height of 70 pixels as non-key
  • the frame face picture information is input to the face key point detection network F for face key point detection, and the 106 face key point coordinates of the current frame are obtained as the face key point position information corresponding to the non-key frame information.
  • generating non-key frame face picture information based on the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information may include: The face key point information of the frame information is corrected to the face key point position information corresponding to the key frame information to obtain key point correction information; the key point is determined according to the key point correction information and the initial key point position information Point tracking position information; generating a face picture cropping frame according to the key point tracking position information; cropping the non-key frame information and/or the non-key frame picture information through the face picture cropping frame to obtain The non-key frame face picture information.
  • the coordinates of the key points of the face of the previous frame can be used as the approximate position of the key points of the face corresponding to the current frame, so as to use the information of adjacent frames to achieve the correct Non-key frames are used for the purpose of face key point detection and tracking.
  • a correction step is added, which is based on the five face key point coordinates output by TNet.
  • the face image is cropped according to the modified 106 face key point information, so that the cropped face image will more closely fit the face of the current frame, which plays a key role in the key frame processing flow.
  • face detection and tracking networks such as TNet can return the coordinates of five key points of the face, such as ⁇ (u 1 ', v 1 '),..., (u 5 ', v 5 ') ⁇ , as shown in Figure 4, TNet outputs 5 face key point coordinates; it can be extracted from the 106 face key point coordinates output by the face key point detection network F in the previous frame
  • the coordinates of these five face key points are denoted as ⁇ (u 1 , v 1 ), whil, (u 5 , v 5 ) ⁇ , and the remaining 101 face keys that can be output by the face key point detection network F
  • the coordinates of the points are used as ⁇ (u 6 , v 6 ),..., (u 106 , v 106 ) ⁇ , and then the coordinates of the five key points of the face extracted ⁇ (u 1 , v 1 ),... , (U 5 , v 5 ) ⁇ is used as the face key point position
  • S can be a characterizing scaling factor, R can be a 2x2 rotation transformation matrix, and b can be a 2-dimensional displacement vector.
  • the linear transformation information (A*, b*) can be obtained by the following steps:
  • Step S10 according to the formula And formula Find the average coordinates of the key points of the two groups of faces, and the coordinates of the key points of the two groups of faces can be calculated centrally, for example, according to the formula Centered as the coordinates of the key points of the face in the previous frame, and can follow the formula The center is used as the coordinates of the key points of the face in the current frame.
  • Step S40 Determine A* and b* according to the optimal 2x2 rotation matrix R * and the value S*, where,
  • linear transformation information (A*, b*) can be used to correct the position of the 106 key points of the face transmitted from the previous frame by using the new information of the current frame learned by TNet, for example, according to the correction formula
  • the linear transformation linear information (A*, b*) is applied to the coordinate positions of all 106 key points of the face in the previous frame, so that the face picture is cut out according to the corrected 106 key points to be closer to the person in the current frame face.
  • face key point detection and tracking the face key point detection network F to obtain the 106 face key point coordinates of the current frame
  • the face key point detection apparatus may include the following modules:
  • the video image frame acquisition module 510 is configured to acquire image frame information of the video, where the image frame information of the video includes key frame information and non-key frame information;
  • the first face key point detection module 520 is configured to The key frame information determines the position information of the face frame, and based on the face frame position information, the face key point detection is performed through the first neural network trained in advance to obtain the initial key point position information;
  • the second face key point detection module 530 Set to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face key point detection result of the video, where
  • the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
  • the above-mentioned face key point detection device can be integrated in the device.
  • the device can be composed of two or more physical entities, or one physical entity.
  • the device can be a personal computer (PC), computer, mobile phone, tablet device, personal digital assistant, server, messaging device , Game console, etc.
  • An embodiment of the present application also provides a device, including a processor and a memory. At least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes the face key point detection method as described in the foregoing method embodiment.
  • the device may include: a processor 60, a memory 61, a display screen 62 with a touch function, an input device 63, an output device 64, and a communication device 65.
  • the processor 60, the memory 61, the display screen 62, the input device 63, the output device 64, and the communication device 65 of the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.
  • the memory 61 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the method for detecting key points of a face according to any embodiment of the present application (for example, face The video image frame acquisition module 510, the first face key point detection module 520, and the second face key point detection module 530 in the key point detection device, etc.).
  • the processor 60 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 61, that is, realizes the aforementioned method for detecting key points of a human face.
  • the processor 60 executes one or more programs stored in the memory 61, the following operations are implemented: acquiring image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information; Determine the face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through a pre-trained first neural network to obtain initial key point position information; based on the initial key The point position information and the image frame information of the video are detected by the pre-trained second neural network for face key point detection to obtain the face key point detection result of the video, wherein the face key point detection result includes The face key point position information corresponding to the key frame information and the face key point position information corresponding to the non-key frame information.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the device can execute the face key point detection method as described in the above method embodiment.
  • the method for detecting key points of a face includes: acquiring image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information; and determining the position of the face frame according to the key frame information Information, and based on the face frame position information, through the pre-trained first neural network to perform face key point detection to obtain initial key point position information; based on the initial key point position information and the image frame information of the video , Performing face key point detection through a pre-trained second neural network to obtain the face key point detection result of the video, wherein the face key point detection result includes the face key point corresponding to the key frame information The position information and the key point position information of the face corresponding to the non-key frame information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Geometry (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A method, apparatus and device for facial key point detection, and a storage medium, the method comprising: acquiring image frame information of a video, the image frame information of the video comprising key frame information and non-key frame information; determining face box position information according to the key frame information; detecting facial key points by means of a pre-trained first neural network on the basis of the face box position information so as to obtain initial key point position information; and detecting facial key points by means of a pre-trained second neural network on the basis of the initial key point position information and the image frame information of the video so as to obtain a facial key point detection result of the video, the facial key point detection result comprising facial key point position information corresponding to the key frame information and facial key point position information corresponding to the non-key frame information.

Description

人脸关键点检测方法、装置、设备及存储介质Method, device, equipment and storage medium for detecting key points of human face
本申请要求在2019年05月31日提交中国专利局、申请号为201910473174.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with application number 201910473174.2 on May 31, 2019. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本申请涉及计算机视觉技术领域,例如涉及一种人脸关键点检测方法、装置、设备及存储介质。This application relates to the field of computer vision technology, for example, to a method, device, device, and storage medium for detecting key points of a human face.
背景技术Background technique
在计算机视觉领域,基于视频数据的算法开发一直以来都受到学术界和工业界的广泛关注。其中,人脸视频数据由于其在生物信息验证、监控安防、视频直播等领域拥有非常现实的应用场景而占有很重要的地位。人脸关键点的检测属于人脸图像处理当中非常重要的一步,其主要功能是准确地定位出人脸上的关键点在图片上的位置,以为后续的操作做准备,如定位眼睛、鼻子、嘴角、脸部轮廓点等人脸关键点在图片上的位置,以为后续的人脸对齐、人脸识别等操作做准备。In the field of computer vision, algorithm development based on video data has always received extensive attention from academia and industry. Among them, face video data occupies a very important position due to its very realistic application scenarios in the fields of biological information verification, surveillance security, and video live broadcast. The detection of key points on the face is a very important step in face image processing. Its main function is to accurately locate the position of the key points on the face in the picture to prepare for subsequent operations, such as locating eyes, nose, The position of key points of the face on the picture, such as the corners of the mouth and the contour points of the face, is used to prepare for the subsequent operations such as face alignment and face recognition.
在实现中,人脸关键点检测通常是位于人脸检测之后的一个环节。人脸检测器通常将检测到的人脸位置信息以及相应的人脸图片输入到关键点检测算法,得到人脸的关键点位置,如将以一个矩形框或者正方形框的形式给出的人脸位置信息输入到关键点检测算法中进行计算,以将计算得到的结果确定为人脸的关键点位置。基于深度卷积网络的人脸关键点检测算法在精度上相比传统人脸关键点算法有了很大的提高。然而,基于一个深度卷积网络实现的人脸关键点检测方法通常计算量大,需要对深度卷积网络的网络结构进行精心设计安排,否则很难在计算资源有限的平台上达到实时处理的效果,如很难在手机等移动端上达到实时处理的效果。In implementation, face key point detection is usually a link after face detection. The face detector usually inputs the detected face position information and the corresponding face picture into the key point detection algorithm to obtain the key point position of the face, such as a face that will be given in the form of a rectangular or square frame The position information is input into the key point detection algorithm for calculation, so that the calculated result is determined as the key point position of the face. The face key point detection algorithm based on deep convolutional network has a great improvement in accuracy compared with the traditional face key point algorithm. However, the face key point detection method based on a deep convolutional network is usually computationally intensive, and the network structure of the deep convolutional network needs to be carefully designed and arranged, otherwise it will be difficult to achieve real-time processing on a platform with limited computing resources For example, it is difficult to achieve real-time processing effects on mobile terminals such as mobile phones.
发明内容Summary of the invention
本申请实施例提供一种新的人脸关键点检测方法、系统、设备以及存储介质,以解决人脸关键点检测方法在移动端中受计算能力有限、存储空间较小及实时性要求高等限制的问题。The embodiments of this application provide a new face key point detection method, system, device and storage medium to solve the limitation of limited computing power, small storage space and high real-time requirements in the mobile terminal of the face key point detection method. The problem.
本申请实施例提供了一种人脸关键点检测方法,包括:获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;根据所述关键帧信息确定人脸框位置信息;基于所述人脸框位置信息,通过预先训练 的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。An embodiment of the present application provides a method for detecting key points of a face, including: acquiring image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information; determining according to the key frame information Face frame position information; based on the face frame position information, face key point detection is performed through the pre-trained first neural network to obtain initial key point position information; based on the initial key point position information and the video Image frame information, face key point detection is performed through a pre-trained second neural network to obtain the face key point detection result of the video, wherein the face key point detection result includes the person corresponding to the key frame information Face key point position information and face key point position information corresponding to the non-key frame information.
本申请实施例还提供了一种人脸关键点检测装置,包括:The embodiment of the present application also provides a face key point detection device, including:
视频图像帧获取模块,设置为获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;A video image frame acquisition module, configured to acquire image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information;
第一人脸关键点检测模块,设置为根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;The first face key point detection module is configured to determine face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through the pre-trained first neural network to obtain Initial key point location information;
第二人脸关键点检测模块,设置为基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。The second face key point detection module is configured to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face of the video The key point detection result, wherein the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
本申请实施例还提供了一种设备,包括:处理器和存储器;所述存储器中存储有至少一条指令,所述指令由所述处理器执行,使得所述设备执行上述人脸关键点检测方法。An embodiment of the present application also provides a device, including: a processor and a memory; the memory stores at least one instruction, and the instruction is executed by the processor, so that the device executes the aforementioned face key point detection method .
本申请实施例还提供了一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行上述人脸关键点检测方法。The embodiment of the present application also provides a computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the device, the device can execute the aforementioned method for detecting key points of a human face.
附图说明Description of the drawings
图1是本申请实施例中的一种人脸关键点检测方法的步骤流程示意图;FIG. 1 is a schematic flowchart of steps of a method for detecting key points of a face in an embodiment of the present application;
图2是本申请一个可选实施例中的一种人脸关键点检测方法的步骤流程示意图;FIG. 2 is a schematic flow chart of the steps of a method for detecting key points of a face in an optional embodiment of the present application;
图3是本申请一个示例中的一个视频中的人脸关键点检测与追踪流程示意图;FIG. 3 is a schematic diagram of the detection and tracking process of face key points in a video in an example of the present application;
图4是本申请一个示例中的对上一帧的人脸关键点进行修正的流程示意图;FIG. 4 is a schematic diagram of the process of correcting the key points of the face in the previous frame in an example of the present application;
图5是本申请实施例中的一种人脸关键点检测装置实施例的结构方框示意图;FIG. 5 is a schematic structural block diagram of an embodiment of an apparatus for detecting face key points in an embodiment of the present application;
图6是本申请一个示例中的一种设备的结构方框示意图。Fig. 6 is a schematic block diagram of the structure of a device in an example of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请进行说明。此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构或组成。The application will be described below with reference to the drawings and embodiments. The specific embodiments described here are only used to explain the application, but not to limit the application. For the convenience of description, the drawings only show parts related to the present application instead of all the structures or components.
大部分的人脸关键点检测算法都是基于单张静态图像去设计的,而对于视频中的人脸关键点的检测,通常是逐帧处理或者运用一般物体追踪算法来追踪人脸再检测人脸关键点。人脸关键点追踪方案可以大致可以分为两类:一类是通过逐帧进行人脸检测和人脸关键点检测;另一类是对第一个图像帧进行人脸检测,再以检测到的人脸为目标运用一般物体追踪方法对后续图像帧进行人脸框追踪,在每一个追踪到的人脸上使用关键点检测算法,如果在一个图像帧追踪失败没有找到人脸,则重新运用人脸检测器检测人脸。其中,第一类方案由于需要对每个图像帧都进行人脸检测和关键点检测,没有充分利用相邻帧之间的关联信息,速度上受到限制;另外,由于每个图像帧都是独立处理,容易出现关键点抖动的问题,影响到后续的依赖关键点稳定性的模块,如影响后续依靠检测出的人脸关键点进行人脸贴纸特效设置的模块,降低用户体验。第二类方案在一个图像帧追踪失败没有找到人脸时重新运用人脸检测器检测人脸,虽然在关键点稳定性方面相比第一类方案要高,但是一方面一般物体追踪方法通常比较耗时,另一方面存在两个潜在问题。一个潜在问题是,由于在视频中人脸会经常出现快速的姿态、尺度、遮挡、表情等变化,如手机视频中人脸会经常出现快速的姿态、尺度、遮挡、表情等变化,导致物体追踪方法失效并且重新使用人脸检测器;另一个问题是,一般人脸关键点算法对于输入的人脸在人脸框内的相对位置比较敏感,即对于扰动输入的人脸框,关键点检测算法在扰动前后输出的结果会有很大差异,而通过追踪得到的人脸框相比通过检测器得到的人脸框人脸贴合程度要低,这会导致关键点检测出现误差。可见,这些人脸关键点检测方法存在计算复杂度高、易丢失追踪目标等问题。另外,大部分人脸关键点检测的应用场景都在诸如手机等移动端,人脸关键点检测方案存在计算能力有限、存储空间较小、实时性要求较高等限制。Most of the face key point detection algorithms are designed based on a single static image. For the detection of face key points in the video, usually frame by frame or use general object tracking algorithms to track the face and then detect the person. Key points of the face. Face key point tracking schemes can be roughly divided into two categories: one is to perform face detection and face key point detection frame by frame; the other is to perform face detection on the first image frame and then detect Use general object tracking methods to track the face frame of subsequent image frames, use the key point detection algorithm on each tracked face, and re-apply if the tracking fails in an image frame and no face is found The face detector detects human faces. Among them, the first type of scheme requires face detection and key point detection for each image frame, and does not make full use of the associated information between adjacent frames, so the speed is limited; in addition, because each image frame is independent Processing, the problem of key point jitter is prone to affect the subsequent modules that rely on the stability of key points, such as the module that relies on the detected key points of the face for face sticker special effect setting, and reduces the user experience. The second type of scheme re-uses the face detector to detect the face when an image frame tracking fails and no face is found. Although the key point stability is higher than the first type of scheme, on the one hand, general object tracking methods are usually more Time-consuming, on the other hand there are two potential problems. A potential problem is that faces in videos often show rapid changes in posture, scale, occlusion, and expressions. For example, faces in mobile videos often show rapid changes in posture, scale, occlusion, and expressions, leading to object tracking. The method fails and the face detector is reused; another problem is that the general face key point algorithm is more sensitive to the relative position of the input face in the face frame, that is, for the face frame that disturbs the input, the key point detection algorithm is The output results before and after the disturbance will be very different, and the face frame obtained by tracking is lower than the face frame obtained by the detector, which will lead to errors in the key point detection. It can be seen that these face key point detection methods have problems such as high computational complexity and easy loss of tracking targets. In addition, most of the application scenarios of face key point detection are on mobile terminals such as mobile phones. The face key point detection solution has limitations such as limited computing power, small storage space, and high real-time requirements.
为了实现快速、稳定地人脸关键点检测及追踪,本申请实施例提出了一种人脸关键点检测方法。本申请实施例在获取到视频信息后,可以根据该视频信息中的关键帧信息确定出人脸框位置信息,以根据该人脸框位置信息通过第一神经网络确定出初始关键点位置信息,随后可基于该初始关键点位置信息,通过第二神经网络进行人脸关键点检测,得到该视频的人脸关键点检测结果,即利用了两级的神经网络来实现视频信息中的人脸关键点检测,从而能够高效处理视频中人脸关键点的检测。In order to achieve fast and stable face key point detection and tracking, an embodiment of the present application proposes a face key point detection method. After obtaining the video information in the embodiment of the application, the face frame position information can be determined according to the key frame information in the video information, so as to determine the initial key point position information through the first neural network according to the face frame position information. Then, based on the initial key point location information, the face key point detection can be performed through the second neural network to obtain the face key point detection result of the video, that is, the two-level neural network is used to realize the face key point in the video information Point detection, which can efficiently process the detection of key points of the face in the video.
参照图1,示出了本申请实施例中的一种人脸关键点检测方法的步骤流程示意图。该人脸关键点检测方法可用于诸如人脸识别、人脸上的特效贴纸、换脸特效等人脸视觉应用中,可以包括如下步骤:Referring to FIG. 1, it shows a schematic flow chart of the steps of a method for detecting key points of a face in an embodiment of the present application. The face key point detection method can be used in face vision applications such as face recognition, special effects stickers on faces, and face-changing special effects, and may include the following steps:
步骤110,获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息。Step 110: Obtain image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information.
一个视频可以包含一个或多个视频帧;每个视频帧可以包含用于显示视频画面的图像帧和/或用于播放视频声音的音频帧等。本实施例中的视频的图像帧信息可以表征视频中的图像帧,如可以是指视频帧中的图像信息,可以用于显示视频画面,使得用户可以观看到视频的播放画面。A video may include one or more video frames; each video frame may include an image frame used to display a video screen and/or an audio frame used to play a video sound. The image frame information of the video in this embodiment can represent the image frame in the video, for example, it can refer to the image information in the video frame, which can be used to display the video screen so that the user can watch the playback screen of the video.
本申请实施例在检测视频中的人脸关键点时,可以获取需要检测的视频的图像帧信息,以根据该图像帧信息中的人脸图片信息进行人脸关键点检测。其中,人脸图片信息可以用于表征视频帧中所包含的人脸图片,如在一视频帧包含一个人的人脸图片时,可以基于图像帧信息中的人脸图片信息确定这个人在该视频帧中显示的人脸图片;又如,在一视频帧包含多个人脸图片的情况下,可以基于图像帧信息中的人脸图片信息确定在该视频帧中显示的多个人的人脸图片等。In the embodiment of the present application, when detecting face key points in a video, the image frame information of the video to be detected may be obtained, so as to perform face key point detection according to the face picture information in the image frame information. Among them, the face picture information can be used to characterize the face picture contained in the video frame. For example, when a video frame contains a face picture of a person, it can be determined based on the face picture information in the image frame information that the person is in the Face pictures displayed in a video frame; for another example, when a video frame contains multiple face pictures, the face pictures of multiple people displayed in the video frame can be determined based on the face picture information in the image frame information Wait.
本申请实施例在获取到视频的图像帧信息后,可以将获取到的图像帧信息分为关键帧信息和非关键帧信息,以基于关键帧信息检测出人脸框位置,即执行步骤120。其中,关键帧信息可以表征视频中关键图像帧(简称关键帧),而非关键帧信息可以表征视频中非关键图像帧(简称非关键帧)。After obtaining the image frame information of the video, the embodiment of the present application may divide the obtained image frame information into key frame information and non-key frame information to detect the position of the face frame based on the key frame information, that is, step 120 is performed. Among them, the key frame information can characterize the key image frames in the video (referred to as key frames), and the non-key frame information can characterize the non-key image frames in the video (referred to as non-key frames).
步骤120,根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息。Step 120: Determine face frame position information according to the key frame information, and perform face key point detection through a pre-trained first neural network based on the face frame position information to obtain initial key point position information.
本实施例在获取到表征关键帧的关键帧信息后,可以利用预先设置的人脸检测器,如作为人脸检测器的联合人脸定位及对齐的多任务级联卷积网络(Multi-Task Convolutional Neural Network,MTCNN),对该关键帧信息进行检测,产生人脸框位置信息。该人脸框位置信息可以表征人脸框位置,可以确定人脸框在视频的图像帧中的显示位置。随后,可以基于该人脸框位置信息从该关键帧中裁剪出人脸框图片,即按照人脸框位置从作为视频关键帧的图像帧中裁剪出包含人脸的人脸框图片,并可生成对应的人脸框图片信息,以采用人脸框图片信息表征裁剪出的人脸图片,然后可以将生成的人脸框图片信息输入到预先训练的第一神经网络进行人脸关键点检测,以初步检测出人脸关键点的位置,如可以将第一神经网络的输出信息作为初始关键点位置信息,以便后续可以采用该初始关键点位置信息初步确定人脸关键点位置,如确定人脸关键点在 当前关键帧的大概位置。In this embodiment, after obtaining the key frame information that characterizes the key frame, a preset face detector can be used, such as a multi-task cascaded convolutional network (Multi-Task Convolutional Network) that is used as a face detector for joint face positioning and alignment. Convolutional Neural Network, MTCNN) detects the key frame information to generate face frame position information. The position information of the face frame can represent the position of the face frame, and the display position of the face frame in the image frame of the video can be determined. Subsequently, the face frame picture can be cut out from the key frame based on the face frame position information, that is, the face frame picture containing the face can be cut out from the image frame as the video key frame according to the face frame position, and The corresponding face frame picture information is generated to use the face frame picture information to characterize the cropped face picture, and then the generated face frame picture information can be input to the pre-trained first neural network for face key point detection, Preliminarily detect the position of the key points of the face. For example, the output information of the first neural network can be used as the initial key point position information, so that the initial key point position information can be used to initially determine the position of the key points of the face, such as determining the face The key point is at the approximate position of the current key frame.
步骤130,基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果。Step 130: Based on the initial key point location information and the image frame information of the video, perform face key point detection through a pre-trained second neural network to obtain a face key point detection result of the video.
其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。Wherein, the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
本实施例中的视频的人脸关键点检测结果可以用于确定该视频中每个图像帧的人脸关键点位置,可以包括视频中每个图像帧信息对应的人脸关键点位置信息,如可以包括关键帧信息对应的人脸关键点位置信息、所述非关键帧信息对应的人脸关键点位置信息等。其中,图像帧信息对应的人脸关键点位置信息可以用于表征图像帧的人脸关键点位置,如关键帧信息对应的人脸关键点位置信息可以用于表征关键帧中的人脸关键点位置,又如非关键帧信息对应的人脸关键点位置信息可以用于表征非关键帧中的人脸关键点位置等。The face key point detection result of the video in this embodiment can be used to determine the face key point position of each image frame in the video, and may include the face key point position information corresponding to each image frame information in the video, such as It may include face key point position information corresponding to the key frame information, face key point position information corresponding to the non-key frame information, and the like. Among them, the face key point position information corresponding to the image frame information can be used to characterize the face key point position of the image frame, for example, the face key point position information corresponding to the key frame information can be used to characterize the face key point in the key frame The position, for example, the key point position information of the face corresponding to the non-key frame information can be used to characterize the position of the key point of the face in the non-key frame.
在实现中,本申请实施例在确定出初始关键点位置信息后,可以基于该初始关键点位置信息,按照人脸关键点在当前关键帧的大概位置生成图片裁剪框,随后可以采用该图片裁剪框对当前关键帧中人脸图片进行裁剪,即可以采用该图片裁剪框对该视频的图像帧进行裁剪处理,得到关键帧人脸图片信息,并可采用该关键帧人脸图片信息表征此次裁剪处理后得到的图片。随后,可以将得到的关键帧人脸图片信息输入到预先训练的第二神经网络进行人脸关键点检测,并可将第二神经网络检测后输出的信息确定为当前关键帧的人脸关键点信息,以及可以基于该关键帧的人脸关键点信息对视频中的非关键帧进行人脸关键点检测及追踪,得到非关键帧的人脸关键点信息,从而可以基于该关键帧的人脸关键点信息和/或非关键帧的人脸关键点信息生成视频的人脸关键点检测结果。In implementation, after the initial key point position information is determined in the embodiment of the present application, based on the initial key point position information, a picture cropping frame can be generated according to the approximate position of the face key point in the current key frame, and then the picture can be cropped. The frame crops the face picture in the current key frame, that is, the picture crop frame can be used to crop the image frame of the video to obtain the key frame face picture information, and the key frame face picture information can be used to characterize this time The picture obtained after cropping. Subsequently, the obtained key frame face image information can be input to the pre-trained second neural network for face key point detection, and the information output after the second neural network detection can be determined as the face key point of the current key frame Information, and can perform face key point detection and tracking on non-key frames in the video based on the face key point information of the key frame to obtain the face key point information of the non-key frame, which can be based on the face of the key frame The key point information and/or the face key point information of the non-key frame generates the face key point detection result of the video.
综上,本申请实施例在获取视频的图像帧信息后,可以根据该视频的图像帧信息中的关键帧信息确定出人脸框位置信息,以基于该人脸框位置信息通过第一神经网络进行人脸关键点检测,得到初始关键点位置信息,随后可基于该初始关键点位置信息和视频的图像帧信息通过第二神经网络进行人脸关键点检测,即利用两级的神经网络进行人脸关键点检测,从而解决了相关技术采用单个深度卷积网络实现人脸关键点检测方案中存在的计算复杂度高、计算量大、实时处理效果差等问题,能够快速地、稳定地检测到人脸的关键点位置,达到快速、稳定处理视频中的人脸关键点检测及追踪问题的目的。To sum up, after obtaining the image frame information of the video, the embodiment of the present application can determine the position information of the face frame according to the key frame information in the image frame information of the video, and pass the first neural network based on the position information of the face frame. Perform face key point detection to obtain initial key point position information. Then, based on the initial key point position information and the video image frame information, the face key point detection can be performed through the second neural network, that is, the two-level neural network is used for human Face key point detection, which solves the problems of high computational complexity, large amount of calculation, and poor real-time processing effect in the use of a single deep convolutional network to achieve face key point detection. It can be detected quickly and stably The location of the key points of the face can quickly and stably deal with the key point detection and tracking of the face in the video.
在实际处理中,本实施例在获取视频的图像帧后,可以按照预设规则从该视频图像帧中选择其中一帧或多帧作为关键帧,如从视频的每N帧图像帧中选 择其中一帧作为关键帧,并可将其余图像帧作为非关键帧,即将与该关键帧相邻的连续(N-1)帧的图像帧确定为该关键帧对应的非关键帧,其中,N的取值可以根据不同的应用场景确定,即,N的取值可根据不同的应用场景变化。随后,可在该关键帧中利用人脸检测器检测出人脸框位置,以按照该人脸框位置从关键帧信息中裁剪出人脸框图像信息,进而可以通过第一神经网络对该人脸框图像信息进行人脸关键点检测,以检测出该关键帧中大概的人脸关键点。In actual processing, after acquiring the image frames of the video in this embodiment, one or more frames of the video image frames can be selected as key frames according to preset rules, such as selecting one of them from every N image frames of the video. One frame is used as a key frame, and the remaining image frames can be used as non-key frames, that is, the image frames of consecutive (N-1) frames adjacent to the key frame are determined as the non-key frames corresponding to the key frame. The value can be determined according to different application scenarios, that is, the value of N can be changed according to different application scenarios. Subsequently, the face detector can be used to detect the position of the face frame in the key frame, so that the face frame image information can be cut out from the key frame information according to the face frame position, and then the face frame image information can be obtained through the first neural network. The face frame image information performs face key point detection to detect the approximate face key points in the key frame.
在上述实施例的基础上,可选的,本实施例提供的人脸关键点检测方法在根据所述关键帧信息确定人脸框位置信息之前,还可以包括:从所述视频的图像帧信息中,选取出关键帧信息和所述关键帧信息对应的非关键帧信息。随后,可以根据该关键帧信息确定出人脸框位置信息,以基于该人脸框位置信息通过第一神经网络确定出人脸关键点大概位置。例如,可以将视频中第t帧图片确定为关键帧信息,并将该视频中的第(t+1)帧到第(t+N-1)帧的图片确定为非关键帧信息,并将这些非关键帧信息与作为关键帧信息的第t帧图片相关联,以将这些非关键帧信息确定为上述关键帧信息对应的非关键帧信息。其中,t可以是大于0的整数。On the basis of the foregoing embodiment, optionally, before determining the position information of the face frame according to the key frame information, the face key point detection method provided in this embodiment may further include: image frame information from the video Select the key frame information and the non-key frame information corresponding to the key frame information. Subsequently, the position information of the face frame may be determined according to the key frame information, so as to determine the approximate position of the face key points based on the position information of the face frame through the first neural network. For example, the t-th frame picture in the video can be determined as key frame information, and the pictures from the (t+1)th frame to the (t+N-1)th frame in the video can be determined as non-key frame information, and The non-key frame information is associated with the t-th frame picture as the key frame information, so as to determine the non-key frame information as the non-key frame information corresponding to the key frame information. Wherein, t can be an integer greater than zero.
可选的,上述根据关键帧信息确定出人脸框位置信息,可以包括:将所述关键帧信息输入到人脸检测器中,其中,所述人脸检测器用于检测人脸框位置;将所述人脸检测器的输出信息确定为所述人脸框位置信息。从而,可以基于该人脸框位置信息确定出人脸框位置,以按照人脸框位置从关键帧中裁剪出人脸框图片,生成对应的人脸框图片信息,如使用作为人脸检测器的MTCNN检测出人脸框位置,并对每个人脸框位置对应的方框做正方形扩充处理,即可以以框的中心为正方形中心,框的长边为正方形的边,进行扩充处理,得到这个正方形所裁剪到的人脸框图片信息,并可将该人脸框图片信息输入到第一神经网络中进行人脸关键点检测。该第一神经网络可以作为人脸关键点检测过程中的第一级的人脸关键点检测网络,可以用于对视频关键帧的人脸框图片进行关键点检测,输出初始关键点位置信息,以便后续流程也依据该初始关键点位置信息对关键帧和/或非关键帧进行人脸关键点检测,从而能够快速、稳定地检测出视频中人脸关键点位置。其中,初始关键点位置信息可以用于初步确定出人脸关键点大概位置。Optionally, determining the position information of the face frame according to the key frame information may include: inputting the key frame information into a face detector, where the face detector is used to detect the position of the face frame; The output information of the face detector is determined to be the position information of the face frame. Therefore, the position of the face frame can be determined based on the position information of the face frame, so as to cut out the face frame picture from the key frame according to the face frame position, and generate the corresponding face frame picture information, such as using as a face detector MTCNN detects the position of the face frame, and performs square expansion processing on the box corresponding to each face frame position. That is, the center of the frame is the center of the square and the long side of the frame is the side of the square. The face frame picture information cut by the square, and the face frame picture information can be input into the first neural network for face key point detection. The first neural network can be used as the first-level face key point detection network in the face key point detection process, which can be used to perform key point detection on the face frame picture of the video key frame, and output initial key point position information, So that the subsequent process also performs face key point detection on key frames and/or non-key frames according to the initial key point position information, so that the key point positions of the face in the video can be detected quickly and stably. Among them, the initial key point position information can be used to preliminarily determine the approximate position of the key point on the face.
在本申请的一个可选实施例中,上述基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,包括:根据所述初始关键点位置信息生成图片裁剪框;通过所述图片裁剪框对所述关键帧信息进行裁剪处理,得到关键帧人脸图片信息,并将所述关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述关键帧信息对应的人脸关键点位置信息。In an optional embodiment of the present application, based on the initial key point position information and the image frame information of the video, the face key point detection is performed through a pre-trained second neural network to obtain the person in the video. The face key point detection result includes: generating a picture cropping frame according to the initial key point position information; performing cropping processing on the key frame information through the picture cropping frame to obtain key frame face picture information, and combining the key The frame face picture information is input to the second neural network for face key point detection, and the face key point position information corresponding to the key frame information is obtained.
本申请实施例在确定出初始关键点位置信息后,可以基于该初始关键点位置信息生成一个图片裁剪框,以采用该图片裁剪框按照人脸关键点大概位置裁剪出当前关键帧对应的人脸图片信息,即对关键帧信息进行裁剪处理,得到关键帧人脸图片信息。随后,可以将关键帧人脸图片信息作为第二神经网络的输入,输入到第二神经网络中进行人脸关键点检测,以精确地的确定出该关键帧的人脸关键点位置,并可将第二神经网络的输出信息确定为该关键帧信息对应的人脸关键点位置信息,以便后续可以基于该人脸关键点位置信息对该关键帧信息对应的非关键帧信息进行人脸关键点进行检查及追踪,即利用视频相邻帧间的信息进行人脸关键点追踪,生成非关键帧信息对应的人脸关键点位置信息,从而可以基于关键帧信息对应的人脸关键点位置信息和/非关键帧信息对应的人脸关键点信息生成视频的人脸关键点检测结果,达到高速处理视频中人脸关键点的检测的目的。After the initial key point position information is determined in the embodiment of the application, a picture cropping frame can be generated based on the initial key point position information, so as to use the picture cropping frame to crop the face corresponding to the current key frame according to the approximate position of the face key points Picture information, that is, the key frame information is cropped to obtain the key frame face picture information. Subsequently, the key frame face image information can be used as the input of the second neural network, and input to the second neural network for face key point detection, so as to accurately determine the face key point position of the key frame, and Determine the output information of the second neural network as the face key point position information corresponding to the key frame information, so that subsequent face key points can be performed on the non-key frame information corresponding to the key frame information based on the face key point position information Perform inspection and tracking, that is, use the information between adjacent frames of the video to track the key points of the face, and generate the key point position information of the face corresponding to the non-key frame information, which can be based on the key frame information corresponding to the face key point position information and / The face key point information corresponding to the non-key frame information generates the face key point detection result of the video to achieve the purpose of high-speed processing of the face key point detection in the video.
在上述实施例的基础上,可选的,本申请实施例中基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,还可以包括:通过所述图片裁剪框对所述关键帧信息对应的非关键帧信息进行裁剪处理,得到非关键帧图片信息;当所述非关键帧图片信息包含人脸图片信息时,依据所述关键帧信息对应的人脸关键点位置信息生成非关键帧人脸图片信息,并将所述非关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述非关键帧信息对应的人脸关键点位置信息。On the basis of the foregoing embodiment, optionally, in the embodiment of the present application, based on the initial key point position information and the image frame information of the video, the face key point detection is performed through a pre-trained second neural network to obtain The detection result of the key points of the face of the video may also include: cropping the non-key frame information corresponding to the key frame information through the picture cropping box to obtain non-key frame picture information; when the non-key frame When the picture information includes face picture information, generate non-key frame face picture information according to the face key point position information corresponding to the key frame information, and input the non-key frame face picture information to the second nerve The network performs face key point detection, and obtains face key point position information corresponding to the non-key frame information.
参照图2,示出了本申请一个可选实施例中的一种人脸关键点检测方法的步骤流程示意图。该人脸关键点检测方法可以包括如下步骤:Referring to FIG. 2, there is shown a schematic flowchart of the steps of a method for detecting key points of a face in an optional embodiment of the present application. The method for detecting key points of a face may include the following steps:
步骤210,获取视频的图像帧信息。Step 210: Obtain image frame information of the video.
其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息。Wherein, the image frame information of the video includes key frame information and non-key frame information.
步骤220,从所述视频的图像帧信息中,选取出关键帧信息和所述关键帧信息对应的非关键帧信息。Step 220: Select key frame information and non-key frame information corresponding to the key frame information from the image frame information of the video.
步骤230,将所述关键帧信息输入到人脸检测器中。Step 230: Input the key frame information into the face detector.
其中,所述人脸检测器用于检测人脸框位置。Wherein, the face detector is used to detect the position of the face frame.
步骤240,将所述人脸检测器的输出信息确定为所述人脸框位置信息。Step 240: Determine the output information of the face detector as the position information of the face frame.
本申请实施例在从视频中选出关键帧信息后,可以将该关键帧信息输入到人脸检测器中,以通过人脸检测器检测出关键帧的人脸框位置,即可基于人脸检测器的输出信息确定为人脸框位置信息,从而基于该人脸框位置信息按照人脸框位置从关键帧裁剪出人脸框图片信息进行初步的人脸关键点检测,即执行 步骤250。In the embodiment of the present application, after selecting the key frame information from the video, the key frame information can be input into the face detector to detect the position of the face frame of the key frame through the face detector, which can be based on the face The output information of the detector is determined as the face frame position information, and the face frame picture information is cut out from the key frames according to the face frame position based on the face frame position information to perform preliminary face key point detection, that is, step 250 is performed.
步骤250,基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息。Step 250: Perform face key point detection through the pre-trained first neural network based on the face frame position information to obtain initial key point position information.
步骤260,根据所述初始关键点位置信息生成图片裁剪框。Step 260: Generate a picture cropping frame according to the initial key point position information.
步骤270,通过所述图片裁剪框对所述关键帧信息进行裁剪处理,得到关键帧人脸图片信息,并将所述关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述关键帧信息对应的人脸关键点位置信息。Step 270: Perform cropping processing on the key frame information through the picture cropping box to obtain key frame face picture information, and input the key frame face picture information to the second neural network to perform face key points By detecting, obtain the key point position information of the face corresponding to the key frame information.
本申请实施例在确定出人脸框位置信息后,可以基于该人脸框位置信息按照人脸框位置从关键帧中裁剪出人脸框图片信息,以将该裁剪出的人脸图片信息输入到第一神经网络中进行人脸关键点检测,得到初始关键点位置信息,从而可以基于初始关键点位置信息生成一个图片裁剪框,以采用该图片裁剪框按照人脸关键点大概的位置对关键帧进行裁剪处理,得到关键帧人脸图片信息。关键帧人脸图片信息可以用于表征视频关键帧中的人脸图片。随后,可以将关键帧人脸图片信息作为第二神经网络的输入,输入到第二神经网络中进行人脸关键点检测,以基于第二神经网络所输出的信息精确、稳定地确定出关键帧中人脸关键点位置,如可将该第二神经网络所输出的信息确定为关键帧信息对应的人脸关键点位置信息,以便后续可以基于该关键帧信息对应的人脸关键点位置信息对非关键帧的人脸关键点进行检测及追踪。After determining the position information of the face frame in the embodiment of the present application, the face frame picture information may be cropped from the key frame according to the face frame position based on the face frame position information, so as to input the cropped face picture information Perform face key point detection in the first neural network to obtain initial key point position information, so that a picture cropping frame can be generated based on the initial key point position information, and the picture cropping frame can be used to set the key according to the approximate position of the face key point The frame is cropped to obtain the key frame face image information. The key frame face picture information can be used to characterize the face picture in the video key frame. Subsequently, the key frame face picture information can be used as the input of the second neural network, and input to the second neural network for face key point detection, so as to accurately and stably determine the key frame based on the information output by the second neural network For example, the information output by the second neural network can be determined as the face key point position information corresponding to the key frame information, so that the subsequent face key point position information corresponding to the key frame information can be paired Non-key frame key points of the face are detected and tracked.
作为本申请的一个示例,在从视频的每N帧选取一帧作为关键帧,其余帧作为非关键帧的情况下,如图3所示,可以将视频中第t帧、第t+N帧确定为关键帧信息,并可以在每个关键帧中,使用作为人脸检测器的MTCNN检测出人脸框位置,并可对该MTCNN确定出的每一个框作正方形扩充处理,以根据这个正方形裁剪人脸图片,如图3中所示的裁剪人脸图片I模块,并且可以将裁剪得到人脸图片缩放到宽、高均为70像素后输入到作为第一神经网络的人脸关键点检测网络C进行处理,得到作为初始关键点位置信息的106个人脸关键点坐标。随后,可以根据该106个人脸关键点坐标构成最小正方形框裁剪一个人脸图片,如图3中所示的裁剪人脸图片II模块,即可将构成的最小正方形框作为图片裁剪框,以采用该图片裁剪框裁对关键帧信息进行裁剪处理,得到关键帧人脸图片信息,如可将裁剪得到的人脸图片缩放到宽、高均为70像素后输入到作为第二神经网络的人脸关键点检测网络F中进行处理,以得到更准确的106个人脸关键点坐标,作为关键帧信息对应的人脸关键点位置信息,从而可以基于关键帧信息对应的人脸关键点位置信息生对应视频的人脸关键点检测结果,以便后续可以基于该关键帧信息对应的人脸关键点位置信息进行关键点处理,并可根据得到的关键帧信息对应的人脸关键点位置信息对非关键帧的人脸关键 点进行检测及追踪,执行步骤280,以利用相邻帧间的信息去直接追踪视频中的人脸关键点,达到高效处理视频中人脸关键点的检测的目的。As an example of this application, when one frame is selected from every N frames of the video as a key frame, and the remaining frames are regarded as non-key frames, as shown in FIG. 3, the t-th frame and the t+N-th frame in the video can be Determined as the key frame information, and in each key frame, the MTCNN as a face detector can be used to detect the position of the face frame, and each frame determined by the MTCNN can be squared to expand according to this square Crop the face picture, as shown in Figure 3, the Crop Face Picture I module, and the cropped face picture can be scaled to 70 pixels in width and height and input to the face key point detection as the first neural network The network C processes, and obtains 106 face key point coordinates as initial key point position information. Subsequently, a face picture can be cropped according to the smallest square frame formed by the 106 key point coordinates of the face. As shown in Figure 3, the cropped face picture II module can use the smallest square frame as the picture cropping frame. The picture cropping frame cropping performs crop processing on the key frame information to obtain the key frame face picture information. For example, the cropped face picture can be scaled to a width and height of 70 pixels and then input to the face as the second neural network The key point detection network F performs processing to obtain more accurate 106 face key point coordinates, which are used as the face key point position information corresponding to the key frame information, so that the corresponding face key point position information can be generated based on the key frame information The face key point detection result of the video, so that subsequent key point processing can be performed based on the face key point position information corresponding to the key frame information, and the non-key frame can be processed according to the face key point position information corresponding to the obtained key frame information The key points of the face are detected and tracked, and step 280 is performed to use the information between adjacent frames to directly track the key points of the face in the video to achieve the purpose of efficiently processing the detection of the key points of the face in the video.
本示例中的人脸关键点检测网络C与人脸关键点检测网络F均可以通过多个卷积层(Convolution Layer)和特征池化层(Pooling Layer)去提取特征,并且可以通过全连接层(Fully Connected Layer)来回归关键点的相对位置。虽然这两个人脸关键点检测网络的网络结构一样,但是在人脸关键点检测网络C的每一层中使用了更少的通道(Channel)数目,因此作为第一神经网络的人脸关键点检测网络C比作为第二神经网络的人脸关键点检测网络F更轻量。另外,这两个人脸关键点检测网络的输入图片的裁剪方式不同,人脸关键点检测网络C的输入图片可以是通过人脸框裁剪得到的,而人脸关键点检测网络F的输入图片可以是根据106个人脸关键点裁剪得到的,且根据这106个人脸关键点位置裁剪得到的输入图片会更加紧贴人脸。此外,这两个人脸关键点检测网络可以是分别独立训练的,且每个卷积层的权重都可以是不同的,以降低由于人脸框不够紧贴人脸而导致关键点不准的影响。由此可见,本申请实施例可以通过一个两级神经网络进行人脸关键点检测的渐进方法,得到更准确的关键点位置,人脸关键点检测网络C可以回归出关键点的粗略位置,而人脸关键点检测网络F则改善得到更准确的关键点。Both the face key point detection network C and the face key point detection network F in this example can extract features through multiple convolution layers (Convolution Layer) and feature pooling layer (Pooling Layer), and can use a fully connected layer (Fully Connected Layer) to return to the relative position of key points. Although the two face key point detection networks have the same network structure, fewer channels are used in each layer of the face key point detection network C, so they are used as the face key points of the first neural network The detection network C is lighter than the face key point detection network F as the second neural network. In addition, the input pictures of the two face key point detection networks have different cropping methods. The input picture of the face key point detection network C can be cropped through the face frame, while the input picture of the face key point detection network F can It is cropped based on 106 key points of the face, and the input picture cropped according to the position of the 106 key points of the face will be closer to the face. In addition, the two face key point detection networks can be trained separately, and the weight of each convolutional layer can be different to reduce the impact of inaccurate key points caused by the face frame not being close enough to the face . It can be seen that the embodiment of the present application can use a two-level neural network to perform a progressive method of face key point detection to obtain more accurate key point positions. The face key point detection network C can return to the rough positions of key points, and The face key point detection network F is improved to get more accurate key points.
步骤280,通过所述图片裁剪框对所述关键帧信息对应的非关键帧信息进行裁剪处理,得到非关键帧图片信息。Step 280: Perform cropping processing on the non-key frame information corresponding to the key frame information through the picture cropping box to obtain non-key frame picture information.
本申请实施例在处理非关键帧时,可以利用图片裁剪框对当前帧进行裁剪处理,如对图3中所示的第t+1帧图片进行裁剪处理,以基于裁剪出的图片生成对应的非关键帧图片信息。该非关键帧图片信息可以用于表征根据关键帧的人脸关键点位置从视频非关键帧中裁剪出的图片。随后,可以将该非关键帧图片信息作为人脸检测追踪网络的输入,以通过该人脸检测追踪网络对该非关键帧图片信息中的人脸进行检测及追踪,如确定非关键帧图片信息是否包含人脸图片信息。其中,人脸检测追踪网络可以作为非关键帧的人脸检测器,如可以是图3中所示的人脸检测器追踪网络(Tracking Net,TNet),该人脸检测器TNet可以判断非关键帧图片信息是否包含人脸图片信息,以判断输入图片是否是人脸图片,并可在判断出输入图片是人脸图片时,输出人脸框的相对位置和人脸关键点的相对位置息等。人脸图片信息可以包括用于表征人脸图片的多种信息,如可以是人脸图片对应的图像信息等,本实施例对此不作限制。When processing non-key frames in the embodiment of the application, the current frame can be cropped using the picture cropping box. For example, the t+1th frame picture shown in FIG. 3 is cropped to generate a corresponding image based on the cropped picture. Non-key frame picture information. The non-key frame picture information can be used to characterize a picture cropped from the non-key frame of the video according to the position of the face key point of the key frame. Subsequently, the non-key frame picture information can be used as the input of the face detection and tracking network to detect and track the face in the non-key frame picture information through the face detection and tracking network, such as determining the non-key frame picture information Whether to include face picture information. Among them, the face detection and tracking network can be used as a non-key frame face detector. For example, it can be the face detector tracking network (Tracking Net, TNet) shown in Figure 3. The face detector TNet can determine the non-key frame Whether the frame picture information contains face picture information, to determine whether the input picture is a face picture, and when it is judged that the input picture is a face picture, output the relative position of the face frame and the relative position of the key points of the face, etc. . The face picture information may include a variety of information used to characterize the face picture, such as image information corresponding to the face picture, etc., which is not limited in this embodiment.
步骤290,依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息,并将所述非关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述非关键帧信 息对应的人脸关键点位置信息。Step 290: Generate non-key frame face picture information according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information, and input the non-key frame face picture information Perform face key point detection on the second neural network to obtain face key point position information corresponding to the non-key frame information.
可选地,本申请实施例在得到非关键帧图片信息之后,在所述非关键帧图片信息包含人脸图片信息的情况下,依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息之前,还包括:将所述非关键帧图片信息输入到人脸检测跟踪网中,得到人脸检测跟踪网的输出信息,所述输出信息包含人脸概率信息;基于所述人脸概率信息确定所述非关键帧图片信息是否包含人脸图片信息。人脸概率信息可以用于确定非关键帧是否包含人脸图片,如可以表征该非关键帧包含人脸图片的概率;当人脸概率信息的值超过一定阈值时,可以确定非关键帧图片信息包含人脸图片信息;相应的,在人脸概率信息的值不超过上述阈值时,可以确定非关键帧图片信息不包含人脸图片信息。若非关键帧图片信息不包含人脸图片信息,则可以确定当前的非关键帧不包含人脸图片,进而可以忽略该非关键帧,不对该非关键帧激进行人脸关键点检测。若非关键帧图片信息包含人脸图片信息,则可以确定当前的非关键帧包含人脸图片,并可通过人脸检测追踪网络,利用关键帧的人脸关键点位置对该非关键帧进行人脸检测,以检测出非关键帧中人脸关键点大概的位置,以及生成对应的非关键帧人脸图片信息,即执行步骤290。其中,非关键帧人脸图片信息可以表征非关键帧的人脸图片信息,可以包括非关键帧信息的人脸关键点信息,如可以包括非关键帧中5个人脸关键点坐标等。这5个人脸关键点坐标可以分别是左眼中心、右眼中心、鼻尖、嘴巴左角、嘴巴右角的位置坐标。Optionally, after the non-key frame picture information is obtained in the embodiment of the present application, in the case that the non-key frame picture information includes face picture information, it is based on the face key point position information corresponding to the key frame information and all Before the non-key frame information corresponding to the key frame information generates the non-key frame face picture information, it further includes: inputting the non-key frame picture information into the face detection and tracking network to obtain the output information of the face detection and tracking network, The output information includes face probability information; based on the face probability information, it is determined whether the non-key frame picture information includes face picture information. The face probability information can be used to determine whether a non-key frame contains a face picture, such as the probability that the non-key frame contains a face picture; when the value of the face probability information exceeds a certain threshold, the non-key frame picture information can be determined Contains face picture information; accordingly, when the value of the face probability information does not exceed the above threshold, it can be determined that the non-key frame picture information does not contain face picture information. If the non-key frame picture information does not include face picture information, it can be determined that the current non-key frame does not include a face picture, and then the non-key frame can be ignored, and face key point detection is not actively performed on the non-key frame. If the non-key frame picture information contains the face picture information, it can be determined that the current non-key frame contains the face picture, and the network can be tracked through face detection, and the face key point position of the key frame is used to perform the face of the non-key frame Detect to detect the approximate position of the key points of the face in the non-key frame, and generate corresponding non-key frame face picture information, that is, step 290 is executed. Among them, the non-key frame face picture information may represent the face picture information of the non-key frame, and may include the face key point information of the non-key frame information, for example, may include five face key point coordinates in the non-key frame. The coordinates of the five key points of the face can be the position coordinates of the center of the left eye, the center of the right eye, the tip of the nose, the left corner of the mouth, and the right corner of the mouth.
在本申请的一个可选实施例中,人脸检测跟踪网的输出信息还可以包含人脸框相对位置信息和关键点相对位置信息。在得到非关键帧图片信息之后,在所述非关键帧图片信息包含人脸图片信息的情况下,依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息之前,还可以包括:依据所述人脸框相对位置信息和关键点相对位置信息,确定所述非关键帧信息的人脸关键点信息。其中,人脸框相对位置信息可以表示回归人脸框的相对位置,如可以是人脸检测跟踪网通过输出层输出的一个4维向量;关键点相对位置信息可以用于表示5个人脸关键点的相对位置,如可以是人脸检测跟踪网通过输出层输出的一个10维向量等。In an optional embodiment of the present application, the output information of the face detection and tracking network may also include the relative position information of the face frame and the relative position information of the key points. After obtaining the non-key frame picture information, in the case that the non-key frame picture information includes face picture information, according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information Before the frame information generates the non-key frame face picture information, it may further include: determining the face key point information of the non-key frame information according to the relative position information of the face frame and the key point relative position information. Among them, the relative position information of the face frame can represent the relative position of the returned face frame, such as a 4-dimensional vector output by the face detection and tracking network through the output layer; the relative position information of the key points can be used to represent 5 key points of the face The relative position of is, for example, a 10-dimensional vector output by the face detection and tracking network through the output layer.
在实际处理中,本申请实施例中的人脸检测跟踪网在接收到输入的非关键帧图片信息后,可以基于该非关键帧图片信息判断该非关键帧所显示的图片是否是人脸图片;如果是人脸图片,则回归该人脸图片在当前帧人脸框的位置,并且可输出其中5个人脸关键点左眼中心、右眼中心、鼻尖、嘴巴左角、嘴巴右角的位置坐标,即输出非关键帧信息的人脸关键点信息,以作为该人脸检测跟踪网的输出信息。In actual processing, after receiving the input non-key frame picture information, the face detection and tracking network in the embodiment of the present application can determine whether the picture displayed in the non-key frame is a face picture based on the non-key frame picture information. ; If it is a face picture, return to the position of the face picture in the face frame of the current frame, and output the position coordinates of the left eye center, right eye center, nose tip, left corner of the mouth, and right corner of the mouth. , That is, output the face key point information of the non-key frame information as the output information of the face detection and tracking network.
结合上述示例,人脸检测跟踪网输出的5个人脸关键点信息可以是人脸关键点检测网络C和人脸关键点检测网络F输出的106个关键点当中的一部分,如在人脸检测跟踪网的输出层为全连接层FC的情况下,可以通过该全连接层FC,输出一个2维向量(p0,p1),作为人脸概率信息,以表示输入图片是/不是人脸的概率,如在p0表示非人脸的概率且p1表示人脸的概率的情况下,p0与p1的和可以为1,即p0+p1=1,当p1超过预设的阈值时可以判断为检测到了人脸,当p不超过预设的阈值时可以判断为当前输入的图片是非人脸图片;并可输出一个4维向量(x0,y0,w,h),作为人脸框相对位置信息,以表示回归的人脸框的相对位置,其中,(x0,y0)可以是人脸框的左上角在图片中的坐标,(w,h)可以是人脸框的宽和高,如在输入TNet的一个作为非关键帧图片信息的方框信息是(x0,y0,w,h),输出的4维向量是(dx0,dy0,dx1,dy1)时,可以采用输出的4维向量中的这4个数表示检测出来的框相对于输入框的相对位置,其对应的检出框是(x0+dx0*w,y0+dy0*h,(dx1-dx0)*w,(dy1-dy0)*h),以便后续可以将该检出框作为非关键帧的人脸框对非关键帧的图片进行裁剪处理;以及可输出一个10维向量(dx0,dy0,...,dx4,dy4),作为关键点相对位置信息,以表示5个人脸关键点的相对位置,从而可以确定出该非关键帧的5个人脸关键点坐标是(x0+dx0*w,y0+dy0*h,...,x0+dx4*w,y0+dy4*h),以作为非关键帧信息的人脸关键点信息,随后可以基于这5个人脸关键点坐标对上一帧的人脸关键点位置信息进行修正处理,得到输入到第二神经网络的非关键帧人脸图片信息,如图3中所示的裁剪人脸图片III,进而可以通过第二神经网络对该非关键帧人脸图片信息进行人脸关键点检测,以产生非关键帧的人脸关键点位置信息。相比在关键帧上使用的人脸检测器MTCNN,TNet可以是一个计算量更小的网络。因为视频中相邻帧之间人脸在图片中的位置变化不大,由上一帧传递来的人脸关键点位置信息已经给出了人脸在当前帧的大概位置,因此只需要用一个简单的人脸检测追踪网络就可以回归出人脸框的位置。Combining the above examples, the five face key point information output by the face detection and tracking network can be part of the 106 key points output by the face key point detection network C and the face key point detection network F, such as in face detection and tracking When the output layer of the net is a fully connected layer FC, a 2-dimensional vector (p0, p1) can be output through the fully connected layer FC as face probability information to indicate the probability that the input picture is/not a face. For example, when p0 represents the probability of a non-human face and p1 represents the probability of a human face, the sum of p0 and p1 can be 1, that is, p0+p1=1. When p1 exceeds the preset threshold, it can be determined that a person is detected Face, when p does not exceed the preset threshold, it can be judged that the current input picture is a non-face picture; and a 4-dimensional vector (x0, y0, w, h) can be output as the relative position information of the face frame to indicate The relative position of the returned face frame, where (x0, y0) can be the coordinates of the upper left corner of the face frame in the picture, and (w, h) can be the width and height of the face frame, such as the input TNet A box information as non-key frame picture information is (x0, y0, w, h), when the output 4-dimensional vector is (dx0, dy0, dx1, dy1), these 4 of the output 4-dimensional vector can be used The number indicates the relative position of the detected box relative to the input box, and the corresponding detection box is (x0+dx0*w,y0+dy0*h,(dx1-dx0)*w,(dy1-dy0)*h ), so that the detected frame can be used as a non-key frame face frame to crop the non-key frame picture; and a 10-dimensional vector (dx0,dy0,...,dx4,dy4) can be output as The relative position information of the key points is used to indicate the relative positions of the five face key points, so that the five face key point coordinates of the non-key frame can be determined as (x0+dx0*w,y0+dy0*h,..., x0+dx4*w,y0+dy4*h), as the face key point information of the non-key frame information, and then based on these 5 face key point coordinates, the face key point position information of the previous frame can be corrected. , Get the non-key frame face picture information input to the second neural network, as shown in Figure 3, crop the face picture III, and then the non-key frame face picture information can be keyed by the second neural network Point detection to generate non-key frame key point position information of the face. Compared with the face detector MTCNN used on key frames, TNet can be a network with a smaller amount of calculation. Because the position of the face in the picture between adjacent frames in the video does not change much, the key point position information of the face passed from the previous frame has given the approximate position of the face in the current frame, so only one is needed A simple face detection and tracking network can return to the position of the face frame.
由于将上一帧的信息作为关键帧信息,用在当前帧上,对于快速的人脸运动可能会出现偏差,因此引入一个关键点修改模块修正这些人脸关键点的坐标位置。在一种可选实施方式中,可以采用一个线性变换的方式,利用TNet学习到的当前帧的新信息来修正上一帧传递来的106个人脸关键点的位置,从而可以利用修正后的106个人脸关键点的坐标构成最小正方形框裁剪一个人脸图片,如图3中裁剪人脸图片III,并可以将该裁剪出的人脸图片缩放到宽、高均为70像素后,作为非关键帧人脸图片信息,输入到人脸关键点检测网络F进行人脸关键点检测,得到当前帧的106个人脸关键点坐标,以作为非关键帧信息对应的人脸关键点位置信息。Since the information of the previous frame is used as the key frame information and used in the current frame, there may be deviations for fast face movement, so a key point modification module is introduced to correct the coordinate positions of these face key points. In an optional implementation manner, a linear transformation can be used to use the new information of the current frame learned by TNet to correct the position of the 106 key points of the face passed in the previous frame, so that the corrected 106 The coordinates of the key points of the personal face form the smallest square frame to crop a face picture, as shown in Figure 3, crop the face picture III, and the cropped face picture can be scaled to a width and height of 70 pixels as non-key The frame face picture information is input to the face key point detection network F for face key point detection, and the 106 face key point coordinates of the current frame are obtained as the face key point position information corresponding to the non-key frame information.
可选的,本实施例依据所述关键帧信息对应的人脸关键点位置信息以及所 述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息,可以包括:基于所述非关键帧信息的人脸关键点信息,对所述关键帧信息对应的人脸关键点位置信息进行修正,得到关键点修正信息;根据所述关键点修正信息和所述初始关键点位置信息,确定关键点追踪位置信息;根据所述关键点追踪位置信息生成人脸图片裁剪框;通过所述人脸图片裁剪框对所述非关键帧信息和/或所述非关键帧图片信息进行裁剪处理,得到所述非关键帧人脸图片信息。Optionally, in this embodiment, generating non-key frame face picture information based on the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information may include: The face key point information of the frame information is corrected to the face key point position information corresponding to the key frame information to obtain key point correction information; the key point is determined according to the key point correction information and the initial key point position information Point tracking position information; generating a face picture cropping frame according to the key point tracking position information; cropping the non-key frame information and/or the non-key frame picture information through the face picture cropping frame to obtain The non-key frame face picture information.
本实施例在人脸关键点的追踪过程中,针对非关键帧,可以采用上一帧的人脸关键点坐标作为当前帧对应人脸关键点的大致位置,以利用相邻帧的信息达到对非关键帧进行人脸关键点检测及追踪的目的。为了应对帧间的变化,增加了一个矫正的步骤,即是以TNet输出的5个人脸关键点坐标为基准,根据TNet输出的5个人脸关键点坐标与上一帧对应5个人脸关键点坐标的差异计算出线性变换信息(A*,b*),然后可以将该线性变化信息(A*,b*)作用在上一帧的所有106个人脸关键点上,得到修正后的106个人脸关键点信息,以依据修改后的106个人脸关键点信息裁人脸图片,使得裁剪得到的人脸图片会更贴合当前帧的人脸,在效果上起到了关键帧处理流程上人脸关键点检测网络C的作用。In the process of tracking key points of the face in this embodiment, for non-key frames, the coordinates of the key points of the face of the previous frame can be used as the approximate position of the key points of the face corresponding to the current frame, so as to use the information of adjacent frames to achieve the correct Non-key frames are used for the purpose of face key point detection and tracking. In order to cope with the changes between frames, a correction step is added, which is based on the five face key point coordinates output by TNet. According to the 5 face key point coordinates output by TNet and the corresponding 5 face key point coordinates of the previous frame Calculate the linear transformation information (A*, b*) from the difference, and then apply the linear transformation information (A*, b*) to all 106 face key points in the previous frame to obtain the corrected 106 face Key point information, the face image is cropped according to the modified 106 face key point information, so that the cropped face image will more closely fit the face of the current frame, which plays a key role in the key frame processing flow. Point to detect the role of network C.
例如,结合上述示例,可在非关键帧上,诸如TNet等人脸检测跟踪网可回归出5个人脸关键点的坐标,如可以记作{(u 1’,v 1’),……,(u 5’,v 5’)},如图4所示,TNet输出5个人脸关键点坐标;并可从上一帧的人脸关键点检测网络F输出的106个人脸关键点坐标中抽取出这5个人脸关键点的坐标,记作{(u 1,v 1),……,(u 5,v 5)},以及可以将人脸关键点检测网络F输出的其余101个人脸关键点的坐标记作为{(u 6,v 6),……,(u 106,v 106)},随后可以将抽取出的5个人脸关键点的坐标{(u 1,v 1),……,(u 5,v 5)}作为关键帧信息对应的人脸关键点位置信息,以基于该关键帧信息对应的人脸关键点位置信息进行修正,得到关键点修正信息。 For example, in combination with the above example, on non-key frames, face detection and tracking networks such as TNet can return the coordinates of five key points of the face, such as {(u 1 ', v 1 '),..., (u 5 ', v 5 ')}, as shown in Figure 4, TNet outputs 5 face key point coordinates; it can be extracted from the 106 face key point coordinates output by the face key point detection network F in the previous frame The coordinates of these five face key points are denoted as {(u 1 , v 1 ),……, (u 5 , v 5 )}, and the remaining 101 face keys that can be output by the face key point detection network F The coordinates of the points are used as {(u 6 , v 6 ),..., (u 106 , v 106 )}, and then the coordinates of the five key points of the face extracted {(u 1 , v 1 ),... , (U 5 , v 5 )} is used as the face key point position information corresponding to the key frame information, and the face key point position information corresponding to the key frame information is corrected to obtain the key point correction information.
作为本申请的一个可选实施方式,可以通过计算公式
Figure PCTCN2020081262-appb-000001
确定出作为关键点修正信息的线性变换信息(A*,b*)。其中,A可以通过计算公式
Figure PCTCN2020081262-appb-000002
来确定,b可以根据公式b=(b x,b y)来确定。S可以是表征缩放系数,R可以是2x2的旋转变换矩阵,b可以是2维的位移向量。
As an optional implementation of this application, you can use the calculation formula
Figure PCTCN2020081262-appb-000001
Determine the linear transformation information (A*, b*) as the key point correction information. Among them, A can be calculated by the formula
Figure PCTCN2020081262-appb-000002
To determine, b can be determined according to the formula b=(b x , b y ). S can be a characterizing scaling factor, R can be a 2x2 rotation transformation matrix, and b can be a 2-dimensional displacement vector.
一实施例中,线性变换信息(A*,b*)可以由以下步骤得到:In an embodiment, the linear transformation information (A*, b*) can be obtained by the following steps:
步骤S10,分别根据公式
Figure PCTCN2020081262-appb-000003
和公式
Figure PCTCN2020081262-appb-000004
求两组人脸关键点的平均坐标,并可以中心化计算出的这两组人脸关键点的坐标,如可以按照公式
Figure PCTCN2020081262-appb-000005
中心化作为上一帧的人脸关键点的坐标,且可按照公式
Figure PCTCN2020081262-appb-000006
中心化作为当前帧的人脸关键点的坐标。
Step S10, according to the formula
Figure PCTCN2020081262-appb-000003
And formula
Figure PCTCN2020081262-appb-000004
Find the average coordinates of the key points of the two groups of faces, and the coordinates of the key points of the two groups of faces can be calculated centrally, for example, according to the formula
Figure PCTCN2020081262-appb-000005
Centered as the coordinates of the key points of the face in the previous frame, and can follow the formula
Figure PCTCN2020081262-appb-000006
The center is used as the coordinates of the key points of the face in the current frame.
步骤S20,按照公式
Figure PCTCN2020081262-appb-000007
计算2x2矩阵C,并可按照公式C=U∑V T对矩阵C奇异值分解,得到最优的2x2旋转矩阵R*,且R*=V TU。
Step S20, according to the formula
Figure PCTCN2020081262-appb-000007
Calculate the 2x2 matrix C, and decompose the matrix C with the singular value according to the formula C=U∑V T to obtain the optimal 2x2 rotation matrix R*, and R*=V T U.
步骤S30,根据最优的2x2旋转矩阵R *计算数值S*,如按照计算公式S *=e/d计算得到,其中,
Figure PCTCN2020081262-appb-000008
Step S30 : Calculate the value S* according to the optimal 2x2 rotation matrix R * , such as calculated according to the calculation formula S * =e/d, where,
Figure PCTCN2020081262-appb-000008
步骤S40,根据最优的2x2旋转矩阵R *和数值S *确定A*和b*,其中,
Figure PCTCN2020081262-appb-000009
Step S40 : Determine A* and b* according to the optimal 2x2 rotation matrix R * and the value S*, where,
Figure PCTCN2020081262-appb-000009
随后,可以通过线性变换信息(A*,b*),利用TNet学习到的当前帧的新信息来修正上一帧传递来的106个人脸关键点的位置,如可以按照修正公式
Figure PCTCN2020081262-appb-000010
将线性变换线性信息(A*,b*)作用于上一帧所有的106个人脸关键点的坐标位置,从而使得根据修正后106个人脸关键点裁剪出得人脸图片更加贴近当前帧的人脸。可以利用修正后的106个关键点的坐标构成最小正方形框裁剪一个人脸图片,如图3中裁剪人脸图片III,并可将裁剪得到的人脸图片缩放到宽、高均为70像素后输入人脸关键点检测网络F得到当前帧的106个人脸关键点坐标,即根据所述关键点追踪位置信息生成人脸图片裁剪框,并通过所述人脸图片裁剪框对所述非关键帧信息和/或所述非关键帧图片信息进行裁剪处理,得到非关键帧人脸图片信息,以及将非关键帧人脸图片信息输入到作为第二神经网络的人脸关键点检测网络F进行 人脸关键点检测,得到所述非关键帧信息对应的人脸关键点位置信息,以便后续可以基于非关键帧信息对应的人脸关键点位置信息生成视频的人脸关键点检测结果,达到对视频中人脸关键点检测及追踪的目的。
Subsequently, linear transformation information (A*, b*) can be used to correct the position of the 106 key points of the face transmitted from the previous frame by using the new information of the current frame learned by TNet, for example, according to the correction formula
Figure PCTCN2020081262-appb-000010
The linear transformation linear information (A*, b*) is applied to the coordinate positions of all 106 key points of the face in the previous frame, so that the face picture is cut out according to the corrected 106 key points to be closer to the person in the current frame face. You can crop a face picture by using the coordinates of the 106 key points after correction to form the smallest square frame, as shown in Figure 3, cropping the face picture III, and scaling the cropped face picture to a width and height of 70 pixels Input the face key point detection network F to obtain the 106 face key point coordinates of the current frame, that is, generate a face picture cropping frame according to the key point tracking position information, and use the face picture cropping frame to correct the non-key frame Information and/or the non-key frame picture information is cropped to obtain non-key frame face picture information, and the non-key frame face picture information is input to the face key point detection network F as the second neural network for human Face key point detection, to obtain the face key point position information corresponding to the non-key frame information, so that the subsequent face key point detection results of the video can be generated based on the face key point position information corresponding to the non-key frame information to achieve the video The purpose of face key point detection and tracking.
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,一些步骤可以采用其他顺序或者同时进行。For the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations. The embodiments of the present application are not limited by the described sequence of actions, because according to the embodiments of the present application, some steps may adopt other sequences or Simultaneously.
参照图5,示出了本申请实施例中的一种人脸关键点检测装置实施例的结构方框示意图,该人脸关键点检测装置可以包括如下模块:Referring to FIG. 5, there is shown a schematic structural block diagram of an embodiment of a face key point detection apparatus in an embodiment of the present application. The face key point detection apparatus may include the following modules:
视频图像帧获取模块510,设置为获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;第一人脸关键点检测模块520,设置为根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;第二人脸关键点检测模块530,设置为基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。The video image frame acquisition module 510 is configured to acquire image frame information of the video, where the image frame information of the video includes key frame information and non-key frame information; the first face key point detection module 520 is configured to The key frame information determines the position information of the face frame, and based on the face frame position information, the face key point detection is performed through the first neural network trained in advance to obtain the initial key point position information; the second face key point detection module 530. Set to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face key point detection result of the video, where The face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
在实现中,上述人脸关键点检测装置可以集成在设备中。该设备可以是两个或多个物理实体构成,也可以是一个物理实体构成,如设备可以是个人计算机(Personal Computer,PC)、电脑、手机、平板设备、个人数字助理、服务器、消息收发设备、游戏控制台等。In implementation, the above-mentioned face key point detection device can be integrated in the device. The device can be composed of two or more physical entities, or one physical entity. For example, the device can be a personal computer (PC), computer, mobile phone, tablet device, personal digital assistant, server, messaging device , Game console, etc.
本申请实施例还提供一种设备,包括:处理器和存储器。存储器中存储有至少一条指令,且指令由所述处理器执行,使得所述设备执行如上述方法实施例中所述的人脸关键点检测方法。An embodiment of the present application also provides a device, including a processor and a memory. At least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes the face key point detection method as described in the foregoing method embodiment.
参照图6,示出了本申请一个示例中的一种设备的结构示意图。如图6所示,该设备可以包括:处理器60、存储器61、具有触摸功能的显示屏62、输入装置63、输出装置64以及通信装置65。该设备的处理器60、存储器61、显示屏62、输入装置63、输出装置64以及通信装置65可以通过总线或者其他方式连接,图6中以通过总线连接为例。Referring to Fig. 6, there is shown a schematic structural diagram of a device in an example of the present application. As shown in FIG. 6, the device may include: a processor 60, a memory 61, a display screen 62 with a touch function, an input device 63, an output device 64, and a communication device 65. The processor 60, the memory 61, the display screen 62, the input device 63, the output device 64, and the communication device 65 of the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.
存储器61作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请任意实施例所述的人脸关键点检测方法对应的程序指令/模块(例如,人脸关键点检测装置中的视频图像帧获取模块510、第一人脸关键点检测模块520以及第二人脸关键点检测模块530等)。The memory 61, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the method for detecting key points of a face according to any embodiment of the present application (for example, face The video image frame acquisition module 510, the first face key point detection module 520, and the second face key point detection module 530 in the key point detection device, etc.).
处理器60通过运行存储在存储器61中的软件程序、指令以及模块,从而执行设备的多种功能应用以及数据处理,即实现上述的人脸关键点检测方法。The processor 60 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 61, that is, realizes the aforementioned method for detecting key points of a human face.
实施例中,处理器60执行存储器61中存储的一个或多个程序时,实现如下操作:获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。In an embodiment, when the processor 60 executes one or more programs stored in the memory 61, the following operations are implemented: acquiring image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information; Determine the face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through a pre-trained first neural network to obtain initial key point position information; based on the initial key The point position information and the image frame information of the video are detected by the pre-trained second neural network for face key point detection to obtain the face key point detection result of the video, wherein the face key point detection result includes The face key point position information corresponding to the key frame information and the face key point position information corresponding to the non-key frame information.
本申请实施例还提供一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如上述方法实施例所述的人脸关键点检测方法。示例性的,该人脸关键点检测方法包括:获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。An embodiment of the present application also provides a computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the device, the device can execute the face key point detection method as described in the above method embodiment. Exemplarily, the method for detecting key points of a face includes: acquiring image frame information of a video, wherein the image frame information of the video includes key frame information and non-key frame information; and determining the position of the face frame according to the key frame information Information, and based on the face frame position information, through the pre-trained first neural network to perform face key point detection to obtain initial key point position information; based on the initial key point position information and the image frame information of the video , Performing face key point detection through a pre-trained second neural network to obtain the face key point detection result of the video, wherein the face key point detection result includes the face key point corresponding to the key frame information The position information and the key point position information of the face corresponding to the non-key frame information.

Claims (11)

  1. 一种人脸关键点检测方法,包括:A method for detecting key points of a face, including:
    获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;Acquiring image frame information of a video, where the image frame information of the video includes key frame information and non-key frame information;
    根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;Determining face frame position information according to the key frame information, and performing face key point detection through a pre-trained first neural network based on the face frame position information to obtain initial key point position information;
    基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。Based on the initial key point position information and the image frame information of the video, the face key point detection is performed through the second neural network trained in advance to obtain the face key point detection result of the video, wherein the face The key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
  2. 根据权利要求1所述的方法,其中,所述基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,包括:The method according to claim 1, wherein, based on the initial key point position information and the image frame information of the video, the face key point detection is performed through a pre-trained second neural network to obtain the video The detection results of key points on the face, including:
    根据所述初始关键点位置信息生成图片裁剪框;Generating a picture cropping frame according to the initial key point position information;
    通过所述图片裁剪框对所述关键帧信息进行裁剪处理,得到关键帧人脸图片信息,并将所述关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述关键帧信息对应的人脸关键点位置信息。The key frame information is cropped through the picture cropping frame to obtain key frame face picture information, and the key frame face picture information is input to the second neural network for face key point detection, and obtain Position information of key points on the face corresponding to the key frame information.
  3. 根据权利要求2所述的方法,在所述根据所述关键帧信息确定人脸框位置信息之前,还包括:The method according to claim 2, before said determining the position information of the face frame according to the key frame information, further comprising:
    从所述视频的图像帧信息中,选取出关键帧信息和所述关键帧信息对应的非关键帧信息;Selecting key frame information and non-key frame information corresponding to the key frame information from the image frame information of the video;
    所述基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,还包括:The step of performing face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face key point detection result of the video, further includes:
    通过所述图片裁剪框对所述关键帧信息对应的非关键帧信息进行裁剪处理,得到非关键帧图片信息;Performing crop processing on the non-key frame information corresponding to the key frame information by using the picture cropping box to obtain non-key frame picture information;
    在所述非关键帧图片信息包含人脸图片信息的情况下,依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息,并将所述非关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述非关键帧信息对应的人脸关键点位置信息。In the case that the non-key frame picture information includes face picture information, a non-key frame face picture is generated according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information Information, and input the non-key frame face picture information to the second neural network to perform face key point detection, and obtain face key point position information corresponding to the non-key frame information.
  4. 根据权利要求3所述的方法,在所述得到非关键帧图片信息之后,在所述非关键帧图片信息包含人脸图片信息的情况下,依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人 脸图片信息之前,还包括:The method according to claim 3, after the non-key frame picture information is obtained, in a case where the non-key frame picture information includes face picture information, according to the face key point position corresponding to the key frame information Information and the non-key frame information corresponding to the key frame information before generating the non-key frame face picture information, further includes:
    将所述非关键帧图片信息输入到人脸检测跟踪网中,得到所述人脸检测跟踪网的输出信息,其中,所述输出信息包含人脸概率信息;Inputting the non-key frame picture information into a face detection and tracking network to obtain output information of the face detection and tracking network, wherein the output information includes face probability information;
    基于所述人脸概率信息确定所述非关键帧图片信息是否包含人脸图片信息。Determine whether the non-key frame picture information includes face picture information based on the face probability information.
  5. 根据权利要求4所述的方法,其中,所述输出信息还包含人脸框相对位置信息和关键点相对位置信息;The method according to claim 4, wherein the output information further includes relative position information of the face frame and relative position information of key points;
    在所述得到非关键帧图片信息之后,在所述非关键帧图片信息包含人脸图片信息的情况下,依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息之前,所述方法还包括:After the non-key frame picture information is obtained, in the case where the non-key frame picture information includes face picture information, according to the face key point position information corresponding to the key frame information and the key frame information corresponding Before generating the non-key frame face picture information from the non-key frame information, the method further includes:
    依据所述人脸框相对位置信息和所述关键点相对位置信息,确定所述非关键帧信息的人脸关键点信息。Determine the face key point information of the non-key frame information according to the relative position information of the face frame and the relative position information of the key points.
  6. 根据权利要求5所述的方法,其中,所述依据所述关键帧信息对应的人脸关键点位置信息以及所述关键帧信息对应的非关键帧信息生成非关键帧人脸图片信息,包括:The method according to claim 5, wherein the generating non-key frame face picture information according to the face key point position information corresponding to the key frame information and the non-key frame information corresponding to the key frame information comprises:
    基于所述非关键帧信息的人脸关键点信息,对所述关键帧信息对应的人脸关键点位置信息进行修正,得到关键点修正信息;Based on the face key point information of the non-key frame information, correcting the face key point position information corresponding to the key frame information to obtain key point correction information;
    根据所述关键点修正信息和所述初始关键点位置信息,确定关键点追踪位置信息;Determining key point tracking position information according to the key point correction information and the initial key point position information;
    根据所述关键点追踪位置信息生成人脸图片裁剪框;Generating a face picture cropping frame according to the key point tracking position information;
    通过所述人脸图片裁剪框对所述非关键帧信息和所述非关键帧图片信息中的至少之一进行裁剪处理,得到所述非关键帧人脸图片信息。Perform cropping processing on at least one of the non-key frame information and the non-key frame picture information by using the face picture cropping frame to obtain the non-key frame face picture information.
  7. 根据权利要求1至6任一所述的方法,其中,所述根据所述关键帧信息确定人脸框位置信息,包括:The method according to any one of claims 1 to 6, wherein the determining the position information of the face frame according to the key frame information comprises:
    将所述关键帧信息输入到人脸检测器中,其中,所述人脸检测器用于检测人脸框位置;Inputting the key frame information into a face detector, where the face detector is used to detect the position of a face frame;
    将所述人脸检测器的输出信息确定为所述人脸框位置信息。The output information of the face detector is determined as the position information of the face frame.
  8. 一种人脸关键点检测装置,包括:A device for detecting key points of a face includes:
    视频图像帧获取模块,设置为获取视频的图像帧信息,其中,所述视频的图像帧信息包含关键帧信息和非关键帧信息;The video image frame acquisition module is configured to acquire image frame information of the video, wherein the image frame information of the video includes key frame information and non-key frame information;
    第一人脸关键点检测模块,设置为根据所述关键帧信息确定人脸框位置信息,并基于所述人脸框位置信息,通过预先训练的第一神经网络进行人脸关键点检测,得到初始关键点位置信息;The first face key point detection module is configured to determine face frame position information according to the key frame information, and based on the face frame position information, perform face key point detection through the pre-trained first neural network to obtain Initial key point location information;
    第二人脸关键点检测模块,设置为基于所述初始关键点位置信息和所述视频的图像帧信息,通过预先训练的第二神经网络进行人脸关键点检测,得到所述视频的人脸关键点检测结果,其中,所述人脸关键点检测结果包含所述关键帧信息对应的人脸关键点位置信息和所述非关键帧信息对应的人脸关键点位置信息。The second face key point detection module is configured to perform face key point detection through a pre-trained second neural network based on the initial key point position information and the image frame information of the video to obtain the face of the video The key point detection result, wherein the face key point detection result includes face key point position information corresponding to the key frame information and face key point position information corresponding to the non-key frame information.
  9. 根据权利要求8所述的装置,其中,所述第二人脸关键点检测模块包括:The apparatus according to claim 8, wherein the second face key point detection module comprises:
    图片裁剪框生成子模块,设置为根据所述初始关键点位置信息生成图片裁剪框;The picture cropping frame generating submodule is set to generate a picture cropping frame according to the initial key point position information;
    关键帧裁剪处理子模块,设置为通过所述图片裁剪框对所述关键帧信息进行裁剪处理,得到关键帧人脸图片信息;A key frame cropping processing sub-module, configured to perform crop processing on the key frame information through the picture cropping frame to obtain key frame face picture information;
    关键帧人脸关键点检测子模块,设置为将所述关键帧人脸图片信息输入到所述第二神经网络进行人脸关键点检测,得到所述关键帧信息对应的人脸关键点位置信息。A key frame face key point detection sub-module, configured to input the key frame face picture information to the second neural network for face key point detection, and obtain face key point position information corresponding to the key frame information .
  10. 一种设备,包括:处理器和存储器;A device including: a processor and a memory;
    所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器执行,使得所述设备执行如权利要求1至7任一项所述的人脸关键点检测方法。At least one instruction is stored in the memory, and the at least one instruction is executed by the processor so that the device executes the method for detecting key points of a face according to any one of claims 1 to 7.
  11. 一种计算机可读存储介质,其中,所述存储介质中的指令由设备的处理器执行时,使得所述设备执行如权利要求1至7任一项所述的人脸关键点检测方法。A computer-readable storage medium, wherein when the instructions in the storage medium are executed by the processor of the device, the device executes the method for detecting key points of a human face according to any one of claims 1 to 7.
PCT/CN2020/081262 2019-05-31 2020-03-26 Method, apparatus, and device for facial key point detection, and storage medium WO2020238374A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910473174.2A CN112016371B (en) 2019-05-31 2019-05-31 Face key point detection method, device, equipment and storage medium
CN201910473174.2 2019-05-31

Publications (1)

Publication Number Publication Date
WO2020238374A1 true WO2020238374A1 (en) 2020-12-03

Family

ID=73506983

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/081262 WO2020238374A1 (en) 2019-05-31 2020-03-26 Method, apparatus, and device for facial key point detection, and storage medium

Country Status (2)

Country Link
CN (1) CN112016371B (en)
WO (1) WO2020238374A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597842A (en) * 2020-12-15 2021-04-02 周美跃 Movement detection facial paralysis degree evaluation system based on artificial intelligence
CN112633084A (en) * 2020-12-07 2021-04-09 深圳云天励飞技术股份有限公司 Face frame determination method and device, terminal equipment and storage medium
CN113177526A (en) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 Image processing method, device and equipment based on face recognition and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561840B (en) * 2020-12-02 2024-05-28 北京有竹居网络技术有限公司 Video clipping method and device, storage medium and electronic equipment
CN112488064B (en) * 2020-12-18 2023-12-22 平安科技(深圳)有限公司 Face tracking method, system, terminal and storage medium
CN112597973A (en) * 2021-01-29 2021-04-02 秒影工场(北京)科技有限公司 High-definition video face alignment method based on convolutional neural network
TWI831582B (en) * 2023-01-18 2024-02-01 瑞昱半導體股份有限公司 Detection system and detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063679A (en) * 2018-08-24 2018-12-21 广州多益网络股份有限公司 A kind of human face expression detection method, device, equipment, system and medium
CN109376684A (en) * 2018-11-13 2019-02-22 广州市百果园信息技术有限公司 A kind of face critical point detection method, apparatus, computer equipment and storage medium
CN109657583A (en) * 2018-12-10 2019-04-19 腾讯科技(深圳)有限公司 Face's critical point detection method, apparatus, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2672424A1 (en) * 2012-06-08 2013-12-11 Realeyes OÜ Method and apparatus using adaptive face registration method with constrained local models and dynamic model switching
CN103824049A (en) * 2014-02-17 2014-05-28 北京旷视科技有限公司 Cascaded neural network-based face key point detection method
CN108875480A (en) * 2017-08-15 2018-11-23 北京旷视科技有限公司 A kind of method for tracing of face characteristic information, apparatus and system
CN109598234B (en) * 2018-12-04 2021-03-23 深圳美图创新科技有限公司 Key point detection method and device
CN109800635A (en) * 2018-12-11 2019-05-24 天津大学 A kind of limited local facial critical point detection and tracking based on optical flow method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063679A (en) * 2018-08-24 2018-12-21 广州多益网络股份有限公司 A kind of human face expression detection method, device, equipment, system and medium
CN109376684A (en) * 2018-11-13 2019-02-22 广州市百果园信息技术有限公司 A kind of face critical point detection method, apparatus, computer equipment and storage medium
CN109657583A (en) * 2018-12-10 2019-04-19 腾讯科技(深圳)有限公司 Face's critical point detection method, apparatus, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633084A (en) * 2020-12-07 2021-04-09 深圳云天励飞技术股份有限公司 Face frame determination method and device, terminal equipment and storage medium
CN112597842A (en) * 2020-12-15 2021-04-02 周美跃 Movement detection facial paralysis degree evaluation system based on artificial intelligence
CN112597842B (en) * 2020-12-15 2023-10-20 芜湖明瞳数字健康科技有限公司 Motion detection facial paralysis degree evaluation system based on artificial intelligence
CN113177526A (en) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 Image processing method, device and equipment based on face recognition and storage medium
CN113177526B (en) * 2021-05-27 2023-10-03 中国平安人寿保险股份有限公司 Image processing method, device, equipment and storage medium based on face recognition

Also Published As

Publication number Publication date
CN112016371A (en) 2020-12-01
CN112016371B (en) 2022-01-14

Similar Documents

Publication Publication Date Title
WO2020238374A1 (en) Method, apparatus, and device for facial key point detection, and storage medium
US11295413B2 (en) Neural networks for cropping images based on body key points
EP3576017A1 (en) Method, apparatus, and device for determining pose of object in image, and storage medium
CN112488064B (en) Face tracking method, system, terminal and storage medium
Zhao et al. Pwstablenet: Learning pixel-wise warping maps for video stabilization
US20210089753A1 (en) Age Recognition Method, Computer Storage Medium and Electronic Device
US10395094B2 (en) Method and apparatus for detecting glasses in a face image
WO2020056903A1 (en) Information generating method and device
US20240015340A1 (en) Live streaming picture processing method and apparatus based on video chat live streaming, and electronic device
JP2019117577A (en) Program, learning processing method, learning model, data structure, learning device and object recognition device
CN111353336B (en) Image processing method, device and equipment
JP5087037B2 (en) Image processing apparatus, method, and program
CN111667504B (en) Face tracking method, device and equipment
CN109767453A (en) Information processing unit, background image update method and non-transient computer readable storage medium
CN111723707A (en) Method and device for estimating fixation point based on visual saliency
CN111563490B (en) Face key point tracking method and device and electronic equipment
CN112954450A (en) Video processing method and device, electronic equipment and storage medium
JP2006146413A (en) Object tracking device
Chen et al. Sound to visual: Hierarchical cross-modal talking face video generation
CN112651322B (en) Cheek shielding detection method and device and electronic equipment
KR102188991B1 (en) Apparatus and method for converting of face image
JP2023512359A (en) Associated object detection method and apparatus
Oshiba et al. Face image generation of anime characters using an advanced first order motion model with facial landmarks
CN111260692A (en) Face tracking method, device, equipment and storage medium
WO2022153481A1 (en) Posture estimation apparatus, learning model generation apparatus, method, and computer-readable recordingmedium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20815484

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20815484

Country of ref document: EP

Kind code of ref document: A1