WO2022089360A1 - Face detection neural network and training method, face detection method, and storage medium - Google Patents

Face detection neural network and training method, face detection method, and storage medium Download PDF

Info

Publication number
WO2022089360A1
WO2022089360A1 PCT/CN2021/126065 CN2021126065W WO2022089360A1 WO 2022089360 A1 WO2022089360 A1 WO 2022089360A1 CN 2021126065 W CN2021126065 W CN 2021126065W WO 2022089360 A1 WO2022089360 A1 WO 2022089360A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
face
key points
sample
dimensional
Prior art date
Application number
PCT/CN2021/126065
Other languages
French (fr)
Chinese (zh)
Inventor
芦爱余
Original Assignee
广州虎牙科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州虎牙科技有限公司 filed Critical 广州虎牙科技有限公司
Publication of WO2022089360A1 publication Critical patent/WO2022089360A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the present disclosure relates to the field of computer technologies, and in particular, to a face detection neural network and a training method thereof, a face detection method, a storage medium and a terminal device.
  • face detection technology In recent years, with the rapid development of deep learning technology, face detection technology has also made great progress.
  • face detection technology based on deep learning requires a large amount of training data containing the three-dimensional coordinates of key points, and the key points marked in the same training data also need to be dense enough.
  • the collection of 3D face data and the labeling of key points are more difficult than traditional image collection, and require a lot of manpower, material and financial resources.
  • the present disclosure provides a A face detection neural network and a training method thereof, a face detection method, a storage medium and a terminal device.
  • a neural network for face detection comprising: an encoding network for acquiring a sample including a face image, and using the sample as an output result thereof, wherein , the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point; the first decoding network is used to extract the The predicted coordinate value of each of the first key points on the face image; the second decoding network is used to obtain the face pose change parameter corresponding to the face image in the sample according to the output result of the encoding network And prediction network, be used for the output result of described coding network and the predicted coordinate value that described first decoding network obtains and the facial posture change parameter that described second decoding network obtains as input and carry out training, output comprises described A face model with three-dimensional or two-dimensional coordinate values of at least two second key points on the face image in the sample, wherein the number of the second key points in the output face model is greater than the The number of the first keypoints
  • a method for training a neural network for face detection comprising: acquiring a sample including a face image, wherein the face image in the sample is labeled There are three-dimensional or two-dimensional coordinates of at least one first key point; the sample is used as the input of the neural network to obtain the predicted coordinate value of each of the first key points and the face pose corresponding to the face image Change parameters; train the neural network based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the neural network after training is a three-dimensional or two-dimensional image containing a first number of second key points.
  • a face model of dimensional coordinates, wherein the first number is greater than the number of the first key points marked in the sample.
  • a method for face detection the method is implemented by the encoding network and the prediction network obtained by training in the first aspect of the embodiments of the present disclosure, and the method includes: the encoding network pairing The acquired unlabeled face image is processed for key point labeling, and the face image marked with at least one first key point is output to the prediction network as a processing result; the prediction network output includes the A face model with three-dimensional or two-dimensional coordinates of at least two second key points on a face image, wherein the number of the second key points is greater than the number of the first key points.
  • a computer storage medium where computer program codes are stored in the computer storage medium, and when the computer program codes are executed on a processor, the processor causes the processor to execute the present disclosure
  • a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the present invention when executing the program.
  • a face image sample marked with three-dimensional coordinates of key points is used as the input of a neural network for face detection, and the neural network is trained by obtaining key points of the sample to predict coordinate values and face pose change parameters. .
  • the neural network for face detection it is possible to input a face image sample marked with only a small number of three-dimensional coordinates of key points into the neural network, and obtain three-dimensional coordinates of a large number of key points of the sample. It can solve the problem that the collection of 3D face data and the labeling of key points are difficult and require a lot of manpower, material resources and financial resources.
  • FIG. 1 is a system architecture diagram to which a face detection neural network, a training method thereof, and a face detection method can be applied according to an exemplary embodiment of the present disclosure.
  • FIG. 2 is a structural diagram of a neural network for face detection according to an exemplary embodiment of the present disclosure.
  • FIG. 3 is a flowchart of another method for obtaining predicted coordinate values of key points according to an exemplary embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a method for obtaining face pose change parameters of key points according to an exemplary embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a method of training a prediction network according to an exemplary embodiment of the present disclosure.
  • FIG. 6 is a flowchart of a training method of a neural network for face detection according to an exemplary embodiment of the present disclosure.
  • FIG. 7 is a flowchart of a method of applying a prediction network according to an exemplary embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.
  • the word “if” as used herein can be interpreted as "at the time of” or "when” or “in response to determining,” depending on the context.
  • Face detection refers to detecting the location of key points of the face given a face image.
  • the key points typically include points at the eyebrows, eyes, nose, mouth, face contours, and the like.
  • Face detection technology is the basic technology for application scenarios such as face dressing, beauty makeup, face special effects, and face AR (Augmented Reality).
  • face special effects as an example, in some live video applications, there will be some requirements for 3D animation special effects, such as adding 3D rabbit ears and pig masks to the faces in the video.
  • 3D animation special effects need to be based on the accurate positioning of the key points of the face, that is to say, it is necessary to be able to obtain the accurate three-dimensional coordinates of the key points of the face.
  • a given face image input to a convolutional neural network to regress the coordinates of key points of the face through the convolutional neural network.
  • a given face image is input to a convolutional neural network, so as to regress the feature map corresponding to the key points of the face through the convolutional neural network, and determine the position representing the key points of the face from the feature map.
  • the face detection technology based on deep learning needs to input a large amount of training data marked with the three-dimensional coordinates of key points in the network to be trained, and the key points marked in the same training data also need to be dense enough (that is, dense three-dimensional data is required) .
  • 3D face data with key point annotations is very precious, because the collection of 3D face data and the annotation of key points are more difficult than traditional image collection, and require a lot of manpower , material and financial resources.
  • the present disclosure proposes a face detection neural network.
  • a training method thereof, a face detection method, a storage medium and a terminal device thereof are provided.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture of a face detection neural network, a training method thereof, and a face detection method to which embodiments of the present disclosure can be applied.
  • the system architecture 1000 may include one or more of terminal devices 1001 , 1002 , and 1003 , as well as a network 1004 and a server 1005 .
  • the network 1004 is the medium used to provide the communication link between the terminal devices 1001 , 1002 , 1003 and the server 1005 .
  • the network 1004 may include various connection types, such as wired connections, wireless communication links, or fiber optic cables, among others.
  • the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the server 1005 may be a server cluster composed of multiple servers, or the like.
  • the user can use the terminal devices 1001, 1002, 1003 to interact with the server 1005 through the network 1004 to receive or send messages and the like.
  • the terminal devices 1001, 1002, 1003 may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.
  • the server 1005 may be a server that provides various services.
  • the server 1005 can provide training of a neural network for face detection of the present disclosure: the server 1005 obtains face images from the terminal devices 1001, 1002, and 1003, and the face images can be three-dimensional images marked with a small number of key points. Coordinates (i.e. sparse 3D data) of face images.
  • the server trains the neural network for face detection based on the acquired face images. Based on the trained neural network, the server can output the three-dimensional coordinates of a large number of key points in the face image (ie, dense three-dimensional data).
  • the server 1005 can provide a face detection method of the present disclosure for execution: the server 1005 is pre-installed with all or part of the neural network for face detection that has been trained, and the 1003 Obtain a face image without key point annotations, and obtain the three-dimensional coordinates of key points in the face image by all or part of the trained neural network for face detection.
  • the terminal devices 1001, 1002, and 1003 may be terminal devices that provide various services.
  • the terminal device may be a terminal device with an image acquisition unit, and therefore, the terminal device may acquire face images for use by the terminal device itself or to send to other devices.
  • the terminal device realizes the acquisition of the input face image without key point annotations. 3D coordinates of key points in the face image.
  • the training method of the neural network for face detection disclosed in the embodiments of the present disclosure can also be executed by the terminal devices 1001 , 1002 , and 1003 .
  • All or part of the neural network trained by the terminal devices 1001 , 1002 and 1003 can directly obtain the three-dimensional coordinates of the key points of the face image from the input face image without labels.
  • all or part of the neural networks trained by the terminal devices 1001, 1002, and 1003 can also be configured on the server 1005 in advance, and the server 1005 can obtain the key points of the face image from the input face image without labels. three-dimensional coordinates.
  • FIG. 2 is a structural diagram of a neural network for face detection according to an exemplary embodiment of the present disclosure.
  • the neural network for face detection includes an encoding network 210 , a first decoding network 220 , a second decoding network 230 and a prediction network 240 .
  • the encoding network 210 is used to obtain face image samples, and send the samples as output results to the first decoding network 220, the second decoding network 230 and the prediction network 240, and the samples include at least A face image with three-dimensional coordinates of the first key point; the first decoding network 220 extracts the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210; the second decoding network 230 For obtaining the face posture change parameter of the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210; obtain the output result of the encoding network 210 and the first decoding network 220 The predicted coordinate value and the face pose change parameter obtained by the second decoding network 230 are used as input to train the prediction network 240, so that the trained prediction network 240 outputs a three-dimensional image including at least two second key points in the sample. Coordinate face model, the number of the second key points output by the prediction network 240 is greater than the number of the first key points of the three-
  • the training data of the neural network for face detection described in the present disclosure may be a face image sample marked with three-dimensional coordinates of key points.
  • the face image samples marked with the three-dimensional coordinates of key points may come from a public data set, or may be face images collected by a user and marked with three-dimensional coordinates of key points.
  • the present disclosure does not limit the source of training data.
  • the marked key points in the face image sample may be points at the eyes, nose, mouth, face contour, or points at other positions on the face, which are not limited in this disclosure.
  • the activation function used in the neural network for face detection is a positive number corresponding to the three-dimensional coordinates of key points, that is, The activation functions corresponding to the X-axis, Y-axis, and Z-axis are all positive numbers.
  • the three-dimensional coordinates of a key point are x, y, z, then the activation function corresponding to the key point is 1, 1, 1, indicating that the three-dimensional coordinates of the key point are all that the neural network needs to learn.
  • the activation functions corresponding to the X axis, the Y axis and the Z axis may be different and may be other positive numbers.
  • the three-dimensional coordinates of a key point are x, y, z, and the corresponding activation function can be 2, 2, 10.
  • the activation functions corresponding to the X-axis, Y-axis and Z-axis are all greater than 0, indicating that the neural network needs to learn the three-dimensional coordinates of the key points; the activation function corresponding to the Z-axis coordinate is greater than the activation function corresponding to the X-axis and Y-axis coordinates. , indicating that the learning of the Z-axis coordinates occupies a relatively large weight in the training process of the neural network for face detection.
  • axis coordinates indicating that the learning of the Z-axis coordinates occupies a relatively large weight in the training process of the neural network for face detection.
  • the training data of the neural network for face detection may also be mixed data, and the mixed data includes face image samples marked with three-dimensional coordinates of key points and face image samples marked with two-dimensional coordinates of key points.
  • the coordinate value of the Z-axis of the key point can be set to a negative number, for example, set to -1, indicating that the Z-axis coordinate of the key point does not exist.
  • the activation functions corresponding to the X-axis, Y-axis and Z-axis of the key points can be the same as those described above.
  • the activation functions corresponding to the X-axis and Y-axis of the key points can also be as described above; the activation function corresponding to the Z-axis of the key points can be set to 0, indicating that the key point
  • the Z-axis coordinates are not what the neural network for face detection needs to learn.
  • the above-mentioned mixed data is used as the training data of the neural network for face detection.
  • the face image samples marked with the two-dimensional coordinates of key points have no Z-axis coordinate values, they have X-axis and Y-axis coordinate values. Also contains some useful information. Therefore, using mixed data for training can learn more useful information about the key points of the face when the face image samples marked with the three-dimensional coordinates of the key points are insufficient, and can realize the neural network for face detection. good training effect.
  • the encoding network 210 of the neural network for face detection is used to perform preliminary processing after obtaining face image samples to obtain samples marked with key points, and send the samples to the Other parts of the neural network for face detection except the encoding network 210.
  • the encoding network 210 in the neural network for face detection may be a lightweight neural network.
  • the encoding network 210 may be a network of the Mobilenet series designed by Google for deep learning applications on mobile terminals and embedded terminals, including MobilenetV1, MobilenetV2 and MobilenetV3.
  • the encoding network 210 can also be other lightweight neural networks, such as SqueezeNet, ShuffleNet, Xception and other networks.
  • Using a light-weight neural network as the coding network 210 in the neural network for face detection can save the volume of the neural network for face detection, improve the running speed of the neural network for face detection, and facilitate the transplantation of faces to mobile terminals Part or all of the detected neural network can effectively expand the application scope and application scenarios of the neural network for face detection.
  • extracting the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210 can be implemented by the first decoding network 220 in the neural network for face detection.
  • FIG. 3 is a schematic diagram of an embodiment of realizing extraction of predicted coordinate values of key points of a face image through the first decoding network 220 in a neural network for face detection.
  • the encoding network 210 can obtain samples with key point annotations after processing the face image samples, and the samples are labeled with key points.
  • the output of the encoding network 210 is sent to the first decoding network 220 .
  • the first decoding network 220 can extract the predicted coordinate value of the key point according to the preset mapping relationship between the UV coordinate map and the three-dimensional coordinate of the key point.
  • UV is the abbreviation of U
  • V texture map U represents the horizontal coordinate
  • V is the vertical coordinate.
  • the UV coordinate map is composed of pixel values of three RGB channels, and each pixel channel represents the coordinates of the X-axis, Y-axis and Z-axis of the three-dimensional face key points respectively.
  • each point on the UV coordinate map can be corresponding to the surface of the three-dimensional face model, that is, each point on the three-dimensional face model has a unique point on the UV coordinate map.
  • the coordinates of the key points marked in the face image sample can be restored from the UV coordinate map.
  • the predicted coordinate values of the key points are valid three-dimensional coordinates.
  • the face image samples input from the encoding network 210 to the first decoding network 220 include a face image marked with three-dimensional coordinates of key points and a face image marked with two-dimensional coordinates of key points, the samples corresponding to the samples marked with three-dimensional coordinates
  • the predicted coordinate value of is a valid three-dimensional coordinate
  • the predicted coordinate value corresponding to the sample marked with two-dimensional coordinates is a pseudo three-dimensional coordinate, which does not have a Z-axis coordinate value.
  • the encoding network 210 and the first decoding network 220 in the above embodiments may be pre-trained.
  • the first decoding network 220 may be a standard neural network (Standard Neural Networks, SNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN) and other networks.
  • the pre-training of the encoding network 210 and the first decoding network 220 can be implemented by any technique well known to those skilled in the art, so that the first decoding network 220 can obtain the predicted coordinates of the key points according to the face image samples marked with the coordinates of the key points. value, which is not repeated here.
  • the parameters of the encoding network 210 and the first decoding network 220 may be determined through training.
  • the encoding network 210 and the first decoding network 220 as described above, may be networks in various forms, which will not be repeated here.
  • an exemplary embodiment in which the parameters of the encoding network 210 and the first decoding network 220 are determined through training is given.
  • the encoding network 210 obtains face image samples marked with key points and uses the samples as output results, and the first decoding network 220 extracts predicted coordinate values of the marked key points in the samples according to the output results of the encoding network 210 .
  • the parameters of the encoding network 210 and the first decoding network 220 are jointly trained based on the first loss function until the first loss function satisfies the preset training conditions, and the network parameters at this time are the encoding network 210 and the first decoding network after training. Parameters of the network 220 .
  • the preset training condition may be that the value of the first loss function is lower than the threshold, or the first loss function converges, or other training conditions, which are not limited in the present disclosure.
  • the first loss function may be a loss function based on the minimum absolute value deviation, may also be a loss function based on the minimum mean square value deviation, or may be other forms of loss functions, which are not limited in the present disclosure.
  • the first loss function may be the deviation between the predicted coordinate value of the key point and the labeled coordinate value of the key point, or may be other measures representing the accuracy of the predicted coordinate value of the key point.
  • An exemplary first loss function may be:
  • Loss 1 is the first loss function
  • predict is the predicted coordinate value of the key point output from the first decoding network 220
  • label is the labeled coordinate value of the key point in the face image sample input by the encoding network 210 .
  • first loss function is only an exemplary embodiment, and those skilled in the art can also design other first loss functions to train the encoding network 210 and the first decoding network 220, so that the key output of the first decoding network 220 is the key The predicted coordinate value of the point is more accurate.
  • obtaining the face pose change parameters corresponding to the key points according to the output result of the encoding network 210 can be realized by the second decoding network 230 in the neural network for face detection.
  • the face pose change parameter is used to characterize the change of the face pose relative to the standard face model.
  • the face pose change parameters may include shape parameters and expression parameters, and may also include texture parameters and the like. Since human faces have many commonalities, for example, they have a certain number of eyes, mouths, noses and ears, and the relative positions of each part remain unchanged, each part has a certain topological relationship, so a standard 3D face with parametric representation can be established. Model. In this way, by obtaining the change of the predicted coordinate value of the key point relative to the standard three-dimensional face model, the face pose change parameter is obtained.
  • the standard three-dimensional face model of the parametric representation may be a three-dimensional face model of parametric representation obtained based on a 3DMM (3D Morphable Models, three-dimensional deformable face model) method.
  • 3DMM 3D Morphable Models, three-dimensional deformable face model
  • T model T 2 + ⁇ i T i (3)
  • S and T represent the average shape and average texture of the face, and the discriminative characteristics of each face are reflected in the linear combination of a set of orthogonal bases Si or Ti on the right side of the plus sign. Eigenvector of the covariance matrix in descending order of values.
  • the heads of a plurality of 3D human faces are collected, and the principal components representing the shape and texture information of the face, ie, the feature vectors Si and T i , can be obtained by using the Principal Components Analysis (PCA) method. Different coefficients ⁇ i and ⁇ i characterize 3D faces with different shapes and textures.
  • PCA Principal Components Analysis
  • M BFM mean +shape*shape std +exp*exp std (4);
  • BFM mean is the average part of the face obtained based on the BFM database
  • shape std and exp std are the feature vectors of face shape and facial expression
  • shape and exp are shape parameters and expression parameters, respectively.
  • the BFM database is a technology well known to those skilled in the art, and will not be repeated here.
  • formula (4) by determining the shape parameter shape and the expression parameter exp, the standard three-dimensional face model can be transformed to obtain a three-dimensional face model with posture (referring to changes in shape and expression).
  • the parametrically represented standard three-dimensional face model may also be a parametrically represented three-dimensional face model established by other methods.
  • a 3D face model with parametric representation is established based on actually collected 3D face data or 3D face data obtained by other means using statistical methods.
  • the face pose change parameters can be obtained through a pre-trained second decoding network 230.
  • the encoding network 210 sends the samples with key point annotations as output results to the second decoding network 230, and the second decoding network 210 sends the The decoding network 230 obtains the face pose change parameter based on the output result of the encoding network 210 .
  • the second decoding network 230 may be a standard neural network (Standard Neural Networks, SNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN) and the like.
  • the pre-training of the second decoding network 230 can be implemented by using techniques well known to those skilled in the art, and details are not described here.
  • the second decoding network 230 may be determined through training.
  • the training process of the second decoding network 230 includes: training the second decoding network 230 based on the parameters of the encoding network 210 and the second loss function, until the second loss function satisfies preset training conditions, and determining the first Two parameters of the decoding network 230.
  • the parameters of the encoding network 210 may be determined by the training method described above, or may be trained by other methods, which are not limited in the present disclosure.
  • the preset training condition may be that the value of the second loss function is lower than the threshold, or the first loss function converges, or other training conditions, which are not limited in the present disclosure.
  • the form of the second loss function may be a loss function based on the minimum absolute value deviation, a loss function based on the minimum mean square value deviation, or a loss function in other forms, which is not limited in the present disclosure .
  • the second loss function may be the deviation between the predicted coordinate value of the key point and the first coordinate value of the key point, or may be another measure representing the accuracy of the face change parameter output by the second decoding network.
  • An exemplary second loss function can be:
  • Loss 2 is the second loss function
  • predict is the predicted coordinate value of the key point output by the first decoding network 220
  • mask is the activation function corresponding to the three-dimensional coordinate of the key point
  • affine represents the standard three-dimensional face model relative to the predicted coordinate value
  • FIG. 4 is a schematic diagram of realizing the acquisition of face pose change parameters by training the second decoding network 230.
  • FIG. 4 schematically shows the influence of the second loss function on training, combined with the expression formula of the second loss function (5 )Be explained.
  • the expressions in the brackets ⁇ in formula (5) represent the first coordinates of the key points of the face image sample, which can be obtained by the following methods: the predicted coordinate values of the key points obtained by the first decoding network 220 and the standard three-dimensional face model, the transformation matrix of the standard three-dimensional face model relative to the predicted coordinate value is calculated, and the transformation matrix represents the rotation and translation required to transform the standard three-dimensional face model into the three-dimensional face model corresponding to the predicted coordinate value and scaling, etc.; transform the standard three-dimensional face model according to the transformation matrix to obtain the first coordinates of the key points.
  • the standard three-dimensional face model is transformed according to the transformation matrix to obtain the first coordinates of the key points.
  • the parameters of the second decoding network 230 are adjusted based on the transformation matrix, so that the change parameters of the face posture output by the second decoding network 230 occur. change, so that the standard three-dimensional face model changes in attitude, so as to match the predicted coordinate value of the key point output by the first decoding network 220 as much as possible.
  • the standard three-dimensional face model represented by the parameterization in formula (4) as an example, that is, through the parameter change of the second decoding network 230, the shape parameter shape and the expression parameter exp in formula (4) change, and then the parameterization
  • the standard 3D face model represented by the pose changes, and the parameterized expression is:
  • M 2 ⁇ (bfm mean +shape*shape std +exp*exp std )*affine ⁇
  • the second decoding network 230 is trained based on the second loss function, that is, the parameters of the second decoding network 230 are continuously adjusted to obtain the face pose change parameters output by the second decoding network 230, so as to transform the standard three-dimensional face model, Until the face model transformed from the standard three-dimensional face model is aligned with the predicted coordinate value of the key point output by the first decoding network 220 (that is, the deviation is small enough), so as to obtain a three-dimensional image with the same or similar posture as the predicted coordinate value of the key point. face model, the second decoding network 230 completes the training. In this way, the result output by the trained second decoding network 230 is the face pose change parameter that can characterize the expression and/or shape change of the face image corresponding to the predicted coordinate value of the key point relative to the standard face model.
  • a three-dimensional face model corresponding to the predicted coordinate values of the key points of the input sample can be obtained in essence. Then, based on the three-dimensional face model, the three-dimensional coordinates of any point on the model can be obtained, so the three-dimensional coordinates of a large number of key points of the sample can be obtained.
  • the parameters of the prediction network 240 are determined based on the parameters determined by the encoding network 210 and the third loss function training.
  • the parameters of the prediction network 240 are trained based on the third loss function , until the third loss function satisfies the preset training conditions, and the network parameters at this time are the parameters of the prediction network 240 after training, that is, the prediction network 240 completes the training.
  • the training of the prediction network 240 is performed based on the third loss function, and the preset training condition may be that the value of the third loss function is lower than the threshold, or the third loss function converges, or other training conditions, which are not limited in the present disclosure.
  • the form of the third loss function may be a loss function based on minimum absolute deviation, a loss function based on minimum mean square deviation, or a loss function in other forms, which is not limited in the present disclosure .
  • the content of the third loss function may be the deviation between the three-dimensional coordinates of the key points (ie, the second key points) output by the prediction network 240 and the first coordinates, or may be the accuracy of the three-dimensional coordinates representing the key points output by the prediction network. other measures of degree.
  • An exemplary third loss function can be:
  • Loss 3 is the third loss function
  • predict pts is the three-dimensional coordinates of the key points output by the prediction network
  • the meanings of the remaining parameters are as described above.
  • the expression in the brackets ⁇ in formula (5) actually represents the first coordinate of the key point, and the acquisition method is the same as that described above, and will not be repeated here.
  • the following can be achieved: input an unmarked face image to the encoding network 210, the encoding network 210 can obtain a face image marked with a small number of key points, and convert the marked face image with key points.
  • the face image is sent to the prediction network 240 as the processing result, so that the prediction network 240 can output the three-dimensional coordinates of a large number of key points in the face image.
  • the first, second and third loss functions can also add regularization to improve the overall performance of the trained model.
  • the neural network for face detection includes an encoding network, a first decoding network, a second decoding network, and a prediction network, and some of the networks may be trained networks and some of the networks may be untrained networks.
  • some of the networks may be trained networks and some of the networks may be untrained networks.
  • the encoding network, the first decoding network, the second decoding network and the prediction network included in the neural network for face detection may all be untrained networks.
  • the following methods can be used: fix the initial parameters of the second decoding network and the prediction network, and jointly train the parameters of the encoding network and the first decoding network based on the first loss function; fix the trained parameters of the encoding network, The parameters of the second decoding network are trained based on the second loss function; the trained parameters of the encoding network are fixed, and the parameters of the prediction network are trained based on the third function.
  • the face image samples marked with the three-dimensional coordinates of the key points are used as the input of the neural network for face detection, and the neural network is trained by obtaining the predicted coordinate values of the key points of the samples and the change parameters of the face posture.
  • the neural network for face detection Using the neural network for face detection, a three-dimensional face model corresponding to the predicted coordinate values of the key points of the input sample can be obtained, and then it is possible to input a face image with only a small number of three-dimensional coordinates of the key points into the neural network. sample, the three-dimensional coordinates of a large number of key points of the sample can be obtained.
  • the following can be achieved: input an unlabeled face image to the encoding network, the encoding network can obtain a face image marked with a small number of key points, and annotated with a small number of key points.
  • the face image of the point is sent to the prediction network as the processing result, so that the prediction network can output the three-dimensional coordinates of a large number of key points in the face image. In this way, the detection of the three-dimensional coordinates of the key points of the unlabeled face image can be effectively realized.
  • FIG. 6 is a flowchart of a training method of a neural network for face detection according to an exemplary embodiment of the present disclosure. As shown in FIG. 6 , the training method includes the following steps:
  • step 602 a sample including a face image is obtained, wherein the sample includes a face image marked with three-dimensional coordinates of at least one first key point;
  • step 604 the sample is used as the input of the neural network, and the predicted coordinate value of the first key point and the face posture change parameter are obtained;
  • step 606 the neural network is trained based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the trained neural network is that the sample contains the three-dimensional coordinates of the first number of second key points , where the first number is greater than the number of first key points marked in the sample.
  • the above training method can be used in various neural networks that need to be trained based on the predicted coordinate values of the key points and the change parameters of the face pose to obtain the three-dimensional coordinates of the key points of the face image.
  • the present disclosure does not limit the structure of the network trained using the training method.
  • the method can also be used in the training of the aforementioned neural network for face detection.
  • the encoding network, the first decoding network, and the second decoding network may be pre-trained networks, and the encoding network outputs the processing result of the input face image sample, that is, the face marked with key points image samples, and send the samples to the first decoding network, the second decoding network and the prediction network; the first decoding network is used to obtain the predicted coordinate values of the key points marked in the sample, and the second decoding network is used to obtain The face pose change parameter corresponding to the sample.
  • the first decoding network is used to obtain the predicted coordinate values of the key points marked in the sample
  • the second decoding network is used to obtain The face pose change parameter corresponding to the sample.
  • the encoding network, the first decoding network, and the second decoding network may be untrained networks, and the training method described above may be used to perform the encoding network, the first decoding network, the second decoding network, and the prediction network. to train.
  • the specific training method of the network has been described in detail above, and will not be repeated here.
  • a parameter-determined encoding network and prediction network can be obtained.
  • a face detection method can be realized: the encoding network is used to process the acquired unlabeled face images, and output faces with a small number of key point annotations Image samples; the prediction network can output a large number of three-dimensional coordinates of key points of the face images for the face image samples input from the encoding network. Based on the acquired three-dimensional coordinates of the key points of the face image, three-dimensional face reconstruction can be performed, as shown in FIG. 7 .
  • the above-mentioned method of face detection can be installed on the server to realize the detection of the key points of the face image, or the trained encoding network and the first decoding network can be installed on the mobile terminal, so as to realize the detection on the mobile terminal. Face Detection.
  • the trained face model using the trained face model, it is possible to input an unlabeled face image into the encoding network and the prediction network, and obtain the three-dimensional coordinates of the key points of the face image. Based on the three-dimensional coordinates of the key points, it is possible to Further realization of 3D face reconstruction.
  • the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the method described in any of the foregoing embodiments.
  • the present disclosure also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and when the program is executed by the processor, the method described in any of the foregoing embodiments is implemented. step.
  • the present disclosure may take the form of a computer program product embodied on one or more storage media having program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.
  • Computer-usable storage media includes permanent and non-permanent, removable and non-removable media, and storage of information can be accomplished by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • PRAM phase-change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • Flash Memory or other memory technology
  • CD-ROM Compact Disc Read Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • DVD Digital Versatile Disc
  • Magnetic tape cassettes magnetic tape magnetic disk storage or other magnetic storage devices or any other non-

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a face detection neural network and a training method, a face detection method, and a storage medium. The neural network comprises: an encoding network (210), being configured to obtain a sample comprising a face image and take the sample as an output result of the encoding network; a first decoding network (220), being configured to extract prediction coordinate values of first key points on the face image in the sample according to the output result of the encoding network (210); a second decoding network (230), being configured to obtain, according to the output result of the encoding network (210), a human face posture change parameter corresponding to the face image in the sample; and a prediction network (240), being configured to train by taking the output result of the encoding network (210), the prediction coordinate values obtained by the first decoding network (220), and the face posture change parameter obtained by the second decoding network (230) as inputs, and output a face model containing three-dimensional or two-dimensional coordinate values of at least two second key points on the face image in the sample.

Description

人脸检测神经网络及训练方法、人脸检测方法、存储介质Face detection neural network and training method, face detection method, storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开要求于2020年10月28日提交的、申请号为2020111738293、发明名称为“人脸检测神经网络及训练方法、人脸检测方法、存储介质”的中国专利申请的优先权,该申请以引用的方式并入本文中。This disclosure claims the priority of the Chinese patent application filed on October 28, 2020, with the application number of 2020111738293 and the invention titled "Face Detection Neural Network and Training Method, Face Detection Method, and Storage Medium", which is entitled to Incorporated herein by reference.
技术领域technical field
本公开涉及计算机技术领域,尤其涉及一种人脸检测神经网络及其训练方法、人脸检测方法、存储介质及终端设备。The present disclosure relates to the field of computer technologies, and in particular, to a face detection neural network and a training method thereof, a face detection method, a storage medium and a terminal device.
背景技术Background technique
近年来,随着深度学习技术的迅猛发展,人脸检测技术也取得了巨大的进步。但是基于深度学习的人脸检测技术需要大量的包含关键点三维坐标的训练数据,且同一训练数据中所标注的关键点也需要足够密集。但是,三维人脸数据的采集和关键点的标注相比于传统图像的采集来讲具有较大难度,且需要花费大量的人力、物力和财力。In recent years, with the rapid development of deep learning technology, face detection technology has also made great progress. However, face detection technology based on deep learning requires a large amount of training data containing the three-dimensional coordinates of key points, and the key points marked in the same training data also need to be dense enough. However, the collection of 3D face data and the labeling of key points are more difficult than traditional image collection, and require a lot of manpower, material and financial resources.
发明内容SUMMARY OF THE INVENTION
为克服相关技术存在的三维人脸数据的采集和关键点的标注相比于传统图像的采集来讲具有较大难度,且需要花费大量的人力、物力和财力的问题,本公开提供了一种人脸检测神经网络及其训练方法、人脸检测方法、存储介质及终端设备。In order to overcome the problems existing in the related art that the collection of 3D face data and the labeling of key points are more difficult than traditional image collection and require a lot of manpower, material resources and financial resources, the present disclosure provides a A face detection neural network and a training method thereof, a face detection method, a storage medium and a terminal device.
根据本公开实施例的第一方面,提供一种人脸检测的神经网络,所述神经网络包括:编码网络,用于获取包括人脸图像的样本,并将所述样本作为其输出结果,其中,所述样本中的所述人脸图像标注有至少一个第一关键点的三维或二维坐标;第一解码网络,用于根据所述编码网络的输出结果,提取出所述样本中所述人脸图像上的各所述第一关键点的预测坐标值;第二解码网络,用于根据所述编码网络的输出结果,获取所述样本中所述人脸图像对应的人脸姿态变化参数;以及预测网络,用于将所述编码网络的输出结果以及所述第一解码网络获取的预测坐标值和所述第二解码网络获得的人脸姿态变化参数作为输入进行训练,输出包含所述样本中所述人脸图像上的至少两个第二关键点的三维或二维坐标值的人脸模型,其中,所输出的所述人脸模型中所述第二关键点 的数量大于所述样本中标注的所述第一关键点的数量。According to a first aspect of the embodiments of the present disclosure, there is provided a neural network for face detection, the neural network comprising: an encoding network for acquiring a sample including a face image, and using the sample as an output result thereof, wherein , the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point; the first decoding network is used to extract the The predicted coordinate value of each of the first key points on the face image; the second decoding network is used to obtain the face pose change parameter corresponding to the face image in the sample according to the output result of the encoding network And prediction network, be used for the output result of described coding network and the predicted coordinate value that described first decoding network obtains and the facial posture change parameter that described second decoding network obtains as input and carry out training, output comprises described A face model with three-dimensional or two-dimensional coordinate values of at least two second key points on the face image in the sample, wherein the number of the second key points in the output face model is greater than the The number of the first keypoints annotated in the sample.
根据本公开实施例的第二方面,提供一种人脸检测的神经网络的训练方法,所述训练方法包括:获取包括人脸图像的样本,其中,所述样本中的所述人脸图像标注有至少一个第一关键点的三维或二维坐标;将所述样本作为所述神经网络的输入,获取各所述第一关键点的预测坐标值以及与所述人脸图像对应的人脸姿态变化参数;基于所获取的预测坐标值和所述人脸姿态变化参数训练所述神经网络,以使训练后的所述神经网络的输出结果为包含第一数量的第二关键点的三维或二维坐标的人脸模型,其中,所述第一数量大于样本中标注的所述第一关键点的数量。According to a second aspect of the embodiments of the present disclosure, there is provided a method for training a neural network for face detection, the training method comprising: acquiring a sample including a face image, wherein the face image in the sample is labeled There are three-dimensional or two-dimensional coordinates of at least one first key point; the sample is used as the input of the neural network to obtain the predicted coordinate value of each of the first key points and the face pose corresponding to the face image Change parameters; train the neural network based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the neural network after training is a three-dimensional or two-dimensional image containing a first number of second key points. A face model of dimensional coordinates, wherein the first number is greater than the number of the first key points marked in the sample.
根据本公开实施例的第三方面,提供一种人脸检测的方法,所述方法由本公开实施例第一方面的编码网络和训练获得的预测网络实现,所述方法包括:所述编码网络对所获取的未标注的人脸图像进行关键点标注的处理,并将标注有至少一个第一关键点的所述人脸图像作为处理结果输出给所述预测网络;所述预测网络输出包括所述人脸图像上至少两个第二关键点的三维或二维坐标的人脸模型,其中,所述第二关键点的数量大于所述第一关键点的数量。According to a third aspect of the embodiments of the present disclosure, there is provided a method for face detection, the method is implemented by the encoding network and the prediction network obtained by training in the first aspect of the embodiments of the present disclosure, and the method includes: the encoding network pairing The acquired unlabeled face image is processed for key point labeling, and the face image marked with at least one first key point is output to the prediction network as a processing result; the prediction network output includes the A face model with three-dimensional or two-dimensional coordinates of at least two second key points on a face image, wherein the number of the second key points is greater than the number of the first key points.
根据本公开实施例的第四方面,提供一种计算机存储介质,所述计算机存储介质中存储有计算机程序代码,当所述计算机程序代码在处理器上运行时,使得所述处理器执行本公开实施例的第二方面或本公开实施例的第三方面所述方法的步骤。According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer storage medium, where computer program codes are stored in the computer storage medium, and when the computer program codes are executed on a processor, the processor causes the processor to execute the present disclosure The steps of the method described in the second aspect of the embodiment or the third aspect of the embodiment of the present disclosure.
根据本公开实施例的第五方面,提供一种终端设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现本公开实施例的第二方面或第三方面所述方法的步骤。According to a fifth aspect of the embodiments of the present disclosure, a terminal device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the present invention when executing the program. The steps of the method of the second or third aspect of the disclosed embodiments.
在本公开实施例中,将标注有关键点三维坐标的人脸图像样本作为人脸检测的神经网络的输入,通过获取样本的关键点预测坐标值和人脸姿态变化参数,训练所述神经网络。利用所述人脸检测的神经网络,能够实现向所述神经网络输入仅标注有少量关键点三维坐标的人脸图像样本,就可以获得该样本大量关键点的三维坐标。能够解决三维人脸数据的采集和关键点的标注难度大,且需要花费大量的人力、物力和财力的问题。In the embodiment of the present disclosure, a face image sample marked with three-dimensional coordinates of key points is used as the input of a neural network for face detection, and the neural network is trained by obtaining key points of the sample to predict coordinate values and face pose change parameters. . Using the neural network for face detection, it is possible to input a face image sample marked with only a small number of three-dimensional coordinates of key points into the neural network, and obtain three-dimensional coordinates of a large number of key points of the sample. It can solve the problem that the collection of 3D face data and the labeling of key points are difficult and require a lot of manpower, material resources and financial resources.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The drawings herein illustrate embodiments consistent with the disclosure, and together with the description serve to explain the principles of the disclosure.
图1是本公开根据一示例性实施例的一种可以应用人脸检测神经网络及其训练方法、人脸检测方法的系统架构图。FIG. 1 is a system architecture diagram to which a face detection neural network, a training method thereof, and a face detection method can be applied according to an exemplary embodiment of the present disclosure.
图2是本公开根据一示例性实施例的一种人脸检测的神经网络的结构图。FIG. 2 is a structural diagram of a neural network for face detection according to an exemplary embodiment of the present disclosure.
图3是本公开根据一示例性实施例的另一种获取关键点的预测坐标值的方法的流程图。FIG. 3 is a flowchart of another method for obtaining predicted coordinate values of key points according to an exemplary embodiment of the present disclosure.
图4是本公开根据一示例性实施例的一种获取关键点的人脸姿态变化参数的方法的流程图。FIG. 4 is a flowchart of a method for obtaining face pose change parameters of key points according to an exemplary embodiment of the present disclosure.
图5是本公开根据一示例性实施例的一种训练预测网络的方法的流程图。FIG. 5 is a flowchart of a method of training a prediction network according to an exemplary embodiment of the present disclosure.
图6是本公开根据一示例性实施例的一种人脸检测的神经网络的训练方法的流程图。FIG. 6 is a flowchart of a training method of a neural network for face detection according to an exemplary embodiment of the present disclosure.
图7是本公开根据一示例性实施例的一种应用预测网络的方法的流程图。FIG. 7 is a flowchart of a method of applying a prediction network according to an exemplary embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
在本文中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this specification and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在…… 时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. The word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining," depending on the context.
人脸检测是指给定人脸图像,检测出面部的关键点的位置。所述关键点通常包括眉毛、眼睛、鼻子、嘴巴、脸部轮廓等处的点。人脸检测技术是人脸装扮、美颜化妆、人脸特效、人脸AR(Augmented Reality,增强现实)等应用场景的基础技术。以人脸特效为例,在一些视频直播应用中,会有一些三维动画特效的需求,例如为视频内的人脸加上三维的兔子耳朵、猪猪面具等。这些三维动画特效都需要建立在人脸关键点准确定位的基础上,也就是说,需要能够获取到人脸关键点准确的三维坐标。Face detection refers to detecting the location of key points of the face given a face image. The key points typically include points at the eyebrows, eyes, nose, mouth, face contours, and the like. Face detection technology is the basic technology for application scenarios such as face dressing, beauty makeup, face special effects, and face AR (Augmented Reality). Taking face special effects as an example, in some live video applications, there will be some requirements for 3D animation special effects, such as adding 3D rabbit ears and pig masks to the faces in the video. These three-dimensional animation special effects need to be based on the accurate positioning of the key points of the face, that is to say, it is necessary to be able to obtain the accurate three-dimensional coordinates of the key points of the face.
近年来,随着深度学习技术的发展,出现了一些基于深度学习的人脸检测技术。例如,输入给定人脸图像至卷积神经网络,以通过卷积神经网络回归出人脸关键点的坐标。又例如,输入给定人脸图像至卷积神经网络,以通过卷积神经网络回归出人脸关键点对应的特征图,并从特征图中确定出代表人脸关键点的位置。然而,基于深度学习的人脸检测技术,都需要对待训练的网络输入大量标注有关键点三维坐标的训练数据,且同一训练数据中所标注的关键点也需要足够密集(即需要密集三维数据)。但是,带有关键点标注的三维人脸数据是十分珍贵的,因为三维人脸数据的采集和关键点的标注相对于传统图像的采集来讲,具有较大的难度,且需要花费大量的人力、物力和财力。In recent years, with the development of deep learning technology, some face detection technologies based on deep learning have emerged. For example, input a given face image to a convolutional neural network to regress the coordinates of key points of the face through the convolutional neural network. For another example, a given face image is input to a convolutional neural network, so as to regress the feature map corresponding to the key points of the face through the convolutional neural network, and determine the position representing the key points of the face from the feature map. However, the face detection technology based on deep learning needs to input a large amount of training data marked with the three-dimensional coordinates of key points in the network to be trained, and the key points marked in the same training data also need to be dense enough (that is, dense three-dimensional data is required) . However, 3D face data with key point annotations is very precious, because the collection of 3D face data and the annotation of key points are more difficult than traditional image collection, and require a lot of manpower , material and financial resources.
针对三维人脸数据的采集和关键点的标注相比于传统图像的采集来讲具有较大难度,且需要花费大量的人力、物力和财力的问题,本公开提出了一种人脸检测神经网络及其训练方法、人脸检测方法、存储介质及终端设备。Compared with traditional image collection, the collection of 3D face data and the labeling of key points are more difficult and require a lot of manpower, material resources and financial resources. The present disclosure proposes a face detection neural network. A training method thereof, a face detection method, a storage medium and a terminal device thereof are provided.
接下来对本公开实施例进行详细说明。Next, the embodiments of the present disclosure will be described in detail.
图1示出了可以应用本公开实施例的人脸检测神经网络及其训练方法、人脸检测方法的示例性系统架构的示意图。FIG. 1 shows a schematic diagram of an exemplary system architecture of a face detection neural network, a training method thereof, and a face detection method to which embodiments of the present disclosure can be applied.
如图1所示,系统架构1000可以包括终端设备1001、1002、1003中的一种或多种,以及网络1004和服务器1005。网络1004是用以在终端设备1001、1002、1003和服务器1005之间提供通信链路的介质。网络1004可以包括各种连接类型,例如有线连接、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 1000 may include one or more of terminal devices 1001 , 1002 , and 1003 , as well as a network 1004 and a server 1005 . The network 1004 is the medium used to provide the communication link between the terminal devices 1001 , 1002 , 1003 and the server 1005 . The network 1004 may include various connection types, such as wired connections, wireless communication links, or fiber optic cables, among others.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器1005可以是多个服务器组成的服务器集群等。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs. For example, the server 1005 may be a server cluster composed of multiple servers, or the like.
用户可以使用终端设备1001、1002、1003通过网络1004与服务器1005交互, 以接收或发送消息等。终端设备1001、1002、1003可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、便携式计算机、和台式计算机等等。The user can use the terminal devices 1001, 1002, 1003 to interact with the server 1005 through the network 1004 to receive or send messages and the like. The terminal devices 1001, 1002, 1003 may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.
服务器1005可以是提供各种服务的服务器。例如,服务器1005可以提供本公开的一种人脸检测的神经网络的训练:服务器1005从终端设备1001、1002、1003获取人脸图像,所述人脸图像可以是被标注有少量关键点的三维坐标(即稀疏三维数据)的人脸图像。服务器基于所获取的人脸图像,对人脸检测的神经网络进行训练。基于已训练好的神经网络,服务器能够输出该人脸图像中大量关键点的三维坐标(即密集三维数据)。又例如,服务器1005可以提供本公开的一种人脸检测的方法执行:服务器1005被预先安装上已训练好的人脸检测的神经网络的全部或者部分网络,服务器1005从终端设备1001、1002、1003获取不带关键点标注的人脸图像,由已训练好的人脸检测的神经网络的全部或者部分网络获取该人脸图像中关键点的三维坐标。The server 1005 may be a server that provides various services. For example, the server 1005 can provide training of a neural network for face detection of the present disclosure: the server 1005 obtains face images from the terminal devices 1001, 1002, and 1003, and the face images can be three-dimensional images marked with a small number of key points. Coordinates (i.e. sparse 3D data) of face images. The server trains the neural network for face detection based on the acquired face images. Based on the trained neural network, the server can output the three-dimensional coordinates of a large number of key points in the face image (ie, dense three-dimensional data). For another example, the server 1005 can provide a face detection method of the present disclosure for execution: the server 1005 is pre-installed with all or part of the neural network for face detection that has been trained, and the 1003 Obtain a face image without key point annotations, and obtain the three-dimensional coordinates of key points in the face image by all or part of the trained neural network for face detection.
终端设备1001、1002、1003可以是提供各种服务的终端设备。例如,终端设备可以是带有图像采集单元的终端设备,因此,终端设备可以采集人脸图像以供终端设备本身使用或者发送至其他设备。又例如,通过在终端设备1001、1002、1003上提前安装由服务器1005训练好的人脸检测的神经网络的全部或者部分网络,由终端设备实现对输入的不带关键点标注的人脸图像获取人脸图像关键点的三维坐标。The terminal devices 1001, 1002, and 1003 may be terminal devices that provide various services. For example, the terminal device may be a terminal device with an image acquisition unit, and therefore, the terminal device may acquire face images for use by the terminal device itself or to send to other devices. For another example, by pre-installing all or part of the neural network for face detection trained by the server 1005 on the terminal devices 1001, 1002, and 1003, the terminal device realizes the acquisition of the input face image without key point annotations. 3D coordinates of key points in the face image.
然而,应当理解的是,对于计算能力满足训练要求的终端设备1001、1002、1003,本公开实施例公开的人脸检测的神经网络的训练方法也可以由终端设备1001、1002、1003执行。由终端设备1001、1002、1003训练好的神经网络的全部或者部分网络,可以直接实现对输入的不带标注的人脸图像获取人脸图像关键点的三维坐标。当然,也可以通过在服务器1005上提前配置由终端设备1001、1002、1003训练的神经网络的全部或者部分网络,由服务器1005实现对输入的不带标注的人脸图像获取人脸图像关键点的三维坐标。However, it should be understood that, for the terminal devices 1001 , 1002 , and 1003 whose computing capabilities meet the training requirements, the training method of the neural network for face detection disclosed in the embodiments of the present disclosure can also be executed by the terminal devices 1001 , 1002 , and 1003 . All or part of the neural network trained by the terminal devices 1001 , 1002 and 1003 can directly obtain the three-dimensional coordinates of the key points of the face image from the input face image without labels. Of course, all or part of the neural networks trained by the terminal devices 1001, 1002, and 1003 can also be configured on the server 1005 in advance, and the server 1005 can obtain the key points of the face image from the input face image without labels. three-dimensional coordinates.
图2是本公开根据一示例性实施例的一种人脸检测的神经网络的结构图。如图2所示,所述人脸检测的神经网络包括编码网络210、第一解码网络220、第二解码网络230和预测网络240。其中,所述编码网络210用于获取人脸图像样本,并将所述样本作为输出结果发送给所述第一解码网络220、第二解码网络230和预测网络240,所述样本包括标注有至少一个第一关键点的三维坐标的人脸图像;所述第一解码网络220根据所述编码网络210的输出结果,提取出样本中第一关键点的预测坐标值;所述第二解码网络230用于根据所述编码网络210的输出结果,获取所述样本中第一关键点的预测 坐标值的人脸姿态变化参数;将所述编码网络210的输出结果以及所述第一解码网络220获取的预测坐标值和所述第二解码网络230获取的人脸姿态变化参数作为输入对预测网络240进行训练,使得经训练的预测网络240输出包含所述样本中至少两个第二关键点的三维坐标的人脸模型,所述预测网络240输出的第二关键点的数量大于在所述样本中标注的三维坐标的第一关键点的数量。FIG. 2 is a structural diagram of a neural network for face detection according to an exemplary embodiment of the present disclosure. As shown in FIG. 2 , the neural network for face detection includes an encoding network 210 , a first decoding network 220 , a second decoding network 230 and a prediction network 240 . The encoding network 210 is used to obtain face image samples, and send the samples as output results to the first decoding network 220, the second decoding network 230 and the prediction network 240, and the samples include at least A face image with three-dimensional coordinates of the first key point; the first decoding network 220 extracts the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210; the second decoding network 230 For obtaining the face posture change parameter of the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210; obtain the output result of the encoding network 210 and the first decoding network 220 The predicted coordinate value and the face pose change parameter obtained by the second decoding network 230 are used as input to train the prediction network 240, so that the trained prediction network 240 outputs a three-dimensional image including at least two second key points in the sample. Coordinate face model, the number of the second key points output by the prediction network 240 is greater than the number of the first key points of the three-dimensional coordinates marked in the sample.
本公开所述的人脸检测的神经网络的训练数据,可以是标注有关键点的三维坐标的人脸图像样本。应当理解,标注有关键点的三维坐标的人脸图像样本,可以来自于公开的数据集,也可以是用户采集并进行关键点三维坐标标注的人脸图像。本公开对于训练数据的来源不做限制。人脸图像样本中被标注的关键点,可以是眼睛、鼻子、嘴巴、脸型轮廓处的点,也可以是人脸的其他位置的点,本公开对此也不做限制。The training data of the neural network for face detection described in the present disclosure may be a face image sample marked with three-dimensional coordinates of key points. It should be understood that the face image samples marked with the three-dimensional coordinates of key points may come from a public data set, or may be face images collected by a user and marked with three-dimensional coordinates of key points. The present disclosure does not limit the source of training data. The marked key points in the face image sample may be points at the eyes, nose, mouth, face contour, or points at other positions on the face, which are not limited in this disclosure.
在一个实施例中,当所使用的训练数据为标注有关键点三维坐标的人脸图像样本时,人脸检测的神经网络中所采用的激活函数是与关键点的三维坐标对应的正数,即X轴、Y轴和Z轴对应的激活函数都为正数。例如一个关键点的三维坐标是x,y,z,那么该关键点所对应的激活函数为1,1,1,表示该关键点的三个维度的坐标都是神经网络需要学习的。当然,本领域的技术人员应当理解,X轴、Y轴和Z轴对应的激活函数可以不同,且可以为其他正数。例如,一个关键点的三维坐标是x,y,z,其所对应的激活函数可以是2,2,10。其中,X轴、Y轴和Z轴所对应的激活函数都大于0,表示神经网络需要学习关键点三个维度的坐标;Z轴坐标对应的激活函数大于X轴和Y轴坐标对应的激活函数,表示对Z轴坐标的学习在人脸检测的神经网络的训练过程中所占的权重比较大,也就是说,通过对人脸检测的神经网络的训练更想学习到的是关键点的Z轴坐标。In one embodiment, when the training data used are face image samples marked with three-dimensional coordinates of key points, the activation function used in the neural network for face detection is a positive number corresponding to the three-dimensional coordinates of key points, that is, The activation functions corresponding to the X-axis, Y-axis, and Z-axis are all positive numbers. For example, the three-dimensional coordinates of a key point are x, y, z, then the activation function corresponding to the key point is 1, 1, 1, indicating that the three-dimensional coordinates of the key point are all that the neural network needs to learn. Of course, those skilled in the art should understand that the activation functions corresponding to the X axis, the Y axis and the Z axis may be different and may be other positive numbers. For example, the three-dimensional coordinates of a key point are x, y, z, and the corresponding activation function can be 2, 2, 10. Among them, the activation functions corresponding to the X-axis, Y-axis and Z-axis are all greater than 0, indicating that the neural network needs to learn the three-dimensional coordinates of the key points; the activation function corresponding to the Z-axis coordinate is greater than the activation function corresponding to the X-axis and Y-axis coordinates. , indicating that the learning of the Z-axis coordinates occupies a relatively large weight in the training process of the neural network for face detection. axis coordinates.
在一个实施例中,为了保证更好的训练效果,以令完成训练的人脸检测的神经网络对于所输入的不同的人脸图像具有更好的泛化性能和更准确的预测结果,本公开所述的人脸检测的神经网络的训练数据还可以是混合数据,所述混合数据包括标注有关键点三维坐标的人脸图像样本和标注有关键点二维坐标的人脸图像样本。对于标注有关键点二维坐标的人脸图像,可以将关键点的Z轴的坐标值设置为负数,例如,设置为-1,表示该关键点的Z轴坐标不存在。相应地,当所使用的训练数据为混合数据,对于标注有关键点三维坐标的人脸图像样本,关键点的X轴、Y轴和Z轴所对应的激活函数可同前文所述,这里不做赘述。对于标注有关键点二维坐标的人脸图像,关键点的X轴和Y轴对应的激活函数也可同前文所述;关键点的Z轴对应的激活函数可以设置为0,表 示该关键点的Z轴坐标不是人脸检测的神经网络需要学习的。In one embodiment, in order to ensure a better training effect, so that the trained face detection neural network has better generalization performance and more accurate prediction results for different input face images, the present disclosure The training data of the neural network for face detection may also be mixed data, and the mixed data includes face image samples marked with three-dimensional coordinates of key points and face image samples marked with two-dimensional coordinates of key points. For face images marked with two-dimensional coordinates of key points, the coordinate value of the Z-axis of the key point can be set to a negative number, for example, set to -1, indicating that the Z-axis coordinate of the key point does not exist. Correspondingly, when the training data used is mixed data, for the face image samples marked with the three-dimensional coordinates of the key points, the activation functions corresponding to the X-axis, Y-axis and Z-axis of the key points can be the same as those described above. Repeat. For face images marked with two-dimensional coordinates of key points, the activation functions corresponding to the X-axis and Y-axis of the key points can also be as described above; the activation function corresponding to the Z-axis of the key points can be set to 0, indicating that the key point The Z-axis coordinates are not what the neural network for face detection needs to learn.
上述采用混合数据作为人脸检测的神经网络的训练数据,标注有关键点二维坐标的人脸图像样本虽然没有Z轴坐标值,但是具有X轴和Y轴的坐标值,二维坐标值中也包含有一定的有用信息。因此,使用混合数据进行训练,能够在标注有关键点的三维坐标的人脸图像样本不足的情况下,学习到更多关于人脸关键点的有用信息,能够实现对人脸检测的神经网络更好的训练效果。The above-mentioned mixed data is used as the training data of the neural network for face detection. Although the face image samples marked with the two-dimensional coordinates of key points have no Z-axis coordinate values, they have X-axis and Y-axis coordinate values. Also contains some useful information. Therefore, using mixed data for training can learn more useful information about the key points of the face when the face image samples marked with the three-dimensional coordinates of the key points are insufficient, and can realize the neural network for face detection. good training effect.
所述的人脸检测的神经网络的编码网络210,用于获取人脸图像样本后进行初步处理,以获得带有关键点标注的样本,并将所述样本作为初步处理的结果发送至所述人脸检测的神经网络中的除编码网络210之外的其他部分。在一个实施例中,所述人脸检测的神经网络中的编码网络210可以是轻量级的神经网络。例如,编码网络210可以是谷歌公司为移动端和嵌入式端的深度学习应用所设计的Mobilenet系列的网络,包括MobilenetV1、MobilenetV2和MobilenetV3。编码网络210还可以是其他轻量级神经网络,例如SqueezeNet、ShuffleNet、Xception等网络。The encoding network 210 of the neural network for face detection is used to perform preliminary processing after obtaining face image samples to obtain samples marked with key points, and send the samples to the Other parts of the neural network for face detection except the encoding network 210. In one embodiment, the encoding network 210 in the neural network for face detection may be a lightweight neural network. For example, the encoding network 210 may be a network of the Mobilenet series designed by Google for deep learning applications on mobile terminals and embedded terminals, including MobilenetV1, MobilenetV2 and MobilenetV3. The encoding network 210 can also be other lightweight neural networks, such as SqueezeNet, ShuffleNet, Xception and other networks.
采用轻量级的神经网络作为人脸检测的神经网络中的编码网络210,能够节省人脸检测的神经网络的体积,提升人脸检测的神经网络的运行速度,有利于向移动终端移植人脸检测的神经网络的部分或者全部,从而能够有效扩大人脸检测的神经网络的应用范围及应用场景。Using a light-weight neural network as the coding network 210 in the neural network for face detection can save the volume of the neural network for face detection, improve the running speed of the neural network for face detection, and facilitate the transplantation of faces to mobile terminals Part or all of the detected neural network can effectively expand the application scope and application scenarios of the neural network for face detection.
在一个实施例中,根据所述编码网络210的输出结果,提取出样本中第一关键点的预测坐标值,可以通过人脸检测的神经网络中的第一解码网络220来实现。In one embodiment, extracting the predicted coordinate value of the first key point in the sample according to the output result of the encoding network 210 can be implemented by the first decoding network 220 in the neural network for face detection.
图3是通过人脸检测的神经网络中的第一解码网络220来实现人脸图像关键点预测坐标值的提取的一个实施例的示意图。在图3中,通过将不带有关键点标注的人脸图像样本输入至编码网络210,编码网络210对人脸图像样本进行处理后可获得带有关键点标注的样本,并将所述样本作为编码网络210的输出结果发送给第一解码网络220。然后,第一解码网络220根据UV坐标图与关键点三维坐标之间预设的映射关系,可以提取关键点的预测坐标值。其中,UV是U、V纹理贴图的简称,U代表水平方向坐标,V是竖直方向坐标。UV坐标图由RGB三个通道的像素值构成,每个像素通道分别表征了三维人脸关键点的X轴、Y轴和Z轴的坐标。通过二维的UV坐标图,可以将UV坐标图上每一个点对应到三维人脸模型的表面,即三维人脸模型上的每个点在UV坐标图上具有唯一的一点。通过预设UV坐标图与三维坐标的映射关系,能够由UV坐标图还原出人脸图像样本中所标注的关键点的坐标。FIG. 3 is a schematic diagram of an embodiment of realizing extraction of predicted coordinate values of key points of a face image through the first decoding network 220 in a neural network for face detection. In FIG. 3, by inputting the face image samples without key point annotations into the encoding network 210, the encoding network 210 can obtain samples with key point annotations after processing the face image samples, and the samples are labeled with key points. The output of the encoding network 210 is sent to the first decoding network 220 . Then, the first decoding network 220 can extract the predicted coordinate value of the key point according to the preset mapping relationship between the UV coordinate map and the three-dimensional coordinate of the key point. Among them, UV is the abbreviation of U, V texture map, U represents the horizontal coordinate, and V is the vertical coordinate. The UV coordinate map is composed of pixel values of three RGB channels, and each pixel channel represents the coordinates of the X-axis, Y-axis and Z-axis of the three-dimensional face key points respectively. Through the two-dimensional UV coordinate map, each point on the UV coordinate map can be corresponding to the surface of the three-dimensional face model, that is, each point on the three-dimensional face model has a unique point on the UV coordinate map. By presetting the mapping relationship between the UV coordinate map and the three-dimensional coordinates, the coordinates of the key points marked in the face image sample can be restored from the UV coordinate map.
当从编码网络210输入给第一解码网络220的人脸图像样本仅包括标注有关键点的三维坐标的人脸图像,则关键点的预测坐标值为有效三维坐标。当从编码网络210输入给第一解码网络220的人脸图像样本包括标注有关键点三维坐标的人脸图像和标注有关键点二维坐标的人脸图像,则与标注有三维坐标的样本对应的预测坐标值是有效三维坐标,而与标注有二维坐标的样本对应的预测坐标值是伪三维坐标,其并没有Z轴坐标值。When the face image samples input from the encoding network 210 to the first decoding network 220 include only face images marked with the three-dimensional coordinates of the key points, the predicted coordinate values of the key points are valid three-dimensional coordinates. When the face image samples input from the encoding network 210 to the first decoding network 220 include a face image marked with three-dimensional coordinates of key points and a face image marked with two-dimensional coordinates of key points, the samples corresponding to the samples marked with three-dimensional coordinates The predicted coordinate value of is a valid three-dimensional coordinate, and the predicted coordinate value corresponding to the sample marked with two-dimensional coordinates is a pseudo three-dimensional coordinate, which does not have a Z-axis coordinate value.
上述实施例中的编码网络210和第一解码网络220可以是预先训练好的。第一解码网络220可以是标准神经网络(Standard Neural Networks,SNN)、卷积神经网络(Convolutional Neural Networks,CNN)、递归神经网络(Recursive Neural Network,RNN)等网络。编码网络210和第一解码网络220的预先训练可以采用本领域技术人员熟知的任意技术来实现,使得第一解码网络220能够根据标注有关键点坐标的人脸图像样本,获得关键点的预测坐标值,这里不做赘述。The encoding network 210 and the first decoding network 220 in the above embodiments may be pre-trained. The first decoding network 220 may be a standard neural network (Standard Neural Networks, SNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN) and other networks. The pre-training of the encoding network 210 and the first decoding network 220 can be implemented by any technique well known to those skilled in the art, so that the first decoding network 220 can obtain the predicted coordinates of the key points according to the face image samples marked with the coordinates of the key points. value, which is not repeated here.
在一个实施例中,编码网络210和第一解码网络220的参数可以是通过训练确定的。编码网络210和第一解码网络220如前文所述,可以是多种形式的网络,这里不再赘述。这里,给出编码网络210和第一解码网络220的参数通过训练确定的一个示例性实施例。In one embodiment, the parameters of the encoding network 210 and the first decoding network 220 may be determined through training. The encoding network 210 and the first decoding network 220, as described above, may be networks in various forms, which will not be repeated here. Here, an exemplary embodiment in which the parameters of the encoding network 210 and the first decoding network 220 are determined through training is given.
编码网络210获取带有关键点标注的人脸图像样本并将所述样本作为输出结果,第一解码网络220根据编码网络210的输出结果提取样本中所标注关键点的预测坐标值。基于第一损失函数对编码网络210和第一解码网络220的参数进行联合训练,直到第一损失函数满足预设的训练条件,此时的网络参数即为训练完成的编码网络210和第一解码网络220的参数。预设的训练条件可以是第一损失函数的值低于阈值,或者第一损失函数收敛,或者是其他训练条件,本公开不做限制。The encoding network 210 obtains face image samples marked with key points and uses the samples as output results, and the first decoding network 220 extracts predicted coordinate values of the marked key points in the samples according to the output results of the encoding network 210 . The parameters of the encoding network 210 and the first decoding network 220 are jointly trained based on the first loss function until the first loss function satisfies the preset training conditions, and the network parameters at this time are the encoding network 210 and the first decoding network after training. Parameters of the network 220 . The preset training condition may be that the value of the first loss function is lower than the threshold, or the first loss function converges, or other training conditions, which are not limited in the present disclosure.
在一个实施例中,第一损失函数可以是基于最小绝对值偏差的损失函数,也可以是基于最小均方值偏差的损失函数,还可以是其他形式的损失函数,本公开不做限制。第一损失函数可以是关键点的预测坐标值与关键点的标注坐标值之间的偏差,也可以是表征关键点预测坐标值的准确度的其他度量。一个示例性第一损失函数可以是:In one embodiment, the first loss function may be a loss function based on the minimum absolute value deviation, may also be a loss function based on the minimum mean square value deviation, or may be other forms of loss functions, which are not limited in the present disclosure. The first loss function may be the deviation between the predicted coordinate value of the key point and the labeled coordinate value of the key point, or may be other measures representing the accuracy of the predicted coordinate value of the key point. An exemplary first loss function may be:
Figure PCTCN2021126065-appb-000001
Figure PCTCN2021126065-appb-000001
其中,Loss 1为第一损失函数,predict为从第一解码网络220输出的关键点的预测坐标值,label为由编码网络210输入的人脸图像样本中关键点的标注坐标值。应当理解,上 述第一损失函数仅为一示例性实施例,本领域技术人员也可以设计其他第一损失函数,以训练编码网络210和第一解码网络220,使第一解码网络220输出的关键点的预测坐标值更加准确。 Among them, Loss 1 is the first loss function, predict is the predicted coordinate value of the key point output from the first decoding network 220 , and label is the labeled coordinate value of the key point in the face image sample input by the encoding network 210 . It should be understood that the above-mentioned first loss function is only an exemplary embodiment, and those skilled in the art can also design other first loss functions to train the encoding network 210 and the first decoding network 220, so that the key output of the first decoding network 220 is the key The predicted coordinate value of the point is more accurate.
在一个实施例中,根据所述编码网络210的输出结果,获取与关键点对应的人脸姿态变化参数,可以通过人脸检测的神经网络中的第二解码网络230来实现。In one embodiment, obtaining the face pose change parameters corresponding to the key points according to the output result of the encoding network 210 can be realized by the second decoding network 230 in the neural network for face detection.
人脸姿态变化参数,用于表征人脸姿态相对于标准人脸模型的变化。人脸姿态变化参数可以包括形状参数和表情参数,还可以包括纹理参数等等。由于人脸具有较多共性,例如,具有特定数目的眼睛、嘴巴、鼻子和耳朵且各个部分的相对位置不变,各个部分具有一定的拓扑关系,因此可以建立一个参数化表征的标准三维人脸模型。这样,通过获得关键点的预测坐标值相对于标准三维人脸模型的变化,获取人脸姿态变化参数。The face pose change parameter is used to characterize the change of the face pose relative to the standard face model. The face pose change parameters may include shape parameters and expression parameters, and may also include texture parameters and the like. Since human faces have many commonalities, for example, they have a certain number of eyes, mouths, noses and ears, and the relative positions of each part remain unchanged, each part has a certain topological relationship, so a standard 3D face with parametric representation can be established. Model. In this way, by obtaining the change of the predicted coordinate value of the key point relative to the standard three-dimensional face model, the face pose change parameter is obtained.
在一个实施例中,所述参数化表征的标准三维人脸模型,可以是基于3DMM(3D Morphable Models,三维可变形人脸模型)方法获得的参数化表征的三维人脸模型。该方法假设三维人脸是已经进行稠密对齐的,即所有的三维人脸都能用相同的点云数据或面片数据来表示,且相同序号的点代表相同的语义。在稠密对齐的情况下,每个带纹理的三维人脸都能够表示为:In one embodiment, the standard three-dimensional face model of the parametric representation may be a three-dimensional face model of parametric representation obtained based on a 3DMM (3D Morphable Models, three-dimensional deformable face model) method. This method assumes that 3D faces have been densely aligned, that is, all 3D faces can be represented by the same point cloud data or patch data, and points with the same number represent the same semantics. With dense alignment, each textured 3D face can be represented as:
S model=S 2+∑α iS i  (2); S model =S 2 +∑α i S i (2);
T model=T 2+∑β iT i  (3); T model = T 2 +∑β i T i (3);
其中,S和T表示人脸平均形状和平均纹理部分,每张人脸的判别特性体现在加号右边一组正交基S i或T i的线性组合,S i和T i分别表示按照特征值降序排列的协方差矩阵的特征向量。采集多个三维人脸的头部,并利用主成分分析(Principal Components Analysis,PCA)的方法,能够获得代表人脸形状和纹理信息的主成分,即特征向量S i和T i。不同的系数α i和β i,表征不同形状和纹理的三维人脸。基于BFM数据库中的数据,对上述参数化表征的人脸模型进行优化,可以得到另一种参数化表征的标准三维人脸模型: Among them, S and T represent the average shape and average texture of the face, and the discriminative characteristics of each face are reflected in the linear combination of a set of orthogonal bases Si or Ti on the right side of the plus sign. Eigenvector of the covariance matrix in descending order of values. The heads of a plurality of 3D human faces are collected, and the principal components representing the shape and texture information of the face, ie, the feature vectors Si and T i , can be obtained by using the Principal Components Analysis (PCA) method. Different coefficients α i and β i characterize 3D faces with different shapes and textures. Based on the data in the BFM database, the above-mentioned parameterized representation of the face model is optimized, and another parameterized representation of the standard 3D face model can be obtained:
M=BFM mean+shape*shape std+exp*exp std  (4); M=BFM mean +shape*shape std +exp*exp std (4);
其中,BFM mean是基于BFM数据库获得的人脸平均部分,shape std和exp std是人脸形状和人脸表情的特征向量,shape和exp分别是形状参数和表情参数。关于BFM数据库,是本领域技术人员熟知的技术,这里不做赘述。在公式(4)中,通过确定形状参数shape和表情参数exp,就能够对标准三维人脸模型进行变换,得到带有姿势(指具有形状变化和表情变化)的三维人脸模型。 Among them, BFM mean is the average part of the face obtained based on the BFM database, shape std and exp std are the feature vectors of face shape and facial expression, and shape and exp are shape parameters and expression parameters, respectively. The BFM database is a technology well known to those skilled in the art, and will not be repeated here. In formula (4), by determining the shape parameter shape and the expression parameter exp, the standard three-dimensional face model can be transformed to obtain a three-dimensional face model with posture (referring to changes in shape and expression).
当然,本领域技术人员应当理解,所述参数化表征的标准三维人脸模型还可以是通过其他方法建立起来的参数化表征的三维人脸模型。例如,利用统计学的方法,基于实际采集的三维人脸数据或者其他方式得到的三维人脸数据,建立的参数化表征的三维人脸模型。Of course, those skilled in the art should understand that the parametrically represented standard three-dimensional face model may also be a parametrically represented three-dimensional face model established by other methods. For example, a 3D face model with parametric representation is established based on actually collected 3D face data or 3D face data obtained by other means using statistical methods.
在一个实施例中,人脸姿态变化参数可以通过预先训练的第二解码网络230获取,具体地,编码网络210将带有关键点标注的样本作为输出结果发送给第二解码网络230,第二解码网络230基于编码网络210的输出结果获取人脸姿态变化参数。第二解码网络230可以是标准神经网络(Standard Neural Networks,SNN)、卷积神经网络(Convolutional Neural Networks,CNN)、递归神经网络(Recursive Neural Network,RNN)等。第二解码网络230的预先训练,可以采用本领域技术人员熟知的技术来实现,这里不做赘述。In one embodiment, the face pose change parameters can be obtained through a pre-trained second decoding network 230. Specifically, the encoding network 210 sends the samples with key point annotations as output results to the second decoding network 230, and the second decoding network 210 sends the The decoding network 230 obtains the face pose change parameter based on the output result of the encoding network 210 . The second decoding network 230 may be a standard neural network (Standard Neural Networks, SNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN) and the like. The pre-training of the second decoding network 230 can be implemented by using techniques well known to those skilled in the art, and details are not described here.
在一个实施例中,第二解码网络230可以是通过训练确定的。第二解码网络230的训练过程包括:基于编码网络210的参数以及第二损失函数,对所述第二解码网络230进行训练,直到第二损失函数满足预设的训练条件,确定出所述第二解码网络230的参数。编码网络210的参数可以是通过前文所述的训练方法确定的,也可以是其他方式训练好的,本公开不做限制。预设的训练条件可以是第二损失函数的值低于阈值,或者第一损失函数收敛,或者是其他训练条件,本公开不做限制。In one embodiment, the second decoding network 230 may be determined through training. The training process of the second decoding network 230 includes: training the second decoding network 230 based on the parameters of the encoding network 210 and the second loss function, until the second loss function satisfies preset training conditions, and determining the first Two parameters of the decoding network 230. The parameters of the encoding network 210 may be determined by the training method described above, or may be trained by other methods, which are not limited in the present disclosure. The preset training condition may be that the value of the second loss function is lower than the threshold, or the first loss function converges, or other training conditions, which are not limited in the present disclosure.
在一个实施例中,第二损失函数的形式可以是基于最小绝对值偏差的损失函数,也可以是基于最小均方值偏差的损失函数,还可以是其他形式的损失函数,本公开不做限制。第二损失函数可以是关键点的预测坐标值与关键点的第一坐标值之间的偏差,也可以是表征第二解码网络输出的人脸变化参数的准确度的其他度量。一个示例性第二损失函数可以是:In one embodiment, the form of the second loss function may be a loss function based on the minimum absolute value deviation, a loss function based on the minimum mean square value deviation, or a loss function in other forms, which is not limited in the present disclosure . The second loss function may be the deviation between the predicted coordinate value of the key point and the first coordinate value of the key point, or may be another measure representing the accuracy of the face change parameter output by the second decoding network. An exemplary second loss function can be:
Figure PCTCN2021126065-appb-000002
Figure PCTCN2021126065-appb-000002
其中,Loss 2为第二损失函数,predict为第一解码网络220输出的关键点的预测坐标值,mask为关键点的三维坐标对应的激活函数,affine表示标准三维人脸模型相对于预测坐标值的变换矩阵,其余参数的含义如前文所述。 Among them, Loss 2 is the second loss function, predict is the predicted coordinate value of the key point output by the first decoding network 220, mask is the activation function corresponding to the three-dimensional coordinate of the key point, and affine represents the standard three-dimensional face model relative to the predicted coordinate value The transformation matrix of , and the meanings of the remaining parameters are as described above.
图4是通过训练第二解码网络230来实现人脸姿态变化参数的获取的示意图,图4示意性地给出了第二损失函数对训练的影响,结合第二损失函数的表达式公式(5)进行说明。公式(5)中括号{}内的表达式代表着人脸图像样本的关键点的第一坐标, 其可通过以下方式获得:根据第一解码网络220获取的关键点的预测坐标值及标准三维人脸模型,计算出标准三维人脸模型相对于预测坐标值的变换矩阵,所述变换矩阵表征着由标准三维人脸模型变换为与预测坐标值对应的三维人脸模型所需的旋转、平移和缩放等;将标准三维人脸模型根据该变换矩阵进行变换,获得关键点的第一坐标。其中,将标准三维人脸模型根据变换矩阵进行变换,获得关键点的第一坐标,是通过基于变换矩阵调整第二解码网络230的参数,使得第二解码网络230输出的人脸姿态变化参数发生变化,以使标准三维人脸模型发生姿态变化,从而尽量匹配第一解码网络220输出的关键点的预测坐标值。以公式(4)中参数化表征的标准三维人脸模型为例,也就是通过第二解码网络230的参数变化,使得公式(4)中的形状参数shape和表情参数exp变化,进而令参数化表征的标准三维人脸模型发生姿态变化,参数化的表达为:FIG. 4 is a schematic diagram of realizing the acquisition of face pose change parameters by training the second decoding network 230. FIG. 4 schematically shows the influence of the second loss function on training, combined with the expression formula of the second loss function (5 )Be explained. The expressions in the brackets {} in formula (5) represent the first coordinates of the key points of the face image sample, which can be obtained by the following methods: the predicted coordinate values of the key points obtained by the first decoding network 220 and the standard three-dimensional face model, the transformation matrix of the standard three-dimensional face model relative to the predicted coordinate value is calculated, and the transformation matrix represents the rotation and translation required to transform the standard three-dimensional face model into the three-dimensional face model corresponding to the predicted coordinate value and scaling, etc.; transform the standard three-dimensional face model according to the transformation matrix to obtain the first coordinates of the key points. Among them, the standard three-dimensional face model is transformed according to the transformation matrix to obtain the first coordinates of the key points. The parameters of the second decoding network 230 are adjusted based on the transformation matrix, so that the change parameters of the face posture output by the second decoding network 230 occur. change, so that the standard three-dimensional face model changes in attitude, so as to match the predicted coordinate value of the key point output by the first decoding network 220 as much as possible. Taking the standard three-dimensional face model represented by the parameterization in formula (4) as an example, that is, through the parameter change of the second decoding network 230, the shape parameter shape and the expression parameter exp in formula (4) change, and then the parameterization The standard 3D face model represented by the pose changes, and the parameterized expression is:
M 2={(bfm mean+shape*shape std+exp*exp std)*affine}, M 2 ={(bfm mean +shape*shape std +exp*exp std )*affine},
也就是公式(5)中中括号内的内容。That is, the content in the brackets in formula (5).
基于第二损失函数对第二解码网络230进行训练,也就是不断调整第二解码网络230的参数,得到第二解码网络230输出的人脸姿态变化参数,以对标准三维人脸模型进行变换,直到从标准三维人脸模型变换得到的人脸模型与第一解码网络220输出的关键点的预测坐标值对齐(即偏差足够小),从而获得与关键点的预测坐标值姿态相同或相近的三维人脸模型,则第二解码网络230完成训练。这样,训练后的第二解码网络230输出的结果即为可表征关键点的预测坐标值所对应的人脸图像相对于标准人脸模型的表情和/或形状变化的人脸姿态变化参数。The second decoding network 230 is trained based on the second loss function, that is, the parameters of the second decoding network 230 are continuously adjusted to obtain the face pose change parameters output by the second decoding network 230, so as to transform the standard three-dimensional face model, Until the face model transformed from the standard three-dimensional face model is aligned with the predicted coordinate value of the key point output by the first decoding network 220 (that is, the deviation is small enough), so as to obtain a three-dimensional image with the same or similar posture as the predicted coordinate value of the key point. face model, the second decoding network 230 completes the training. In this way, the result output by the trained second decoding network 230 is the face pose change parameter that can characterize the expression and/or shape change of the face image corresponding to the predicted coordinate value of the key point relative to the standard face model.
基于训练完成的第二解码网络230输出的人脸姿态变化参数,实质上是可以获得一个与输入样本的关键点的预测坐标值对应的人脸三维模型。然后,基于该人脸三维模型能够得到该模型上任意一点的三维坐标,因此可以获得该样本大量关键点的三维坐标。Based on the face pose change parameters output by the second decoding network 230 after training, a three-dimensional face model corresponding to the predicted coordinate values of the key points of the input sample can be obtained in essence. Then, based on the three-dimensional face model, the three-dimensional coordinates of any point on the model can be obtained, so the three-dimensional coordinates of a large number of key points of the sample can be obtained.
在一个实施例中,预测网络240的参数是基于所述编码网络210确定的参数以及第三损失函数训练确定的。In one embodiment, the parameters of the prediction network 240 are determined based on the parameters determined by the encoding network 210 and the third loss function training.
如图5所示,根据编码网络210输出的带有关键点标注的人脸图像样本,结合所获取的预测坐标值和人脸姿态变化参数,对预测网络240的参数基于第三损失函数进行训练,直到第三损失函数满足预设的训练条件,此时的网络参数即为训练完成的预测网络240的参数,即预测网络240完成训练。所述预测网络240的训练基于第三损失函 数进行,预设的训练条件可以是第三损失函数的值低于阈值,或者第三损失函数收敛,或者是其他训练条件,本公开不做限制。As shown in FIG. 5 , according to the face image samples with key point annotations output by the encoding network 210, combined with the obtained predicted coordinate values and face pose change parameters, the parameters of the prediction network 240 are trained based on the third loss function , until the third loss function satisfies the preset training conditions, and the network parameters at this time are the parameters of the prediction network 240 after training, that is, the prediction network 240 completes the training. The training of the prediction network 240 is performed based on the third loss function, and the preset training condition may be that the value of the third loss function is lower than the threshold, or the third loss function converges, or other training conditions, which are not limited in the present disclosure.
在一个实施例中,第三损失函数的形式可以是基于最小绝对值偏差的损失函数,也可以是基于最小均方值偏差的损失函数,还可以是其他形式的损失函数,本公开不做限制。第三损失函数的内容可以是预测网络240输出的关键点(即,第二关键点)的三维坐标与第一坐标之间的偏差,也可以是表征预测网络输出的关键点的三维坐标的准确度的其他度量。一个示例性第三损失函数可以是:In one embodiment, the form of the third loss function may be a loss function based on minimum absolute deviation, a loss function based on minimum mean square deviation, or a loss function in other forms, which is not limited in the present disclosure . The content of the third loss function may be the deviation between the three-dimensional coordinates of the key points (ie, the second key points) output by the prediction network 240 and the first coordinates, or may be the accuracy of the three-dimensional coordinates representing the key points output by the prediction network. other measures of degree. An exemplary third loss function can be:
Figure PCTCN2021126065-appb-000003
Figure PCTCN2021126065-appb-000003
其中,Loss 3为第三损失函数,predict pts为预测网络输出的关键点的三维坐标,其余参数的含义如前文所述。如前文所述,式(5)中括号{}内的表达式,其实代表着关键点的第一坐标,获取方法同前文所述,这里不再赘述。 Among them, Loss 3 is the third loss function, predict pts is the three-dimensional coordinates of the key points output by the prediction network, and the meanings of the remaining parameters are as described above. As mentioned above, the expression in the brackets {} in formula (5) actually represents the first coordinate of the key point, and the acquisition method is the same as that described above, and will not be repeated here.
基于训练好的预测网络240,可以实现以下:向编码网络210输入一张不带标注的人脸图像,编码网络210可获取标注有少量关键点的人脸图像,并将带有关键点标注的人脸图像作为处理结果发送至预测网络240,从而预测网络240可输出该人脸图像中大量关键点的三维坐标。Based on the trained prediction network 240, the following can be achieved: input an unmarked face image to the encoding network 210, the encoding network 210 can obtain a face image marked with a small number of key points, and convert the marked face image with key points. The face image is sent to the prediction network 240 as the processing result, so that the prediction network 240 can output the three-dimensional coordinates of a large number of key points in the face image.
所属技术领域人员应当理解,为了使得训练的模型具有更好的泛化能力及更准确的预测结果,同时还防止模型过拟合,所述第一、第二和第三损失函数还可以加入正则化项,以提高训练好的模型的整体性能。Those skilled in the art should understand that, in order to make the trained model have better generalization ability and more accurate prediction results, and at the same time prevent the model from overfitting, the first, second and third loss functions can also add regularization to improve the overall performance of the trained model.
在一个实施例中,所述人脸检测的神经网络包括的编码网络、第一解码网络、第二解码网络和预测网络,可以部分网络是训练好的网络、部分网络是未训练好的网络。其中,对于未训练好的网络的具体训练方法和所设计的损失函数,可以参见前文所述,这里不做赘述。In one embodiment, the neural network for face detection includes an encoding network, a first decoding network, a second decoding network, and a prediction network, and some of the networks may be trained networks and some of the networks may be untrained networks. Among them, for the specific training method of the untrained network and the designed loss function, please refer to the above-mentioned, and will not be repeated here.
在一个实施例中,所述人脸检测的神经网络所包括的编码网络、第一解码网络、第二解码网络和预测网络,可以都是未训练好的网络。对于各个网络的训练,可以采用如下方法:固定第二解码网络和预测网络的初始参数,对编码网络和第一解码网络的参数基于第一损失函数进行联合训练;固定编码网络训练好的参数,对第二解码网络的参数基于第二损失函数进行训练;固定编码网络训练好的参数,对预测网络的参数基于第三函数进行训练。In one embodiment, the encoding network, the first decoding network, the second decoding network and the prediction network included in the neural network for face detection may all be untrained networks. For the training of each network, the following methods can be used: fix the initial parameters of the second decoding network and the prediction network, and jointly train the parameters of the encoding network and the first decoding network based on the first loss function; fix the trained parameters of the encoding network, The parameters of the second decoding network are trained based on the second loss function; the trained parameters of the encoding network are fixed, and the parameters of the prediction network are trained based on the third function.
对于每个网络的具体训练方法和所设计的损失函数,可以参见前文所述,这里也不做赘述。For the specific training method and the designed loss function of each network, you can refer to the above, and will not repeat them here.
在上述实施例中,将标注有关键点三维坐标的人脸图像样本作为人脸检测的神经网络的输入,通过获取样本的关键点预测坐标值和人脸姿态变化参数,训练所述神经网络。利用所述人脸检测的神经网络,能够得到与输入样本的关键点的预测坐标值对应的人脸三维模型,进而能够实现向所述神经网络输入仅标注有少量关键点三维坐标的人脸图像样本,就可以获得该样本大量关键点的三维坐标。此外,基于训练好的编码网络和预测网络,可以实现以下:向编码网络输入一张不带标注的人脸图像,编码网络可获取标注有少量关键点的人脸图像,并将标注有少量关键点的人脸图像作为处理结果发送至预测网络,从而预测网络可输出该人脸图像中大量关键点的三维坐标。这样,可有效实现对未标注的人脸图像的关键点的三维坐标的检测。In the above embodiment, the face image samples marked with the three-dimensional coordinates of the key points are used as the input of the neural network for face detection, and the neural network is trained by obtaining the predicted coordinate values of the key points of the samples and the change parameters of the face posture. Using the neural network for face detection, a three-dimensional face model corresponding to the predicted coordinate values of the key points of the input sample can be obtained, and then it is possible to input a face image with only a small number of three-dimensional coordinates of the key points into the neural network. sample, the three-dimensional coordinates of a large number of key points of the sample can be obtained. In addition, based on the trained encoding network and prediction network, the following can be achieved: input an unlabeled face image to the encoding network, the encoding network can obtain a face image marked with a small number of key points, and annotated with a small number of key points. The face image of the point is sent to the prediction network as the processing result, so that the prediction network can output the three-dimensional coordinates of a large number of key points in the face image. In this way, the detection of the three-dimensional coordinates of the key points of the unlabeled face image can be effectively realized.
图6是本公开根据一示例性实施例示出的一种人脸检测的神经网络的训练方法的流程图,如图6所示,所述训练方法包括以下步骤:FIG. 6 is a flowchart of a training method of a neural network for face detection according to an exemplary embodiment of the present disclosure. As shown in FIG. 6 , the training method includes the following steps:
在步骤602中,获取包括人脸图像的样本,其中,所述样本包括标注有至少一个第一关键点的三维坐标的人脸图像;In step 602, a sample including a face image is obtained, wherein the sample includes a face image marked with three-dimensional coordinates of at least one first key point;
在步骤604中,将所述样本作为所述神经网络的输入,获取第一关键点的预测坐标值以及人脸姿态变化参数;In step 604, the sample is used as the input of the neural network, and the predicted coordinate value of the first key point and the face posture change parameter are obtained;
在步骤606中,基于所获取的预测坐标值和人脸姿态变化参数训练所述神经网络,以使训练后的神经网络的输出结果为所述样本包含第一数量的第二关键点的三维坐标,其中,第一数量大于所述样本中标注的第一关键点的数量。In step 606, the neural network is trained based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the trained neural network is that the sample contains the three-dimensional coordinates of the first number of second key points , where the first number is greater than the number of first key points marked in the sample.
上述训练方法可以用在各种需要基于关键点的预测坐标值和人脸姿态变化参数进行训练以获得人脸图像的关键点三维坐标的神经网络中。本公开对使用该训练方法进行训练的网络的结构不做限制。当然,本领域技术人员应当理解,所述方法也可以用在前文所述的人脸检测的神经网络的训练中。The above training method can be used in various neural networks that need to be trained based on the predicted coordinate values of the key points and the change parameters of the face pose to obtain the three-dimensional coordinates of the key points of the face image. The present disclosure does not limit the structure of the network trained using the training method. Of course, those skilled in the art should understand that the method can also be used in the training of the aforementioned neural network for face detection.
在一个实施例中,编码网络、第一解码网络和第二解码网络可以是预先已经训练好的网络,由编码网络输出对于输入的人脸图像样本的处理结果、即标注有关键点的人脸图像样本,并将所述样本发送给第一解码网络、第二解码网络和预测网络;第一解码网络用于获取所述样本中标注的关键点的预测坐标值,第二解码网络用于获取所述样本对应的人脸姿态变化参数。这样,只需对预测网络进行训练。预测网络的具体训练方 法前文已经详述,这里不再赘述。In one embodiment, the encoding network, the first decoding network, and the second decoding network may be pre-trained networks, and the encoding network outputs the processing result of the input face image sample, that is, the face marked with key points image samples, and send the samples to the first decoding network, the second decoding network and the prediction network; the first decoding network is used to obtain the predicted coordinate values of the key points marked in the sample, and the second decoding network is used to obtain The face pose change parameter corresponding to the sample. This way, only the prediction network needs to be trained. The specific training method of the prediction network has been described in detail above, and will not be repeated here.
在一个实施例中,编码网络、第一解码网络和第二解码网络可以是未训练的网络,可以采用前文所述的训练方法,对编码网络、第一解码网络、第二解码网络和预测网络进行训练。网络的具体训练方法前文已经详述,这里也不再赘述。In one embodiment, the encoding network, the first decoding network, and the second decoding network may be untrained networks, and the training method described above may be used to perform the encoding network, the first decoding network, the second decoding network, and the prediction network. to train. The specific training method of the network has been described in detail above, and will not be repeated here.
此外,本领域技术人员应该理解,通过上述训练方法,同样能够获取关键点的预测坐标值以及人脸姿态变化参数,获取方法已在前文详述,这里不再赘述。In addition, those skilled in the art should understand that, through the above training method, the predicted coordinate value of the key point and the change parameter of the face posture can also be obtained.
当上述人脸检测的神经网络完成训练,可以得到一个参数确定的编码网络和预测网络。基于参数确定的编码网络和预测网络,能够实现一种人脸检测的方法:所述编码网络用于对所获取的未标注的人脸图像进行处理,并输出带有少量关键点标注的人脸图像样本;所述预测网络针对从所述编码网络输入的人脸图像样本,能够输出大量人脸图像关键点的三维坐标。基于所获取的人脸图像关键点的三维坐标,能够进行三维人脸重建,如图7所示。When the above neural network for face detection is trained, a parameter-determined encoding network and prediction network can be obtained. Based on the encoding network and prediction network determined by parameters, a face detection method can be realized: the encoding network is used to process the acquired unlabeled face images, and output faces with a small number of key point annotations Image samples; the prediction network can output a large number of three-dimensional coordinates of key points of the face images for the face image samples input from the encoding network. Based on the acquired three-dimensional coordinates of the key points of the face image, three-dimensional face reconstruction can be performed, as shown in FIG. 7 .
上述人脸检测的方法,既可以被安装在服务器上以实现对人脸图像关键点的检测,也可以将训练好的编码网络和第一解码网络安装在移动终端上,以实现在移动终端的人脸检测。The above-mentioned method of face detection can be installed on the server to realize the detection of the key points of the face image, or the trained encoding network and the first decoding network can be installed on the mobile terminal, so as to realize the detection on the mobile terminal. Face Detection.
在该实施例中,利用训练好的人脸模型,能够实现向编码网络和预测网络输入未标注的人脸图像,可以获得该人脸图像关键点的三维坐标,基于关键点的三维坐标,能够进一步实现三维人脸的重建。In this embodiment, using the trained face model, it is possible to input an unlabeled face image into the encoding network and the prediction network, and obtain the three-dimensional coordinates of the key points of the face image. Based on the three-dimensional coordinates of the key points, it is possible to Further realization of 3D face reconstruction.
相应地,本公开还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一实施例所述方法的步骤。Correspondingly, the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the method described in any of the foregoing embodiments.
相应地,本公开还提供一种终端设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,该程序被处理器执行时实现上述任一实施例所述方法的步骤。Correspondingly, the present disclosure also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and when the program is executed by the processor, the method described in any of the foregoing embodiments is implemented. step.
本公开可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机可用存储介质包括永久性和非永久性、可移动和非可移动媒体,可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括但不限于:相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电 可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。The present disclosure may take the form of a computer program product embodied on one or more storage media having program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like. Computer-usable storage media includes permanent and non-permanent, removable and non-removable media, and storage of information can be accomplished by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
本领域技术人员在考虑说明书及实践这里申请的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention claimed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field to which this disclosure is not claimed . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
以上所述仅为本公开的较佳实施例而已,并不用以限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the present disclosure. within the scope of protection.

Claims (22)

  1. 一种人脸检测的神经网络,所述神经网络包括:A neural network for face detection, the neural network comprising:
    编码网络,用于获取包括人脸图像的样本,并将所述样本作为其输出结果,其中,所述样本中的所述人脸图像标注有至少一个第一关键点的三维或二维坐标;an encoding network for acquiring a sample including a face image, and using the sample as its output result, wherein the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point;
    第一解码网络,用于根据所述编码网络的输出结果,提取出所述样本中所述人脸图像上的各所述第一关键点的预测坐标值;a first decoding network, configured to extract the predicted coordinate value of each of the first key points on the face image in the sample according to the output result of the encoding network;
    第二解码网络,用于根据所述编码网络的输出结果,获取所述样本中所述人脸图像对应的人脸姿态变化参数;以及a second decoding network, configured to acquire, according to the output result of the encoding network, a face pose change parameter corresponding to the face image in the sample; and
    预测网络,用于将所述编码网络的输出结果以及所述第一解码网络获取的预测坐标值和所述第二解码网络获得的人脸姿态变化参数作为输入进行训练,输出包含所述样本中所述人脸图像上的至少两个第二关键点的三维或二维坐标值的人脸模型,其中,所输出的所述人脸模型中所述第二关键点的数量大于所述样本中标注的所述第一关键点的数量。The prediction network is used to train the output result of the encoding network, the predicted coordinate value obtained by the first decoding network, and the face pose change parameter obtained by the second decoding network as input, and the output includes the sample in the sample. A face model of three-dimensional or two-dimensional coordinate values of at least two second key points on the face image, wherein the number of the second key points in the output face model is greater than that in the sample The number of labeled first keypoints.
  2. 根据权利要求1所述的神经网络,其特征在于,所述样本还包括标注有关键点的二维坐标的人脸图像。The neural network according to claim 1, wherein the sample further comprises a face image marked with two-dimensional coordinates of key points.
  3. 根据权利要求1所述的神经网络,其特征在于,所述编码网络为轻量级神经网络。The neural network according to claim 1, wherein the encoding network is a lightweight neural network.
  4. 根据权利要求1所述的神经网络,其特征在于,所述第一解码网络用于根据所述编码网络的输出结果,提取出所述样本中所述人脸图像上的各所述第一关键点的预测坐标值,包括:The neural network according to claim 1, wherein the first decoding network is configured to extract each of the first keys on the face image in the sample according to an output result of the encoding network The predicted coordinate values of the point, including:
    所述第一解码网络根据所述编码网络的输出结果,获取所述样本中所述人脸图像对应的UV坐标图;The first decoding network obtains the UV coordinate map corresponding to the face image in the sample according to the output result of the encoding network;
    基于所述UV坐标图与三维或二维坐标的映射关系,提取所述样本中所述人脸图像上的各所述第一关键点的预测坐标值。Based on the mapping relationship between the UV coordinate map and the three-dimensional or two-dimensional coordinates, the predicted coordinate value of each of the first key points on the face image in the sample is extracted.
  5. 根据权利要求1所述的神经网络,其特征在于,所述编码网络和所述第一解码网络的参数通过以下方式获得:The neural network according to claim 1, wherein the parameters of the encoding network and the first decoding network are obtained in the following manner:
    基于第一损失函数对所述编码网络和所述第一解码网络进行联合训练。The encoding network and the first decoding network are jointly trained based on a first loss function.
  6. 根据权利要求5所述的神经网络,其特征在于,所述第二解码网络的参数通过以下方式获得:The neural network according to claim 5, wherein the parameters of the second decoding network are obtained in the following manner:
    基于所确定的所述编码网络的参数以及第二损失函数,对所述第二解码网络进行训练。The second decoding network is trained based on the determined parameters of the encoding network and a second loss function.
  7. 根据权利要求6所述的神经网络,其特征在于,所述预测网络的参数是基于所述编码网络确定的参数以及第三损失函数进行训练确定的。The neural network according to claim 6, wherein the parameters of the prediction network are determined by training based on the parameters determined by the encoding network and a third loss function.
  8. 根据权利要求6所述的神经网络,其特征在于,所述第一损失函数为所述样本中所述人脸图像上的各所述第一关键点的预测坐标值与所述样本中所述人脸图像上的各所述第一关键点被标注的坐标值之间的偏差。The neural network according to claim 6, wherein the first loss function is the predicted coordinate value of each of the first key points on the face image in the sample and the value of the predicted coordinates in the sample The deviation between the marked coordinate values of each of the first key points on the face image.
  9. 根据权利要求7所述的神经网络,其特征在于,基于所确定的所述编码网络的参数以及第二损失函数,对所述第二解码网络进行训练,包括:The neural network according to claim 7, wherein training the second decoding network based on the determined parameters of the encoding network and the second loss function, comprising:
    根据所述第一解码网络获取的各所述第一关键点的预测坐标值及标准三维人脸模型,获取所述标准三维人脸模型相对于各所述第一关键点的预测坐标值的变换矩阵;According to the predicted coordinate values of each of the first key points and the standard three-dimensional face model obtained by the first decoding network, obtain the transformation of the standard three-dimensional face model relative to the predicted coordinate values of the first key points matrix;
    使所述标准三维人脸模型根据所述变换矩阵进行变换,获得各所述第一关键点的第一坐标值;Transforming the standard three-dimensional face model according to the transformation matrix to obtain the first coordinate value of each of the first key points;
    根据各所述第一关键点的预测坐标值和各所述第一关键点的第一坐标值,对所述第二解码网络的参数基于所述第二损失函数进行训练。According to the predicted coordinate value of each of the first key points and the first coordinate value of each of the first key points, the parameters of the second decoding network are trained based on the second loss function.
  10. 根据权利要求9所述的神经网络,其特征在于,所述标准三维人脸模型为基于3DMM方法获得的参数化表征的三维人脸模型。The neural network according to claim 9, wherein the standard three-dimensional face model is a three-dimensional face model of parametric representation obtained based on a 3DMM method.
  11. 根据权利要求9所述的神经网络,其特征在于,所述第二损失函数为各所述第一关键点的预测坐标值与所述第一坐标值之间的偏差。The neural network according to claim 9, wherein the second loss function is a deviation between the predicted coordinate value of each of the first key points and the first coordinate value.
  12. 根据权利要求9所述的神经网络,其特征在于,所述第三损失函数为所述预测网络输出的各所述第二关键点的三维或二维坐标值与所述第一坐标值之间的偏差。The neural network according to claim 9, wherein the third loss function is the difference between the three-dimensional or two-dimensional coordinate value of each of the second key points output by the prediction network and the first coordinate value deviation.
  13. 一种人脸检测的神经网络的训练方法,所述训练方法包括:A training method of a neural network for face detection, the training method comprising:
    获取包括人脸图像的样本,其中,所述样本中的所述人脸图像标注有至少一个第一关键点的三维或二维坐标;acquiring a sample including a face image, wherein the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point;
    将所述样本作为所述神经网络的输入,获取各所述第一关键点的预测坐标值以及与所述人脸图像对应的人脸姿态变化参数;Using the sample as the input of the neural network, obtain the predicted coordinate value of each of the first key points and the face posture change parameter corresponding to the face image;
    基于所获取的预测坐标值和所述人脸姿态变化参数训练所述神经网络,以使训练后的所述神经网络的输出结果为包含第一数量的第二关键点的三维或二维坐标的人脸模型,其中,所述第一数量大于样本中标注的所述第一关键点的数量。The neural network is trained based on the obtained predicted coordinate values and the face pose change parameters, so that the output result of the trained neural network is a three-dimensional or two-dimensional coordinate including a first number of second key points. A face model, wherein the first number is greater than the number of the first key points marked in the sample.
  14. 根据权利要求13所述的训练方法,其特征在于,所述人脸检测的神经网络包括:The training method according to claim 13, wherein the neural network for face detection comprises:
    编码网络,用于获取包括人脸图像的样本,并将所述样本作为其输出结果,其中,所述样本中的所述人脸图像标注有至少一个第一关键点的三维或二维坐标;an encoding network for acquiring a sample including a face image, and using the sample as its output result, wherein the face image in the sample is marked with three-dimensional or two-dimensional coordinates of at least one first key point;
    第一解码网络,用于将所述样本作为输入,根据所述编码网络的输出结果,获取所述样本中所述人脸图像对应的UV坐标图;基于所述UV坐标图与三维坐标映射关系,提取所述样本中所述人脸图像上各所述第一关键点的预测坐标值。The first decoding network is used to take the sample as input, and obtain the UV coordinate map corresponding to the face image in the sample according to the output result of the encoding network; based on the mapping relationship between the UV coordinate map and the three-dimensional coordinate , extracting the predicted coordinate value of each of the first key points on the face image in the sample.
  15. 根据权利要求14所述的训练方法,其特征在于,所述训练方法还包括:The training method according to claim 14, wherein the training method further comprises:
    基于第一损失函数对所述编码网络和所述第一解码网络进行联合训练,确定出所述编码网络的参数和所述第一解码网络的参数。The encoding network and the first decoding network are jointly trained based on the first loss function, and the parameters of the encoding network and the parameters of the first decoding network are determined.
  16. 根据权利要求15所述的训练方法,其特征在于,所述人脸检测的神经网络还包括第二解码网络,所述第二解码网络用于基于所述编码网络的输出结果,获取所述样本中所述人脸图像对应的人脸姿态变化参数,所述训练方法还包括:The training method according to claim 15, wherein the neural network for face detection further comprises a second decoding network, and the second decoding network is used to obtain the sample based on an output result of the encoding network The face posture change parameter corresponding to the face image described in, the training method also includes:
    基于所确定的所述编码网络的参数以及第二损失函数,对所述第二解码网络进行训练,确定出所述第二解码网络的参数。Based on the determined parameters of the encoding network and the second loss function, the second decoding network is trained to determine the parameters of the second decoding network.
  17. 根据权利要求16所述的训练方法,其特征在于,所述人脸检测的神经网络还包括预测网络,所述预测网络用于基于各所述第一关键点的预测坐标值和所述人脸姿态变化参数,输出包含第一数量的所述第二关键点的三维或二维坐标的人脸模型,所述训练方法还包括:The training method according to claim 16, wherein the neural network for face detection further comprises a prediction network, wherein the prediction network is configured to predict the coordinate value of each of the first key points and the face Pose change parameters, outputting a face model containing a first number of three-dimensional or two-dimensional coordinates of the second key point, and the training method further includes:
    基于所确定的所述编码网络的参数以及第三损失函数,对所述预测网络进行训练,确定所述预测网络的参数。Based on the determined parameters of the encoding network and the third loss function, the prediction network is trained to determine the parameters of the prediction network.
  18. 根据权利要求16所述的训练方法,其特征在于,所述第二解码网络获取所述样本中所述人脸图像对应的人脸姿态变化参数,包括:The training method according to claim 16, wherein the second decoding network obtains a face pose change parameter corresponding to the face image in the sample, comprising:
    根据所述第一解码网络获取的各所述第一关键点的预测坐标值及标准三维人脸模型,获取所述标准三维人脸模型相对于所述预测坐标值的变换矩阵;According to the predicted coordinate value of each of the first key points and the standard three-dimensional face model obtained by the first decoding network, a transformation matrix of the standard three-dimensional face model relative to the predicted coordinate value is obtained;
    使所述标准三维人脸模型根据变换矩阵进行变换,获得各所述第一关键点的第一坐标值;The standard three-dimensional face model is transformed according to the transformation matrix to obtain the first coordinate value of each of the first key points;
    根据各所述第一关键点的预测坐标值和各所述第一关键点的所述第一坐标值,对第二解码网络的参数基于第二损失函数进行训练,确定出所述第二解码网络的参数;According to the predicted coordinate value of each of the first key points and the first coordinate value of each of the first key points, the parameters of the second decoding network are trained based on the second loss function, and the second decoding network is determined. parameters of the network;
    由训练完成的所述第二解码网络基于所述样本输出所述样本中的所述人脸图像对应的人脸姿态变化参数。The second decoding network completed by training outputs, based on the sample, a face pose change parameter corresponding to the face image in the sample.
  19. 一种人脸检测的方法,其特征在于,所述方法由权利要求1至12中任一项所述的编码网络和训练获得的预测网络实现,所述方法包括:A method for face detection, characterized in that the method is implemented by the coding network described in any one of claims 1 to 12 and a prediction network obtained by training, the method comprising:
    所述编码网络对所获取的未标注的人脸图像进行关键点标注的处理,并将标注有至少一个第一关键点的所述人脸图像作为处理结果输出给所述预测网络;The encoding network performs key point labeling processing on the acquired unlabeled face image, and outputs the face image marked with at least one first key point to the prediction network as a processing result;
    所述预测网络输出包括所述人脸图像上至少两个第二关键点的三维或二维坐标的人脸模型,其中,所述第二关键点的数量大于所述第一关键点的数量。The prediction network outputs a face model including three-dimensional or two-dimensional coordinates of at least two second key points on the face image, wherein the number of the second key points is greater than the number of the first key points.
  20. 根据权利要求19所述的人脸检测的方法,其特征在于,所述编码网络和训练获得的预测网络被安装于移动终端上。The method for face detection according to claim 19, wherein the encoding network and the prediction network obtained by training are installed on a mobile terminal.
  21. 一种计算机存储介质,其存储有计算机程序代码,当所述计算机程序代码在处理器上运行时,使得所述处理器执行权利要求13至20任一项所述方法的步骤。A computer storage medium storing computer program code which, when executed on a processor, causes the processor to perform the steps of the method of any one of claims 13 to 20.
  22. 一种终端设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现权利要求13至20任一项所述方法的步骤。A terminal device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the method of any one of claims 13 to 20 when the processor executes the program. step.
PCT/CN2021/126065 2020-10-28 2021-10-25 Face detection neural network and training method, face detection method, and storage medium WO2022089360A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011173829.3A CN112287820A (en) 2020-10-28 2020-10-28 Face detection neural network, face detection neural network training method, face detection method and storage medium
CN202011173829.3 2020-10-28

Publications (1)

Publication Number Publication Date
WO2022089360A1 true WO2022089360A1 (en) 2022-05-05

Family

ID=74373633

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126065 WO2022089360A1 (en) 2020-10-28 2021-10-25 Face detection neural network and training method, face detection method, and storage medium

Country Status (2)

Country Link
CN (1) CN112287820A (en)
WO (1) WO2022089360A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114845067A (en) * 2022-07-04 2022-08-02 中科计算技术创新研究院 Hidden space decoupling-based depth video propagation method for face editing
CN115393487A (en) * 2022-10-27 2022-11-25 科大讯飞股份有限公司 Virtual character model processing method and device, electronic equipment and storage medium
CN115426505A (en) * 2022-11-03 2022-12-02 北京蔚领时代科技有限公司 Preset expression special effect triggering method based on face capture and related equipment
CN115578392A (en) * 2022-12-09 2023-01-06 深圳智能思创科技有限公司 Line detection method, device and storage medium
CN116055211A (en) * 2023-02-14 2023-05-02 成都理工大学工程技术学院 Method and system for identifying identity and automatically logging in application based on neural network
CN116229008A (en) * 2023-03-06 2023-06-06 北京百度网讯科技有限公司 Image processing method and device
CN116309591A (en) * 2023-05-19 2023-06-23 杭州健培科技有限公司 Medical image 3D key point detection method, model training method and device
CN116469175A (en) * 2023-06-20 2023-07-21 青岛黄海学院 Visual interaction method and system for infant education
CN116665284A (en) * 2023-08-02 2023-08-29 深圳宇石科技有限公司 Face modeling and mask model partition matching method, device, terminal and medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287820A (en) * 2020-10-28 2021-01-29 广州虎牙科技有限公司 Face detection neural network, face detection neural network training method, face detection method and storage medium
CN112465278A (en) * 2021-02-01 2021-03-09 聚时科技(江苏)有限公司 Video prediction method, device, computer equipment and readable storage medium
CN113096001A (en) * 2021-04-01 2021-07-09 咪咕文化科技有限公司 Image processing method, electronic device and readable storage medium
CN113129362B (en) * 2021-04-23 2024-05-10 北京地平线机器人技术研发有限公司 Method and device for acquiring three-dimensional coordinate data
WO2023097479A1 (en) * 2021-11-30 2023-06-08 华为技术有限公司 Model training method and apparatus, and method and apparatus for constructing three-dimensional structure of auricle
CN115345931B (en) * 2021-12-15 2023-05-26 禾多科技(北京)有限公司 Object attitude key point information generation method and device, electronic equipment and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122705A (en) * 2017-03-17 2017-09-01 中国科学院自动化研究所 Face critical point detection method based on three-dimensional face model
CN108805977A (en) * 2018-06-06 2018-11-13 浙江大学 A kind of face three-dimensional rebuilding method based on end-to-end convolutional neural networks
CN109034131A (en) * 2018-09-03 2018-12-18 福州海景科技开发有限公司 A kind of semi-automatic face key point mask method and storage medium
CN109960986A (en) * 2017-12-25 2019-07-02 北京市商汤科技开发有限公司 Human face posture analysis method, device, equipment, storage medium and program
US20190295302A1 (en) * 2018-03-22 2019-09-26 Northeastern University Segmentation Guided Image Generation With Adversarial Networks
CN110287846A (en) * 2019-06-19 2019-09-27 南京云智控产业技术研究院有限公司 A kind of face critical point detection method based on attention mechanism
CN110516642A (en) * 2019-08-30 2019-11-29 电子科技大学 A kind of lightweight face 3D critical point detection method and system
CN110705355A (en) * 2019-08-30 2020-01-17 中国科学院自动化研究所南京人工智能芯片创新研究院 Face pose estimation method based on key point constraint
US20200285837A1 (en) * 2015-06-24 2020-09-10 Samsung Electronics Co., Ltd. Face recognition method and apparatus
CN111652105A (en) * 2020-05-28 2020-09-11 南京审计大学 Face feature point positioning method based on depth measurement learning
CN112287820A (en) * 2020-10-28 2021-01-29 广州虎牙科技有限公司 Face detection neural network, face detection neural network training method, face detection method and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200285837A1 (en) * 2015-06-24 2020-09-10 Samsung Electronics Co., Ltd. Face recognition method and apparatus
CN107122705A (en) * 2017-03-17 2017-09-01 中国科学院自动化研究所 Face critical point detection method based on three-dimensional face model
CN109960986A (en) * 2017-12-25 2019-07-02 北京市商汤科技开发有限公司 Human face posture analysis method, device, equipment, storage medium and program
US20190295302A1 (en) * 2018-03-22 2019-09-26 Northeastern University Segmentation Guided Image Generation With Adversarial Networks
CN108805977A (en) * 2018-06-06 2018-11-13 浙江大学 A kind of face three-dimensional rebuilding method based on end-to-end convolutional neural networks
CN109034131A (en) * 2018-09-03 2018-12-18 福州海景科技开发有限公司 A kind of semi-automatic face key point mask method and storage medium
CN110287846A (en) * 2019-06-19 2019-09-27 南京云智控产业技术研究院有限公司 A kind of face critical point detection method based on attention mechanism
CN110516642A (en) * 2019-08-30 2019-11-29 电子科技大学 A kind of lightweight face 3D critical point detection method and system
CN110705355A (en) * 2019-08-30 2020-01-17 中国科学院自动化研究所南京人工智能芯片创新研究院 Face pose estimation method based on key point constraint
CN111652105A (en) * 2020-05-28 2020-09-11 南京审计大学 Face feature point positioning method based on depth measurement learning
CN112287820A (en) * 2020-10-28 2021-01-29 广州虎牙科技有限公司 Face detection neural network, face detection neural network training method, face detection method and storage medium

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114845067A (en) * 2022-07-04 2022-08-02 中科计算技术创新研究院 Hidden space decoupling-based depth video propagation method for face editing
CN115393487A (en) * 2022-10-27 2022-11-25 科大讯飞股份有限公司 Virtual character model processing method and device, electronic equipment and storage medium
CN115426505A (en) * 2022-11-03 2022-12-02 北京蔚领时代科技有限公司 Preset expression special effect triggering method based on face capture and related equipment
CN115426505B (en) * 2022-11-03 2023-03-24 北京蔚领时代科技有限公司 Preset expression special effect triggering method based on face capture and related equipment
CN115578392A (en) * 2022-12-09 2023-01-06 深圳智能思创科技有限公司 Line detection method, device and storage medium
CN115578392B (en) * 2022-12-09 2023-03-03 深圳智能思创科技有限公司 Line detection method, device and storage medium
CN116055211B (en) * 2023-02-14 2023-11-17 成都理工大学工程技术学院 Method and system for identifying identity and automatically logging in application based on neural network
CN116055211A (en) * 2023-02-14 2023-05-02 成都理工大学工程技术学院 Method and system for identifying identity and automatically logging in application based on neural network
CN116229008A (en) * 2023-03-06 2023-06-06 北京百度网讯科技有限公司 Image processing method and device
CN116229008B (en) * 2023-03-06 2023-12-12 北京百度网讯科技有限公司 Image processing method and device
CN116309591A (en) * 2023-05-19 2023-06-23 杭州健培科技有限公司 Medical image 3D key point detection method, model training method and device
CN116309591B (en) * 2023-05-19 2023-08-25 杭州健培科技有限公司 Medical image 3D key point detection method, model training method and device
CN116469175B (en) * 2023-06-20 2023-08-29 青岛黄海学院 Visual interaction method and system for infant education
CN116469175A (en) * 2023-06-20 2023-07-21 青岛黄海学院 Visual interaction method and system for infant education
CN116665284A (en) * 2023-08-02 2023-08-29 深圳宇石科技有限公司 Face modeling and mask model partition matching method, device, terminal and medium
CN116665284B (en) * 2023-08-02 2023-11-28 深圳宇石科技有限公司 Face modeling and mask model partition matching method, device, terminal and medium

Also Published As

Publication number Publication date
CN112287820A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
WO2022089360A1 (en) Face detection neural network and training method, face detection method, and storage medium
CN109214343B (en) Method and device for generating face key point detection model
CN108509915B (en) Method and device for generating face recognition model
KR102663519B1 (en) Cross-domain image transformation techniques
US11880927B2 (en) Three-dimensional object reconstruction from a video
US11238272B2 (en) Method and apparatus for detecting face image
CN110148085B (en) Face image super-resolution reconstruction method and computer readable storage medium
CN111553267B (en) Image processing method, image processing model training method and device
CN111401216B (en) Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
US11620521B2 (en) Smoothing regularization for a generative neural network
CN111898696A (en) Method, device, medium and equipment for generating pseudo label and label prediction model
CN109712234B (en) Three-dimensional human body model generation method, device, equipment and storage medium
WO2021208601A1 (en) Artificial-intelligence-based image processing method and apparatus, and device and storage medium
CN111754541A (en) Target tracking method, device, equipment and readable storage medium
US11960570B2 (en) Learning contrastive representation for semantic correspondence
CN109754464B (en) Method and apparatus for generating information
CN111524216B (en) Method and device for generating three-dimensional face data
CN111275784A (en) Method and device for generating image
CN113704531A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
US20220222832A1 (en) Machine learning framework applied in a semi-supervised setting to perform instance tracking in a sequence of image frames
CN115861462B (en) Training method and device for image generation model, electronic equipment and storage medium
WO2021127916A1 (en) Facial emotion recognition method, smart device and computer-readabel storage medium
CN110866469A (en) Human face facial features recognition method, device, equipment and medium
CN114298997B (en) Fake picture detection method, fake picture detection device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21885087

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21885087

Country of ref document: EP

Kind code of ref document: A1