WO2024032464A1 - Three-dimensional face reconstruction method, apparatus, and device, medium, and product - Google Patents

Three-dimensional face reconstruction method, apparatus, and device, medium, and product Download PDF

Info

Publication number
WO2024032464A1
WO2024032464A1 PCT/CN2023/111005 CN2023111005W WO2024032464A1 WO 2024032464 A1 WO2024032464 A1 WO 2024032464A1 CN 2023111005 W CN2023111005 W CN 2023111005W WO 2024032464 A1 WO2024032464 A1 WO 2024032464A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
image
dimensional
reconstruction
dimensional face
Prior art date
Application number
PCT/CN2023/111005
Other languages
French (fr)
Chinese (zh)
Inventor
靳凯
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Publication of WO2024032464A1 publication Critical patent/WO2024032464A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Definitions

  • This application relates to the field of image processing technology, for example, to a three-dimensional face reconstruction method and its device, equipment, media, and products.
  • 3DMM 3D Morphable Models
  • 3DMM based on neural networks needs rich and accurate training data to achieve better reconstruction results, which means the training cost is high.
  • This application provides a three-dimensional face reconstruction method and its corresponding devices, equipment, non-volatile readable storage media, and computer program products.
  • a three-dimensional face reconstruction method including the following steps:
  • the parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the Corresponding parameter coefficients in the parameterized three-dimensional face model, the parameter coefficients include an identity coefficient corresponding to the facial identity and an expression coefficient corresponding to the facial expression.
  • a three-dimensional face reconstruction device including:
  • the image acquisition module is configured to acquire face image data and extract the face image therein;
  • a face detection module configured to perform key point detection on the face image and obtain a face region image of the area where the face key points are located;
  • the face modeling module is configured to use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to a converged state to perform bilinear modeling of facial identity and facial expression on the face area image, and obtain Parametric three-dimensional face model;
  • a parameter mapping module configured to use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model, where the parameter coefficients include the The identity coefficient corresponding to the facial identity and the expression coefficient corresponding to the facial expression.
  • a three-dimensional face reconstruction device including a central processor and a memory.
  • the central processor is configured to call and run a computer program stored in the memory to execute the three-dimensional face reconstruction method described in the present application. Steps of face reconstruction method.
  • a non-volatile readable storage medium which stores a computer program implemented according to the three-dimensional face reconstruction method in the form of computer-readable instructions, and the computer program is When the computer calls the runtime, it executes the steps involved in the method.
  • a computer program product including a computer program/instructions that, when executed by a processor, implement the steps of the method described in any embodiment of the present application.
  • Figure 1 is a schematic flow chart of an embodiment of the three-dimensional face reconstruction method of the present application.
  • Figure 2 is a flowchart of an embodiment of the exemplary scenario application of the three-dimensional face reconstruction method of the present application. intention;
  • Figure 3 is a schematic diagram of the expression migration results of the three-dimensional face model in the embodiment of the present application.
  • Figure 4 is a schematic flowchart of obtaining a face area image in an embodiment of the present application.
  • Figure 5 is a schematic diagram of the results of obtaining a three-dimensional face model in an embodiment of the present application.
  • Figure 6 is a schematic flowchart of parameter mapping for facial feature maps in an embodiment of the present application.
  • Figure 7 is a schematic flowchart of training a three-dimensional face reconstruction network in an embodiment of the present application.
  • Figure 8 is a schematic diagram of the training framework used in the three-dimensional face reconstruction network method in the embodiment of the present application.
  • Figure 9 is a schematic flow chart of the calculation of the reconstruction loss function in the embodiment of the present application.
  • Figure 10 is a functional block diagram of the three-dimensional face reconstruction device of the present application.
  • Figure 11 is a schematic structural diagram of a three-dimensional face reconstruction device used in this application.
  • the models cited or that may be cited in this application include traditional machine learning models or deep learning models. Unless expressly specified, they can be deployed on a remote server and remotely called on the client, or they can be deployed on a client with competent device capabilities. Direct call, in some embodiments, when it is run on the client, its corresponding machine intelligence can be obtained through transfer learning, so as to reduce the requirements for client hardware running resources and avoid excessive occupation of client hardware running resources.
  • Step S1100 Obtain facial image data and extract facial images therein;
  • Face image data refers to image data with human face parts. This type of image data can be obtained through authorized live broadcast, on-demand and other legal channels. It can be video stream data or image data.
  • the image data of the real person needs to be collected in real time through a camera, and then sent to the backend server for further processing to generate a digital person image and replace it with the image data of the real person. describe the real person in the image data, and finally output the image data with the image of the digital person to the display terminal device facing the audience.
  • the collected image data of real people can be used as the face image data.
  • the video data that has been shot can be stored in the server, and the relevant technical personnel can capture the image data with the target person in it, and then replace it with the image data.
  • the corresponding digital human image is generated, and finally the corresponding image file is generated.
  • the image data with the target person can be used as the face image data.
  • some advertising posters need to use digital human images to attract the masses.
  • an image with a real person is first captured by a camera, and then handed over to relevant technical personnel to generate a digital image of the corresponding style. human image to replace the real person in the image.
  • the image with a real person is the face image data.
  • the face image data may be a kind of video stream data or a kind of image data.
  • it is necessary to further extract the face images in the face image data that is, when the face image data is video stream data, each frame of the image is extracted as a face image; when When the face image data is image data, the face image data is a face image.
  • the extracted face images need to be in a unified format, which can be YUV420 format, or RGB24 format, or YUV444 format, or other similar encoding formats.
  • a unified format which can be YUV420 format, or RGB24 format, or YUV444 format, or other similar encoding formats.
  • the unification of its image data format can make the interfaces for subsequent operations consistent, facilitating unified processing and rapid completion.
  • Step S1200 Perform key point detection on the face image to obtain a face region image of the area where the face key points are located;
  • face detection and face key point detection are performed to detect and obtain the face area image and face key points in the face image.
  • a face detection model pre-trained to a converged state is used to perform face detection to obtain face target frame information.
  • the face target frame information includes the upper left point and the lower right point of the face part. Point coordinate information.
  • the image of the corresponding area position is intercepted from the face image, which is the face area image, which eliminates the interference of redundant image information in non-face areas and has the ability to focus more on face information.
  • a face key point detection model pre-trained to a converged state is used to perform face key point detection to obtain face key point information.
  • the face key points are key points pointing to the face part in the face area image, which can represent the location of the key areas of the face, such as eyebrows, eyes, nose, mouth, facial contour, etc.
  • a standard alignment operation After obtaining the face area image and face key points, a standard alignment operation also needs to be performed.
  • a preset standard three-dimensional face model can be projected into a two-dimensional plane, and the standard face key point information on the two-dimensional plane can be obtained accordingly, and the face key points and the standard face key points can be obtained.
  • the points are aligned and matched to obtain standard transformation parameters, and the face area image is transformed into a face area image with standard size and angle according to the standard transformation parameters.
  • Step S1300 Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional dimensional face model;
  • the 3D face reconstruction network includes a two-layer structure.
  • the first layer is a bilinear modeling layer, based on a parameterized 3D face model, used to decouple facial identity and facial expressions for the face region image. Modeling, the corresponding identity coefficient and expression coefficient need to be further determined;
  • the second layer is the parameter mapping layer, which is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model , the parameter coefficients include an identity coefficient corresponding to the facial identity and an expression coefficient corresponding to the facial expression.
  • a parameterized face model is first determined as the three-dimensional face model to be optimized; in one embodiment, the parameterized face model can be BFM (Basel Face Model , BFM) model.
  • BFM Basel Face Model
  • the BFM model is based on the 3DMM (3D Morphable Models, 3DMM) statistical model. According to the principle of 3DMM, each face is a superposition of shape vectors and texture vectors.
  • 3DMM 3D Morphable Models
  • each face is a superposition of shape vectors and texture vectors.
  • vertex represents the face grid vertex
  • identity represents the identity coefficient
  • expression represents the expression coefficient
  • core_tensor represents the tensor representation of the three-dimensional face model grid vertex.
  • the 3DMM based on the bilinear model uses coefficient multiplication to decouple the identity information and expression information of the face for modeling, and can realize the separate application of the identity coefficient and the expression coefficient to realize the expression.
  • Migration etc. people with different identities and the same expression can be represented by a set of different identity coefficients and the same expression coefficient. In another embodiment, people with the same identity and different expressions can be represented by a set of the same identity coefficients and different expression coefficients.
  • the three-dimensional face model database can be used by relevant technical personnel according to actual applications. Set according to scenarios and actual business needs.
  • this application pre-constructs a three-dimensional face model database with a number of 79, with 46 types of expressions, that is, the vector dimension of the identity coefficient in the face model is 79.
  • the vector dimension of the expression coefficient is 46.
  • the number of the three-dimensional face model database, the number of expression types, the vector dimensions of the identity coefficient and the vector dimension of the expression coefficient can be adjusted according to the actual application scenario, without affecting the actual application of the method. .
  • Step S1400 Use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model.
  • the parameter coefficients include the parameters corresponding to the face The identity coefficient corresponding to the identity and the expression coefficient corresponding to the facial expression.
  • the parameter mapping layer of the three-dimensional face reconstruction network is the second layer structure of the three-dimensional face reconstruction network, which is used to map the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model. .
  • the face area image contains all the information of the target face, such as the identity information that represents the identity of the face, the expression information that represents the facial expression, etc. Therefore, the mapping relationship between it and the identity coefficient and expression coefficient in the three-dimensional face model is constructed. feasible.
  • texture parameters, lighting parameters, posture parameters and transformation parameters can all be expressed in the face area image, and it is also feasible to construct corresponding mapping relationships based on these parameters.
  • mapping relationship can be constructed between the face area image and the identity parameters, expression parameters, texture parameters, lighting parameters, posture parameters, and transformation parameters, so that the identity coefficient, Expression coefficient, texture coefficient, lighting coefficient, posture coefficient, transformation coefficient, etc.
  • the encoder in the three-dimensional face reconstruction network is first used to perform feature extraction on the face area image to obtain the depth features of the face area image, which is called a face feature map; secondly, Perform spatial mapping on the facial feature map to obtain all parameter coefficients, including: identity coefficient, expression coefficient, texture coefficient, lighting coefficient, posture coefficient, transformation coefficient, where the identity coefficient and expression coefficient are the double Parameter coefficients corresponding to identity parameters and expression parameters in the linear modeling layer.
  • the parameter coefficients corresponding to each face image can be stored independently for later use, and can be used to arbitrarily combine to construct different three-dimensional face models, so as to obtain different effects of people. face image. For example, one identity coefficient is combined with multiple expression coefficients to generate face images corresponding to different expressions of the same person, or one expression coefficient is combined with multiple different identity coefficients to generate face images corresponding to the same expressions of different characters. face images, etc.
  • the method after using the parameter mapping layer of the three-dimensional face reconstruction network to map the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model, the method includes:
  • Three-dimensional reconstruction is performed according to the parameter coefficients to obtain a three-dimensional face model of the face area image.
  • the identity coefficient and the expression coefficient among the parameter coefficients are used to construct the corresponding three-dimensional face model. Therefore, the above process of the present application is performed according to a face area image to obtain parameterized By using the three-dimensional face model and the identity coefficient and expression coefficient, a three-dimensional face model that effectively reflects the identity information and expression information of the face area image can be obtained.
  • V _ The expression coefficient output by the parameter mapping layer, ⁇ id (F g(x) ) represents the identity coefficient output by the parameter mapping layer in the three-dimensional face reconstruction network.
  • this application uses the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to target the face after obtaining the face area image of the area where the key points of the face are located in the face image.
  • the regional image performs bilinear modeling of identity information and expression information to obtain a parameterized three-dimensional face model; then the parameter mapping layer of the three-dimensional face reconstruction network is used to map the facial region image to the parameterized
  • the corresponding parameter coefficients in the three-dimensional face model are used to complete the reconstruction of the three-dimensional face model.
  • the three-dimensional face reconstruction method uses a bilinear modeling layer to decouple the identity information and expression information in the face, thereby effectively separating the expression parameters and realizing expression migration, which can greatly promote live broadcast, film and television, and animation and other related industries;
  • the three-dimensional face reconstruction network is suitable for training using a weakly supervised learning method based on a single image, which can greatly reduce the acquisition cost and labeling cost of training data, and contribute to scale application.
  • the parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model. Afterwards, the method further includes:
  • Step S1500 Obtain the target parameter coefficients required to constitute the parameterized three-dimensional face model, where the target parameter coefficients include pre-specified identity coefficients and pre-specified expression coefficients;
  • the parameterized three-dimensional face model is constructed in the bilinear modeling layer of the three-dimensional face reconstruction network, and its undetermined parameter coefficients are the identity coefficient and the expression coefficient.
  • the vector dimension of the identity coefficient is 79
  • the vector dimension of the expression coefficient is 46.
  • Step S1600 Migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person;
  • the previous step completed the reconstruction of the three-dimensional face model of the face area image, but in actual application scenario requirements, it is more inclined to apply its digital image.
  • a digital person is used to replace the face part in the face area image, in order to replace the "real person” with a "digital person” for activities such as live broadcast or communication and interaction.
  • the real-time emotional simulation of "digital people” has become an urgent problem to be solved.
  • One solution is to migrate the real expressions of the "real person” to the "digital person” so that it can simultaneously express the emotions of the "real person".
  • the bilinear modeling layer constructed by the present application can realize the decoupling of expression information, thereby migrating the expression coefficients in the three-dimensional face model of the "real person” to the "digital person". "In the three-dimensional face model in ", the expression migration from "real person” to "digital person” can be completed.
  • the number of identities and the number of expressions that is, the vector dimensions of the identity coefficient and the expression coefficient should be consistent.
  • the expression coefficient corresponding to the "real person” can be directly replaced with the expression coefficient in the "digital person” three-dimensional face model, and the other parameters remain unchanged, and then we can obtain 3D face model of digital human after expression transfer.
  • Step S1700 Render and project the three-dimensional face model of the digital human into a two-dimensional image space to obtain a digital human image.
  • step S1400 after obtaining the three-dimensional face model of the "digital human", three-dimensional rendering and projection are performed based on the illumination coefficient, posture coefficient, transformation coefficient obtained in step S1400, and the texture coefficient of the "digital human” itself.
  • the image of the "digital human” is obtained, that is, the expression migration from the face area image to the "digital human” image is completed.
  • the face area image in the single-frame face image is obtained and replaced with the "digital human” image, so that the "digital human” can be broadcast simultaneously.
  • This type of application is one of the scenarios where the expression migration function of the method is applied, and it can also be used in other scenarios.
  • the method is aimed at decoupling modeling of identity information and expression information, which can bring huge benefits to industries such as live broadcast, film and television, and digital image. It has great application value, and its expression migration application does not affect the changes of other face information.
  • Figure 4 to implement key point detection on the face image.
  • Step S1210 Perform face key point detection on the face image to obtain the face area image and face key point information
  • the face detection model pre-trained to the convergence state is used to perform face detection on the face image, and the face rectangular frame information in the face image is obtained.
  • the face rectangular frame can calibrate the position and size of the face part in the face image, and the calibration result can be represented by a set with four coordinate elements, such as S roi .
  • the corresponding area image is selected from the face image according to the set, that is, the face area image is obtained.
  • the face area image completely contains the face part, and redundant parts of other non-face areas in the face image are removed.
  • S roi ⁇ x 1 ,y 1 ,x 2 ,y 2 ⁇
  • x 1 and y 1 represent the pixel coordinates of the upper left corner of the detected face part
  • x 2 and y 2 represent the pixel coordinates of the lower right corner of the face part
  • the face detection model and face key point detection model are implemented by neural network models. In practical applications, relatively excellent face detection models and face key point detection models in related technologies can be used.
  • Step S1220 Align the face key points with standard face key points to obtain standard alignment parameters.
  • the standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model;
  • the face contours in the face area image have different angles and sizes, which can easily interfere with subsequent three-dimensional face parameter calibration work. Therefore, it is necessary to perform standard alignment on the face area images.
  • the face key points are also detected from the standard face image projected from the standard three-dimensional face model to the two-dimensional plane, thereby obtaining the standard face key points.
  • the standard three-dimensional face model can be preset by relevant technical personnel.
  • the face key points detected from the face area image are aligned to obtain corresponding standard transformation parameters.
  • the method used in the alignment operation can be any minimization method such as PnP, least squares method, etc. In one embodiment of the present application, the PnP method is used.
  • the standard transformation parameters include translation transformation parameters and scale transformation parameters.
  • Step S1230 Align the face area image according to the standard alignment parameter.
  • a standard transformation is performed on the face area image S roi and the face key point L n .
  • its size is adjusted to a preset size, which is 224 ⁇ 224 ⁇ 3 in one embodiment of the present application.
  • the Hough transformation can be used to obtain the posture information of the three-dimensional facial model corresponding to the facial region image.
  • the posture information of the three-dimensional face model includes pitch angle, roll angle and rotation angle.
  • the parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding part of the parameterized three-dimensional face model.
  • Parameter coefficients including:
  • Step S1410 Use the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map
  • the encoder pre-trained to convergence is used to perform feature extraction on the face area image obtained in step S1200, and obtain Facial feature map.
  • the face feature map can reduce the interference of redundant information in non-face area images in the face image, thereby better extracting the semantic information of the face part.
  • the encoder is implemented by a neural network model.
  • the neural network model can use a variety of relatively excellent feature extraction models in related technologies, including: VGG16 model, VGG19 model, InceptionV3 model, Xception model, MobileNet model, AlexNet model, LeNet Model, ZF_Net model, ResNet18 model, ResNet34 model, ResNet_50 model, ResNet_101 model, ResNet_152 model, etc., are all mature feature extraction models.
  • the feature extraction model is a neural network model that has been trained to convergence. In one embodiment, it is trained to convergence on the ImageNet large-scale data set.
  • the output of the encoder is set to a feature map.
  • the encoder directly outputs the feature map of the last convolutional layer, which is called a face feature map.
  • the input size of the encoder is defined as N ⁇ C ⁇ H ⁇ W, and the output size is N ⁇ C' ⁇ H' ⁇ W', where N represents the number of samples, C represents the number of channels, and H ⁇ W represents the preset image. Size, C' represents the number of features, H' ⁇ W' represents the feature map size.
  • Step S1420 Perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer
  • the above facial feature map is spatially mapped to obtain the parameter system of the three-dimensional facial model. numbers and related parameter coefficients for 3D rendering and 2D projection.
  • the space mapping includes semantic space mapping and parameter space mapping.
  • the semantic space mapping maps the face feature map into a face feature vector.
  • the face feature vector contains all the depth semantic information in the face image, which is the face identity semantic information, expression semantic information, and texture semantic information. , comprehensive representation of illumination semantic information, posture semantic information, and transformation semantic information.
  • the parameter space mapping maps the face feature vector to the corresponding parameter subspace, thereby obtaining the coefficients of its corresponding parameters.
  • the parameter space includes a face identity parameter space, an expression parameter space, a texture parameter space, and an illumination parameter space. , attitude parameter space, transformation parameter space.
  • the facial feature map is processed through the above-mentioned semantic space mapping and parameter space mapping to obtain identity coefficients, expression coefficients, texture coefficients, illumination coefficients, posture coefficients, and transformation coefficients.
  • identity coefficient and expression coefficient are used to reconstruct the three-dimensional face model of the face area image;
  • texture coefficient, lighting coefficient, posture coefficient, and transformation coefficient are used for three-dimensional rendering and two-dimensional projection.
  • the parameter mapping layer of the three-dimensional face reconstruction network first extracts the face feature map in the face area image, and then maps it to the semantic space to extract its semantic feature vector, and then It is mapped to different parameter spaces respectively in order to obtain the coefficients in the corresponding parameter space; it can make full use of the identity information, expression information, texture information, lighting information, posture information, and transformation information in the face area image without introducing Other additional information achieves the purpose of integrated modeling of 3D face reconstruction and rendering projection.
  • the spatial mapping is performed on the facial feature map to obtain the parameter coefficients in the bilinear modeling layer, including:
  • Step S1421 Perform semantic space mapping on the facial feature map to obtain a facial feature vector
  • the facial feature map is N ⁇ C' ⁇ H' ⁇ W', where N represents the number of samples, C' represents the number of features, and H' ⁇ W' represents the size of the feature map.
  • Semantic space mapping is performed on the facial feature map x.
  • the F g (x) contains rich information describing the characteristic information of the face, including identity information, shape information, texture information, lighting information, posture information and transformation information.
  • the F g (x) after the semantic space mapping is a feature vector, that is, a face feature vector, represented by x'[N,C'].
  • Step S1422 Perform parameter space mapping on the facial feature vector to obtain parameter coefficients in the bilinear modeling layer.
  • a corresponding number of parameter space mapping layers are designed to map the facial feature vectors into corresponding parameter subspaces for optimization, and obtain coefficients of corresponding parameters.
  • F all (x) ⁇ id (F g(x) ), ⁇ exp (F g(x) ), ⁇ texture (F g(x) ), ⁇ light (F g(x) ), ⁇ pose ( F g(x) ), ⁇ transition (F g(x) ) ⁇
  • ⁇ id represents the learning of identity coefficients.
  • the same person should have similar coefficient representations, and different people have different coefficient representations.
  • the parameter size can be described as [C′,79]; ⁇ exp represents the learning of expression coefficients.
  • People should have similar coefficients, such as closing eyes, opening mouth, curling lips, etc. People with different expressions should have different coefficients. For example, closing eyes and opening eyes should be inconsistent in a specific shape.
  • the parameter size can be described as [C′, 46]; ⁇ texture represents the learning of texture coefficients, which are used to model real textures, and their parameters are described as [C′,79].
  • ⁇ light is used to estimate the current facial illumination, and its parameters are described as [C′,27], which represents the basis coefficients of 27 spherical harmonics.
  • ⁇ pose is used to estimate the pose of the human face and contains three sub-parameters yaw, pitch and roll, corresponding to roll, pitch and rotation respectively.
  • ⁇ transition is used to estimate the transformation of the three-dimensional face space, so it contains the transformation coefficients of the three axes of x, y, and z.
  • the decoupled modeling based on the bilinear modeling layer in the three-dimensional face reconstruction network can separately model the identity information and expression information, which is helpful for the scenario application of expression migration and drives Development of expression generation applications in related industries.
  • the spatial mapping in the parameter mapping layer is used to map the face area image with the three-dimensional face model parameters and rendering projection parameters, making full use of the feature information of the input face area image to provide more convenient acquisition of parameter coefficients. effective way.
  • the input of the three-dimensional face reconstruction network of this application is a face region image, and its output is a three-dimensional face model.
  • a framework corresponding to the weakly supervised learning mechanism is constructed for the three-dimensional face reconstruction network, and the training of the three-dimensional face reconstruction network is completed.
  • Figure 7 shows a schematic diagram of the principle of the framework corresponding to the weakly supervised learning mechanism used to train the three-dimensional face reconstruction network of the present application.
  • the three-dimensional face reconstruction network is trained according to this framework. Therefore, based on any of the above embodiments, please refer to Figure 8.
  • the training process of the three-dimensional face reconstruction network includes:
  • Step S2100 Obtain a single sample of the preprocessed face image data
  • the facial image data refers to image data with human face parts. This type of image data can be obtained through authorized live broadcast, on-demand and other legal channels. In one embodiment, it can be video stream data. Its video storage formats can be diverse, including MP4, avi, rmvb, x264, etc. In another embodiment, it may also be image data.
  • the image data content may include indoor, outdoor, news media, sports and entertainment and other scenes, including natural scenes.
  • the data storage format of the image data is inconsistent due to various data sources, including RGB24, YUV444, YUV420 and other formats.
  • the data storage formats are unified.
  • image data from different sources can be converted into a unified YUV420 format.
  • image data from different sources can also be converted into a unified RGB24 format, or YUV444 format, or others.
  • the above-mentioned preprocessing method is applied to the training and application of relevant technical methods in this application, unifying various data formats into one to improve the efficiency of technical applications without affecting its performance.
  • one face image with a face part is extracted as a single sample for subsequent processing.
  • Step S2200 Obtain the face area image, face key points and three-dimensional face model posture coefficients in the single sample;
  • the face area image, face key points and three-dimensional face model posture coefficients are extracted from the single sample in the same manner as in step S1200 above.
  • the face detection model pre-trained to the convergence state is used to detect the single sample, the face rectangular frame information is obtained, and the face area image is further obtained; and then the face key point detection model pre-trained to the convergence state is used to detect the face key point detection model.
  • the face area image to obtain face key point information; align the face area image S roi and the face key point information L n according to standard alignment parameters; finally, use Hough transform calculation on the face key points Obtain three-dimensional face pose information Y pose .
  • the face region image is used as the input of the three-dimensional face reconstruction network, and the face key points and three-dimensional face pose information are used to calculate the loss value.
  • Step S2300 Use the three-dimensional face reconstruction network to reconstruct and obtain a three-dimensional face model of the face area image, and obtain a face reconstruction image through rendering and projection into two dimensions;
  • the bilinear modeling layer of the three-dimensional face reconstruction network is used to perform decoupled modeling of identity information and expression information
  • the parameter mapping layer of the three-dimensional face reconstruction network is used to obtain the identity coefficient, expression coefficient, texture coefficient, and lighting Coefficients, attitude coefficients, transformation coefficients.
  • the identity coefficient and expression coefficient are used to reconstruct a three-dimensional face model to obtain the face area image.
  • the three-dimensional rendering and two-dimensional projection of the three-dimensional face model include the following operations: estimating the surface texture of the face, assuming in advance that the face is a Lambertian surface, and using spherical harmonics to approximate the scene lighting, which can be combined with the face surface discovery and Skin texture ⁇ texture (F g(x) ) to calculate the radiance of the vertex where ⁇ represents the spherical harmonic function basis function.
  • the three-dimensional rendering work of the three-dimensional face model can be completed on it, and then the camera system transformation of the face is performed, using the posture parameter ⁇ pose (F g(x) ) and the transformation parameter ⁇ transition (F g(x) ), Combined with the camera perspective model to perform translation and rotation changes on the three-dimensional face, it can be projected into a two-dimensional plane to obtain all the projection points L x of the face vertices, which can be expressed as [N v ,2], where 2 represents the x, y plane coordinate information.
  • the face projection has completed the relevant transformation from the world coordinate system to the pixel coordinate system, and it matches the relevant positions of the standard face key points. At this point, the projection of the three-dimensional face model into the two-dimensional plane is completed, and the reconstructed face image is obtained.
  • Step S2400 Calculate a reconstruction loss value based on the face area image and the face reconstruction image, and update the parameters of the three-dimensional face reconstruction network based on the reconstruction loss value;
  • the three-dimensional reconstruction loss function is a weighted sum of four sub-loss functions: the first sub-loss function is a perceptual loss function, used to minimize the face area image and the face reconstruction image
  • the second sub-loss function is the photometric loss function, which is used to enhance the shape and pixel-level alignment between the face area image and the face reconstruction image;
  • the third sub-loss function is the posture loss function , used to ensure higher accuracy of the posture;
  • the fourth sub-loss function is the reprojection loss function, used to optimize the accuracy of the projection point.
  • the weighted sum of the above sub-loss values is the reconstruction loss value of the three-dimensional face reconstruction network under the current iteration number, that is, the error L(x).
  • the relevant weights can be updated according to the back propagation mechanism of the neural network.
  • the updated weight part is mainly the weight of the space mapping in the parameter mapping layer in the three-dimensional face reconstruction network, that is, the semantic space mapping component and the parameter space mapping component.
  • the direction of the weight update is a direction that makes the error L(x) smaller.
  • Step S2500 Repeat the above operations until the preset termination condition is triggered to end the training, and obtain the three-dimensional face reconstruction network.
  • Training can be terminated until the training conditions reach the preset termination condition, indicating that the training has reached convergence.
  • the preset termination condition can be set by relevant technical personnel according to actual application scenario requirements. In one embodiment, it can be a constraint on the number of iterations, that is, training is terminated when the number of training times reaches a preset number. In another embodiment, it can be Loss value constraint, that is, when the reconstruction loss value reaches the preset minimum value during the iterative training process, Terminate training.
  • the weakly supervised learning mechanism based on a single face image can construct training data in large quantities at low cost, thereby effectively reducing the acquisition cost and labeling cost of training samples, which is beneficial to the rapid research and development of related technologies.
  • this method can decouple and obtain facial expression models for expression migration applications, such as film and television, animation, digital humans and other related fields, which has great practical application value and commercial value.
  • Calculating the reconstruction loss value based on the aligned face area image and the reconstructed face image includes:
  • Step S2410 Calculate a first loss value, which is used to minimize the error between the face area image and the face reconstruction image;
  • the first loss value is calculated based on the depth perception of the face area image and the face reconstruction image. That is, a neural network with mature perceptual capabilities is used to pre-extract the semantic features of the face area image and the face reconstruction image, and then calculate the correlation loss value based on the semantic features.
  • self-supervised modeling is first performed on the reconstructed face image.
  • a face recognition network pre-trained to a converged state is introduced to extract the top-level depth of the reconstructed face image and the face region image. feature.
  • the face recognition network can use mature neural network models in related technologies, and face recognition models such as VGGNet, FaceNet, and ArcFaceNet can be used for self-supervised training.
  • the ArcFaceNet network can be used, which has better effects.
  • the perceptual loss function can be expressed as:
  • the above similarity loss function is used to constrain the network model so that the reconstructed face is close to the real face, and the surface texture features and lighting parameters are optimized.
  • Step S2420 Calculate a second loss value, the second loss value is used to enhance the shape and pixel level alignment between the face area image and the face reconstruction image;
  • the first loss value implicitly constrains the approximate relationship of the face feature layer.
  • a second loss value is added to strengthen the face region image and the face reconstruction.
  • Shape and pixel level alignment between images which can be expressed as:
  • This constraint has a strong pixel-level constraint. Therefore, in one embodiment, a smaller weight w photo is given to prevent the network from falling into a local solution.
  • Step S2430 Calculate a third loss value.
  • the third loss value is used to ensure that the posture has higher accuracy
  • the first loss value implicitly constrains and optimizes the pose.
  • the third loss value is calculated.
  • ⁇ pose (F g(x) ) ⁇ R 3 is the posture coefficient obtained in the forward reasoning of the three-dimensional face reconstruction network, including roll angle, pitch angle and rotation angle
  • Y pose ⁇ R 3 represents It is the posture coefficient of the three-dimensional face model obtained in step S2200, and also includes the roll angle, pitch angle and rotation angle.
  • Step S2440 Calculate a fourth loss value, which is used to optimize the accuracy of projection points in two-dimensional projection;
  • the fourth loss value can also be used for model constraints.
  • face key point data extracted based on the sample and reprojected points after 3D rendering and 2D projection after 3D face reconstruction to construct reprojection error constraints.
  • the number of vertices is consistent with the number of two-dimensional facial key point detection.
  • Step S2450 Calculate a reconstruction loss value, which is a weighted fusion of the first loss value, the second loss value, the third loss value, and the fourth loss value.
  • Weighted fusion is performed based on the four sub-loss functions constructed in the above steps.
  • w percep , w pose and w proj represent the weights expressed as the first loss value, the third loss value and the fourth loss value respectively.
  • the weighted fusion reconstruction loss value calculation based on the first loss value, the second loss value, the third loss value, and the fourth loss value can more comprehensively constrain the three-dimensional face reconstruction network. All parameters obtained in the method are close to the real label values. At the same time, loss calculation and parameter update based on single samples can accelerate convergence and save training costs.
  • a three-dimensional face reconstruction device provided according to one aspect of the present application, a In the embodiment, it includes an image acquisition module 1100, a face detection module 1200, a face modeling module 1300, and a parameter mapping module 1400.
  • the image acquisition module 1100 is configured to acquire face image data and extract face images therein;
  • the face detection module 1200 is configured to perform key point detection on the face image to obtain a face region image of the area where the key points of the face are located;
  • the face modeling module 1300 is configured to use a three-dimensional model pre-trained to a converged state.
  • the bilinear modeling layer of the face reconstruction network performs bilinear modeling of facial identity and facial expression on the face region image to obtain a parameterized three-dimensional face model; the parameter mapping module 1400 is set to use
  • the parameter mapping layer of the three-dimensional face reconstruction network maps the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model, and the parameter coefficients include the identity corresponding to the face identity. coefficient and the expression coefficient corresponding to the facial expression.
  • the parameter mapping module 1400 includes: a coefficient acquisition unit configured to obtain target parameter coefficients required to constitute the parameterized three-dimensional face model, wherein the target parameter coefficients include prespecified The identity coefficient and the pre-specified expression coefficient; the expression migration unit is configured to migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person; the rendering projection unit is configured to transfer the three-dimensional face model of the digital person.
  • the three-dimensional face model of the digital human is rendered and projected into the two-dimensional image space to obtain the digital human image.
  • the face detection module 1200 includes: a face detection unit configured to detect face key points on the face image to obtain face area images and face key point information; standard alignment A unit configured to align the face key points with the standard face key points to obtain standard alignment parameters.
  • the standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model; face An alignment unit configured to align the face area image according to the standard alignment parameters.
  • the modeling projection module 1400 includes: a feature encoding unit configured to use the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map ; A spatial mapping unit configured to perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer.
  • the spatial mapping unit includes: a semantic space mapping subunit, which is configured to perform semantic space mapping on the facial feature map to obtain a facial feature vector; and a parameter space mapping subunit, which is configured to perform semantic space mapping on the facial feature map.
  • the facial feature vector is mapped in parameter space to obtain the parameter coefficients in the bilinear modeling layer.
  • the network training module includes: a sample acquisition unit configured to acquire a single sample of preprocessed face image data; a data acquisition unit configured to acquire a face region image in the single sample , face key points and three-dimensional face model posture coefficients; the reconstruction image unit is configured to use the three-dimensional face reconstruction network to reconstruct the three-dimensional face model to obtain the face area image, and then render and project it into two dimensions to obtain the human face model.
  • face reconstruction image a loss optimization unit configured to calculate a reconstruction loss value according to the face area image and the face reconstruction image, and perform the reconstruction loss on the three-dimensional face according to the reconstruction loss value.
  • the face reconstruction network performs parameter updates; the training repetition unit is set to repeat the above operations until the preset termination condition is triggered and the training ends, and the three-dimensional face reconstruction network is obtained.
  • the loss optimization unit includes: a first loss subunit configured to calculate a first loss value, the first loss value being used to minimize the relationship between the face region image and the face reconstruction Error between images; a second loss subunit configured to calculate a second loss value, the second loss value being used to enhance the shape and pixel-level alignment between the face region image and the face reconstruction image ;
  • the third loss subunit is configured to calculate a third loss value, the third loss value is used to ensure that the posture has higher accuracy;
  • the fourth loss subunit is configured to calculate a fourth loss value, the fourth loss value The value is used to optimize the accuracy of the projection points in the two-dimensional projection; the loss fusion subunit is set to calculate the reconstruction loss value, which is the first loss value, the second loss value, the third loss value, and the fourth loss value. Weighted fusion of values.
  • FIG. 11 a schematic diagram of the internal structure of the three-dimensional face reconstruction device.
  • the three-dimensional face reconstruction device includes a processor, a computer-readable storage medium, a memory and a network interface connected through a system bus.
  • the computer-readable non-volatile readable storage medium of the three-dimensional face reconstruction device stores an operating system, a database and computer-readable instructions.
  • the database can store information sequences, and the computer-readable instructions are processed by the processor.
  • the processor can be enabled to implement a three-dimensional face reconstruction method.
  • the processor of the three-dimensional face reconstruction device is used to provide computing and control capabilities to support the operation of the entire three-dimensional face reconstruction device.
  • Computer-readable instructions may be stored in the memory of the three-dimensional face reconstruction device. When executed by the processor, the computer-readable instructions may cause the processor to execute the three-dimensional face reconstruction method of the present application.
  • the network interface of the three-dimensional face reconstruction device is used to connect and communicate with the terminal.
  • FIG. 11 is only a block diagram of part of the structure related to the solution of the present application.
  • a specific three-dimensional face reconstruction device may include more or fewer components than shown in the figure. Or combine certain parts, or have different parts arrangements.
  • the processor is used to execute the specific functions of each module in Figure 10, and the memory stores program codes and various types of data required to execute the above modules or sub-modules.
  • the network interface is used to realize data transmission between user terminals or servers.
  • the non-volatile readable storage medium in this embodiment stores the program codes and data required to execute all modules in the three-dimensional face reconstruction device of the present application.
  • the server can call the server's program codes and data to execute the functions of all modules. .
  • This application also provides a non-volatile readable storage medium storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, they cause one or more processors to execute any embodiment of the application.
  • the steps of the 3D face reconstruction method are also provided.
  • the present application also provides a computer program product, which includes a computer program/instruction that implements the steps of the method described in any embodiment of the present application when executed by one or more processors.
  • the computer program can be stored in a non-volatile readable storage medium. , when the program is executed, it may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage media can be computer-readable storage media such as magnetic disks, optical disks, read-only memory (Read-Only Memory, ROM), or random access memory (Random Access Memory, RAM), etc.
  • this application can achieve three-dimensional face reconstruction.
  • the three-dimensional face reconstruction method uses a bilinear modeling layer to decouple the identity information and expression information in the face, thereby effectively separating the expression parameters.
  • realizing expression migration can greatly promote the application and development of related live broadcast, film and television and other industries;
  • the training method of the method is based on weakly supervised learning of a single image, which can greatly reduce the acquisition cost and labeling cost of training data. , conducive to large-scale application.

Abstract

A three-dimensional face reconstruction method, apparatus, and device, a medium, and a product. The method comprises: acquiring face image data, and extracting a face image in the face image data; then performing key point detection on the face image to obtain a face area image of an area where face key points are located; performing bilinear modeling of a face identity and a face expression on the face area image by using a bilinear modeling layer of a three-dimensional face reconstruction network pre-trained to a convergence state, so as to obtain a parameterized three-dimensional face model; and finally, mapping the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model by using a parameter mapping layer of the three-dimensional face reconstruction network.

Description

三维人脸重建方法及其装置、设备、介质、产品Three-dimensional face reconstruction method and its devices, equipment, media and products
本申请要求在2022年08月12日提交中国专利局、申请号为202210969989.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202210969989.1, which was submitted to the China Patent Office on August 12, 2022. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本申请涉及图像处理技术领域,例如涉及一种三维人脸重建方法及其装置、设备、介质、产品。This application relates to the field of image processing technology, for example, to a three-dimensional face reconstruction method and its device, equipment, media, and products.
背景技术Background technique
基础网络技术的演进推进了数字人、虚拟人物和3D形象的发展。其在相关影视、游戏及教育等各领域的应用急剧加大了三维虚拟人物生成技术的需求,其中,三维人脸重建技术在三维虚拟人物生成技术领域中更为重要。The evolution of basic network technology has promoted the development of digital people, virtual characters and 3D images. Its application in related fields such as film and television, games, and education has sharply increased the demand for three-dimensional virtual character generation technology. Among them, three-dimensional face reconstruction technology is more important in the field of three-dimensional virtual character generation technology.
传统三维人脸重建方法基于3DMM(3D Morphable Models,3DMM)先验,其依赖于视觉信号,而视觉信号的偏差容易导致泛化性较弱,因此需要更多的样本用于训练。此外,依赖于关键点的表情迁移极易出现非自然表情,非真实感较强。Traditional 3D face reconstruction methods are based on 3DMM (3D Morphable Models, 3DMM) priors, which rely on visual signals. The deviation of visual signals can easily lead to weak generalization, so more samples are needed for training. In addition, expression migration that relies on key points is prone to unnatural expressions and a strong sense of non-realism.
综上,相关技术中基于神经网络的3DMM,若要取得较好的重建效果,需要丰富且准确的训练数据,即训练成本高;其次,难以有效进行表情迁移,即不能获得精准表达表情的三维人脸图像。To sum up, in related technologies, 3DMM based on neural networks needs rich and accurate training data to achieve better reconstruction results, which means the training cost is high. Secondly, it is difficult to effectively transfer expressions, that is, it cannot obtain three-dimensional expressions that accurately express expressions. Face images.
发明内容Contents of the invention
本申请提供一种三维人脸重建方法及其相应的装置、设备、非易失性可读存储介质,以及计算机程序产品。This application provides a three-dimensional face reconstruction method and its corresponding devices, equipment, non-volatile readable storage media, and computer program products.
根据本申请的一个方面,提供一种三维人脸重建方法,包括如下步骤:According to one aspect of this application, a three-dimensional face reconstruction method is provided, including the following steps:
获取人脸影像数据,提取其中的人脸图像;Obtain facial image data and extract facial images;
对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像;Perform key point detection on the face image to obtain a face area image of the area where the face key points are located;
采用预训练至收敛状态的三维人脸重建网络的双线性建模层针对所述人脸区域图像进行人脸身份和人脸表情的双线性建模,获得参数化的三维人脸模型;Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional face model;
采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述 参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the Corresponding parameter coefficients in the parameterized three-dimensional face model, the parameter coefficients include an identity coefficient corresponding to the facial identity and an expression coefficient corresponding to the facial expression.
根据本申请的另一方面,提供一种三维人脸重建装置,包括:According to another aspect of the present application, a three-dimensional face reconstruction device is provided, including:
图像获取模块,设置为获取人脸影像数据,提取其中的人脸图像;The image acquisition module is configured to acquire face image data and extract the face image therein;
人脸检测模块,设置为对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像;A face detection module configured to perform key point detection on the face image and obtain a face region image of the area where the face key points are located;
人脸建模模块,设置为采用预训练至收敛状态的三维人脸重建网络的双线性建模层针对所述人脸区域图像进行人脸身份和人脸表情的双线性建模,获得参数化的三维人脸模型;The face modeling module is configured to use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to a converged state to perform bilinear modeling of facial identity and facial expression on the face area image, and obtain Parametric three-dimensional face model;
参数映射模块,设置为采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。A parameter mapping module configured to use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model, where the parameter coefficients include the The identity coefficient corresponding to the facial identity and the expression coefficient corresponding to the facial expression.
根据本申请的另一方面,提供一种三维人脸重建设备,包括中央处理器和存储器,所述中央处理器设置为调用运行存储于所述存储器中的计算机程序以执行本申请所述的三维人脸重建方法的步骤。According to another aspect of the present application, a three-dimensional face reconstruction device is provided, including a central processor and a memory. The central processor is configured to call and run a computer program stored in the memory to execute the three-dimensional face reconstruction method described in the present application. Steps of face reconstruction method.
根据本申请的另一方面,提供一种非易失性可读存储介质,其以计算机可读指令的形式存储有依据所述的三维人脸重建方法所实现的计算机程序,所述计算机程序被计算机调用运行时,执行该方法所包括的步骤。According to another aspect of the present application, a non-volatile readable storage medium is provided, which stores a computer program implemented according to the three-dimensional face reconstruction method in the form of computer-readable instructions, and the computer program is When the computer calls the runtime, it executes the steps involved in the method.
根据本申请的另一方面,提供一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现本申请任意一种实施例中所述方法的步骤。According to another aspect of the present application, a computer program product is provided, including a computer program/instructions that, when executed by a processor, implement the steps of the method described in any embodiment of the present application.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings needed to be used in the description of the embodiments will be briefly introduced below. The drawings in the following description are only some embodiments of the present application. For those in the field, Ordinary technicians can also obtain other drawings based on these drawings without exerting creative work.
图1为本申请的三维人脸重建方法的一种实施例的流程示意图;Figure 1 is a schematic flow chart of an embodiment of the three-dimensional face reconstruction method of the present application;
图2为本申请的三维人脸重建方法示例性场景应用的一种实施例的流程示 意图;Figure 2 is a flowchart of an embodiment of the exemplary scenario application of the three-dimensional face reconstruction method of the present application. intention;
图3为本申请的实施例中三维人脸模型表情迁移的结果示意图;Figure 3 is a schematic diagram of the expression migration results of the three-dimensional face model in the embodiment of the present application;
图4为本申请的实施例中获取人脸区域图像的流程示意图;Figure 4 is a schematic flowchart of obtaining a face area image in an embodiment of the present application;
图5为本申请的实施例中获取三维人脸模型的结果示意图;Figure 5 is a schematic diagram of the results of obtaining a three-dimensional face model in an embodiment of the present application;
图6为本申请的实施例中针对人脸特征图进行参数映射的流程示意图;Figure 6 is a schematic flowchart of parameter mapping for facial feature maps in an embodiment of the present application;
图7为本申请的实施例中训练三维人脸重建网络的流程示意图;Figure 7 is a schematic flowchart of training a three-dimensional face reconstruction network in an embodiment of the present application;
图8为本申请的实施例中三维人脸重建网络方法所采用的训练框架示意图;Figure 8 is a schematic diagram of the training framework used in the three-dimensional face reconstruction network method in the embodiment of the present application;
图9为本申请的实施例中重建损失函数计算的流程示意图;Figure 9 is a schematic flow chart of the calculation of the reconstruction loss function in the embodiment of the present application;
图10为本申请的三维人脸重建装置的原理框图;Figure 10 is a functional block diagram of the three-dimensional face reconstruction device of the present application;
图11为本申请所采用的一种三维人脸重建设备的结构示意图。Figure 11 is a schematic structural diagram of a three-dimensional face reconstruction device used in this application.
具体实施方式Detailed ways
本申请中所引用或可能引用到的模型,包括传统机器学习模型或深度学习模型,除非明文指定,既可部署于远程服务器且在客户端实施远程调用,也可部署于设备能力胜任的客户端直接调用,某些实施例中,当其运行于客户端时,其相应的机器智能可通过迁移学习来获得,以便降低对客户端硬件运行资源的要求,避免过度占用客户端硬件运行资源。The models cited or that may be cited in this application include traditional machine learning models or deep learning models. Unless expressly specified, they can be deployed on a remote server and remotely called on the client, or they can be deployed on a client with competent device capabilities. Direct call, in some embodiments, when it is run on the client, its corresponding machine intelligence can be obtained through transfer learning, so as to reduce the requirements for client hardware running resources and avoid excessive occupation of client hardware running resources.
请参阅图1,根据本申请提供的一种三维人脸重建方法,在其一个实施例中,包括如下步骤:Please refer to Figure 1. According to a three-dimensional face reconstruction method provided by this application, in one embodiment, it includes the following steps:
步骤S1100、获取人脸影像数据,提取其中的人脸图像;Step S1100: Obtain facial image data and extract facial images therein;
人脸影像数据指的是带有人脸部分的影像数据,该类影像数据可通过已授权的直播、点播等合法途径进行获取,其可以是视频流数据,也可以是图像数据。Face image data refers to image data with human face parts. This type of image data can be obtained through authorized live broadcast, on-demand and other legal channels. It can be video stream data or image data.
一个实施例中,当真实人以数字人形象进行直播活动时,需要通过摄像头实时采集所述真实人的影像数据,其后输送到后台服务器中作进一步处理,生成数字人形象并以此替代所述影像数据中的真实人,最后将带有数字人形象的影像数据输出到面向观众的显示终端设备中。上述实施例中,所采集的真实人的影像数据,即可作为所述人脸影像数据。In one embodiment, when a real person performs a live broadcast in the form of a digital person, the image data of the real person needs to be collected in real time through a camera, and then sent to the backend server for further processing to generate a digital person image and replace it with the image data of the real person. describe the real person in the image data, and finally output the image data with the image of the digital person to the display terminal device facing the audience. In the above embodiment, the collected image data of real people can be used as the face image data.
又一个实施例中,部分影视作品中常需将真实人替换为数字人形象,以生成带有相应风格的作品。本实施例中,可以将已拍摄完成的视频数据存储至服务器中,交由相关技术人员捕捉其中带有目标人物的影像数据,然后替换成相 应的数字人形象,最后生成相应的影像文件。所述带有目标人物的影像数据,即可作为所述人脸影像数据。In another embodiment, in some film and television works, it is often necessary to replace real people with digital human images to generate works with corresponding styles. In this embodiment, the video data that has been shot can be stored in the server, and the relevant technical personnel can capture the image data with the target person in it, and then replace it with the image data. The corresponding digital human image is generated, and finally the corresponding image file is generated. The image data with the target person can be used as the face image data.
再一个实施例中,一些广告海报需要以数字人形象来吸引群众,可以服务于此类目的先通过摄像机拍摄一张带有真实人的图像,然后交由相关技术人员生成其相应风格的数字人形象,来替代所述图像中的真实人。上述实施例中,所述带有真实人的图像,即为所述人脸影像数据。In another embodiment, some advertising posters need to use digital human images to attract the masses. To serve this purpose, an image with a real person is first captured by a camera, and then handed over to relevant technical personnel to generate a digital image of the corresponding style. human image to replace the real person in the image. In the above embodiment, the image with a real person is the face image data.
以上实施例是所述人脸影像数据在示例性应用场景中的部分例证。因此,所述人脸影像数据可以是一种视频流数据,也可以是一种图像数据。适应本申请的需要,需要进一步提取所述人脸影像数据中的人脸图像,即,当所述人脸影像数据是视频流数据时,则提取其中的每一帧图像作为人脸图像;当所述人脸影像数据是图像数据时,则所述人脸影像数据即为人脸图像。The above embodiments are some examples of the facial image data in exemplary application scenarios. Therefore, the face image data may be a kind of video stream data or a kind of image data. To meet the needs of this application, it is necessary to further extract the face images in the face image data, that is, when the face image data is video stream data, each frame of the image is extracted as a face image; when When the face image data is image data, the face image data is a face image.
值得说明的是,所抽取的人脸图像需为统一格式,其可以是YUV420格式、或RGB24格式、或YUV444格式、或其他类似编码格式。其图像数据格式的统一能够使得后续操作的接口一致,方便统一处理,快速完成。It is worth noting that the extracted face images need to be in a unified format, which can be YUV420 format, or RGB24 format, or YUV444 format, or other similar encoding formats. The unification of its image data format can make the interfaces for subsequent operations consistent, facilitating unified processing and rapid completion.
步骤S1200、对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像;Step S1200: Perform key point detection on the face image to obtain a face region image of the area where the face key points are located;
获得所述人脸图像之后,进行人脸检测和人脸关键点检测,检测获得人脸图像中的人脸区域图像和人脸关键点。可选地,针对所述人脸图像,采用预训练至收敛状态的人脸检测模型实施人脸检测,获得人脸目标框信息,所述人脸目标框信息包括人脸部分左上点和右下点的坐标信息。根据所述人脸目标框信息从所述人脸图像中截取相应区域位置的图像,即为人脸区域图像,其消除了非人脸区域的冗余图像信息的干扰,具有更聚焦人脸信息的特点。在一实施例中,针对所述人脸区域图像,采用预训练至收敛状态的人脸关键点检测模型实施人脸关键点检测,获得人脸关键点信息。所述人脸关键点为指向人脸区域图像中人脸部分的关键点,其能够表征人脸面部的关键区域位置,如眉毛、眼睛、鼻子、嘴巴、脸部轮廓等。After obtaining the face image, face detection and face key point detection are performed to detect and obtain the face area image and face key points in the face image. Optionally, for the face image, a face detection model pre-trained to a converged state is used to perform face detection to obtain face target frame information. The face target frame information includes the upper left point and the lower right point of the face part. Point coordinate information. According to the face target frame information, the image of the corresponding area position is intercepted from the face image, which is the face area image, which eliminates the interference of redundant image information in non-face areas and has the ability to focus more on face information. Features. In one embodiment, for the face region image, a face key point detection model pre-trained to a converged state is used to perform face key point detection to obtain face key point information. The face key points are key points pointing to the face part in the face area image, which can represent the location of the key areas of the face, such as eyebrows, eyes, nose, mouth, facial contour, etc.
在获得所述人脸区域图像和人脸关键点之后,还需要进行标准对齐操作。一种实施例中,可以将预设标准三维人脸模型投影至二维平面中,相应获得二维平面上的标准人脸关键点信息,对所述人脸关键点和所述标准人脸关键点进行对齐匹配,获得标准变换参数,根据所述标准变换参数将人脸区域图像变换到具有标准尺寸和角度的人脸区域图像。After obtaining the face area image and face key points, a standard alignment operation also needs to be performed. In one embodiment, a preset standard three-dimensional face model can be projected into a two-dimensional plane, and the standard face key point information on the two-dimensional plane can be obtained accordingly, and the face key points and the standard face key points can be obtained. The points are aligned and matched to obtain standard transformation parameters, and the face area image is transformed into a face area image with standard size and angle according to the standard transformation parameters.
步骤S1300、采用预训练至收敛状态的三维人脸重建网络的双线性建模层针对所述人脸区域图像进行人脸身份和人脸表情的双线性建模,获得参数化的三 维人脸模型;Step S1300: Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional dimensional face model;
三维人脸重建网络包括两层结构,第一层为双线性建模层,基于参数化的三维人脸模型,用于针对所述人脸区域图像进行人脸身份和人脸表情的解耦建模,其对应的身份系数和表情系数需进一步确定;第二层为参数映射层,其用于将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。The 3D face reconstruction network includes a two-layer structure. The first layer is a bilinear modeling layer, based on a parameterized 3D face model, used to decouple facial identity and facial expressions for the face region image. Modeling, the corresponding identity coefficient and expression coefficient need to be further determined; the second layer is the parameter mapping layer, which is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model , the parameter coefficients include an identity coefficient corresponding to the facial identity and an expression coefficient corresponding to the facial expression.
在所述双线性建模层中,首先确定一种参数化的人脸模型作为待优化的三维人脸模型;一个实施例中,所述参数化的人脸模型可为BFM(Basel Face Model,BFM)模型,BFM模型是以3DMM(3D Morphable Models,3DMM)统计模型为基础的,根据3DMM的原理,每一张人脸为形状向量与纹理向量的叠加。另一个实施例中,亦即本申请中的一个示例性应用实例,采用基于双线性模型的3DMM作为参数化的人脸模型,其参数化的表示可为:
core_tensor=vertex*identity*expression
In the bilinear modeling layer, a parameterized face model is first determined as the three-dimensional face model to be optimized; in one embodiment, the parameterized face model can be BFM (Basel Face Model , BFM) model. The BFM model is based on the 3DMM (3D Morphable Models, 3DMM) statistical model. According to the principle of 3DMM, each face is a superposition of shape vectors and texture vectors. In another embodiment, which is an exemplary application example in this application, a 3DMM based on a bilinear model is used as a parameterized face model, and its parameterized representation can be:
core_tensor=vertex*identity*expression
其中vertex表示人脸网格顶点,identity表示身份系数,expression表示表情系数,core_tensor表示三维人脸模型网格顶点的张量表示。Among them, vertex represents the face grid vertex, identity represents the identity coefficient, expression represents the expression coefficient, and core_tensor represents the tensor representation of the three-dimensional face model grid vertex.
所述基于双线性模型的3DMM相对于传统的3DMM,采用系数相乘的方式将人脸的身份信息和表情信息进行解耦建模,能够实现身份系数和表情系数的分离应用,以实现表情迁移等。一个实施例中,不同身份拥有同样表情的人,可以被一组不同的身份系数和相同的表情系数表示。另一实施例中,针对同一身份拥有不同表情的人,可以被一组相同的身份系数和不同的表情系数表示。Compared with the traditional 3DMM, the 3DMM based on the bilinear model uses coefficient multiplication to decouple the identity information and expression information of the face for modeling, and can realize the separate application of the identity coefficient and the expression coefficient to realize the expression. Migration etc. In one embodiment, people with different identities and the same expression can be represented by a set of different identity coefficients and the same expression coefficient. In another embodiment, people with the same identity and different expressions can be represented by a set of the same identity coefficients and different expression coefficients.
针对建模本身作更具体地说明,所述基于双线性模型的3DMM将人脸的表示定义为上述中的core_tensor,其为预设三维人脸模型库中所有三维人脸模型的加权组合,可统一表达为:

B0=U0,B=[U1-U0,U2-U0,…,Um-U0]
To explain the modeling itself in more detail, the 3DMM based on the bilinear model defines the representation of the face as the core_tensor in the above, which is the weighted combination of all three-dimensional face models in the preset three-dimensional face model library, It can be expressed uniformly as:

B 0 =U 0 , B = [U 1 -U 0 ,U 2 -U 0 ,…,U m -U 0 ]
其中,UiBi∈Rn×(l+1),α∈Rm×1,n为基底数,l为表情数量,m为身份数量。Among them, U i B i ∈R n×(l+1) , α∈R m×1 , n is the base number, l is the number of expressions, and m is the number of identities.
则相对应映射到三维空间中的网格顶点可表示为:
f0+fα
Then the corresponding mesh vertices mapped to the three-dimensional space can be expressed as:
f 0 + f α
在当前实施例中,所述三维人脸模型数据库可由相关技术人员根据实际应 用场景和实际业务需求而设定,在示例性应用中,本申请预先构建一个数量为79的三维人脸模型数据库,有46类表情,亦即人脸模型中身份系数的向量维度为79,表情系数的向量维度为46。在其他的应用场景中,所述三维人脸模型数据库的数量、表情类型的数量、身份系数的向量维度和表情系数的向量维度都可以依据实际应用场景而调整,不影响所述方法的实际应用。In the current embodiment, the three-dimensional face model database can be used by relevant technical personnel according to actual applications. Set according to scenarios and actual business needs. In the exemplary application, this application pre-constructs a three-dimensional face model database with a number of 79, with 46 types of expressions, that is, the vector dimension of the identity coefficient in the face model is 79. The vector dimension of the expression coefficient is 46. In other application scenarios, the number of the three-dimensional face model database, the number of expression types, the vector dimensions of the identity coefficient and the vector dimension of the expression coefficient can be adjusted according to the actual application scenario, without affecting the actual application of the method. .
步骤S1400、采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。Step S1400: Use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model. The parameter coefficients include the parameters corresponding to the face The identity coefficient corresponding to the identity and the expression coefficient corresponding to the facial expression.
三维人脸重建网络的参数映射层是所述三维人脸重建网络的第二层结构,其用于将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数。The parameter mapping layer of the three-dimensional face reconstruction network is the second layer structure of the three-dimensional face reconstruction network, which is used to map the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model. .
人脸区域图像中包含目标人脸的所有信息,如表征人脸身份的身份信息,表征人脸表情的表情信息等,因而构建其与三维人脸模型中的身份系数和表情系数的映射关系是可行的。除此之外,纹理参数、光照参数、姿态参数和变换参数都可在所述人脸区域图像中得以表达,根据这些参数构建相应的映射关系也是可行的。The face area image contains all the information of the target face, such as the identity information that represents the identity of the face, the expression information that represents the facial expression, etc. Therefore, the mapping relationship between it and the identity coefficient and expression coefficient in the three-dimensional face model is constructed. feasible. In addition, texture parameters, lighting parameters, posture parameters and transformation parameters can all be expressed in the face area image, and it is also feasible to construct corresponding mapping relationships based on these parameters.
因此,可对所述人脸区域图像与所述身份参数、表情参数、纹理参数、光照参数、姿态参数、变换参数之间构建映射关系,使得由所述人脸区域图像可相应获得身份系数、表情系数、纹理系数、光照系数、姿态系数、变换系数等。Therefore, a mapping relationship can be constructed between the face area image and the identity parameters, expression parameters, texture parameters, lighting parameters, posture parameters, and transformation parameters, so that the identity coefficient, Expression coefficient, texture coefficient, lighting coefficient, posture coefficient, transformation coefficient, etc.
一种实施例中,首先采用所述三维人脸重建网络中的编码器针对所述人脸区域图像进行特征提取,获得所述人脸区域图像的深度特征,称之为人脸特征图;其次,针对所述人脸特征图进行空间映射,获得所有参数系数,包括:身份系数、表情系数、纹理系数、光照系数、姿态系数、变换系数,其中,所述身份系数、表情系数即为所述双线性建模层中身份参数和表情参数相对应的参数系数。In one embodiment, the encoder in the three-dimensional face reconstruction network is first used to perform feature extraction on the face area image to obtain the depth features of the face area image, which is called a face feature map; secondly, Perform spatial mapping on the facial feature map to obtain all parameter coefficients, including: identity coefficient, expression coefficient, texture coefficient, lighting coefficient, posture coefficient, transformation coefficient, where the identity coefficient and expression coefficient are the double Parameter coefficients corresponding to identity parameters and expression parameters in the linear modeling layer.
各种所述的系数可按需调用,用于三维人脸重建,获得所述人脸区域图像相对应的三维人脸模型,可作为三维人脸重建的结果输出。一个实施例中,每个人脸图像相对应的参数系数,包括其中的身份系数和表情系数,均可独立存储备用,可用于随意组合以构造出不同的三维人脸模型,以便获得不同效果的人脸图像。例如,将一个身份系数与多个表情系数相结合用于产生同一人物不同表情相对应的人脸图像,或者,将一个表情系数与多个不同身份系数相结合用于产生不同人物相同表情相对应的人脸图像等。另一实施例中,采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为参数化的三维人脸模型中相对应的参数系数之后,包括: Various coefficients can be called as needed for three-dimensional face reconstruction to obtain a three-dimensional face model corresponding to the face area image, which can be output as a result of three-dimensional face reconstruction. In one embodiment, the parameter coefficients corresponding to each face image, including the identity coefficients and expression coefficients, can be stored independently for later use, and can be used to arbitrarily combine to construct different three-dimensional face models, so as to obtain different effects of people. face image. For example, one identity coefficient is combined with multiple expression coefficients to generate face images corresponding to different expressions of the same person, or one expression coefficient is combined with multiple different identity coefficients to generate face images corresponding to the same expressions of different characters. face images, etc. In another embodiment, after using the parameter mapping layer of the three-dimensional face reconstruction network to map the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model, the method includes:
根据所述参数系数进行三维重建,获得所述人脸区域图像的三维人脸模型。Three-dimensional reconstruction is performed according to the parameter coefficients to obtain a three-dimensional face model of the face area image.
在一实施例中,采用所述参数系数中的身份系数和表情系数用于构建其相对应的三维人脸模型,由此,根据一个人脸区域图像进行本申请以上过程的处理,获得参数化的三维人脸模型和所述身份系数、表情系数,便能得到有效反映所述人脸区域图像的身份信息和表情信息的三维人脸模型。In one embodiment, the identity coefficient and the expression coefficient among the parameter coefficients are used to construct the corresponding three-dimensional face model. Therefore, the above process of the present application is performed according to a face area image to obtain parameterized By using the three-dimensional face model and the identity coefficient and expression coefficient, a three-dimensional face model that effectively reflects the identity information and expression information of the face area image can be obtained.
一个实施例中,三维人脸模型重建后需进一步确定其网格化表示,以完成三维空间中相应人脸的重构。由此,对三维人脸网格进行建模,首先定义T为[V,I79,E46],其中V为顶点网格、I为身份系数、E为表情系数,则所述三维人脸网格可表示为:
Vx=T×σexp(Fg(x))×σid(Fg(x))
In one embodiment, after the three-dimensional face model is reconstructed, its grid representation needs to be further determined to complete the reconstruction of the corresponding face in the three-dimensional space. Therefore, to model the three-dimensional face mesh, first define T as [V, I 79 , E 46 ], where V is the vertex mesh, I is the identity coefficient, and E is the expression coefficient, then the three-dimensional face The grid can be expressed as:
V x =T×σ exp (F g(x) )×σ id (F g(x) )
其中Vx可表示为[Nv,3],Nv表示三维网格的顶点数,3表示x,y,z空间坐标信息,σexp(Fg(x))表示三维人脸重建网络中参数映射层输出的表情系数,σid(Fg(x))表示三维人脸重建网络中参数映射层输出的身份系数。 Among them , V _ The expression coefficient output by the parameter mapping layer, σ id (F g(x) ) represents the identity coefficient output by the parameter mapping layer in the three-dimensional face reconstruction network.
值得说明的是,通常不同人脸具备相同数目的三维网格顶点。It is worth noting that usually different faces have the same number of three-dimensional mesh vertices.
相对于相关技术,本申请在获得人脸图像中人脸关键点所在区域的人脸区域图像之后,采用预训练至收敛状态的三维人脸重建网络的双线性建模层针对所述人脸区域图像进行身份信息和表情信息的双线性建模,获得参数化的三维人脸模型;然后采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,完成三维人脸模型的重建。所述三维人脸重建方法采用双线性建模层针对人脸中的身份信息和表情信息进行解耦建模,从而有效分离出表情参数,实现表情迁移,可极大地促进直播、影视、动画等相关行业地应用和发展;其次,所述三维人脸重建网络适于采用基于单张图像的弱监督学习方式进行训练,可极大的减少训练数据的获取成本和标注成本,有助于规模应用。Compared with related technologies, this application uses the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to target the face after obtaining the face area image of the area where the key points of the face are located in the face image. The regional image performs bilinear modeling of identity information and expression information to obtain a parameterized three-dimensional face model; then the parameter mapping layer of the three-dimensional face reconstruction network is used to map the facial region image to the parameterized The corresponding parameter coefficients in the three-dimensional face model are used to complete the reconstruction of the three-dimensional face model. The three-dimensional face reconstruction method uses a bilinear modeling layer to decouple the identity information and expression information in the face, thereby effectively separating the expression parameters and realizing expression migration, which can greatly promote live broadcast, film and television, and animation and other related industries; secondly, the three-dimensional face reconstruction network is suitable for training using a weakly supervised learning method based on a single image, which can greatly reduce the acquisition cost and labeling cost of training data, and contribute to scale application.
在以上任意实施例的基础上,请参阅图2,采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数之后,所述方法还包括:Based on any of the above embodiments, please refer to Figure 2. The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model. Afterwards, the method further includes:
步骤S1500、获取构成所述参数化的三维人脸模型所需的目标参数系数,其中,目标参数系数包括预先指定的身份系数和预先指定的表情系数;Step S1500: Obtain the target parameter coefficients required to constitute the parameterized three-dimensional face model, where the target parameter coefficients include pre-specified identity coefficients and pre-specified expression coefficients;
所述参数化的三维人脸模型是所述三维人脸重建网络的双线性建模层中构建的,其未被确定的参数系数为身份系数和表情系数。在本申请的一个示例性的应用中,所述身份系数的向量维度为79,所述表情系数的向量维度为46。当 所述预先指定的身份系数和预先指定的表情系数被确定之后,所述参数化的三维人脸模型的参数系数亦即被确定,即所述人脸区域图像相对应的三维人脸模型重建完成。The parameterized three-dimensional face model is constructed in the bilinear modeling layer of the three-dimensional face reconstruction network, and its undetermined parameter coefficients are the identity coefficient and the expression coefficient. In an exemplary application of this application, the vector dimension of the identity coefficient is 79, and the vector dimension of the expression coefficient is 46. when After the pre-specified identity coefficient and the pre-specified expression coefficient are determined, the parameter coefficients of the parameterized three-dimensional face model are determined, that is, the three-dimensional face model corresponding to the face area image is reconstructed. .
步骤S1600、将所述目标参数系数迁移至相应数字人的三维人脸模型中,获得数字人的三维人脸模型;Step S1600: Migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person;
上一步骤完成对人脸区域图像的三维人脸模型的重建,但在实际应用场景需求中,更倾向于应用其数字化形象。一个实施例中,用数字人替代所述人脸区域图像中的人脸部分,以期将“真实人”替换成“数字人”进行直播或者交流互动等活动。在该场景下,“数字人”的实时情感模拟成为急需解决的问题。一种解决方式是,将所述“真实人”的真实表情迁移到“数字人”中,使其能够同步表达“真实人”的情感。因此,一个实施例中,本申请构建的双线性建模层能够实现表情信息的解耦,从而实现将所述“真实人”的三维人脸模型中的表情系数迁移到所述“数字人”中的三维人脸模型中,即可完成“真实人”到“数字人”的表情迁移。The previous step completed the reconstruction of the three-dimensional face model of the face area image, but in actual application scenario requirements, it is more inclined to apply its digital image. In one embodiment, a digital person is used to replace the face part in the face area image, in order to replace the "real person" with a "digital person" for activities such as live broadcast or communication and interaction. In this scenario, the real-time emotional simulation of "digital people" has become an urgent problem to be solved. One solution is to migrate the real expressions of the "real person" to the "digital person" so that it can simultaneously express the emotions of the "real person". Therefore, in one embodiment, the bilinear modeling layer constructed by the present application can realize the decoupling of expression information, thereby migrating the expression coefficients in the three-dimensional face model of the "real person" to the "digital person". "In the three-dimensional face model in ", the expression migration from "real person" to "digital person" can be completed.
实际应用场景中,为能够实现“真实人”到“数字人”的表情迁移,则其身份数量和表情数量,即身份系数和表情系数的向量维度应当保持一致。如图3所示,在此基础上,可以直接将所述“真实人”相对应的表情系数替换到“数字人”三维人脸模型中的表情系数中,其他参数保持不变,即可获得表情迁移后的数字人的三维人脸模型。In actual application scenarios, in order to achieve expression migration from "real people" to "digital people", the number of identities and the number of expressions, that is, the vector dimensions of the identity coefficient and the expression coefficient should be consistent. As shown in Figure 3, on this basis, the expression coefficient corresponding to the "real person" can be directly replaced with the expression coefficient in the "digital person" three-dimensional face model, and the other parameters remain unchanged, and then we can obtain 3D face model of digital human after expression transfer.
步骤S1700、将所述数字人的三维人脸模型渲染投影到二维图像空间,获得数字人图像。Step S1700: Render and project the three-dimensional face model of the digital human into a two-dimensional image space to obtain a digital human image.
在上一步骤中,获得所述“数字人”的三维人脸模型之后,根据步骤S1400获得的光照系数、姿态系数、变换系数,以及“数字人”本身的纹理系数,进行三维渲染并投影到二维图像空间中,获得“数字人”的图像,亦即完成人脸区域图像到“数字人”图像的表情迁移。一个实施例中,在直播平台的视频流中,获取其单帧人脸图像中的人脸区域图像,将其替换为“数字人”图像,即可进行“数字人”的同步直播。该类应用是所述方法其表情迁移功能性应用的场景之一,在其他场景中亦即可用。In the previous step, after obtaining the three-dimensional face model of the "digital human", three-dimensional rendering and projection are performed based on the illumination coefficient, posture coefficient, transformation coefficient obtained in step S1400, and the texture coefficient of the "digital human" itself. In the two-dimensional image space, the image of the "digital human" is obtained, that is, the expression migration from the face area image to the "digital human" image is completed. In one embodiment, in the video stream of the live broadcast platform, the face area image in the single-frame face image is obtained and replaced with the "digital human" image, so that the "digital human" can be broadcast simultaneously. This type of application is one of the scenarios where the expression migration function of the method is applied, and it can also be used in other scenarios.
根据以上实施例可知,从真实人的三维人脸模型重建到数字人的表情迁移,所述方法针对身份信息和表情信息的解耦建模,能够给直播、影视、数字化形象等行业带来巨大的应用价值,且其表情迁移应用不影响其他人脸信息的改变。According to the above embodiments, it can be seen that from the reconstruction of the three-dimensional face model of a real person to the expression migration of a digital person, the method is aimed at decoupling modeling of identity information and expression information, which can bring huge benefits to industries such as live broadcast, film and television, and digital image. It has great application value, and its expression migration application does not affect the changes of other face information.
在以上任意实施例的基础上,请参阅图4,对所述人脸图像实施关键点检测, 获得其中人脸关键点所在区域的人脸区域图像,包括:Based on any of the above embodiments, please refer to Figure 4 to implement key point detection on the face image. Obtain the face area image of the area where the key points of the face are located, including:
步骤S1210、针对所述人脸图像进行人脸关键点检测,获得人脸区域图像和人脸关键点信息;Step S1210: Perform face key point detection on the face image to obtain the face area image and face key point information;
采用预训练至收敛状态的人脸检测模型对所述人脸图像进行人脸检测,获得人脸图像中的人脸矩形框信息。所述人脸矩形框可对所述人脸图像中人脸部分的位置和尺寸进行标定,其标定的结果可由一个带有四个坐标元素的集合表示,如Sroi。其后,根据所述集合从所述人脸图像中框选出对应的区域图像,即获得人脸区域图像。所述人脸区域图像完整包含人脸部分,去除了人脸图像中其他非人脸区域的冗余部分。
Sroi={x1,y1,x2,y2}
The face detection model pre-trained to the convergence state is used to perform face detection on the face image, and the face rectangular frame information in the face image is obtained. The face rectangular frame can calibrate the position and size of the face part in the face image, and the calibration result can be represented by a set with four coordinate elements, such as S roi . Thereafter, the corresponding area image is selected from the face image according to the set, that is, the face area image is obtained. The face area image completely contains the face part, and redundant parts of other non-face areas in the face image are removed.
S roi ={x 1 ,y 1 ,x 2 ,y 2 }
其中x1和y1表示检测出的人脸部分左上角的像素坐标,x2和y2表示人脸部分的右下角像素坐标。Among them, x 1 and y 1 represent the pixel coordinates of the upper left corner of the detected face part, and x 2 and y 2 represent the pixel coordinates of the lower right corner of the face part.
采用预训练至收敛状态的人脸关键点检测模型对上述获得的人脸区域图像进行检测,获取人脸关键点信息。所述人脸关键点能够表征人脸面部的关键区域位置,如眉毛、眼睛、鼻子、嘴巴、脸部轮廓等。所述人脸关键点的所有结果可表示为一个点的集合Ln。其中,n表示人脸关键点的个数,其个数的确定可由相关技术人员根据实际需求而设定,可为5、30、68、106、240等。
Ln={(x1,y1),(x2,y2),…,(xn,yn)}
Use the face key point detection model pre-trained to the convergence state to detect the face area image obtained above to obtain the face key point information. The facial key points can represent the positions of key areas of the human face, such as eyebrows, eyes, nose, mouth, facial contours, etc. All results of the facial key points can be expressed as a set of points L n . Among them, n represents the number of face key points. The number can be set by relevant technical personnel according to actual needs, and can be 5, 30, 68, 106, 240, etc.
L n ={(x 1 ,y 1 ),(x 2 ,y 2 ),…,(x n ,y n )}
所述人脸检测模型和人脸关键点检测模型为神经网络模型实现,在实际应用中,可用相关技术中较为优秀的人脸检测模型和人脸关键点检测模型。The face detection model and face key point detection model are implemented by neural network models. In practical applications, relatively excellent face detection models and face key point detection models in related technologies can be used.
步骤S1220、对齐所述人脸关键点与标准人脸关键点,获得标准对齐参数,所述标准人脸关键点为标准三维人脸模型经二维投影获得的相应人脸关键点;Step S1220: Align the face key points with standard face key points to obtain standard alignment parameters. The standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model;
所述人脸区域图像中的人脸轮廓由于实际场景的多样性,其角度和尺寸不一,容易对后续三维人脸参数标定工作造成干扰。因此需要对所述人脸区域图像进行标准对齐。Due to the diversity of actual scenes, the face contours in the face area image have different angles and sizes, which can easily interfere with subsequent three-dimensional face parameter calibration work. Therefore, it is necessary to perform standard alignment on the face area images.
在获得人脸区域图像中的人脸关键点之后,同样从标准三维人脸模型投影到二维平面后的标准人脸图像中检测人脸关键点,从而获得标准人脸关键点。所述标准三维人脸模型可由相关技术人员预先设定。以所述标准人脸关键点的相对位置、尺度、角度为标准,对从所述人脸区域图像中检测得到的人脸关键点进行对齐,获得相对应的标准变换参数。所述对齐操作使用的方法可用PnP、最小二乘法等任意最小化方法,在本申请的一个实施例中采用PnP方法。所述标准变换参数包括平移变换参数和尺度变换参数。After obtaining the face key points in the face area image, the face key points are also detected from the standard face image projected from the standard three-dimensional face model to the two-dimensional plane, thereby obtaining the standard face key points. The standard three-dimensional face model can be preset by relevant technical personnel. Using the relative positions, scales, and angles of the standard face key points as standards, the face key points detected from the face area image are aligned to obtain corresponding standard transformation parameters. The method used in the alignment operation can be any minimization method such as PnP, least squares method, etc. In one embodiment of the present application, the PnP method is used. The standard transformation parameters include translation transformation parameters and scale transformation parameters.
步骤S1230、根据所述标准对齐参数对齐所述人脸区域图像。 Step S1230: Align the face area image according to the standard alignment parameter.
根据所述标准变换参数,对所述人脸区域图像Sroi和人脸关键点Ln关进行标准变换。所述人脸区域图像经变换后调整其尺寸至预设大小,在本申请的一个实施例中,为224×224×3。通过上述操作,可获得对齐后的人脸区域图像。According to the standard transformation parameters, a standard transformation is performed on the face area image S roi and the face key point L n . After the face region image is transformed, its size is adjusted to a preset size, which is 224×224×3 in one embodiment of the present application. Through the above operations, the aligned face area image can be obtained.
值得说明的是,所述人脸关键点经标准变换后,在一个实施例中,可通过霍夫变换,求解其人脸区域图像对应的三维人脸模型的姿态信息。所述三维人脸模型的姿态信息包括俯仰角、横滚角和旋转角。It is worth noting that after the facial key points undergo standard transformation, in one embodiment, the Hough transformation can be used to obtain the posture information of the three-dimensional facial model corresponding to the facial region image. The posture information of the three-dimensional face model includes pitch angle, roll angle and rotation angle.
根据以上实施例可知,针对所述待处理对象进行人脸检测和人脸关键点检测,然后进行标准变换,可以排除位置偏移和尺度偏差所带来的干扰、以及非人脸区域部分的冗余信息的后续干扰。According to the above embodiments, it can be seen that by performing face detection and facial key point detection on the object to be processed, and then performing standard transformation, interference caused by position offset and scale deviation, as well as redundancy in non-face areas can be eliminated. subsequent interference from residual information.
在以上任意实施例的基础上,请参阅图5,所述采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,包括:Based on any of the above embodiments, please refer to Figure 5. The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding part of the parameterized three-dimensional face model. Parameter coefficients, including:
步骤S1410、采用所述三维人脸重建网络中的编码器针对所述人脸区域图像进行特征提取,获得人脸特征图;Step S1410: Use the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map;
在所述三维人脸重建网络的双线性建模层确定完参数化的三维人脸模型之后,采用预训练至收敛的编码器针对步骤S1200获得的所述人脸区域图像进行特征提取,获得人脸特征图。所述人脸特征图能够减少人脸图像中非人脸区域图像中冗余信息的干扰,从而更好的提取到人脸部分的语义信息。After the bilinear modeling layer of the three-dimensional face reconstruction network determines the parameterized three-dimensional face model, the encoder pre-trained to convergence is used to perform feature extraction on the face area image obtained in step S1200, and obtain Facial feature map. The face feature map can reduce the interference of redundant information in non-face area images in the face image, thereby better extracting the semantic information of the face part.
所述编码器为神经网络模型实现,所述神经网络模型可选用多种相关技术中比较优秀的特征提取模型,包括:VGG16模型、VGG19模型、InceptionV3模型、Xception模型、MobileNet模型、AlexNet模型、LeNet模型、ZF_Net模型、ResNet18模型、ResNet34模型、ResNet_50模型、ResNet_101模型、ResNet_152模型等,诸如此类,均为成熟的特征提取模型。所述特征提取模型为已训练至收敛的神经网络模型,一个实施例中,采用在ImageNet大规模数据集上训练至收敛。The encoder is implemented by a neural network model. The neural network model can use a variety of relatively excellent feature extraction models in related technologies, including: VGG16 model, VGG19 model, InceptionV3 model, Xception model, MobileNet model, AlexNet model, LeNet Model, ZF_Net model, ResNet18 model, ResNet34 model, ResNet_50 model, ResNet_101 model, ResNet_152 model, etc., are all mature feature extraction models. The feature extraction model is a neural network model that has been trained to convergence. In one embodiment, it is trained to convergence on the ImageNet large-scale data set.
所述编码器的输出设定为特征图,在本申请的一个实施例中,所述编码器直接输出最后一层卷积层的特征图,称为人脸特征图。定义所述编码器的输入尺寸为N×C×H×W,输出尺寸为N×C'×H'×W',其中N表示样本数目,C表示通道数,H×W表示预设的图像尺寸,C'表示特征数,H'×W'表示特征图尺寸。The output of the encoder is set to a feature map. In one embodiment of the present application, the encoder directly outputs the feature map of the last convolutional layer, which is called a face feature map. The input size of the encoder is defined as N×C×H×W, and the output size is N×C'×H'×W', where N represents the number of samples, C represents the number of channels, and H×W represents the preset image. Size, C' represents the number of features, H'×W' represents the feature map size.
步骤S1420、针对所述人脸特征图进行空间映射,获得所述双线性建模层中的参数系数;Step S1420: Perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer;
该步骤将上述人脸特征图进行空间映射,获得所述三维人脸模型的参数系 数以及用于三维渲染和二维投影的相关参数系数。In this step, the above facial feature map is spatially mapped to obtain the parameter system of the three-dimensional facial model. numbers and related parameter coefficients for 3D rendering and 2D projection.
需要说明的是,所述空间映射包括语义空间映射和参数空间映射。所述语义空间映射将所述人脸特征图映射为人脸特征向量,所述人脸特征向量包含了人脸图像中的所有深度语义信息,是人脸身份语义信息、表情语义信息、纹理语义信息、光照语义信息、姿态语义信息、变换语义信息的综合表示。所述参数空间映射将所述人脸特征向量映射到对应参数子空间中,从而获得其对应参数的系数,所述参数空间包括人脸身份参数空间、表情参数空间、纹理参数空间、光照参数空间、姿态参数空间、变换参数空间。It should be noted that the space mapping includes semantic space mapping and parameter space mapping. The semantic space mapping maps the face feature map into a face feature vector. The face feature vector contains all the depth semantic information in the face image, which is the face identity semantic information, expression semantic information, and texture semantic information. , comprehensive representation of illumination semantic information, posture semantic information, and transformation semantic information. The parameter space mapping maps the face feature vector to the corresponding parameter subspace, thereby obtaining the coefficients of its corresponding parameters. The parameter space includes a face identity parameter space, an expression parameter space, a texture parameter space, and an illumination parameter space. , attitude parameter space, transformation parameter space.
将所述人脸特征图通过上述语义空间映射和参数空间映射,获得身份系数、表情系数、纹理系数、光照系数、姿态系数、变换系数。其中,身份系数和表情系数用于重建所述人脸区域图像的三维人脸模型;所述纹理系数、光照系数、姿态系数、变换系数用于三维渲染和二维投影。The facial feature map is processed through the above-mentioned semantic space mapping and parameter space mapping to obtain identity coefficients, expression coefficients, texture coefficients, illumination coefficients, posture coefficients, and transformation coefficients. Among them, the identity coefficient and expression coefficient are used to reconstruct the three-dimensional face model of the face area image; the texture coefficient, lighting coefficient, posture coefficient, and transformation coefficient are used for three-dimensional rendering and two-dimensional projection.
从以上的实施例中,不难理解,所述三维人脸重建网络的参数映射层首先提取人脸区域图像中的人脸特征图,其后映射到语义空间中提取其语义特征向量,再将其分别映射到各个不同的参数空间中以期获得相应参数空间中的系数;能够充分利用人脸区域图像中的身份信息、表情信息、纹理信息、光照信息、姿态信息、变换信息,同时又不引入其他额外信息,达到了三维人脸重建与渲染投影的一体化建模的目的。From the above embodiments, it is easy to understand that the parameter mapping layer of the three-dimensional face reconstruction network first extracts the face feature map in the face area image, and then maps it to the semantic space to extract its semantic feature vector, and then It is mapped to different parameter spaces respectively in order to obtain the coefficients in the corresponding parameter space; it can make full use of the identity information, expression information, texture information, lighting information, posture information, and transformation information in the face area image without introducing Other additional information achieves the purpose of integrated modeling of 3D face reconstruction and rendering projection.
在以上任意实施例的基础上,请参阅图6,所述针对所述人脸特征图进行空间映射,获得所述双线性建模层中的参数系数,包括:Based on any of the above embodiments, please refer to Figure 6. The spatial mapping is performed on the facial feature map to obtain the parameter coefficients in the bilinear modeling layer, including:
步骤S1421、所述人脸特征图进行语义空间映射,获得人脸特征向量;Step S1421: Perform semantic space mapping on the facial feature map to obtain a facial feature vector;
所述人脸特征图为N×C'×H'×W',其中N表示样本数目,C'表示特征数,H'×W'表示特征图尺寸。The facial feature map is N×C'×H'×W', where N represents the number of samples, C' represents the number of features, and H'×W' represents the size of the feature map.
针对所述人脸特征图x进行语义空间映射,一个实施例中,采用全局池化方式:
Fg(x)=gloabl_pooling(x)=x′[N,C‘]
Semantic space mapping is performed on the facial feature map x. In one embodiment, global pooling is used:
F g (x)=gloabl_pooling(x)=x′[N,C′]
该Fg(x)包含丰富的信息描述了该人脸的特征信息,包括身份信息、形状信息、纹理信息、光照信息、姿态信息及变换信息。The F g (x) contains rich information describing the characteristic information of the face, including identity information, shape information, texture information, lighting information, posture information and transformation information.
所述语义空间映射后的Fg(x)为一个特征向量,即人脸特征向量,用x′[N,C‘]表示。 The F g (x) after the semantic space mapping is a feature vector, that is, a face feature vector, represented by x'[N,C'].
步骤S1422、将所述人脸特征向量进行参数空间映射,获得所述双线性建模层中的参数系数。Step S1422: Perform parameter space mapping on the facial feature vector to obtain parameter coefficients in the bilinear modeling layer.
一个实施例中,设计相应数量的参数空间映射层将所述人脸特征向量映射到相应参数子空间中进行优化,获得相应参数的系数。In one embodiment, a corresponding number of parameter space mapping layers are designed to map the facial feature vectors into corresponding parameter subspaces for optimization, and obtain coefficients of corresponding parameters.
具体可表示为:
Fall(x)={σid(Fg(x)),σexp(Fg(x)),σtexture(Fg(x)),σlight(Fg(x)),σpose(Fg(x)),σtransition(Fg(x))}
Specifically it can be expressed as:
F all (x)={σ id (F g(x) ),σ exp (F g(x) ),σ texture (F g(x) ),σ light (F g(x) ),σ pose ( F g(x) ),σ transition (F g(x) )}
其中σ(x)表示为可学习的映射函数:σ(x)=Wx+b,其中W表示可学习的权重,在不同参数子空间中随其映射关系不同而不同,b表示可学习的权重偏置,在不同参数子空间中随其映射关系不同而不同。这里σid表示身份系数的学习,同一个人应具备相似的系数表示,不同的人具备不同的系数表示,其参数尺寸可描述为[C′,79];σexp表示表情系数的学习,同样表情的人应当具备相似的系数,例如闭眼、张嘴、撇嘴等,不同表情的人应具备不同的系数,例如闭眼和张眼在特定形状中应当不一致,其参数尺寸可描述为[C′,46];σtexture表示纹理系数的学习,该系数用于建模真实纹理,其参数描述为[C′,79]。σlight用于当前面部光照的估计,其参数描述为[C′,27],代表27个球谐函数的基系数。σpose用于估计人脸的姿态,包含三个子参数yaw,pitch和roll,分别对应横滚,俯仰和旋转。σtransition用于估计人脸三维空间的变换,因此包含x,y,z三个轴的变换系数。where σ(x) represents a learnable mapping function: σ(x)=Wx+b, where W represents the learnable weight, which varies with its mapping relationship in different parameter subspaces, and b represents the learnable weight Bias varies with different mapping relationships in different parameter subspaces. Here σ id represents the learning of identity coefficients. The same person should have similar coefficient representations, and different people have different coefficient representations. The parameter size can be described as [C′,79]; σ exp represents the learning of expression coefficients. The same expression People should have similar coefficients, such as closing eyes, opening mouth, curling lips, etc. People with different expressions should have different coefficients. For example, closing eyes and opening eyes should be inconsistent in a specific shape. The parameter size can be described as [C′, 46]; σ texture represents the learning of texture coefficients, which are used to model real textures, and their parameters are described as [C′,79]. σ light is used to estimate the current facial illumination, and its parameters are described as [C′,27], which represents the basis coefficients of 27 spherical harmonics. σ pose is used to estimate the pose of the human face and contains three sub-parameters yaw, pitch and roll, corresponding to roll, pitch and rotation respectively. σ transition is used to estimate the transformation of the three-dimensional face space, so it contains the transformation coefficients of the three axes of x, y, and z.
根据以上实施例的理解,所述三维人脸重建网络中基于双线性建模层的解耦建模,可以对身份信息和表情信息进行分离建模,有助于表情迁移的场景应用,带动相关产业中的表情生成应用发展。同时采用参数映射层中的空间映射,将人脸区域图像与三位人脸模型参数和渲染投影参数进行映射建模,充分利用输入人脸区域图像的特征信息,为参数系数的获取提供更方便有效的途径。According to the understanding of the above embodiments, the decoupled modeling based on the bilinear modeling layer in the three-dimensional face reconstruction network can separately model the identity information and expression information, which is helpful for the scenario application of expression migration and drives Development of expression generation applications in related industries. At the same time, the spatial mapping in the parameter mapping layer is used to map the face area image with the three-dimensional face model parameters and rendering projection parameters, making full use of the feature information of the input face area image to provide more convenient acquisition of parameter coefficients. effective way.
本申请的三维人脸重建网络,其输入为人脸区域图像,其输出为三维人脸模型。本申请中,针对所述三维人脸重建网络构建弱监督学习机制对应的框架,完成所述三维人脸重建网络的训练。如图7所示,其示出用于训练本申请的三维人脸重建网络的弱监督学习机制相对应的框架的原理示意图。所述三维人脸重建网络根据该框架实施训练。由此,在以上任意实施例的基础上,请参阅图8,所述三维人脸重建网络的训练过程,包括:The input of the three-dimensional face reconstruction network of this application is a face region image, and its output is a three-dimensional face model. In this application, a framework corresponding to the weakly supervised learning mechanism is constructed for the three-dimensional face reconstruction network, and the training of the three-dimensional face reconstruction network is completed. As shown in Figure 7, it shows a schematic diagram of the principle of the framework corresponding to the weakly supervised learning mechanism used to train the three-dimensional face reconstruction network of the present application. The three-dimensional face reconstruction network is trained according to this framework. Therefore, based on any of the above embodiments, please refer to Figure 8. The training process of the three-dimensional face reconstruction network includes:
步骤S2100、获取预处理后的人脸影像数据的单个样本;Step S2100: Obtain a single sample of the preprocessed face image data;
所述人脸影像数据指的是带有人脸部分的影像数据,该类影像数据可通过已授权的直播、点播等合法途径进行获取,一种实施例中,可以是视频流数据, 其视频存储格式可多样,其包括MP4、avi、rmvb、x264等。另一种实施例中,也可以是图像数据。所述影像数据内容可包括室内、室外、新闻媒体、体育娱乐等其他场景,其包括自然场景。The facial image data refers to image data with human face parts. This type of image data can be obtained through authorized live broadcast, on-demand and other legal channels. In one embodiment, it can be video stream data. Its video storage formats can be diverse, including MP4, avi, rmvb, x264, etc. In another embodiment, it may also be image data. The image data content may include indoor, outdoor, news media, sports and entertainment and other scenes, including natural scenes.
所述影像数据的数据存储格式由于数据来源的多样而不一致,其包括RGB24、YUV444、YUV420等格式。为实现本申请中相关技术的自动化应用,将所述数据存储格式统一,一种实施例中,可将不同来源的影像数据转换为统一的YUV420格式。另一种实施例中,也可将不同来源的影像数据转换为统一的RGB24格式、或YUV444格式、或其他。上述预处理方式应用于本申请中相关技术方法的训练和应用,将多样的数据格式统一为一种以提升技术应用的效率同时又不会影响其性能方面。The data storage format of the image data is inconsistent due to various data sources, including RGB24, YUV444, YUV420 and other formats. In order to realize the automated application of the related technologies in this application, the data storage formats are unified. In one embodiment, image data from different sources can be converted into a unified YUV420 format. In another embodiment, image data from different sources can also be converted into a unified RGB24 format, or YUV444 format, or others. The above-mentioned preprocessing method is applied to the training and application of relevant technical methods in this application, unifying various data formats into one to improve the efficiency of technical applications without affecting its performance.
所述预处理后的人脸影像数据中,无论是视频流数据还是图像数据,抽取其中一张张带有人脸部分的人脸图像作为单个样本供后续处理。From the preprocessed face image data, whether it is video stream data or image data, one face image with a face part is extracted as a single sample for subsequent processing.
步骤S2200、获取所述单个样本中的人脸区域图像、人脸关键点和三维人脸模型姿态系数;Step S2200: Obtain the face area image, face key points and three-dimensional face model posture coefficients in the single sample;
采用上述步骤S1200中相同的方式从所述单个样本中提取人脸区域图像、人脸关键点和三维人脸模型姿态系数。具体为:采用预训练至收敛状态的人脸检测模型检测所述单个样本,获得人脸矩形框信息,进一步获得人脸区域图像;然后采用预训练至收敛状态的人脸关键点检测模型检测所述人脸区域图像,获得人脸关键点信息;根据标准对齐参数,对齐所述人脸区域图像Sroi和所述人脸关键点信息Ln;最后在人脸关键点上采用霍夫变换计算获得三维人脸姿态信息YposeThe face area image, face key points and three-dimensional face model posture coefficients are extracted from the single sample in the same manner as in step S1200 above. Specifically, the face detection model pre-trained to the convergence state is used to detect the single sample, the face rectangular frame information is obtained, and the face area image is further obtained; and then the face key point detection model pre-trained to the convergence state is used to detect the face key point detection model. Describe the face area image to obtain face key point information; align the face area image S roi and the face key point information L n according to standard alignment parameters; finally, use Hough transform calculation on the face key points Obtain three-dimensional face pose information Y pose .
需要说明的是,所述人脸区域图像用于作为所述三维人脸重建网络的输入,所述人脸关键点和三维人脸姿态信息用于计算损失值。It should be noted that the face region image is used as the input of the three-dimensional face reconstruction network, and the face key points and three-dimensional face pose information are used to calculate the loss value.
步骤S2300、采用所述三维人脸重建网络重建获得所述人脸区域图像的三维人脸模型,经渲染投影至二维中获得人脸重建图像;Step S2300: Use the three-dimensional face reconstruction network to reconstruct and obtain a three-dimensional face model of the face area image, and obtain a face reconstruction image through rendering and projection into two dimensions;
采用所述三维人脸重建网络的双线性建模层进行身份信息和表情信息的解耦建模,采用所述三维人脸重建网络的参数映射层获取身份系数、表情系数、纹理系数、光照系数、姿态系数、变换系数。所述身份系数和表情系数用于重建获得所述人脸区域图像的三维人脸模型。The bilinear modeling layer of the three-dimensional face reconstruction network is used to perform decoupled modeling of identity information and expression information, and the parameter mapping layer of the three-dimensional face reconstruction network is used to obtain the identity coefficient, expression coefficient, texture coefficient, and lighting Coefficients, attitude coefficients, transformation coefficients. The identity coefficient and expression coefficient are used to reconstruct a three-dimensional face model to obtain the face area image.
所述三维人脸模型的三维渲染和二维投影包括如下操作:估计人脸表面纹理,预先假设人脸面部为朗伯曲面,采用球谐函数近似估计场景照明,则可结合人脸表面发现和皮肤纹理σtexture(Fg(x))来计算顶点的辐射度其中Φ表示球谐函数基 函数。The three-dimensional rendering and two-dimensional projection of the three-dimensional face model include the following operations: estimating the surface texture of the face, assuming in advance that the face is a Lambertian surface, and using spherical harmonics to approximate the scene lighting, which can be combined with the face surface discovery and Skin texture σ texture (F g(x) ) to calculate the radiance of the vertex where Φ represents the spherical harmonic function basis function.
其上便可完成所述三维人脸模型的三维渲染工作,随后进行人脸的相机系变换,利用姿态参数σpose(Fg(x))和变换参数σtransition(Fg(x)),结合相机透视模型对三维人脸进行平移和旋转变化,便可投影到二维平面中获得人脸顶点的所有投影点Lx,可表示为[Nv,2],其中2表示x,y平面坐标信息。需要说明的是,所述人脸投影已经完成从世界坐标系至像素坐标系的相关变换,其与标准人脸关键点的相关位置匹配。到此,完成三维人脸模型到二维平面中的投影,获得人脸重建图像。The three-dimensional rendering work of the three-dimensional face model can be completed on it, and then the camera system transformation of the face is performed, using the posture parameter σ pose (F g(x) ) and the transformation parameter σ transition (F g(x) ), Combined with the camera perspective model to perform translation and rotation changes on the three-dimensional face, it can be projected into a two-dimensional plane to obtain all the projection points L x of the face vertices, which can be expressed as [N v ,2], where 2 represents the x, y plane coordinate information. It should be noted that the face projection has completed the relevant transformation from the world coordinate system to the pixel coordinate system, and it matches the relevant positions of the standard face key points. At this point, the projection of the three-dimensional face model into the two-dimensional plane is completed, and the reconstructed face image is obtained.
假设输入的人脸区域图像为x,则三维人脸重建后进行渲染投影后的人脸重建图像可表示为:
R(x)=Render(Fid,Fexp,Fill,Falbedo,Fpose,Ftransition)
Assuming that the input face area image is x, the reconstructed face image after rendering and projection after three-dimensional face reconstruction can be expressed as:
R(x)=Render(F id ,F exp ,F ill ,F albedo ,F pose ,F transition )
步骤S2400,根据所述人脸区域图像与所述人脸重建图像计算重建损失值,根据该重建损失值对所述三维人脸重建网络进行参数更新;Step S2400: Calculate a reconstruction loss value based on the face area image and the face reconstruction image, and update the parameters of the three-dimensional face reconstruction network based on the reconstruction loss value;
构造重建损失函数,计算获得所述人脸区域图像和所述人脸重建图像之间的误差。在一个实施例中,所述三维重建损失函数为四个子损失函数的加权之和:第一个子损失函数为感知损失函数,用于最小化所述人脸区域图像和所述人脸重建图像之间的感知误差;第二个子损失函数为光度损失函数,用于加强所述人脸区域图像和所述人脸重建图像之间形状及像素级别的对齐;第三个子损失函数为姿态损失函数,用于保证姿态具备较高准确性;第四个子损失函数为重投影损失函数,用于优化投影点的准确性。上述几个子损失值的加权和,即为当前迭代次数下所述三维人脸重建网络的重建损失值,即误差L(x)。Construct a reconstruction loss function and calculate the error between the face area image and the face reconstruction image. In one embodiment, the three-dimensional reconstruction loss function is a weighted sum of four sub-loss functions: the first sub-loss function is a perceptual loss function, used to minimize the face area image and the face reconstruction image The second sub-loss function is the photometric loss function, which is used to enhance the shape and pixel-level alignment between the face area image and the face reconstruction image; the third sub-loss function is the posture loss function , used to ensure higher accuracy of the posture; the fourth sub-loss function is the reprojection loss function, used to optimize the accuracy of the projection point. The weighted sum of the above sub-loss values is the reconstruction loss value of the three-dimensional face reconstruction network under the current iteration number, that is, the error L(x).
计算获得误差L(x)之后,可根据神经网络的反向传播机制进行相关权重的更新。After the error L(x) is calculated, the relevant weights can be updated according to the back propagation mechanism of the neural network.
其更新的权重部分主要为所述三维人脸重建网络中的参数映射层中空间映射的权重,即语义空间映射构件和参数空间映射构件。The updated weight part is mainly the weight of the space mapping in the parameter mapping layer in the three-dimensional face reconstruction network, that is, the semantic space mapping component and the parameter space mapping component.
所述权重更新的方向为使得所述误差L(x)更小的方向。The direction of the weight update is a direction that makes the error L(x) smaller.
步骤S2500、重复以上操作,直至预设终止条件触发而结束训练,获得所述三维人脸重建网络。Step S2500: Repeat the above operations until the preset termination condition is triggered to end the training, and obtain the three-dimensional face reconstruction network.
重复上述步骤,即:获得样本-->获得人脸重建图像-->计算误差-->更新参数。直到训练条件达到预设终止条件即可终止训练,表明训练达到收敛。所述预设终止条件可由相关技术人员根据实际应用场景需求而设定,在一个实施例中,可以是迭代次数约束,即训练次数达到预设次数时终止训练,又一个实施例中,可以是损失值约束,即迭代训练过程中当所述重建损失值达到预设最小值时可 终止训练。Repeat the above steps, namely: obtain the sample-->obtain the face reconstruction image-->calculate the error-->update parameters. Training can be terminated until the training conditions reach the preset termination condition, indicating that the training has reached convergence. The preset termination condition can be set by relevant technical personnel according to actual application scenario requirements. In one embodiment, it can be a constraint on the number of iterations, that is, training is terminated when the number of training times reaches a preset number. In another embodiment, it can be Loss value constraint, that is, when the reconstruction loss value reaches the preset minimum value during the iterative training process, Terminate training.
根据以上实施例,不难理解,基于单张人脸图像的弱监督学习机制,能够大批量低成本地构造训练数据,从而有效减少训练样本的获取成本和标注成本,这对相关技术的快速研发提供了强大的动力。此外,该方法能够解耦获得人脸表情模型,用于表情迁移应用,如影视、动画及数字人等相关领域,具有很大的实际应用价值和商业价值。According to the above embodiments, it is not difficult to understand that the weakly supervised learning mechanism based on a single face image can construct training data in large quantities at low cost, thereby effectively reducing the acquisition cost and labeling cost of training samples, which is beneficial to the rapid research and development of related technologies. Provides powerful motivation. In addition, this method can decouple and obtain facial expression models for expression migration applications, such as film and television, animation, digital humans and other related fields, which has great practical application value and commercial value.
在以上任意实施例的基础上,请参阅图9,所述根据所述对齐后的人脸区域图像与所述重建人脸图像计算重建损失值,包括:Based on any of the above embodiments, please refer to Figure 9. Calculating the reconstruction loss value based on the aligned face area image and the reconstructed face image includes:
步骤S2410、计算第一损失值,所述第一损失值用于最小化所述人脸区域图像与所述人脸重建图像之间的误差;Step S2410: Calculate a first loss value, which is used to minimize the error between the face area image and the face reconstruction image;
第一损失值是基于对所述人脸区域图像与所述人脸重建图像的深度感知之后计算获得。即:通过一个具有成熟感知能力的神经网络预先提取所述人脸区域图像与人脸重建图像的语义特征,然后基于所述语义特征计算相关损失值。The first loss value is calculated based on the depth perception of the face area image and the face reconstruction image. That is, a neural network with mature perceptual capabilities is used to pre-extract the semantic features of the face area image and the face reconstruction image, and then calculate the correlation loss value based on the semantic features.
可选的,首先对人脸重建图像进行自监督建模,一个实施例中,引入预训练至收敛状态的人脸识别网络来提取所述人脸重建图像和所述人脸区域图像的顶层深度特征。需要说明的是,所述人脸识别网络可采用相关技术中成熟的神经网络模型,可选用VGGNet,FaceNet、ArcFaceNet等人脸识别模型,进行自监督训练。在本申请的实施例中,可采用ArcFaceNet网络,其效果更好。Optionally, self-supervised modeling is first performed on the reconstructed face image. In one embodiment, a face recognition network pre-trained to a converged state is introduced to extract the top-level depth of the reconstructed face image and the face region image. feature. It should be noted that the face recognition network can use mature neural network models in related technologies, and face recognition models such as VGGNet, FaceNet, and ArcFaceNet can be used for self-supervised training. In the embodiment of this application, the ArcFaceNet network can be used, which has better effects.
定义人脸区域图像为x,重建后的人脸重建图像为R(x),人脸识别模型为E(x),所述感知损失函数可表示为:
Define the face area image as x, the reconstructed face reconstruction image as R(x), and the face recognition model as E(x). The perceptual loss function can be expressed as:
通过上述相似度损失函数来约束网络模型使得重构人脸接近真实人脸,来优化表面纹理特征及光照参数等。The above similarity loss function is used to constrain the network model so that the reconstructed face is close to the real face, and the surface texture features and lighting parameters are optimized.
步骤S2420、计算第二损失值,所述第二损失值用于加强所述人脸区域图像与所述人脸重建图像之间形状及像素级别的对齐;Step S2420: Calculate a second loss value, the second loss value is used to enhance the shape and pixel level alignment between the face area image and the face reconstruction image;
所述第一损失值隐式地约束了人脸特征层的近似关系,为进一步地加强形状及像素级别的对齐,增加第二损失值,以加强所述人脸区域图像与所述人脸重建图像之间形状及像素级别的对齐,其可表示为:
The first loss value implicitly constrains the approximate relationship of the face feature layer. In order to further strengthen the alignment of the shape and pixel level, a second loss value is added to strengthen the face region image and the face reconstruction. Shape and pixel level alignment between images, which can be expressed as:
该约束具备较强的像素程度约束,因此,一个实施例中,给予一个较小的权重wphoto避免网络陷入局部解。This constraint has a strong pixel-level constraint. Therefore, in one embodiment, a smaller weight w photo is given to prevent the network from falling into a local solution.
步骤S2430、计算第三损失值,所述第三损失值用于保证姿态具备较高准确性;Step S2430: Calculate a third loss value. The third loss value is used to ensure that the posture has higher accuracy;
第一损失值对姿态进行隐式的约束和优化。为进一步保证姿态具备较高的准确性,计算第三损失值。一个实施例中,采用步骤S2200中的所述三维人脸模型姿态系数作为标记数据,使用L1范数损失进行数值约束及最小化:
Lpose=||σpose(Fg(x))-Ypose||1
The first loss value implicitly constrains and optimizes the pose. In order to further ensure that the posture has higher accuracy, the third loss value is calculated. In one embodiment, the three-dimensional face model posture coefficients in step S2200 are used as marker data, and L1 norm loss is used for numerical constraints and minimization:
L pose =||σ pose (F g(x) )-Y pose || 1
其中,σpose(Fg(x))∈R3,为所述三维人脸重建网络前向推理中所获得的姿态系数,包括横滚角、俯仰角和旋转角,Ypose∈R3表示为步骤S2200中所获得的所述三维人脸模型姿态系数,也包括横滚角、俯仰角和旋转角。Among them, σ pose (F g(x) )∈R 3 is the posture coefficient obtained in the forward reasoning of the three-dimensional face reconstruction network, including roll angle, pitch angle and rotation angle, and Y pose ∈R 3 represents It is the posture coefficient of the three-dimensional face model obtained in step S2200, and also includes the roll angle, pitch angle and rotation angle.
步骤S2440、计算第四损失值,所述第四损失值用于优化二维投影中投影点的准确性;Step S2440: Calculate a fourth loss value, which is used to optimize the accuracy of projection points in two-dimensional projection;
为进一步优化人脸顶点网格建模的准确性,还可采用第四损失值进行模型约束。可选的,基于样本提取到的人脸关键点数据和三维人脸重建后进行三维渲染和二维投影后的重投影点来构建重投影误差约束。该顶点数目和二维人脸关键点检测数目一致。
In order to further optimize the accuracy of face vertex mesh modeling, the fourth loss value can also be used for model constraints. Optional, face key point data extracted based on the sample and reprojected points after 3D rendering and 2D projection after 3D face reconstruction to construct reprojection error constraints. The number of vertices is consistent with the number of two-dimensional facial key point detection.
由此约束投影点的准确性。This constrains the accuracy of the projected points.
步骤S2450、计算重建损失值,所述重建损失值为第一损失值和第二损失值、第三损失值、第四损失值的加权融合。Step S2450: Calculate a reconstruction loss value, which is a weighted fusion of the first loss value, the second loss value, the third loss value, and the fourth loss value.
根据上述步骤构建的四个子损失函数,进行加权融合。本申请的一个实施例中,完整的网络训练损失函数可表示为:
L(x)=wpercepLpercep(x,R(x))+Lphoto(x,R(x))+wposeLpose(x)+wprojLproj(x)
Weighted fusion is performed based on the four sub-loss functions constructed in the above steps. In one embodiment of this application, the complete network training loss function can be expressed as:
L(x)=w percep L percep (x,R(x))+L photo (x,R(x))+w pose L pose (x)+w proj L proj (x)
其中wpercep、wpose和wproj分别表示表示为第一损失值、第三损失值和第四损失值的权重。Among them, w percep , w pose and w proj represent the weights expressed as the first loss value, the third loss value and the fourth loss value respectively.
根据以上的实施例,不难理解,基于第一损失值、第二损失值、第三损失值、第四损失值的加权融合重建损失值计算,能够更全面地约束所述三维人脸重建网络中获得的所有参数逼近真实标签值,同时基于单样本的损失计算与参数更新能够加速收敛,节约训练成本。According to the above embodiments, it is easy to understand that the weighted fusion reconstruction loss value calculation based on the first loss value, the second loss value, the third loss value, and the fourth loss value can more comprehensively constrain the three-dimensional face reconstruction network. All parameters obtained in the method are close to the real label values. At the same time, loss calculation and parameter update based on single samples can accelerate convergence and save training costs.
请参阅图10,根据本申请的一个方面提供的一种三维人脸重建装置,一个 实施例中,包括图像获取模块1100、人脸检测模块1200、人脸建模模块1300、参数映射模块1400,其中,图像获取模块1100,设置为获取人脸影像数据,提取其中的人脸图像;人脸检测模块1200,设置为对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像;人脸建模模块1300,设置为采用预训练至收敛状态的三维人脸重建网络的双线性建模层针对所述人脸区域图像进行人脸身份和人脸表情的双线性建模,获得参数化的三维人脸模型;参数映射模块1400,设置为采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。Please refer to Figure 10, a three-dimensional face reconstruction device provided according to one aspect of the present application, a In the embodiment, it includes an image acquisition module 1100, a face detection module 1200, a face modeling module 1300, and a parameter mapping module 1400. The image acquisition module 1100 is configured to acquire face image data and extract face images therein; The face detection module 1200 is configured to perform key point detection on the face image to obtain a face region image of the area where the key points of the face are located; the face modeling module 1300 is configured to use a three-dimensional model pre-trained to a converged state. The bilinear modeling layer of the face reconstruction network performs bilinear modeling of facial identity and facial expression on the face region image to obtain a parameterized three-dimensional face model; the parameter mapping module 1400 is set to use The parameter mapping layer of the three-dimensional face reconstruction network maps the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model, and the parameter coefficients include the identity corresponding to the face identity. coefficient and the expression coefficient corresponding to the facial expression.
在一实施例中,所述参数映射模块1400,包括:系数获取单元,设置为获取构成所述参数化的三维人脸模型所需的目标参数系数,其中,所述目标参数系数包括预先指定的身份系数和预先指定的表情系数;表情迁移单元,设置为将所述目标参数系数迁移至相应数字人的三维人脸模型中,获得数字人的三维人脸模型;渲染投影单元,设置为将所述数字人的三维人脸模型渲染投影到二维图像空间,获得数字人图像。In one embodiment, the parameter mapping module 1400 includes: a coefficient acquisition unit configured to obtain target parameter coefficients required to constitute the parameterized three-dimensional face model, wherein the target parameter coefficients include prespecified The identity coefficient and the pre-specified expression coefficient; the expression migration unit is configured to migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person; the rendering projection unit is configured to transfer the three-dimensional face model of the digital person. The three-dimensional face model of the digital human is rendered and projected into the two-dimensional image space to obtain the digital human image.
在一实施例中,所述人脸检测模块1200,包括:人脸检测单元,设置为针对所述人脸图像进行人脸关键点检测,获得人脸区域图像和人脸关键点信息;标准对齐单元,设置为对齐所述人脸关键点与标准人脸关键点,获得标准对齐参数,所述标准人脸关键点为标准三维人脸模型经二维投影获得的相应人脸关键点;人脸对齐单元,设置根据所述标准对齐参数对齐所述人脸区域图像。In one embodiment, the face detection module 1200 includes: a face detection unit configured to detect face key points on the face image to obtain face area images and face key point information; standard alignment A unit configured to align the face key points with the standard face key points to obtain standard alignment parameters. The standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model; face An alignment unit configured to align the face area image according to the standard alignment parameters.
在一实施例中,所述建模投影模块1400,包括:特征编码单元,设置为采用所述三维人脸重建网络中的编码器针对所述人脸区域图像进行特征提取,获得人脸特征图;空间映射单元,设置为针对所述人脸特征图进行空间映射,获得所述双线性建模层中的参数系数。In one embodiment, the modeling projection module 1400 includes: a feature encoding unit configured to use the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map ; A spatial mapping unit configured to perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer.
在一实施例中,所述空间映射单元,包括:语义空间映射子单元,设置为将所述人脸特征图进行语义空间映射,获得人脸特征向量;参数空间映射子单元,设置为将所述人脸特征向量进行参数空间映射,获得所述双线性建模层中的参数系数。In one embodiment, the spatial mapping unit includes: a semantic space mapping subunit, which is configured to perform semantic space mapping on the facial feature map to obtain a facial feature vector; and a parameter space mapping subunit, which is configured to perform semantic space mapping on the facial feature map. The facial feature vector is mapped in parameter space to obtain the parameter coefficients in the bilinear modeling layer.
在一实施例中,所述网络训练模块,包括:样本获取单元,设置为获取预处理后的人脸影像数据的单个样本;数据获取单元,设置为获取所述单个样本中的人脸区域图像、人脸关键点和三维人脸模型姿态系数;重建图像单元,设置为采用所述三维人脸重建网络重建获得所述人脸区域图像的三维人脸模型,经渲染投影至二维中获得人脸重建图像;损失优化单元,设置为根据所述人脸区域图像与所述人脸重建图像计算重建损失值,根据该重建损失值对所述三维 人脸重建网络进行参数更新;训练重复单元,设置为重复以上操作,直至预设终止条件触发而结束训练,获得所述三维人脸重建网络。In one embodiment, the network training module includes: a sample acquisition unit configured to acquire a single sample of preprocessed face image data; a data acquisition unit configured to acquire a face region image in the single sample , face key points and three-dimensional face model posture coefficients; the reconstruction image unit is configured to use the three-dimensional face reconstruction network to reconstruct the three-dimensional face model to obtain the face area image, and then render and project it into two dimensions to obtain the human face model. face reconstruction image; a loss optimization unit configured to calculate a reconstruction loss value according to the face area image and the face reconstruction image, and perform the reconstruction loss on the three-dimensional face according to the reconstruction loss value. The face reconstruction network performs parameter updates; the training repetition unit is set to repeat the above operations until the preset termination condition is triggered and the training ends, and the three-dimensional face reconstruction network is obtained.
在一实施例中,所述损失优化单元,包括:第一损失子单元,设置为计算第一损失值,所述第一损失值用于最小化所述人脸区域图像与所述人脸重建图像之间的误差;第二损失子单元,设置为计算第二损失值,所述第二损失值用于加强所述人脸区域图像与所述人脸重建图像之间形状及像素级别的对齐;第三损失子单元,设置为计算第三损失值,所述第三损失值用于保证姿态具备较高准确性;第四损失子单元,设置为计算第四损失值,所述第四损失值用于优化二维投影中投影点的准确性;损失融合子单元,设置为计算重建损失值,所述重建损失值为第一损失值和第二损失值、第三损失值、第四损失值的加权融合。In one embodiment, the loss optimization unit includes: a first loss subunit configured to calculate a first loss value, the first loss value being used to minimize the relationship between the face region image and the face reconstruction Error between images; a second loss subunit configured to calculate a second loss value, the second loss value being used to enhance the shape and pixel-level alignment between the face region image and the face reconstruction image ; The third loss subunit is configured to calculate a third loss value, the third loss value is used to ensure that the posture has higher accuracy; the fourth loss subunit is configured to calculate a fourth loss value, the fourth loss value The value is used to optimize the accuracy of the projection points in the two-dimensional projection; the loss fusion subunit is set to calculate the reconstruction loss value, which is the first loss value, the second loss value, the third loss value, and the fourth loss value. Weighted fusion of values.
本申请的另一实施例还提供一种三维人脸重建设备。如图11所示,三维人脸重建设备的内部结构示意图。该三维人脸重建设备包括通过系统总线连接的处理器、计算机可读存储介质、存储器和网络接口。其中,该三维人脸重建设备的计算机可读的非易失性可读存储介质,存储有操作系统、数据库和计算机可读指令,数据库中可存储有信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种三维人脸重建方法。Another embodiment of the present application also provides a three-dimensional face reconstruction device. As shown in Figure 11, a schematic diagram of the internal structure of the three-dimensional face reconstruction device. The three-dimensional face reconstruction device includes a processor, a computer-readable storage medium, a memory and a network interface connected through a system bus. Among them, the computer-readable non-volatile readable storage medium of the three-dimensional face reconstruction device stores an operating system, a database and computer-readable instructions. The database can store information sequences, and the computer-readable instructions are processed by the processor. When executed, the processor can be enabled to implement a three-dimensional face reconstruction method.
该三维人脸重建设备的处理器用于提供计算和控制能力,支撑整个三维人脸重建设备的运行。该三维人脸重建设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行本申请的三维人脸重建方法。该三维人脸重建设备的网络接口用于与终端连接通信。The processor of the three-dimensional face reconstruction device is used to provide computing and control capabilities to support the operation of the entire three-dimensional face reconstruction device. Computer-readable instructions may be stored in the memory of the three-dimensional face reconstruction device. When executed by the processor, the computer-readable instructions may cause the processor to execute the three-dimensional face reconstruction method of the present application. The network interface of the three-dimensional face reconstruction device is used to connect and communicate with the terminal.
本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,具体的三维人脸重建设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 11 is only a block diagram of part of the structure related to the solution of the present application. A specific three-dimensional face reconstruction device may include more or fewer components than shown in the figure. Or combine certain parts, or have different parts arrangements.
本实施方式中处理器用于执行图10中的各个模块的具体功能,存储器存储有执行上述模块或子模块所需的程序代码和各类数据。网络接口用于实现用户终端或服务器之间的数据传输。本实施方式中的非易失性可读存储介质中存储有本申请的三维人脸重建装置中执行所有模块所需的程序代码及数据,服务器能够调用服务器的程序代码及数据执行所有模块的功能。In this embodiment, the processor is used to execute the specific functions of each module in Figure 10, and the memory stores program codes and various types of data required to execute the above modules or sub-modules. The network interface is used to realize data transmission between user terminals or servers. The non-volatile readable storage medium in this embodiment stores the program codes and data required to execute all modules in the three-dimensional face reconstruction device of the present application. The server can call the server's program codes and data to execute the functions of all modules. .
本申请还提供一种存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行本申请任一实施例的三维人脸重建方法的步骤。 This application also provides a non-volatile readable storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, they cause one or more processors to execute any embodiment of the application. The steps of the 3D face reconstruction method.
本申请还提供一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被一个或多个处理器执行时实现本申请任一实施例所述方法的步骤。The present application also provides a computer program product, which includes a computer program/instruction that implements the steps of the method described in any embodiment of the present application when executed by one or more processors.
本领域普通技术人员可以理解,实现本申请上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一非易失性可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等计算机可读存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments of the present application can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile readable storage medium. , when the program is executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned storage media can be computer-readable storage media such as magnetic disks, optical disks, read-only memory (Read-Only Memory, ROM), or random access memory (Random Access Memory, RAM), etc.
综上所述,本申请可以实现三维人脸重建,所述三维人脸重建方法采用双线性建模层针对人脸中的身份信息和表情信息进行解耦建模,从而有效分离出表情参数,实现表情迁移,可极大地促进相关直播、影视等行业地应用和发展;其次,所述方法的训练方式基于单张图像的弱监督学习,可极大的减少训练数据的获取成本和标注成本,有助于规模应用。 To sum up, this application can achieve three-dimensional face reconstruction. The three-dimensional face reconstruction method uses a bilinear modeling layer to decouple the identity information and expression information in the face, thereby effectively separating the expression parameters. , realizing expression migration can greatly promote the application and development of related live broadcast, film and television and other industries; secondly, the training method of the method is based on weakly supervised learning of a single image, which can greatly reduce the acquisition cost and labeling cost of training data. , conducive to large-scale application.

Claims (11)

  1. 一种三维人脸重建方法,包括:A three-dimensional face reconstruction method, including:
    获取人脸影像数据,提取其中的人脸图像;Obtain facial image data and extract facial images;
    对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像;Perform key point detection on the face image to obtain a face area image of the area where the face key points are located;
    采用预训练至收敛状态的三维人脸重建网络的双线性建模层针对所述人脸区域图像进行人脸身份和人脸表情的双线性建模,获得参数化的三维人脸模型;Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional face model;
    采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model, and the parameter coefficients include the parameters corresponding to the face identity. The identity coefficient and the expression coefficient corresponding to the facial expression.
  2. 根据权利要求1所述的三维人脸重建方法,采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数之后,所述方法包括:The three-dimensional face reconstruction method according to claim 1, after using the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model , the method includes:
    获取构成所述参数化的三维人脸模型所需的目标参数系数,其中,所述目标参数系数包括预先指定的身份系数和预先指定的表情系数;Obtain target parameter coefficients required to constitute the parameterized three-dimensional face model, wherein the target parameter coefficients include a pre-specified identity coefficient and a pre-specified expression coefficient;
    将所述目标参数系数迁移至相应数字人的三维人脸模型中,获得数字人的三维人脸模型;Migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person;
    将所述数字人的三维人脸模型渲染投影到二维图像空间,获得数字人图像。The three-dimensional face model of the digital human is rendered and projected into a two-dimensional image space to obtain a digital human image.
  3. 根据权利要求1所述的三维人脸重建方法,其中,对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像,包括:The three-dimensional face reconstruction method according to claim 1, wherein performing key point detection on the face image to obtain the face area image of the area where the face key points are located includes:
    针对所述人脸图像进行人脸关键点检测,获得人脸区域图像和人脸关键点信息;Perform face key point detection on the face image to obtain the face area image and face key point information;
    对齐所述人脸关键点与标准人脸关键点,获得标准对齐参数,所述标准人脸关键点为标准三维人脸模型经二维投影获得的相应人脸关键点;Align the face key points with standard face key points to obtain standard alignment parameters. The standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model;
    根据所述标准对齐参数对齐所述人脸区域图像。The face region images are aligned according to the standard alignment parameters.
  4. 根据权利要求1所述的三维人脸重建方法,其中,采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,包括:The three-dimensional face reconstruction method according to claim 1, wherein the parameter mapping layer of the three-dimensional face reconstruction network is used to map the face region image to the corresponding parameters in the parameterized three-dimensional face model. coefficients, including:
    采用所述三维人脸重建网络中的编码器针对所述人脸区域图像进行特征提取,获得人脸特征图;Using the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map;
    针对所述人脸特征图进行空间映射,获得所述双线性建模层中的参数系数。Perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer.
  5. 据权利要求4所述的三维人脸重建方法,其中,针对所述人脸特征图进 行空间映射,获得所述双线性建模层中的参数系数,包括:The three-dimensional face reconstruction method according to claim 4, wherein the face feature map is Line space mapping is used to obtain parameter coefficients in the bilinear modeling layer, including:
    将所述人脸特征图进行语义空间映射,获得人脸特征向量;Perform semantic space mapping on the facial feature map to obtain a facial feature vector;
    将所述人脸特征向量进行参数空间映射,获得所述双线性建模层中的参数系数。Perform parameter space mapping on the facial feature vector to obtain parameter coefficients in the bilinear modeling layer.
  6. 据权利要求1至5中任意一项所述的三维人脸重建方法,其中,所述三维人脸重建网络的训练过程,包括:The three-dimensional face reconstruction method according to any one of claims 1 to 5, wherein the training process of the three-dimensional face reconstruction network includes:
    获取预处理后的人脸影像数据的单个样本;Obtain a single sample of preprocessed face image data;
    获取所述单个样本中的人脸区域图像、人脸关键点和三维人脸模型姿态系数;Obtain the face area image, face key points and three-dimensional face model posture coefficients in the single sample;
    采用所述三维人脸重建网络重建获得所述人脸区域图像的三维人脸模型,经渲染投影至二维中获得人脸重建图像;Use the three-dimensional face reconstruction network to reconstruct and obtain a three-dimensional face model of the face area image, and then render and project it into two dimensions to obtain the face reconstruction image;
    根据所述人脸区域图像与所述人脸重建图像计算重建损失值,根据该重建损失值对所述三维人脸重建网络进行参数更新;Calculate a reconstruction loss value based on the face area image and the face reconstruction image, and update the parameters of the three-dimensional face reconstruction network based on the reconstruction loss value;
    重复以上操作,直至预设终止条件触发而结束训练,获得所述三维人脸重建网络。Repeat the above operations until the preset termination condition is triggered and the training ends, and the three-dimensional face reconstruction network is obtained.
  7. 据权利要求6所述的三维人脸重建方法,其中,所述根据所述对齐后的人脸区域图像与所述重建人脸图像计算重建损失值,包括:The three-dimensional face reconstruction method according to claim 6, wherein calculating the reconstruction loss value based on the aligned face area image and the reconstructed face image includes:
    计算第一损失值,所述第一损失值用于最小化所述人脸区域图像与所述人脸重建图像之间的误差;Calculate a first loss value, the first loss value being used to minimize the error between the face region image and the face reconstruction image;
    计算第二损失值,所述第二损失值用于加强所述人脸区域图像与所述人脸重建图像之间形状及像素级别的对齐;Calculate a second loss value, the second loss value being used to enhance the shape and pixel-level alignment between the face region image and the face reconstruction image;
    计算第三损失值,所述第三损失值用于保证姿态具备较高准确性;Calculate a third loss value, the third loss value is used to ensure that the posture has higher accuracy;
    计算第四损失值,所述第四损失值用于优化二维投影中投影点的准确性;Calculating a fourth loss value, the fourth loss value being used to optimize the accuracy of the projected points in the two-dimensional projection;
    计算重建损失值,所述重建损失值为第一损失值和第二损失值、第三损失值、第四损失值的加权融合。Calculate a reconstruction loss value, which is a weighted fusion of the first loss value, the second loss value, the third loss value, and the fourth loss value.
  8. 一种三维人脸重建装置,包括:A three-dimensional face reconstruction device, including:
    图像获取模块,设置为获取人脸影像数据,提取其中的人脸图像;The image acquisition module is configured to acquire face image data and extract the face image therein;
    人脸检测模块,设置为对所述人脸图像实施关键点检测,获得其中人脸关键点所在区域的人脸区域图像;A face detection module configured to perform key point detection on the face image and obtain a face region image of the area where the face key points are located;
    人脸建模模块,设置为采用预训练至收敛状态的三维人脸重建网络的双线 性建模层针对所述人脸区域图像进行人脸身份和人脸表情的双线性建模,获得参数化的三维人脸模型;Face modeling module, set up to use a dual-line 3D face reconstruction network pre-trained to a converged state The sexual modeling layer performs bilinear modeling of facial identity and facial expression on the face region image to obtain a parameterized three-dimensional face model;
    参数映射模块,设置为采用所述三维人脸重建网络的参数映射层将所述人脸区域图像映射为所述参数化的三维人脸模型中相对应的参数系数,所述参数系数包括与所述人脸身份对应的身份系数和与所述人脸表情对应的表情系数。A parameter mapping module configured to use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model, where the parameter coefficients include the The identity coefficient corresponding to the facial identity and the expression coefficient corresponding to the facial expression.
  9. 一种三维人脸重建设备,包括中央处理器和存储器,所述中央处理器设置为调用运行存储于所述存储器中的计算机程序以执行如权利要求1至7中任意一项所述的方法的步骤。A three-dimensional face reconstruction device, including a central processor and a memory. The central processor is configured to call and run a computer program stored in the memory to perform the method according to any one of claims 1 to 7. step.
  10. 一种非易失性可读存储介质,所述非易失性可读存储介质以计算机可读指令的形式存储有依据权利要求1至7中任意一项所述的方法所实现的计算机程序,所述计算机程序被计算机调用运行时,执行相应的方法所包括的步骤。A non-volatile readable storage medium, the non-volatile readable storage medium stores a computer program implemented according to the method according to any one of claims 1 to 7 in the form of computer readable instructions, When the computer program is called and run by the computer, the steps included in the corresponding method are executed.
  11. 一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现权利要求1至7中任意一项所述方法的步骤。 A computer program product, comprising a computer program/instructions which, when executed by a processor, implements the steps of the method of any one of claims 1 to 7.
PCT/CN2023/111005 2022-08-12 2023-08-03 Three-dimensional face reconstruction method, apparatus, and device, medium, and product WO2024032464A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210969989.1A CN115330947A (en) 2022-08-12 2022-08-12 Three-dimensional face reconstruction method and device, equipment, medium and product thereof
CN202210969989.1 2022-08-12

Publications (1)

Publication Number Publication Date
WO2024032464A1 true WO2024032464A1 (en) 2024-02-15

Family

ID=83923644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111005 WO2024032464A1 (en) 2022-08-12 2023-08-03 Three-dimensional face reconstruction method, apparatus, and device, medium, and product

Country Status (2)

Country Link
CN (1) CN115330947A (en)
WO (1) WO2024032464A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115330947A (en) * 2022-08-12 2022-11-11 百果园技术(新加坡)有限公司 Three-dimensional face reconstruction method and device, equipment, medium and product thereof
CN115690327A (en) * 2022-11-16 2023-02-03 广州大学 Space-frequency decoupling weak supervision three-dimensional face reconstruction method
CN116228763B (en) * 2023-05-08 2023-07-21 成都睿瞳科技有限责任公司 Image processing method and system for eyeglass printing
CN116993948B (en) * 2023-09-26 2024-03-26 粤港澳大湾区数字经济研究院(福田) Face three-dimensional reconstruction method, system and intelligent terminal
CN117237547B (en) * 2023-11-15 2024-03-01 腾讯科技(深圳)有限公司 Image reconstruction method, reconstruction model processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060001673A1 (en) * 2004-06-30 2006-01-05 Mitsubishi Electric Research Laboratories, Inc. Variable multilinear models for facial synthesis
CN103093490A (en) * 2013-02-02 2013-05-08 浙江大学 Real-time facial animation method based on single video camera
CN114241102A (en) * 2021-11-11 2022-03-25 清华大学 Method and device for reconstructing and editing human face details based on parameterized model
CN114742954A (en) * 2022-04-27 2022-07-12 南京大学 Method for constructing large-scale diversified human face image and model data pairs
CN115330947A (en) * 2022-08-12 2022-11-11 百果园技术(新加坡)有限公司 Three-dimensional face reconstruction method and device, equipment, medium and product thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060001673A1 (en) * 2004-06-30 2006-01-05 Mitsubishi Electric Research Laboratories, Inc. Variable multilinear models for facial synthesis
CN103093490A (en) * 2013-02-02 2013-05-08 浙江大学 Real-time facial animation method based on single video camera
CN114241102A (en) * 2021-11-11 2022-03-25 清华大学 Method and device for reconstructing and editing human face details based on parameterized model
CN114742954A (en) * 2022-04-27 2022-07-12 南京大学 Method for constructing large-scale diversified human face image and model data pairs
CN115330947A (en) * 2022-08-12 2022-11-11 百果园技术(新加坡)有限公司 Three-dimensional face reconstruction method and device, equipment, medium and product thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CAO, CHEN ET AL.: "FaceWarehouse: A 3D Facial Expression Database for Visual Computing", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 20, no. 3, 31 March 2014 (2014-03-31), XP011543570, ISSN: 1077-2626, DOI: 10.1109/TVCG.2013.249 *

Also Published As

Publication number Publication date
CN115330947A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
WO2024032464A1 (en) Three-dimensional face reconstruction method, apparatus, and device, medium, and product
CN110458939B (en) Indoor scene modeling method based on visual angle generation
US11538216B2 (en) Dynamically estimating light-source-specific parameters for digital images using a neural network
CN109285215B (en) Human body three-dimensional model reconstruction method and device and storage medium
US11748934B2 (en) Three-dimensional expression base generation method and apparatus, speech interaction method and apparatus, and medium
Park et al. Transformation-grounded image generation network for novel 3d view synthesis
WO2022002032A1 (en) Image-driven model training and image generation
US9792725B2 (en) Method for image and video virtual hairstyle modeling
JP2022524891A (en) Image processing methods and equipment, electronic devices and computer programs
WO2022001236A1 (en) Three-dimensional model generation method and apparatus, and computer device and storage medium
CN112085835B (en) Three-dimensional cartoon face generation method and device, electronic equipment and storage medium
CN110458924B (en) Three-dimensional face model establishing method and device and electronic equipment
WO2024007478A1 (en) Three-dimensional human body modeling data collection and reconstruction method and system based on single mobile phone
WO2023066120A1 (en) Image processing method and apparatus, electronic device, and storage medium
WO2021063271A1 (en) Human body model reconstruction method and reconstruction system, and storage medium
US20200118333A1 (en) Automated costume augmentation using shape estimation
CN111402403B (en) High-precision three-dimensional face reconstruction method
CN115496862A (en) Real-time three-dimensional reconstruction method and system based on SPIN model
EP3855386B1 (en) Method, apparatus, device and storage medium for transforming hairstyle and computer program product
CN111754622B (en) Face three-dimensional image generation method and related equipment
CN115775300A (en) Reconstruction method of human body model, training method and device of human body reconstruction model
Patterson et al. Landmark-based re-topology of stereo-pair acquired face meshes
Peng et al. Geometrical consistency modeling on b-spline parameter domain for 3d face reconstruction from limited number of wild images
CN117557699B (en) Animation data generation method, device, computer equipment and storage medium
CN116704097B (en) Digitized human figure design method based on human body posture consistency and texture mapping

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23851685

Country of ref document: EP

Kind code of ref document: A1