CN115830239A

CN115830239A - Method and device for generating animation by using human face photo based on three-dimensional reconstruction

Info

Publication number: CN115830239A
Application number: CN202211609505.9A
Authority: CN
Inventors: 潘仁
Original assignee: Hangzhou Xiaoying Innovation Technology Co ltd
Current assignee: Hangzhou Xiaoying Innovation Technology Co ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-03-21

Abstract

The application relates to the technical field of face images, and discloses a method and a device for generating animation by using a face photo based on three-dimensional reconstruction, wherein the method comprises the following steps: the method comprises the steps of obtaining image data, preprocessing the image data, building a deep learning model, training the model based on the deep learning model to obtain a pre-training model, reasoning the human face photo data to be edited by using the pre-training model, decoding camera parameters obtained by reasoning and adjusted FLAME parameters to obtain a 3D human head model of each frame, rendering each frame based on UV mapping, and displaying the frames in sequence after rendering a background to generate the animation.

Description

Method and device for generating animation by using human face photo based on three-dimensional reconstruction

Technical Field

The application relates to the technical field of human face image processing, in particular to a method and a device for generating animation by using a human face photo based on three-dimensional reconstruction.

Background

With the development of the motion picture entertainment industry, more and more groups are added to the video creation line. Some cool and interesting special-effect animations usually need to be realized by creators using professional tools and skills, and have certain use thresholds for common users and beginners, so that automation of the special-effect animations is realized, the thresholds of the users are reduced, and the common users can feel the miraculous effect of the special-effect animations, which is important work.

Currently, the human face editing animation special effect is a direction in which a user is interested, the user inputs a photo, and people in the photo blink, open mouth or smile through some technologies to enable the photo to 'live'. These techniques belong to the category of face editing, and the face editing animation mainly includes two aspects: control of facial expressions, such as opening the mouth, closing the mouth, opening the eyes, closing the eyes, smiling, anger, etc., can indicate a person's mood; another aspect is the pose control of the face, e.g. head-down, head-up, look left, look right, head-swing, etc.

At present, face editing, including implementation of face editing animation, is mainly based on a GAN technology, for example, the styleGAN of NVIDIA mainly extracts the characteristics of the eyes, mouth, face, and the like of a face from a picture to decouple, and then parameterizes and controls to implement the face editing animation.

Disclosure of Invention

The application aims to overcome the defects of the prior art and provides a method and a device for generating an animation based on a three-dimensional reconstruction face photo.

In a first aspect, a method for generating an animation based on a three-dimensional reconstructed face photo is provided, which includes:

acquiring image data suitable for a special effect application scene, wherein the data comprises a photo and a video frame picture, and the photo and the video frame picture both comprise a person head;

preprocessing the image data;

building a deep learning model, and inputting the preprocessed image data into the deep learning model for model training to obtain a pre-training model;

inputting the preprocessed face photo data into the pre-training model for reasoning to obtain FLMAE parameters and camera parameters of the face photo, and adjusting the FLMAE parameters required by each frame of the animation;

decoding the camera parameters and the adjusted FLAME parameters by using a FLAME model to obtain a 3D human head model of each frame;

carrying out UV mapping on the face photo to obtain UVmap information of the face photo;

and rendering the 3D human head model of each frame by utilizing UVmap information, and displaying the rendered background in sequence to generate the animation.

Further, the preprocessing of the data comprises:

cutting the pictures and the video frame pictures to obtain image data with the same size;

acquiring landworks of the face by using an open source face detection algorithm FAN;

and acquiring the mask of the face by using a face segmentation algorithm.

Further, the deep learning model inputs an image containing a human head to the deep learning model based on the Resnet50, and can output FLMAE model parameters and camera parameters corresponding to the photo, wherein the FLMAE model parameters comprise a shape parameter, a position parameter and an expression parameter.

Further, the preprocessed face photo data are data obtained after the face photo is subjected to cutting, face key point detection and face region segmentation.

Further, decoding the camera parameters and the adjusted flag parameters by using a flag model, including:

inputting the camera parameters and the adjusted FLMAE parameters required by each frame of the animation into a FLAME model for decoding in sequence;

and outputting the vertex position information of the 3D human head model corresponding to each frame, namely obtaining the 3D human head model of the frame.

Further, performing UV mapping on the face photo, including:

projecting the vertex of the 3D head to a 2D plane with the same size as the original face photo by using camera parameters to obtain UV coordinates;

dividing the UV coordinates by the width and height of the corresponding original face photo to obtain a UV mapping relation between the vertex of the 3D head and the original face photo;

and outputting the color information of each vertex corresponding to the original facial picture, namely UVmap information of the 3D human head model corresponding to the original facial picture.

Furthermore, the 3D human head model of each frame is rendered frame by using UVmap information, and animation can be generated by sequentially displaying after rendering the background, wherein the method comprises the following steps:

inputting the vertex position information of the 3D human head model into a vertex shader;

calculating the camera parameters into a projection matrix, and inputting the projection matrix into a vertex shader;

taking the original picture as sampling data and the UV mapping relation as a UV coordinate, and rendering the image of each frame in sequence;

rendering the original face photo data to obtain a background rendering image;

taking the background rendering image as the background of each frame of image;

and sequentially displaying the images of each frame after the background is added, so that the animation can be generated.

In a second aspect, an apparatus for generating an animation based on a three-dimensional reconstructed face picture is provided, which includes:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring image data suitable for a special effect application scene, the data comprises a photo and a video frame picture, and the photo and the video frame picture both comprise character heads;

the preprocessing module is used for preprocessing the image data;

the model building and training module is used for building a deep learning model and inputting the preprocessed image data into the deep learning model for model training to obtain a pre-training model;

the inference module is used for inputting the preprocessed human face photo data into the pre-training model for inference so as to obtain the FLMAE parameters and the camera parameters of the human face photo and adjust the FLMAE parameters required by each frame of the animation;

the decoding module is used for decoding the camera parameters and the adjusted FLAME parameters by using a FLAME model to obtain a 3D human head model of each frame;

the UV mapping module is used for carrying out UV mapping on the face photo to obtain UVmap information of the face photo;

and the rendering module is used for rendering the 3D human head model of each frame by utilizing UVmap information, and animation can be generated by sequentially displaying after rendering the background.

In a third aspect, a computer-readable storage medium is provided, which stores program code for execution by a device, the program code comprising instructions for performing the steps of the method as in any one of the implementations in the first aspect.

In a fourth aspect, an electronic device is provided, which comprises a processor, a memory, and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implement the steps of the method in any one of the implementations of the first aspect.

The application has the following beneficial effects: the application provides a new thinking to realize the face editing animation of the image, through directly adopting the mode of three-dimensional reconstruction, make the human face model maneuverability of three-dimensional reconstruction stronger, follow-up use this scheme as the basis, can derive the editing method of more human faces, and, through combining deep learning and image rendering, can realize the automatic operation flow of the face editing animation, the reality sense of the animation that produces is stronger, make the animation display effect more natural and true, can also realize the interaction with the user, great reduction the threshold that the user carries out the animation creation.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and are incorporated in and constitute a part of this application for purposes of illustration and description.

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for generating an animation based on a three-dimensional reconstructed face picture according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment of the application relates to a method for generating animation based on a three-dimensional reconstruction face photo, which comprises the following steps: acquiring image data suitable for a special effect application scene, wherein the data comprises a photo and a video frame picture, and the photo and the video frame picture both comprise a character head; preprocessing the image data; building a deep learning model, and inputting the preprocessed image data into the deep learning model for model training to obtain a pre-training model; inputting the preprocessed face photo data into the pre-training model for reasoning to obtain FLMAE parameters and camera parameters of the face photo, and adjusting the FLMAE parameters required by each frame of the animation; decoding the camera parameters and the adjusted FLAME parameters by using a FLAME model to obtain a 3D human head model of each frame; carrying out UV mapping on the face photo to obtain UVmap information of the face photo; the embodiment of the application provides a new idea to realize the face editing animation of images, the three-dimensional reconstruction mode is directly adopted, the operability of the three-dimensional reconstruction face model is stronger, subsequently, on the basis of the scheme, the editing method of more human faces can be derived, in addition, the automatic operation flow of the face editing animation can be realized by combining deep learning and image rendering, the reality sense of the generated animation is stronger, the animation display effect is more natural and real, the interaction with a user can be realized, and the threshold of animation creation of the user is greatly reduced.

Specifically, fig. 1 shows a flowchart of a method for generating an animation based on a three-dimensional reconstructed face picture in an application embodiment one, which includes:

s101, obtaining image data suitable for a special effect application scene, wherein the data comprises a photo and a video frame picture, and the photo and the video frame picture both comprise a character head;

it should be noted that the image data suitable for the special effect application scene refers to an image containing a head of a person, and is derived from a public head data set or a video data set, where for the video data, a frame including the head of the person needs to be selected.

S102, preprocessing the image data;

specifically, the preprocessing of the data includes:

s201, cutting the picture and the video frame to obtain image data with the same size;

s202, acquiring landworks of the face by using an open source face detection algorithm FAN;

s203, obtaining the mask of the face by using a face segmentation algorithm.

For example, since the size of the image in the acquired image data is not uniform, and the size of the image input to the deep learning network needs to be kept uniform, the image data needs to be cropped or upsampled (refer to enlarging the image) so that the size of the image data is correlated, for example: the size of the image data is adjusted to 224 pixels x224 pixels, then 68 landmarks (feature points) of the face are obtained by using an open source face detection algorithm FAN, and a mask (face model) of the face is obtained by using a face segmentation algorithm, namely, the preprocessing of the original image data is completed.

S103, building a deep learning model, and inputting the preprocessed image data into the deep learning model for model training to obtain a pre-training model;

illustratively, the deep learning model inputs an image containing a human head to the deep learning model based on the Resnet50, and it should be noted that the input image is preprocessed image data, wherein the data of the image has been uniformly adjusted to 224 pixels × 224 pixels, and FLMAE model parameters and camera parameters corresponding to the picture can be output, wherein the FLMAE model parameters include shape parameters, position parameters, and expression parameters.

S104, inputting the preprocessed face photo data into the pre-training model for reasoning to obtain an FLMAE parameter and a camera parameter of the face photo, and adjusting the FLMAE parameter required by each frame of the animation;

it should be noted that the preprocessed face photograph data is obtained by performing clipping, face key point detection, and face region segmentation on a face photograph.

The purpose of this step is to obtain PLMAE parameters and camera parameters c of a face photograph (i.e., an original face photograph that needs to generate an animation effect), where the FLMAE model parameters include a shape parameter, a position parameter, and an expression parameter, and since the FLAME model is controlled by the parameters, the FLMAE model parameters are manually modified, so that a series of shape parameters, position parameters, expression parameters, and camera parameters c after modification can be obtained (the camera parameters c are kept unchanged and do not need to be adjusted).

S105, decoding the camera parameters and the adjusted FLAME parameters by using a FLAME model to obtain a 3D human head model of each frame;

specifically, the decoding of the camera parameters and the adjusted flag parameters by using a flag model includes:

s501, inputting camera parameters and the adjusted FLMAE parameters required by each frame of the animation into a FLAME model for decoding in sequence;

and S502, outputting vertex position information of the 3D human head model corresponding to each frame, namely obtaining the 3D human head model of the frame.

Exemplarily, the FLMAE parameters and the camera parameters c required by each frame of the animation obtained in step S104 are input into the flag head model for decoding (decoding means converting one-dimensional shape parameters, position parameters, and expression parameters into 3D data, and the algorithm is open source), and 5023 3 pieces of 3D head vertex information of the flag model corresponding to the frame are output, that is, the 3D head model of the frame is obtained.

S106, carrying out UV mapping on the face photo to obtain UVmap information of the face photo;

specifically, the UV mapping of the photo of the face includes:

s601, projecting the vertex of the 3D human head to a 2D plane with the same size as the original human face picture by using camera parameters to obtain UV coordinates;

s602, dividing the UV coordinates by the width and the height of the corresponding original face photo to obtain a UV mapping relation between the vertex of the 3D head and the original face photo;

and S603, outputting color information of each vertex corresponding to the original face picture, namely UVmap information of the 3D head model corresponding to the original face picture.

Exemplarily, the original face photo is subjected to UV mapping, 5023 vertexes of the 3D head model are linked with the original face photo, and color information of each vertex corresponding to the original face photo, that is, UVmap information of the 3D head model corresponding to the original face photo is obtained.

And S107, rendering the 3D human head model of each frame by utilizing UVmap information, and sequentially displaying after rendering the background to generate the animation.

Specifically, the 3D human head model of each frame is rendered frame by using UVmap information, and animation can be generated by sequentially displaying after rendering a background, including:

s701, inputting vertex position information of the 3D human head model into a vertex shader;

s702, calculating the camera parameters into a projection matrix, and inputting the projection matrix into a vertex shader;

s703, taking the original picture as sampling data, and taking the UV mapping relation as a UV coordinate, namely rendering the image of each frame in sequence;

s704, rendering the original face photo data to obtain a background rendering image;

s705, taking the background rendering image as the background of each frame of image;

and S706, sequentially displaying the images of each frame after the background is added, and generating the animation.

It is worth noting that the realization of human face photo editing animation based on three-dimensional reconstruction depends on a 3D human head human face model, the current human head human face model mainly uses a 3DMM method, mainly comprises a BFM model and a FLAME model, because a BFM data set is from a European human face, a certain deviation can be generated for the estimation of Asian human faces, and the data set of the FLAME model consists of 33000 3D human head scanning data, which relate to each age bracket and are of various types, so that the generalization capability is strong, therefore, the FLAME model is selected as the human head human face model in the technical scheme of the application; and (3) learning and outputting parameters of the FLAME model from a single picture by combining a deep learning method, and then generating a three-dimensional model of the head (namely obtaining the vertex position information of the head) according to the parameters.

After obtaining the 3D model of the human head, the human head animation can be realized by changing the vertex position of the model, for example, animation can be realized by using BlendShape, but the method of BlendShape needs to generate a new expression base; the FLAME head model can be controlled by related expression parameters, so that the change of the vertex can be controlled by manually adjusting the parameters, the 3D head model is enabled to move, and the actions of blinking, shaking head, nodding head, opening mouth and the like are realized.

After the model animation is generated, another problem to be solved is texture, and the material of the skin needs to be given to the model, so that the model looks more realistic, and at present, a common practice is to learn an albedo graph, a normal graph, illumination parameters and the like of a person by a deep learning method, and then render through illumination to obtain the color of the model, but the method still has a certain difference with the user's feeling in the sense of reality, the method is to restore the color of the person in the picture, and the color of the person in the picture is normal, so that the color of the face of the original picture is directly mapped by UV, the rendered result is basically the same as the original face picture in the sense of reality, and the animation based on the 3D model can be combined to realize the real face photo editing animation.

The method is realized by editing the animation based on the three-dimensional reconstructed face photo, fully utilizes the combination of the strong learning' capability of deep learning and the existing human head face model, and obtains model texture representation from the original face photo, thereby solving the problem that the FLAME model does not have texture representation or has unreal texture generated by other schemes, and the generated face animation has strong sense of reality and high playability. The method is further described below with an implementation of a photo-blinking animation of a human face, it is to be noted that the following operations are performed on the premise that a pre-trained model has been constructed and trained:

1. inputting photo data containing human faces by a user;

2. the method comprises the steps of carrying out cutting, face key point detection and face region segmentation pretreatment on an original picture of a user to obtain preprocessed data;

3. sending the preprocessed data into a deep learning pre-training model for reasoning to obtain related data such as a shape parameter, a position parameter and an expression parameter of the FLAME model, a camera parameter c and the like;

4. in a FLAME model, controlling a coefficient of blink action in an expression parameter, and replacing a coefficient related to blink in the expression parameter obtained by inference with a series of adjusted coefficients;

5. sending a series of FLAME parameters obtained after reasoning and adjustment into a FLAME model for decoding in sequence to obtain vertex position information of the 3D head model corresponding to each frame;

6. sending the FLAME parameters obtained in the step 3 into a FLAME model for decoding to obtain 3D head vertex information corresponding to an original picture, projecting the 3D head vertex to a 2D plane with the same size as the original face picture by using the camera parameters c obtained in the step 3, and dividing the obtained UV coordinates by the corresponding picture width and height to obtain a UV mapping relation associated with the original face picture;

7. rendering, in this step, selecting OpenGL across platforms to render, where an operation of rendering a frame is shown: inputting the 3D vertex data obtained in the step 5 into a vertex shader, calculating c obtained in the step 3 into a projection matrix, and inputting the projection matrix into the vertex shader; meanwhile, the original face picture is used as sampling data, the UV mapping relation in the step 6 is used as a UV coordinate, so that a frame of picture can be obtained through rendering, and all frames are rendered according to the same rendering method;

8. because only the reconstructed face data is obtained by rendering in the step 7 and does not contain other information, the original face photo data needs to be supplemented, and for this purpose, the original face photo data is rendered and used as the background of the step 7;

9. and (4) sequentially displaying each frame of picture generated in the steps (7) and (8) to generate the blinking animation of the original picture.

Example two

The second embodiment of the present application relates to a device for generating an animation based on a three-dimensional reconstructed face photograph, which includes:

the preprocessing module is used for preprocessing the image data;

the reasoning module is used for inputting the preprocessed human face photo data into the pre-training model for reasoning so as to obtain an FLMAE parameter and a camera parameter of the human face photo, and adjusting the FLMAE parameter required by each frame of the animation;

EXAMPLE III

A computer-readable storage medium according to a third embodiment of the present application, storing program code for execution by a device, the program code including steps for performing a method according to any one of the first to third embodiments of the present application;

the computer-readable storage medium may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM); the computer readable storage medium may store program code, which when the program stored in the computer readable storage medium is executed by a processor, the processor is configured to perform the steps of the method as in any one of the implementations of the embodiment.

Example four

An electronic device according to a fourth embodiment of the present application includes a processor, a memory, and a program or an instruction stored in the memory and executable on the processor, where the program or the instruction, when executed by the processor, implements the method according to any one of the first embodiment of the present application;

the processor may adopt a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute a related program, so as to implement the method in any implementation manner in the first embodiment of the present application.

The processor may also be an integrated circuit electronic device having signal processing capabilities. In the implementation process, each step of the method in any one implementation manner of the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor or an instruction in the form of software.

The processor may also be a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory, and performs, in combination with hardware of the storage medium, functions required to be performed by a unit included in the data processing apparatus according to the embodiment of the present application, or performs a method according to any one implementation manner in the embodiment of the present application.

The above are merely preferred embodiments of the present application; the scope of protection of the present application is not limited thereto. Any person skilled in the art should be able to cover all equivalent or changes within the technical scope of the present disclosure, which is equivalent to the technical solution and the improvement concept of the present disclosure, and the protection scope of the present disclosure.

Claims

1. A method for generating animation based on a human face photo of three-dimensional reconstruction is characterized by comprising the following steps:

acquiring image data suitable for a special effect application scene, wherein the data comprises a photo and a video frame picture, and the photo and the video frame picture both comprise a character head;

preprocessing the image data;

2. The method of claim 1, wherein the preprocessing of the data comprises:

and acquiring the mask of the face by using a face segmentation algorithm.

3. The method for generating an animation based on a three-dimensional reconstructed human face photo as claimed in claim 1, wherein the deep learning model inputs an image containing a human head to the deep learning model based on Resnet50, and outputs FLMAE model parameters and camera parameters corresponding to the photo, wherein the FLMAE model parameters comprise a shape parameter, a position parameter and an expression parameter.

4. The method for generating animation based on three-dimensional reconstruction of human face photo as claimed in claim 1, wherein the preprocessed human face photo data is obtained by clipping human face photo, detecting human face key points and segmenting human face area.

5. The method for generating animation based on human face photo of three-dimensional reconstruction as claimed in claim 1, wherein the camera parameters and the adjusted FLAME parameters are decoded by using a FLAME model, comprising:

6. The method for generating animation based on three-dimensional reconstruction facial photo according to claim 5, wherein the UV mapping is performed on the facial photo and comprises the following steps:

and outputting UVmap information of each vertex corresponding to the color information in the original face picture, namely the 3D human head model corresponds to the original face picture.

7. The method for generating animation based on three-dimensional reconstruction facial photo of claim 6, wherein UVmap information is used for rendering the 3D human head model of each frame by frame, and the animation can be generated by rendering the background and then displaying the background in sequence, comprising:

rendering the original face photo data to obtain a background rendering image;

taking the background rendering image as the background of each frame of image;

8. An apparatus for generating animation based on human face photo of three-dimensional reconstruction, comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring image data suitable for a special effect application scene, the data comprises a photo and a video frame picture, and the photo and the video frame picture both comprise a character head;

the preprocessing module is used for preprocessing the image data;

9. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, the program code comprising steps for performing the method according to any one of claims 1-7.

10. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implement the steps of the method of any one of claims 1-7.