CN113395476A

CN113395476A - Virtual character video call method and system based on three-dimensional face reconstruction

Info

Publication number: CN113395476A
Application number: CN202110632937.0A
Authority: CN
Inventors: 杨志景; 温瑞冕; 徐永宗; 李为杰; 李凯; 凌永权
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-09-14

Abstract

The invention provides a virtual character video call method and a virtual character video call system based on three-dimensional face reconstruction, aiming at overcoming the defects of low video call fluency and low flexibility, and the method comprises the following steps: acquiring a video stream and an audio stream of a first communication terminal, or only acquiring the audio stream of the first communication terminal; inputting video stream image frames into a three-dimensional face reconstruction network to obtain predicted three-dimensional face model parameters; or inputting the audio stream into an audio prediction network to obtain predicted three-dimensional human face model parameters; merging and updating preset initial three-dimensional face model parameters according to the predicted three-dimensional face model parameters, and then saving the parameters as a parameter file; transmitting the parameter file and the audio stream of the first communication terminal to a second communication terminal, restoring a corresponding three-dimensional face model by the second communication terminal according to the parameter file by using a three-dimensional face reconstruction technology, and mapping the three-dimensional face model to a two-dimensional image plane to obtain a restored video image frame sequence; and rendering the video image frame sequence and then synthesizing the video of the virtual character with the audio stream.

Description

Virtual character video call method and system based on three-dimensional face reconstruction

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a virtual character video call method and a virtual character video call system based on three-dimensional face reconstruction.

Background

With the popularization of smart phones and the rapid development of internet technologies, the communication modes among people are changed greatly, and video communication becomes a popular mode for communication among people, but has certain limitations in practical application. Firstly, in the current video call, the video call can still be normally carried out with the other end under the condition that the camera is inconvenient to open at one communication end, so that the video call lacks certain flexibility; secondly, when the video call is performed in an area with low network transmission capability, the video is blocked, which greatly reduces the user experience in the video call.

At present, a video call method for constructing a virtual character is provided, which not only can reduce the amount of transmitted data to increase the smoothness of video call, but also can improve the interest of video call by replacing the character identity of the opposite communication terminal. For example, a virtual instant messaging method disclosed in publication No. CN110213521A (published japanese 2019-09-06) proposes to use a virtual 2D/3D image model with the same expression and posture as those of two parties to replace the real appearance of the two parties in the process of virtual instant messaging. However, the method needs a terminal camera to capture the face image of the person at any time, lacks of acquiring the head posture information, still does not get rid of the excessive dependence on the camera, and still has the problems of low flexibility and low video call fluency.

Disclosure of Invention

In order to overcome the defects of low video call fluency and low flexibility in the prior art, the invention provides a virtual character video call method based on three-dimensional face reconstruction and a virtual character video call system based on three-dimensional face reconstruction.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a virtual character video call method based on three-dimensional face reconstruction comprises the following steps:

s1: selecting virtual character video call modes, including a video-to-video call mode and an audio-to-video call mode:

when the video-to-video call mode is selected: acquiring a video stream and an audio stream of a first communication terminal, decomposing the video stream into image frames, and then performing model parameter prediction on a parameterized three-dimensional face by using a three-dimensional face reconstruction network to obtain predicted three-dimensional face model parameters and storing the predicted three-dimensional face model parameters as a parameter file;

when the audio-to-video call mode is selected: acquiring an audio stream of a first communication terminal, inputting the audio stream into an audio prediction network to perform model parameter prediction of a parameterized three-dimensional face, and obtaining predicted three-dimensional face model parameters; merging and updating preset initial three-dimensional face model parameters according to the predicted three-dimensional face model parameters, and then saving the parameters as a parameter file;

s2: transmitting the parameter file and the audio stream of the first communication terminal to a second communication terminal, restoring a corresponding three-dimensional face model by the second communication terminal according to the parameter file by using a three-dimensional face reconstruction technology, and mapping the three-dimensional face model to a two-dimensional image plane to obtain a restored video image frame sequence;

s3: and rendering the video image frame sequence and then synthesizing the video image frame sequence and the audio stream into a virtual character video.

As a preferred scheme, the three-dimensional face model parameters obtained by the three-dimensional face reconstruction network prediction comprise an identity model parameter S, an expression model parameter E, a texture model parameter T, a posture model parameter P and an illumination model parameter L; the three-dimensional face model parameters obtained through the audio prediction network prediction comprise expression model parameters E and posture model parameters P.

Preferably, the method further comprises the following steps: carrying out optimization training on the three-dimensional face reconstruction network according to the video stream image frames, the predicted three-dimensional face model parameters and the restored three-dimensional face model, wherein the expression formula is as follows:

in the formula (I), the compound is shown in the specification,

network parameters representing a three-dimensional face reconstruction network,

representing the image frames of the original video stream,

representing a prediction function learned by a three-dimensional face reconstruction network; ω (-) represents the function that maps the restored three-dimensional face model to the two-dimensional image plane.

Preferably, the three-dimensional face reconstruction network comprises an R-Net network.

Preferably, the method further comprises the following steps: performing optimization training on the audio prediction network according to the audio stream, the predicted three-dimensional face model parameters and the preset initial three-dimensional face model parameters, wherein an expression formula is as follows:

in the formula, theta₁And theta₂Predicting a network parameter of a network for the audio,

is the preset initial expression model parameters,

showing the pre-set initial pose model parameters,

representing an audio stream; h is_E(. h) represents an expressive feature prediction function learned by an audio prediction network_P(. cndot.) represents a gesture feature prediction function learned by an audio prediction network.

Preferably, the audio prediction network comprises an LSTM network.

Preferably, in the step S2, the specific step of restoring the corresponding three-dimensional face model by using the three-dimensional face reconstruction technique includes:

initializing a set of vertices of a three-dimensional face

And RGB set corresponding to three-dimensional face vertex set

Changing the position of the three-dimensional face vertex set according to the expression model parameter E and the posture model parameter P in the parameter file, and changing the color value of the RGB set corresponding to the three-dimensional face vertex set according to the texture model parameter T and the illumination model parameter L in the parameter file, wherein the expression formula is as follows:

in the formula (I), the compound is shown in the specification,

an identity base representing a three-dimensional face,

an expression base representing a three-dimensional face,

a texture base representing a three-dimensional face; x (lambda; P) represents a function for changing the position of the three-dimensional face vertex set according to the posture model parameter P, and lambda represents the vertex set of the position to be changed; c (epsilon; L) represents a function for changing the RGB set corresponding to the three-dimensional face vertex set according to the illumination model parameter L, wherein epsilon is the RGB set of the color value to be changed; n is a radical of₁、N₂The total numbers of the identity bases and the expression bases are respectively, and the subscripts i and j are respectively the ordinal numbers of the identity bases and the expression bases;

according to the vertex set S of the changed three-dimensional face^*And RGB set T^*And constructing a recovered three-dimensional face model, mapping each vertex in the three-dimensional face model to a two-dimensional image plane by affine transformation, and mapping the RGB color value of each vertex to the two-dimensional image plane correspondingly to serve as a pixel point of the mapping point to obtain a recovered video image frame.

Preferably, the step S2 further includes the following steps: after the parameter file is compressed, the parameter file is transmitted to a second communication terminal by adopting cloud service; and transmitting the audio stream of the first communication terminal to a second communication terminal by adopting a network protocol.

Preferably, the method further comprises the following steps: and the second communication terminal is preset with identity model parameters S of other figures, and when the second communication terminal receives the parameter file and the audio stream transmitted by the first communication terminal, the identity model parameters S and the identity model parameters in the parameter file are replaced, and then the corresponding three-dimensional face model is recovered by using a three-dimensional face reconstruction technology.

The invention also provides a virtual character video call system based on three-dimensional face reconstruction, which is applied to the virtual character video call method provided by any technical scheme, and the virtual character video call system comprises a first communication terminal and a second communication terminal, wherein the first communication terminal and the second communication terminal respectively comprise a video acquisition module, an audio acquisition module, a display module, a communication module and a main control module; wherein:

the video acquisition module is used for acquiring video streams and sending the video streams to the main control module;

the audio acquisition module is used for acquiring audio streams and sending the audio streams to the main control module;

the main control module decomposes the video stream into image frames according to the currently selected virtual character video call mode, and then carries out model parameter prediction on a parameterized three-dimensional face by adopting a three-dimensional face reconstruction network to obtain predicted three-dimensional face model parameters and store the predicted three-dimensional face model parameters as a parameter file;

or inputting the audio stream into an audio prediction network to carry out model parameter prediction of a parameterized three-dimensional face to obtain predicted three-dimensional face model parameters, merging and updating preset initial three-dimensional face model parameters according to the predicted three-dimensional face model parameters, and then saving the parameters as a parameter file;

the master control module transmits the generated parameter file to another communication terminal through the communication module;

the main control module is also used for restoring a corresponding three-dimensional face model by using a three-dimensional face reconstruction technology according to the parameter file received by the communication module and then mapping the three-dimensional face model to a two-dimensional image plane to obtain a restored video image frame sequence; and rendering the video image frame sequence, synthesizing the video image frame sequence with the audio stream to form a virtual character video, and transmitting the virtual character video to the display module for displaying.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the invention, a three-dimensional face reconstruction technology is utilized, image frames captured by a camera are subjected to three-dimensional face reconstruction to obtain parameterized three-dimensional face model parameters for transmission, or the three-dimensional face model parameters of a communication terminal user are predicted from a recorded audio stream and then are transmitted, so that the data transmission quantity of a communication terminal can be reduced, and the smoothness of video call is effectively improved;

the invention can predict the three-dimensional face model parameters of the communication terminal user only from the recorded audio stream, and can recover the complete three-dimensional face by combining the preset initial three-dimensional face model parameters, thereby realizing video call under the condition of closing the camera.

Drawings

Fig. 1 is a flowchart of a virtual character video call method based on three-dimensional face reconstruction in embodiment 1.

Fig. 2 is a schematic diagram of a virtual character video call method according to embodiment 1.

Fig. 3 is a schematic diagram of a video-to-video one-way avatar video call in embodiment 1.

Fig. 4 is a schematic diagram of an audio-to-video one-way avatar video call in embodiment 1.

Fig. 5 is a schematic view of a virtual character video call for replacing the virtual character identity according to embodiment 2.

Fig. 6 is a schematic view of a virtual character video call for replacing the virtual character identity according to embodiment 2.

Fig. 7 is a schematic structural diagram of a virtual character video call system based on three-dimensional face reconstruction in embodiment 3.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The first communication terminal proposed in this embodiment refers to a data transmitting end, and the second communication terminal refers to a data receiving end.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a virtual character video call method based on three-dimensional face reconstruction, and is a flow chart of the virtual character video call method based on three-dimensional face reconstruction in the embodiment, as shown in fig. 1 to 2.

The virtual character video call method based on three-dimensional face reconstruction provided by the embodiment comprises the following steps:

step 1: the selected video-to-video communication mode or audio-to-video communication mode is as follows:

when the audio-to-video call mode is selected: the method comprises the steps of obtaining an audio stream of a first communication terminal, inputting the audio stream into an audio prediction network to carry out model parameter prediction on a parameterized three-dimensional face to obtain predicted three-dimensional face model parameters, merging and updating preset initial three-dimensional face model parameters according to the predicted three-dimensional face model parameters, and storing the parameters as parameter files.

The three-dimensional face model parameters obtained through the three-dimensional face reconstruction network prediction comprise an identity model parameter S, an expression model parameter E, a texture model parameter T, a posture model parameter P and an illumination model parameter L; the three-dimensional face model parameters obtained through the audio prediction network prediction comprise expression model parameters E and posture model parameters P. In addition, the preset initial three-dimensional face model parameters in this embodiment include an identity model parameter S, an expression model parameter E, a texture model parameter T, a pose model parameter P, and an illumination model parameter L, and the initial three-dimensional face model parameters are obtained by performing three-dimensional face reconstruction through a face image of the front face of a user character shot in advance before a virtual character video call is performed.

The three-dimensional face reconstruction network in this step adopts an R-Net network, and the three-dimensional face reconstruction network is optimally trained according to the video stream image frames, the predicted three-dimensional face model parameters and the restored three-dimensional face model, and the expression formula is as follows:

in the formula (I), the compound is shown in the specification,

showing the network parameters of the three-dimensional face reconstruction network,

representing the image frames of the original video stream,

The audio prediction network in this step adopts an LSTM network, and the audio prediction network is optimized and trained according to the audio stream, the predicted three-dimensional face model parameters and the preset initial three-dimensional face model parameters, and the expression formula is as follows:

is the preset initial expression model parameters,

representing pre-set initial pose model parameters,

representing an audio stream; h is_E(. represents an audio prediction networkLearned expression feature prediction function, h_P(. cndot.) represents a gesture feature prediction function learned by an audio prediction network.

Step 2: and transmitting the parameter file and the audio stream of the first communication terminal to a second communication terminal, restoring the corresponding three-dimensional face model by the second communication terminal according to the parameter file by using a three-dimensional face reconstruction technology, and mapping the three-dimensional face model to a two-dimensional image plane to obtain a restored video image frame sequence.

In this step, the parameter file is compressed, then the cloud service is adopted to transmit the parameter file to the second communication terminal, and a network protocol is adopted to transmit the audio stream of the first communication terminal to the second communication terminal.

Further, the specific steps of restoring the corresponding three-dimensional face model by using the three-dimensional face reconstruction technology in the step include:

initializing a set of vertices of a three-dimensional face

RGB set corresponding to three-dimensional face vertex set

in the formula (I), the compound is shown in the specification,

an identity base representing a three-dimensional face,

an expression base representing a three-dimensional face,

And step 3: and rendering the video image frame sequence and then synthesizing the video image frame sequence and the audio stream into a virtual character video.

In the step, a face renderer for generating a countermeasure network is adopted to render the synthesized video image frame sequence, so that the reality of the video image frame sequence is improved, and then the video image frame sequence and the received audio stream are synthesized by adopting a multimedia processing technology to obtain a virtual character video and are displayed.

In a specific implementation process, the method for virtual character video call provided in this embodiment is applied to a video-to-video call mode, and a flow diagram of the method is shown in fig. 3.

When the virtual character video call is carried out, the first communication terminal continuously captures video stream and audio stream, the video stream is decomposed into image frame sequences by adopting a multimedia processing technology, then model parameter prediction of a parameterized three-dimensional face is carried out by adopting a three-dimensional face reconstruction network, and predicted three-dimensional face model parameters are obtained and stored as parameter files in a mat format or a yml format. Compressing the parameter file in the mat format or yml format into a file in the zip format, then transmitting the file by using cloud service, and compressing the audio stream into an mp3 file, then transmitting the file by using a network protocol. And the second communication terminal recovers the corresponding three-dimensional face model by using a three-dimensional face reconstruction technology according to the received parameter file, then maps the three-dimensional face model to a two-dimensional image plane to obtain a recovered video image frame sequence, and then renders the video image frame sequence to synthesize a virtual character video with the audio stream.

In the embodiment, the problem of poor video call fluency in a poor network area is considered, a three-dimensional face reconstruction technology is utilized, image frames captured by a camera are subjected to three-dimensional face reconstruction to obtain parameterized three-dimensional face model parameters, and the parameterized three-dimensional face model parameters almost contain all information of image portraits, so that the communication terminal can complete data volume transmission of video calls only by transmitting the complete three-dimensional face model parameters and audio streams to a communication opposite terminal, and accordingly the data volume required to be transmitted by the video calls is reduced and the fluency of the video calls is improved. When the user selects the audio-video call mode, the data volume to be transmitted is still the complete three-dimensional human face model parameter and audio stream.

In another specific implementation process, the method for virtual character video call provided in this embodiment is applied to a call mode from audio to video, and a flow diagram of the method is shown in fig. 4.

Before the virtual character video call is carried out, a front face image of a single portrait needs to be shot in advance to carry out three-dimensional face reconstruction, so that initial three-dimensional face model parameters are obtained and stored on a corresponding communication terminal.

When the virtual character video call is carried out, the first communication terminal only collects audio stream data, then inputs the audio stream into an audio prediction network adopting an LSTM network to carry out model parameter prediction on a parameterized three-dimensional face to obtain predicted expression model parameters and posture model parameters, then carries out merging and updating with initial three-dimensional face model parameters stored in the current communication terminal, and then stores the parameters as a parameter file in a mat format or a parameter file in a yml format.

And similarly, compressing the parameter file, transmitting the compressed parameter file by using cloud service, and transmitting the audio stream by using a network protocol. And the second communication terminal recovers the corresponding three-dimensional face model by using a three-dimensional face reconstruction technology according to the received parameter file, then maps the three-dimensional face model to a two-dimensional image plane to obtain a recovered video image frame sequence, and then renders the video image frame sequence to synthesize a virtual character video with the audio stream.

In the embodiment, the limitation problem of the video call in an application scene is considered, an audio prediction method combining a three-dimensional face reconstruction technology with deep learning is utilized, under the condition that a camera is closed, the expression and head posture model parameters of a parameterized three-dimensional face of a communication terminal user are predicted from a recorded audio stream, and then are combined with the three-dimensional face model parameters preset on the communication terminal, the communication terminal can recover the complete three-dimensional face from the complete three-dimensional face model parameters and map the complete three-dimensional face to a two-dimensional image plane to obtain a corresponding image frame, and therefore the video call mode from audio to video is achieved.

Example 2

The present embodiment provides a virtual character video call method for replacing the identity of a virtual character on the basis of the virtual character video call method based on three-dimensional face reconstruction provided in embodiment 1. Fig. 5 to 6 are schematic diagrams of the virtual character video call for replacing the virtual character identity according to the embodiment.

In this embodiment, the method further includes the following steps: and the second communication terminal is preset with identity model parameters S of other figures, and when the second communication terminal receives the parameter file and the audio stream transmitted by the first communication terminal, the identity model parameters S and the identity model parameters in the parameter file are replaced, and then the corresponding three-dimensional face model is recovered by using a three-dimensional face reconstruction technology.

In the specific implementation process, the communication terminal can autonomously select other virtual character identities which are pre-stored, that is, the communication terminal pre-stores the identity model parameters S of the corresponding character identities, and the virtual character identities can be replaced at any time in the video call process.

When the communication terminal selects the identities of other virtual characters, the communication terminal selects corresponding identity model parameters S from the prestored identity model parameters S, replaces the corresponding identity model parameters S with the identity model parameters in the currently received parameter file, keeps expression model parameters E, texture model parameters T, posture model parameters P and illumination model parameters L unchanged, and recovers the corresponding three-dimensional face model by using a three-dimensional face reconstruction technology.

In the embodiment, the flexibility of replaceable modification of the parameterized three-dimensional face model parameters is utilized, when the complete three-dimensional face model parameters are received at one end of a video call, only the identity model parameters S in the three-dimensional face model parameters are replaced by the identity parameters of the preset cartoon characters or celebrities, and the identity of other virtual characters can be replaced under the condition that the expression and the head posture of the characters at the opposite end of communication are not changed, so that the interestingness and the flexibility of the video call are improved.

Example 3

The embodiment provides a virtual character video call system based on three-dimensional face reconstruction, which is applied to the virtual character video call method based on three-dimensional face reconstruction provided in embodiment 1 or embodiment 2. Fig. 7 is a schematic structural diagram of a virtual character video call system based on three-dimensional face reconstruction according to this embodiment.

The virtual character video call system based on three-dimensional face reconstruction provided by the embodiment comprises a first communication terminal and a second communication terminal which have the same structure, wherein the first communication terminal and the second communication terminal respectively comprise a video acquisition module 1, an audio acquisition module 2, a display module 5, a communication module 3 and a main control module 4; wherein:

the video acquisition module 1 is used for acquiring video streams and sending the video streams to the main control module 4;

the audio acquisition module 2 is used for acquiring audio streams and sending the audio streams to the main control module 4;

the main control module 4 processes the collected video stream or audio stream according to the currently selected virtual character video call mode, specifically:

when a video-to-video call mode is selected, decomposing the video stream into image frames, and then performing model parameter prediction on a parameterized three-dimensional face by using a three-dimensional face reconstruction network to obtain predicted three-dimensional face model parameters and storing the predicted three-dimensional face model parameters as a parameter file;

when an audio-video call mode is selected, inputting the audio stream into an audio prediction network to perform model parameter prediction of a parameterized three-dimensional face to obtain predicted three-dimensional face model parameters, merging and updating preset initial three-dimensional face model parameters according to the predicted three-dimensional face model parameters, and storing the parameters as a parameter file;

the main control module 4 transmits the generated parameter file to another communication terminal through the communication module 3;

the main control module 4 is further configured to recover a corresponding three-dimensional face model by using a three-dimensional face reconstruction technique according to the parameter file received by the communication module 3, and then map the three-dimensional face model to a two-dimensional image plane to obtain a recovered video image frame sequence; and rendering the video image frame sequence, synthesizing the video image frame sequence with the audio stream to form a virtual character video, and transmitting the virtual character video to the display module 5 for display.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A virtual character video call method based on three-dimensional face reconstruction is characterized by comprising the following steps:

2. The virtual character video call method according to claim 1, wherein the three-dimensional face model parameters predicted by the three-dimensional face reconstruction network include an identity model parameter S, an expression model parameter E, a texture model parameter T, a pose model parameter P, and an illumination model parameter L; the three-dimensional face model parameters obtained through the audio prediction network prediction comprise expression model parameters E and posture model parameters P.

3. The virtual character video call method as claimed in claim 2, further comprising the steps of: carrying out optimization training on the three-dimensional face reconstruction network according to the video stream image frames, the predicted three-dimensional face model parameters and the restored three-dimensional face model, wherein the expression formula is as follows:

in the formula (I), the compound is shown in the specification,

representing the image frames of the original video stream,

4. The virtual character video call method as claimed in claim 3, wherein the three-dimensional face reconstruction network comprises an R-Net network.

5. The virtual character video call method as claimed in claim 2, further comprising the steps of: performing optimization training on the audio prediction network according to the audio stream, the predicted three-dimensional face model parameters and the preset initial three-dimensional face model parameters, wherein an expression formula is as follows:

is the preset initial expression model parameters,

representing pre-set initial pose model parameters,

6. The avatar video call method of claim 5, wherein said audio prediction network comprises an LSTM network.

7. The virtual character video call method as claimed in claim 2, wherein in the step S2, the specific step of restoring the corresponding three-dimensional face model by using the three-dimensional face reconstruction technique includes:

initializing a set of vertices of a three-dimensional face

And RGB set corresponding to three-dimensional face vertex set

in the formula (I), the compound is shown in the specification,

an identity base representing a three-dimensional face,

an expression base representing a three-dimensional face,

8. The virtual character video call method as claimed in claim 2, wherein said step of S2 further comprises the steps of: after the parameter file is compressed, the parameter file is transmitted to a second communication terminal by adopting cloud service; and transmitting the audio stream of the first communication terminal to a second communication terminal by adopting a network protocol.

9. The virtual character video call method according to any one of claims 2 to 8, further comprising the steps of: and the second communication terminal is preset with identity model parameters S of other figures, and when the second communication terminal receives the parameter file and the audio stream transmitted by the first communication terminal, the identity model parameters S and the identity model parameters in the parameter file are replaced, and then the corresponding three-dimensional face model is recovered by using a three-dimensional face reconstruction technology.

10. A virtual character video call system based on three-dimensional face reconstruction is applied to the virtual character video call method of any one of claims 1 to 9, and is characterized by comprising a first communication terminal and a second communication terminal, wherein the first communication terminal and the second communication terminal respectively comprise a video acquisition module, an audio acquisition module, a display module, a communication module and a main control module; wherein: