CN114332315B

CN114332315B - 3D video generation method, model training method and device

Info

Publication number: CN114332315B
Application number: CN202111494651.7A
Authority: CN
Inventors: 彭哲; 刘玉强; 耿凡禺
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-11-08
Anticipated expiration: 2041-12-07
Also published as: CN114332315A

Abstract

The disclosure provides a 3D video generation method and device, a neural network model training method and device, electronic equipment, a storage medium and a computer program, and relates to the field of image processing, in particular to the fields of voice, virtual/augmented reality and deep learning. The specific implementation scheme is as follows: performing Principal Component Analysis (PCA) processing on a plurality of 3D avatar sequences of a plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions, wherein the 3D avatar sequences comprise a plurality of 3D avatar models arranged in time sequence; performing PCA processing on the plurality of PCA coefficients to obtain a mean value sequence and a change matrix of the expression; and generating expression coefficients for the expression based on the mean sequence and the variation matrix of the expression, and generating the first 3D video based on the expression coefficients and the PCA parameters.

Description

3D video generation method, model training method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to the field of technologies for speech, virtual/augmented reality, and deep learning, and in particular, to a 3D video generation method and apparatus, a neural network model training method and apparatus, an electronic device, a storage medium, and a computer program.

Background

In the fields of video production, electronic games, self-service customer service and the like, 3D video generation is required. The expression of the 3D avatar is an important part in the 3D video generation process.

The generation of 3D video is usually limited by blending animation (Blendshape), which is mostly made by hand, and the expression of the generated 3D avatar is therefore deficient in expressiveness and detail.

Disclosure of Invention

The present disclosure provides a 3D video generation method and apparatus, a training method and apparatus of a neural network model, an electronic device, a storage medium, and a computer program.

According to an aspect of the present disclosure, there is provided a 3D video generation method including:

performing Principal Component Analysis (PCA) processing on a plurality of 3D avatar sequences with a plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions, wherein the 3D avatar sequences comprise a plurality of 3D avatar models arranged according to a time sequence;

performing PCA processing on the plurality of PCA coefficients to obtain a mean value sequence and a change matrix of the expression; and

generating an expression coefficient for the expression based on the mean sequence and the variation matrix of the expression, and generating a first 3D video based on the expression coefficient and the PCA parameters.

According to another aspect of the present disclosure, there is provided a training method of a neural network model, including:

generating vertex change coefficients of the 3D avatar based on speech features and PCA coefficients for silent expressions using the neural network model;

calculating a loss function based on the generated vertex change coefficient and the target vertex change coefficient;

adjusting parameters of the neural network model according to the loss function.

According to another aspect of the present disclosure, there is provided a 3D video generating apparatus including:

the first processing module is used for performing Principal Component Analysis (PCA) processing on a plurality of 3D avatar sequences with a plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions, wherein the 3D avatar sequences comprise a plurality of 3D avatar models arranged in time sequence;

the second processing module is used for carrying out PCA processing on the plurality of PCA coefficients to obtain a mean sequence and a change matrix of the expression; and

a first generation module, configured to generate an expression coefficient for the expression based on the mean sequence and the variation matrix of the expression, and generate a first 3D video based on the expression coefficient and the PCA parameter.

According to another aspect of the present disclosure, there is provided a training apparatus of a neural network model, including:

a coefficient generation module for generating a vertex change coefficient of the 3D avatar based on the speech feature and the PCA coefficient for the silent expression using the neural network model;

a loss calculation module for calculating a loss function based on the generated vertex change coefficient and the target vertex change coefficient;

an adjusting module for adjusting parameters of the neural network model according to the loss function.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute a method implementing the above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flow chart of a 3D video generation method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of obtaining PCA parameters and PCA coefficients according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a method of obtaining PCA parameters and PCA coefficients according to an embodiment of the disclosure;

fig. 4 is a flowchart of a 3D video generation method according to another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a method of obtaining a mean sequence and a change matrix from PCA coefficients according to an embodiment of the disclosure;

FIG. 6 is a flow chart of a method of generating expression coefficients according to an embodiment of the present disclosure;

fig. 7A and 7B are schematic diagrams of a method of generating an expression coefficient according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a 3D video generation method according to an embodiment of the present disclosure;

fig. 9 is a flowchart of a 3D video generation method according to another embodiment of the present disclosure;

fig. 10 is a schematic diagram of a 3D video generation method according to another embodiment of the present disclosure;

fig. 11 is a flowchart of a 3D video generation method according to yet another embodiment of the present disclosure;

fig. 12 is a schematic diagram of a 3D video generation method according to yet another embodiment of the present disclosure;

FIG. 13 is a flow chart diagram of a method of training a neural network model in accordance with an embodiment of the present disclosure;

FIG. 14 is a schematic diagram of a method of training a neural network model, according to an embodiment of the present disclosure;

FIG. 15 is a flow chart of a method of training a neural network model according to another embodiment of the present disclosure;

FIG. 16 is a schematic diagram of a method of generating training data according to an embodiment of the present disclosure;

fig. 17 is a block diagram of a 3D video generation apparatus according to an embodiment of the present disclosure;

FIG. 18 is a block diagram of a training apparatus for a neural network model in accordance with an embodiment of the present disclosure;

fig. 19 is a block diagram of an electronic device for implementing a 3D video generation method and a training method of a neural network model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a 3D video generation method according to an embodiment of the present disclosure.

As shown in fig. 1, the 3D video generating method 100 includes operations S110 to S130.

In operation S110, a Principal Component Analysis (PCA) process is performed on a plurality of 3D avatar sequences of a plurality of expressions, the 3D avatar sequence including a plurality of 3D avatar models arranged in time series, to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions.

The PCA technique, which is referred to herein, is intended to extract a main feature classification of data by converting high-dimensional data into low-dimensional data using a dimensionality reduction concept. In the PCA processing process, the dimensionality reduction is realized by projecting high-dimensional features to PCA parameters to obtain low-dimensional features, and feature vectors obtained by dimensionality reduction are also called PCA coefficients. The PCA parameters are calculated during the PCA calculation, which may be vectors. For example, for a matrix of size T (V x K), the PCA parameters may be vectors of size M (V x K), and the resulting PCA coefficients may be vectors of size tm, where T, V, K, and M are integers greater than 1, by projecting the matrix onto the PCA parameters. The projection process may include calculating a mean vector of the T x (V x K) matrix, the magnitude of the mean vector being V x K, and multiplying the T x (V x K) matrix minus the mean vector by the transpose of the M x (V x K) matrix to obtain the T x M vector.

The plurality of expressions referred to herein may be, for example, various expressions in a silent state (hereinafter referred to as silent expressions), for example, smile expressions, angry expressions, sad expressions, and the like in a silent state, and are not particularly limited. By silent state can be meant a state without lip movement, i.e. a state of no speech.

The plurality of 3D avatar sequences of the plurality of expressions referred to herein may be, for example, a plurality of 3D avatar sequences for a plurality of silent expressions, wherein each silent expression corresponds to one 3D avatar sequence, and each 3D avatar sequence may, for example, include a plurality of 3D avatar models arranged in a time sequence. In the present embodiment, each 3D avatar model may include a plurality of 3D vertices, and each 3D vertex may be represented by three-dimensional coordinates, such as x, y, and z coordinates. The method for constructing the 3D avatar model may adopt any suitable method as required, for example, collecting position changes of a plurality of points of the face of a person or other object when making an expression, and constructing the 3D avatar model based on the collected information, which is not described herein again. It should be noted that the 3D avatar sequence in this embodiment is not a head model for a specific user, and cannot reflect personal information of a specific user. The 3D avatar sequence in this embodiment is from a public data set.

In this operation, PCA processing is performed on a plurality of 3D avatar sequences respectively corresponding to a plurality of expressions, so as to obtain PCA parameters and a plurality of PCA coefficients respectively corresponding to a plurality of expressions, for example, PCA processing is performed on 3D avatar sequences of all silent expressions (for example, 3D avatar sequences under silent expressions including smile, heartburn, sadness, etc.), so as to obtain PCA parameters, which may be a matrix. Projecting the 3D avatar sequence of each silent expression (e.g. the 3D avatar sequence of the smile expression in the silent state) to the PCA parameters to obtain the PCA coefficient corresponding to each silent expression (e.g. the smile expression in the silent state), where the PCA coefficient of each silent expression may be a sequence having a certain sequence length and amplitude.

In operation S120, PCA processing is performed on the plurality of PCA coefficients to obtain a mean sequence and a variation matrix of the expression.

Based on operation S110, for each expression, the PCA coefficient corresponding to the expression may be obtained according to the 3D avatar sequence of the expression. And carrying out PCA processing on the PCA coefficient corresponding to each expression to obtain a mean value sequence and a change matrix aiming at the expression.

For example, for a smile expression in a silence state, PCA processing is performed on PCA coefficients corresponding to the smile expression to obtain a mean sequence and a variation matrix for the smile expression.

The mean sequence may include the mean of the individual PCA coefficients, and the variation matrix may be a matrix obtained by subtracting the mean from the individual PCA coefficients. Based on the above operations, the mean sequence and the change matrix of other various expressions in the silent state can be obtained, for example, the mean sequence and the change matrix of silent expressions including but not limited to anger, hurry, sadness, and the like, and are not described herein again.

In operation S130, an expression coefficient for an expression is generated based on the mean sequence and the variation matrix of the expression, and a first 3D video is generated based on the expression coefficient and the PCA parameter.

After obtaining the mean sequence and the change matrix for each expression according to the above-described method, for each expression, an expression coefficient for the expression is generated based on the mean sequence and the change matrix for the expression. The expression coefficient can be a sequence, has a certain sequence length and amplitude, represents the characteristics of each expression, and can be properly adjusted by a user according to actual needs, so that the adjusted expression coefficient can more closely embody the characteristics of a 3D virtual image, and a more vivid 3D video can be obtained.

After the expression coefficient for each expression is obtained, corresponding expression information is generated based on the expression coefficient and the PCA parameters obtained according to the method, and the expression information is applied to a pre-constructed 3D basic virtual image model to obtain a first 3D video.

The pre-constructed 3D base avatar model is specifically a 3D base avatar model under the condition of mouth closing, and the method for constructing the 3D base avatar model may adopt any suitable method as required, and is not described herein again.

In this embodiment, the 3D video can be generally applied to the fields of electronic games, video production, self-service customer service, and the like, for example, the 3D avatar is presented in the scenes of games, animations, or intelligent self-service, and the like, so as to interact with the user.

For example, for a smile expression in a silent state, an expression coefficient for the smile expression is generated based on the mean sequence and the variation matrix of the smile expression, and the expression coefficient embodies the characteristics of the smile expression in the silent state. And generating a first 3D video with a smile expression based on the expression coefficients and the PCA parameters, wherein the first 3D video with the smile expression embodies the change of the smile expression under the continuous video frames.

Similarly, based on the above operations, the first 3D video including other various expressions in the silent state, for example, but not limited to, the first 3D video including the silent expressions such as anger, hurry, sadness, etc., can be obtained, and will not be described herein again.

According to the technical scheme of the embodiment of the disclosure, two times of PCA processing are performed on a plurality of 3D virtual image sequences with a plurality of expressions, and the 3D video is constructed according to the processing result, so that the constructed video can reflect expression changes under continuous video frames, more facial details are provided, and the expressive force of the 3D virtual image is improved.

According to the 3D video generation method, the process of generating the 3D video is not limited by the Blenshape, the Blenshape matched with the 3D virtual image role does not need to be manually made one by one, so that the 3D basic virtual image model can be flexibly constructed according to different 3D application scenes, the generation efficiency of the 3D video with the expression can be improved, the labor cost is reduced, and the problem that the selection of the 3D virtual role is limited due to the adoption of the Blenshape can be avoided.

Fig. 2 is a flowchart of a method of acquiring PCA parameters and PCA coefficients according to an embodiment of the present disclosure, and fig. 3 is a schematic diagram of a method of acquiring PCA parameters and PCA coefficients according to an embodiment of the present disclosure. An exemplary implementation of the above operation S110 will be described below with reference to fig. 2 and 3.

As shown in fig. 2, the method of acquiring PCA parameters and PCA coefficients includes operations S211 to S212.

In operation S211, PCA processing is performed on the plurality of 3D avatar sequences to obtain PCA parameters.

The plurality of 3D avatar sequences referred to herein may be generated based on 4D data, for example, based on which a plurality of 3D avatar sequences respectively corresponding to a plurality of silent expressions may be extracted and obtained.

The 4D data may be obtained by recording, or may be extracted from a data source (e.g., a network resource), which is not limited specifically. For example, the 4D data may comprise a plurality of frames, each frame being a 3D avatar model. The 3D avatar model includes a plurality of vertices, each of which may be represented by three dimensional coordinates (e.g., x, y, z). That is, the 4D data may include a time series of 3D avatar models. The 4D data may further include audio data corresponding to a time series of the 3D avatar model. Position change information of a plurality of points of the face of an object (also called a character), such as a person, during facial activities (e.g., speaking, emoting, etc.), and audio data generated due to the speaking can be collected by way of recording, for example. A time series of the 3D avatar model may be generated based on the position change information of the plurality of points of the face, and the generated time series of the 3D avatar model is combined with the recorded audio data to obtain 4D data.

In this operation, PCA processing is performed on the plurality of 3D avatar sequences to obtain PCA parameters that are the same as or similar to the methods described above, and are not described herein again.

In operation S212, a projection of the 3D avatar sequence of the expression on the PCA parameters is calculated, resulting in PCA coefficients for the expression.

In this operation, after the PCA parameters are obtained, for each expression, the projection of the 3D avatar sequence with the expression on the PCA parameters is calculated, and the process of obtaining the PCA coefficient for the expression is the same as the above-described manner, and is not described again here.

In the embodiment of the present disclosure, the plurality of 3D avatar sequences for the plurality of expressions, respectively, are acquired based on the 4D data, and PCA processing is performed on the plurality of 3D avatar sequences for the plurality of expressions, respectively, to generate PCA parameters and a plurality of PCA coefficients, and the data acquisition and processing process is simpler and more efficient.

Fig. 3 is a schematic diagram of a method of obtaining PCA parameters and PCA coefficients in accordance with an embodiment of the disclosure.

As shown in fig. 3, here, the so-called 4D data 301 may include, for example, 4D avatar data of various silent expressions, the so-called plurality of 3D avatar sequences 302 may be, for example, a plurality of 3D avatar sequences for a plurality of silent expressions, and the above-mentioned definitions regarding the 4D data 301 and the plurality of 3D avatar sequences 302 are the same as or similar to the above description and are not repeated herein. It should be noted that the 4D data in this embodiment is not 4D data for a specific user, and does not reflect personal information of a specific user. The 4D data in this embodiment is from a public data set.

For example, after the 4D data 301 acquired according to the above method, a plurality of 3D avatar sequences 302 for a plurality of silent expressions are extracted based on the 4D data 301, and PCA processing is performed on the 3D avatar sequences of all expressions to obtain PCA parameters 303. For each silent expression (e.g. smile expression in silence state), the projection of the 3D avatar sequence 302 of each expression on the PCA parameters 303 is calculated, resulting in PCA coefficients 304 for that expression (e.g. smile expression in silence state).

In the embodiment of the present disclosure, the plurality of 3D avatar sequences 302 for the plurality of expressions, respectively, are acquired based on the 4D data 301, and PCA processing is performed on the plurality of 3D avatar sequences 302 for the plurality of expressions, respectively, to generate PCA parameters 303 and a plurality of PCA coefficients 304, so that the data acquisition and processing process is simpler and more efficient.

Fig. 4 is a flowchart of a 3D video generation method according to another embodiment of the present disclosure.

As shown in fig. 4, in the present embodiment, the 3D video generating method 400 includes operations S410 to S440. Operations S410 and operations S430 to S440 may be implemented in the same or similar manner as operations S110 and operations S120 to S130, respectively, and repeated details are not repeated.

In operation S410, PCA processing is performed on a plurality of 3D avatar sequences of a plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions, where the 3D avatar sequence includes a plurality of 3D avatar models arranged in time sequence.

In operation S420, an alignment process is performed on the sequence length and the magnitude of each PCA coefficient.

For example, as can be seen from the above description, the PCA-coefficients of each silent expression are a sequence having a certain sequence length and amplitude. Because the PCA coefficients corresponding to the silent expressions are obtained according to different 3D avatar sequences, the sequence lengths of the PCA coefficients of each expression may be inconsistent and the amplitude of each sequence differs greatly, which is inconvenient for the second PCA processing.

In this embodiment, in order to make the PCA coefficients of various silent expressions have uniform sequence length and amplitude to facilitate the second PCA processing, the sequence length and amplitude of the PCA coefficients of various silent expressions are aligned. The alignment operation here generally refers to unifying the sequence lengths and amplitudes of different PCA coefficients through a Dynamic Time Warping (DTW) algorithm and amplitude adjustment.

The term "sequence length unification" generally refers to making the sequence lengths of the PCA coefficients of different silent expressions consistent, for example, the time lengths of the PCA coefficients of different silent expressions can be unified to the same length by using the DTW algorithm. By amplitude unity is generally meant dividing the amplitude of the sequence elements of the PCA coefficients of different silent expressions by the average of the amplitudes.

It should be noted that, in this embodiment, the sequence length and the amplitude of the PCA coefficients of different silent expressions are unified, and other suitable methods may be used besides the above-described method, which is not limited specifically.

In the embodiment of the disclosure, the sequence length and the amplitude of the PCA coefficients of various silent expressions are aligned, so that the PCA coefficients of various silent expressions have uniform sequence length and amplitude, which is convenient for the second PCA processing, thereby improving the efficiency and the accuracy of the second PCA processing.

In operation S430, PCA processing is performed on the aligned plurality of PCA coefficients to obtain a mean sequence and a variation matrix of the expression.

For example, after aligning PCA coefficients of different silent expressions, PCA processing is performed on the aligned PCA coefficients to obtain a mean sequence and a variation matrix for each expression.

In this operation, the process of obtaining the mean sequence and the change matrix for each expression according to the aligned PCA coefficients is the same as or similar to the manner described above, and is not described herein again.

In operation S440, an expression coefficient for an expression is generated based on the mean sequence and the variation matrix of the expression, and a first 3D video is generated based on the expression coefficient and the PCA parameter.

Fig. 5 is a schematic diagram of a method for obtaining a mean sequence and a variation matrix according to PCA coefficients according to an embodiment of the disclosure, and an example implementation of the operations S420 to S430 described above will be described below with reference to fig. 5.

For example, as shown in fig. 5, after obtaining PCA coefficients Pa of each silent expression (e.g., a silent expression such as smile, sad, angry, etc.) in the above manner, alignment processing 501 is performed on the sequence length and amplitude of each PCA coefficient Pa, so as to obtain a plurality of aligned PCA coefficients Pa'. The PCA processing 502 is performed on the aligned plurality of PCA coefficients Pa', resulting in a mean sequence Pjx and a variation matrix Tbj for each silent expression (e.g., a silent expression such as smile, hurry, sadness, anger, etc.).

Fig. 6 is a flowchart of a method of generating expression coefficients according to an embodiment of the present disclosure. Fig. 7A and 7B are schematic diagrams of a method of generating an expression coefficient according to an embodiment of the present disclosure, and an example implementation of operation S130 described above will be described below with reference to fig. 6, 7A, and 7B.

As shown in fig. 6, the method of generating the expression coefficients includes operations S631 to S633.

In operation S631, a random fluctuation is applied to the change matrix of the expression, resulting in a random fluctuation change coefficient.

And aiming at each silent expression, applying a random fluctuation curve to the change matrix obtained based on the method to obtain a random fluctuation change coefficient.

For example, taking an expression coefficient for generating a smile expression in a silent state as an example, after obtaining a change matrix and a mean sequence of the smile expression based on the above-described method, a random fluctuation curve is applied to the change matrix of the smile expression to obtain a random fluctuation change coefficient.

In operation S632, an expression coefficient for an expression is generated based on the mean sequence of the expression and the random fluctuation coefficient.

For example, based on the mean sequence of the smile expression and the random fluctuation coefficient calculated as described above, an expression coefficient for the smile expression may be generated. The expression coefficient reflects the characteristic of smile expression in a silent state, and can be subsequently used for generating a 3D video with smile expression.

Similarly, the expression coefficients of other silent expressions can be obtained based on the above operations, and the specific obtaining manner is the same as or similar to the process described above, and is not described herein again.

In operation S633, the expression coefficient is adjusted according to the user input.

The user can properly adjust the expression coefficient according to actual needs, for example, the sequence length and amplitude of the expression coefficient can be flexibly adjusted according to user input, so that the adjusted expression coefficient can more closely embody the characteristics of a 3D virtual image expected by the user, and a more vivid 3D video can be obtained.

It should be noted that although the steps of the method are described in a specific order, the embodiments of the present disclosure are not limited thereto, and the steps may be performed in other orders as needed. For example, in some embodiments, step S633 may not be included. In this case, the expression coefficient obtained in operation S632 is used for the generation of the 3D video without adjusting the expression coefficient obtained in operation S632, so that the 3D video is generated in a simple and fast manner, which is not limited by the present disclosure.

Fig. 7A is a schematic diagram of a method of generating expression coefficients according to an embodiment of the present disclosure.

As shown in fig. 7A, after obtaining the change matrix Tbj and the mean value sequence Pjx according to the method described above, the random fluctuation curve Rfc is applied 701 to the change matrix Tbj to obtain a random fluctuation coefficient Rf. An expression coefficient Px for the expression is generated 702 based on the random fluctuation coefficient Rf and the mean sequence Pjx, which is to be subsequently used for generating the 3D video.

Fig. 7B is a schematic diagram of a method of generating an expression coefficient according to another embodiment of the present disclosure.

As shown in fig. 7B, after obtaining the variation matrix Tbj and the mean value sequence Pjx according to the above-described method, the random fluctuation curve Rfc is applied 701 to the variation matrix Tbj to obtain a random fluctuation variation coefficient Rf. An expression coefficient Px for the expression is generated 702 based on the random fluctuation coefficient Rf and the mean sequence Pjx. At this time, the expression coefficient Px may be flexibly adjusted 703 according to the parameter uc input by the user, so as to obtain an adjusted expression coefficient Px', which will be subsequently used for generating the 3D video.

In this embodiment, the expression coefficients are flexibly adjusted through user input, so that the adjusted expression coefficients are more appropriate to the characteristics of the 3D avatar desired by the user, and thus a more vivid 3D video is obtained.

Fig. 8 is a schematic diagram of a 3D video generation method according to an embodiment of the present disclosure, and a specific implementation of the method of generating a 3D video according to the embodiment of the present disclosure will be described in detail with reference to fig. 8. It should be understood that the illustration in fig. 8 is only for facilitating the understanding of the technical solutions of the present disclosure by those skilled in the art, and is not intended to limit the protection scope of the present disclosure.

As shown in fig. 8, after obtaining a plurality of 3D avatar sequences 3D-X for a plurality of expressions respectively in the manner described above, PCA processing 801 is performed on the plurality of 3D avatar sequences 3D-X for the plurality of expressions respectively to obtain PCA parameters Pcs and a plurality of PCA coefficients Pcx for the plurality of expressions. The plurality of PCA coefficients Pcx are subjected to PCA processing 802 to obtain a mean sequence Pjx and a variation matrix Tbj for each expression.

After obtaining the change matrix Tbj and the mean value sequence Pjx for each expression, the random fluctuation curve Rfc is applied 803 to the change matrix Tbj to obtain a random fluctuation coefficient Rf. An expression coefficient Px for the expression is generated 804 based on the random fluctuation coefficient Rf and the mean sequence Pjx. At this time, the expression coefficient Px may be flexibly adjusted 805 based on the parameter uc input by the user, so as to obtain an adjusted expression coefficient Px'.

After obtaining the expression coefficients Px 'for each expression, a first 3D video Anml is generated 806 based on the expression coefficients Px' and the PCA parameters Pcs obtained according to the above method.

In the embodiment of the disclosure, PCA processing is performed twice on a plurality of 3D avatar sequences respectively targeting a plurality of expressions, and a 3D video is constructed according to the processing result, so that the constructed video can reflect expression changes under continuous video frames, thereby providing more facial details and improving the expressive power of the 3D avatar.

In order to make the 3D video generated above more vivid, for example, a lip movement effect may be given to the 3D video, so as to obtain a 3D video in which the lip movement matches the expression (hereinafter, referred to as a 3D video of a non-silent expression). A manner of generating a 3D video of a non-silent expression will be described in detail with reference to fig. 9 and 10.

Fig. 9 is a flowchart of a 3D video generation method according to another embodiment of the present disclosure.

As shown in fig. 9, in the present embodiment, the 3D video generating method 900 includes operations S910 to S950. Operations S910 to S920 may be implemented in the same manner as or similar to operations S110 to S120, and repeated details are not repeated.

In operation S910, PCA processing is performed on a plurality of 3D avatar sequences of a plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions.

In operation S920, PCA processing is performed on the plurality of PCA coefficients to obtain a mean sequence and a variation matrix of the expression.

In operation S930, an expression coefficient for an expression is generated based on the mean sequence and the variation matrix of the expression.

In this operation, for each expression, generating an expression coefficient for the expression based on the mean sequence and the change matrix for the expression is the same as or similar to the process described above, and details are not repeated.

In operation S940, a vertex change coefficient of the 3D avatar is generated based on the voice feature and the expression coefficient using a neural network model.

The Neural Network model referred to herein may be, for example, a Recurrent Neural Network (RNN). The neural network model may also be implemented by other suitable types of neural networks, which may be chosen according to the actual choice.

The Speech feature may be, for example, a Speech feature extracted from audio data, or a Speech feature obtained by performing Text-To-Speech (TTS) conversion on a Text, or a Speech feature acquired by another method. The method for acquiring the voice feature may be specifically selected according to actual situations, and is not limited herein.

In the speaking state of lip movement (hereinafter referred to as "non-silent state"), since the lip shape is changed, the change of the expression in the non-silent state is usually different from the change of the expression in the silent state, and thus the change of the expression in the non-silent state needs to be properly adjusted.

In order to make the generated 3D video more conform to the expression effect in the non-silent state, in the embodiment of the present disclosure, for each expression, the vertex change coefficient of the 3D avatar is generated using the neural network model based on the expression coefficient and the voice feature acquired in the above-described manner. The vertex change coefficient of the 3D avatar represents the change of the expression in the non-silent state, and the vertex change coefficient of the 3D avatar can be subsequently utilized to generate the 3D video with the non-silent expression.

For example, after obtaining an expression coefficient (e.g., an expression coefficient of a smiling expression) and a voice feature based on the above-described manner using a neural network model such as an RNN network, a vertex change coefficient of the 3D avatar is generated based on the voice feature and the expression coefficient using the neural network model.

In operation S950, a second 3D video is generated based on the expression coefficients, the PCA parameters, and the 3D vertex change coefficients.

For example, a silent expression coefficient is generated based on the expression coefficients (e.g., expression coefficients of smile expressions) and PCA parameters obtained in the above manner, expression information in a non-silent state is generated based on the silent expression coefficient and the 3D vertex change coefficient, and the expression information in the non-silent state is applied to a pre-constructed 3D avatar model to obtain a second 3D video (i.e., a 3D video of a non-silent expression).

It should be noted that, in some embodiments, the expression coefficients of the 3D video for generating the non-silent expression may also be adjusted, and specifically, the adjustment process of the expression coefficients may be implemented by referring to the above-described manner, for example, the sequence length and amplitude of the expression coefficients may be flexibly adjusted according to the user input, so that the adjusted expression coefficients more closely reflect the characteristics of the 3D avatar desired by the user, thereby obtaining a more vivid 3D video.

In the embodiment of the disclosure, PCA processing is performed twice on a plurality of 3D avatar sequences respectively aiming at a plurality of expressions, and the processing result is combined with a neural network model to construct a 3D video of the expression in a non-silent state, so that the constructed video can reflect the non-silent expression change in continuous video frames, that is, a speaking effect of a certain expression can be presented, thereby providing more facial details and improving the expressive force of the 3D avatar.

Fig. 10 is a schematic diagram of a 3D video generation method according to another embodiment of the present disclosure, and a specific implementation of the 3D video generation method according to the embodiment of the present disclosure will be described in detail with reference to fig. 10.

In the present embodiment, a smile expression will be taken as an example to explain a manner of generating a 3D video with a non-silent expression. It should be noted that the smiley expression is only an example to help those skilled in the art understand the solution of the present disclosure, and is not intended to limit the protection scope of the present disclosure.

As shown in fig. 10, after the voice feature Va and the expression coefficients Px and PCA parameters Pcs of the smile expression are obtained in the above manner, a neural network model such as an RNN network is used to generate vertex variation coefficients Pd of the 3D avatar based on the voice feature Va and the expression coefficients Px of the smile expression in operation 1010, and a second 3D video Anm2 (i.e., a non-silent-expression 3D video) is generated based on the expression coefficients Px, PCA parameters Pcs and the vertex variation coefficients Pd of the 3D avatar of the smile expression in operation 1020.

It should be noted that, in some embodiments, the expression coefficient Px used for generating the 3D video with the non-silent expression may also be, for example, an adjusted expression coefficient, and specifically, the adjustment process of the expression coefficient may be implemented by referring to the above-described manner, for example, the sequence length and amplitude of the expression coefficient may be flexibly adjusted according to the user input, so that the adjusted expression coefficient is more appropriate for the characteristics of the 3D avatar desired by the user, and thus, a more vivid 3D video may be obtained.

In the embodiment of the disclosure, PCA processing is performed twice on a plurality of 3D avatar sequences respectively aiming at a plurality of expressions, and the processing result is combined with a neural network model to construct a 3D video of the expression in a non-silent state, so that the constructed video can reflect the expression change under continuous video frames, thereby providing more facial details and improving the expressive force of the 3D avatar. In the 3D video generation method according to the embodiment of the present disclosure, since the process of generating a 3D video is not limited by Blendshape, blendshapes matched with 3D avatar characters do not need to be manually made one by one, so that a 3D basic avatar model can be flexibly constructed according to different 3D application scenes, not only can the generation efficiency of a 3D video with expressions be improved, but also the labor cost is reduced, and the problem of limited selection of 3D avatar due to blending of a 3D video can be avoided.

In some embodiments, in addition to generating the 3D video for a single expression change as described above, the 3D video for switching between different expressions may also be generated. The following describes a method for generating 3D video with different expression conversions in detail with reference to fig. 11 and 12.

Fig. 11 is a flowchart of a 3D video generation method according to still another embodiment of the present disclosure.

As shown in fig. 11, in the embodiment of the present disclosure, the 3D video generation method 1100 includes operations S1110 to S1140.

In operation S1110, an expression coefficient for a first expression is generated.

In operation S1120, an expression coefficient for a second expression is generated.

The first expression and the second expression referred to herein may include, for example, but are not limited to, a smiling expression, an angry expression, a sad expression, etc., wherein the first expression and the second expression are different and may be specifically selected according to the actual choice. The expression coefficients for the first expression and the second expression, for example, the expression coefficients for the smiling expression and the expression coefficients for the sad expression, may be generated in the manner described in any of the embodiments above.

In operation S1130, interpolation processing is performed between the expression coefficient for the first expression and the expression coefficient for the second expression, resulting in an expression conversion coefficient.

The interpolation processing referred to herein includes linear interpolation between the expression coefficient for the first expression and the expression coefficient for the second expression, thereby obtaining an expression conversion coefficient that converts from the first expression to the second expression, which embodies conversion from the first expression to the second expression.

For example, the process of switching from the smile expression to the casual expression is described by taking the first expression as the smile expression and the second expression as the casual expression.

After obtaining the expression coefficient for the first expression (for example, smiling expression) and the expression coefficient for the second expression (for example, casual expression) based on the manner described above, interpolation processing is performed between the expression coefficient for the first expression and the expression coefficient for the second expression, resulting in an expression conversion coefficient that embodies conversion from smiling to casual.

In operation S1140, a third 3D video is generated based on the expression conversion coefficients and the PCA parameters.

And generating expression conversion information from the first expression to the second expression based on the expression conversion coefficient and the acquired PCA parameters, and applying the expression conversion information to a pre-constructed 3D basic virtual image model to obtain a third 3D video.

For example, following the above example of the conversion from smile to casualty expression, based on interpolation processing between the expression coefficient of the smile expression and the expression coefficient of the casualty expression, an expression conversion coefficient from smile to casualty is obtained, expression conversion information from smile to casualty is generated according to the expression conversion coefficient and the PCA parameter, and the expression conversion information is applied to a pre-constructed 3D base avatar model, so as to obtain a 3D video with the effect of conversion from smile to casualty.

It should be noted that, in the embodiment of the present disclosure, the 3D video generating process with the expression conversion effect is not limited to the conversion between two expressions, and besides the manner shown in the above embodiment, the process may also be applied to the conversion between multiple other expressions, or other conversion manners applicable to multiple expressions may be adopted. For example, a transition from a first expression to a second expression to a third expression or even an nth expression, or from a first expression to a second expression to a first expression, etc. may be made. The expression conversion mode is specifically selected according to practical choices, and is not limited herein.

In addition, in some embodiments, the expression coefficients for generating the 3D video with non-silent expression may also be adjusted, for example, the adjustment process of the expression coefficients may be implemented by referring to the above-described manner, for example, the sequence length and amplitude of the expression coefficients may be flexibly adjusted according to the user input, so that the adjusted expression coefficients more closely reflect the characteristics of the 3D avatar desired by the user, thereby obtaining a more vivid 3D video.

In the embodiment of the disclosure, the expression conversion coefficient is generated based on the expression coefficients of at least two expressions to construct a 3D video with expression conversion effect, thereby further improving the expressive force and detail presentation of the 3D avatar.

Fig. 12 is a schematic diagram of a 3D video generation method according to still another embodiment of the present disclosure, and an example implementation of the 3D video generation method according to the embodiment of the present disclosure will be described below with reference to fig. 12.

As shown in fig. 12, the expression coefficient Px for the first expression (e.g., smile expression) is obtained based on the manner described above ₁ And an expression coefficient Px of a second expression (e.g., an expression of heart injury) ₂ And PCA parameters Pcs followed by expression coefficients Px for the first expression ₁ And needleExpression coefficient Px for second expression ₂ An interpolation process 1201 is performed to obtain an expression conversion coefficient Pz representing the conversion from the first expression (e.g., smiling) to the second expression (e.g., hurting). A third 3D video Anm3 is generated 1202 based on the expression conversion coefficients Pz and the PCA parameters Pcs.

In some embodiments, in order to make the generated 3D video more vivid, it is also possible to impart a lip motion effect to the 3D video, for example, resulting in a 3D video of non-silent expression. The manner of generating a 3D video with a non-silent expression of an expression conversion effect will be described in detail below along with an expression conversion example in fig. 12. The expression conversion coefficient for converting from smiling to casualty may be generated based on the smiling expression coefficient and the casualty expression coefficient in the above-described manner, for example. Generating 3D vertex change information based on the voice feature using RNN in the above manner, and generating a 3D video having a smile expression converted from a non-silence smile expression to a non-silence casualty expression based on the expression conversion coefficient, the 3D vertex change information, and the PCA parameter.

FIG. 13 is a flow chart of a method of training a neural network model in accordance with an embodiment of the present disclosure. The method is suitable for training the neural network model used in the 3D video generation method.

As shown in fig. 13, the training method 1300 of the neural network model includes operations S1310 to S1330. In the embodiments of the present disclosure, the neural network model is the same as or similar to the definitions described above, and is not described herein again. After the neural network model is trained, the neural network model can be used as the neural network model used in the 3D video generation method.

In operation S1310, vertex change coefficients of the 3D avatar are generated based on the speech features and PCA coefficients for silent expressions using the neural network model.

The speech features and PCA coefficients for silent expressions, as they are called here, can be derived, for example, on the basis of 4D data, as will be explained in detail below.

In the present operation, a neural network model such as an RNN network is used to generate vertex change coefficients of the 3D avatar, which are used to generate 3D video of non-silent expressions, based on the speech features and PCA coefficients for silent expressions.

In operation S1320, a loss function is calculated based on the generated vertex change coefficient and the target vertex change coefficient.

And calculating a loss function based on the comparison between the obtained vertex change coefficient and the target vertex change coefficient, wherein the loss function represents the difference between the vertex change coefficient obtained by the prediction of the neural network model and the preset target vertex change coefficient, and the parameters of the neural network model can be corrected based on the difference, so that the result output by the neural network model is more accurate.

In operation S1330, parameters of the neural network model are adjusted according to the loss function.

For example, parameters of a neural network model (e.g., RNN network) may be adjusted according to the loss function calculated in the above manner, so as to improve reliability of prediction of the neural network model, so that a result output by the neural network model is more accurate.

The technical scheme of the embodiment of the disclosure generates the vertex change coefficient of the 3D avatar based on the voice feature and the PCA coefficient for the silent expression, trains the neural network model by using the vertex change coefficient of the 3D avatar, and adjusts the parameters of the neural network model by comparing the difference between the predicted vertex change coefficient of the 3D avatar and the target vertex change coefficient in the training process. The trained neural network model can be used in the method for generating the 3D video, and the vertex change coefficients output by the neural network model can be used to generate the 3D video with the non-silent expression.

Fig. 14 is a schematic diagram of a training method of a neural network model according to an embodiment of the present disclosure, and an example implementation of operations S1310 to S1330 described above will be described below with reference to fig. 14.

For example, the neural network model shown in fig. 14 may adopt an RNN network, and the neural network model may be used as the neural network model used in the 3D video generation method after being trained.

As shown in fig. 14, after the voice feature Va and the PCA coefficient Vb for silent expression are obtained, the neural network model 1401 is used to generate the vertex variation coefficient Pd of the 3D avatar, which is used to generate the 3D video of non-silent expression, based on the voice feature Va and the PCA coefficient Vb for silent expression.

A loss function is calculated 1402 based on the generated vertex change coefficient Pd and the target vertex change coefficient Dcx, and parameters of the neural network model are adjusted 1401 using the loss function.

The technical scheme of the embodiment of the disclosure generates a vertex change coefficient Pd of the 3D avatar based on the voice feature Va and a PCA coefficient Vb aiming at the silent expression, trains the neural network model by using the vertex change coefficient Pd of the 3D avatar, and adjusts the parameters of the neural network model by comparing the difference between the predicted vertex change coefficient Pd of the 3D avatar and the target vertex change coefficient Dcx in the training process. The trained neural network model can be used in the method for generating the 3D video, and the vertex change coefficients output by the neural network model can be used to generate the 3D video with the non-silent expression.

Fig. 15 is a flowchart of a method of training a neural network model according to another embodiment of the present disclosure.

As shown in fig. 15, in the present embodiment, the training method 1500 of the neural network model includes operations S1510 to S1540. Operations S1520 to S1540 may be implemented in the same manner as operations S1310 to S1330, respectively, and repeated details are not repeated.

In operation S1510, training data is generated based on the 4D avatar data.

The training data referred to herein may include, for example, speech features, PCA coefficients for silent expressions, and target vertex change coefficients.

In this operation, the training data may be generated based on the single person 4D data, and the training data may be used in a training process of the model in the following, where the specific training mode is the same as or similar to the above-described process, and is not described herein again.

In operation S1520, vertex change coefficients of the 3D avatar are generated based on the speech features and the PCA coefficients for the silent expression using the neural network model.

In operation S1530, a loss function is calculated based on the generated vertex change coefficient and the target vertex change coefficient.

In operation S1540, parameters of the neural network model are adjusted according to the loss function.

Fig. 16 is a schematic diagram of a method of generating training data according to an embodiment of the present disclosure, and an example implementation of generating training data in operation S1610 described above will be described below with reference to fig. 16.

In the present embodiment, the training data may include, for example, speech features extracted from 4D data, PCA coefficients for silent expressions, and target vertex change coefficients. The manner of acquiring the 4D data is the same as or similar to that described above, and is not described herein again.

The manner of acquiring the above data will be described in detail with reference to fig. 16.

As shown in fig. 16, after the 4D data 1600 is acquired, audio data 1610, a 3D avatar sequence 1620 of silent expression, and a 3D avatar sequence 1630 of non-silent expression are extracted from the 4D data 1600. It should be noted that the 4D data in this embodiment is not 4D data for a specific user, and cannot reflect personal information of a specific user. The 4D data in this embodiment is from a public data set.

Speech features on phonemes are extracted from the audio data 1610 to obtain speech features 1611, and a 3D avatar sequence 16111 with lip movements is constructed according to the speech features 1611, and the 3D avatar sequence 16111 with lip movements may be, for example, a 3D avatar sequence of expressionless lip movements.

In the present embodiment, constructing a 3D avatar sequence 16111 with lip movements according to the speech features 1611 may be accomplished by various suitable methods. For example, a pre-trained neural network model (e.g., a Long-Short Term Memory (LSTM) based neural network model) may be used to generate a 3D avatar sequence 16111 with lip movements based on the speech features 1611 and a pre-constructed 3D base avatar model.

And performing PCA processing on the 3D avatar sequence 1620 with the silent expression to obtain PCA parameters 1621.

And projecting the 3D avatar sequence 1630 with the non-silent expression to the PCA parameters 1621 to obtain a PCA coefficient 1631. The 3D avatar sequence is reconstructed based on the PCA parameters 1621 and PCA coefficients 1631, resulting in a reconstructed 3D avatar sequence 16211.

A 3D avatar sequence 1630 with a non-silent expression, a 3D avatar sequence 16111 with lip movements, and a reconstructed 3D avatar sequence 16211, thereby obtaining a target 3D vertex change coefficient 161111, the target 3D vertex change coefficient 161111 being used as a reference in the training process to calculate a loss function.

In the embodiment of the disclosure, training data for training the neural network model is generated based on the information related to the silent expression and the information related to the non-silent expression in the 4D data, so that the output result of the trained neural network model can accurately represent lip movement information when the avatar speaks.

Fig. 17 is a block diagram of a 3D video generation apparatus according to an embodiment of the present disclosure.

As shown in fig. 17, the 3D video generating apparatus 1700 includes: a first processing module 1710, a second processing module 1720, and a first generation module 1730.

The first processing module 1710 is configured to perform Principal Component Analysis (PCA) processing on a plurality of 3D avatar sequences of a plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions, where the 3D avatar sequences include a plurality of 3D avatar models arranged in time sequence.

The second processing module 1720 is configured to perform PCA processing on the multiple PCA coefficients to obtain a mean sequence and a change matrix of the expression. And

the first generation module 1730 is configured to generate an expression coefficient for an expression based on the mean sequence and the variation matrix of the expression, and generate a first 3D video based on the expression coefficient and the PCA parameters.

In some embodiments of the present disclosure, the apparatus 1700 further includes: a second generation module for generating a vertex change coefficient of the 3D avatar based on the voice feature and the expression coefficient using the neural network model for each expression; and generating a second 3D video based on the expression coefficients, the PCA parameters, and the 3D vertex change coefficients.

In some embodiments, the first generation module 1730 includes: the system comprises a random fluctuation sub-module, an expression generation sub-module and an expression adjustment sub-module.

And the random fluctuation submodule is used for applying random fluctuation to the expression change matrix to obtain a random fluctuation change coefficient.

The expression generation submodule is used for generating an expression coefficient aiming at the expression based on the mean sequence of the expression and the random fluctuation coefficient. And

the expression adjusting submodule is used for adjusting the expression coefficient according to the user input.

In some embodiments, the first processing module 1710 includes: a parameter calculation submodule and a coefficient calculation submodule.

And the parameter calculation submodule is used for carrying out PCA processing on the plurality of 3D virtual image sequences to obtain PCA parameters.

And the coefficient calculation submodule is used for calculating the projection of the 3D avatar sequence with the expression on the PCA parameters according to each expression to obtain the PCA coefficient according to the expression.

In some embodiments of the disclosure, the plurality of expressions includes a first expression and a second expression, and the apparatus 1700 further includes: the third generation module is used for carrying out interpolation processing between the expression coefficient aiming at the first expression and the expression coefficient aiming at the second expression to obtain an expression conversion coefficient; and generating a third 3D video based on the expression conversion coefficients and the PCA parameters.

In some embodiments, the apparatus 1700 further comprises: and the alignment module is used for aligning the sequence length and the amplitude of each PCA coefficient before the PCA processing is carried out on the plurality of PCA coefficients.

Fig. 18 is a block diagram of a training apparatus of a neural network model according to an embodiment of the present disclosure.

As shown in fig. 18, the training apparatus 1800 for the neural network model includes: a coefficient generation module 1810, a loss calculation module 1820, and an adjustment module 1830. In the embodiments of the present disclosure, the neural network model is the same as or similar to the definitions described above, and is not described herein again.

The coefficient generation module 1810 is configured to generate vertex change coefficients of the 3D avatar based on the speech features and PCA coefficients for silent expressions using a neural network model.

The loss calculation module 1820 is configured to calculate a loss function based on the generated vertex change coefficients and the target vertex change coefficients.

The adjustment module 1830 is used to adjust parameters of the neural network model according to a loss function.

In some embodiments, the apparatus 1800 further comprises: a data processing module for generating training data based on the 4D avatar data, the training data including speech features, PCA coefficients for silent expressions, and target vertex change coefficients.

In some embodiments, the data processing module includes: the device comprises a data extraction submodule, a first calculation submodule, a second calculation submodule, a first reconstruction submodule, a second reconstruction submodule and a third calculation submodule.

The data extraction submodule is used for extracting the 3D avatar sequence of the silent expression, the 3D avatar sequence of the non-silent expression and the audio data from the 4D data.

The first calculation submodule is used for determining PCA parameters by performing PCA processing on the 3D avatar sequence of the silent expression.

And the second calculation submodule is used for obtaining the PCA coefficient by projecting the 3D avatar sequence with the non-silent expression to the PCA parameters.

The first reconstruction sub-module is for reconstructing the 3D avatar sequence based on the PCA parameters and the PCA coefficients.

The second reconstruction submodule is used for extracting voice features from the audio data and constructing a 3D avatar sequence with lip movements based on the voice features.

The third calculation submodule is used for determining a target 3D vertex change coefficient according to the 3D avatar sequence with the non-silent expression, the 3D avatar sequence with the lip movement and the reconstructed 3D avatar sequence.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 19 shows a schematic block diagram of an example electronic device 1900 with which embodiments of the present disclosure may be implemented. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 19, the device 1900 includes a computing unit 1901, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1902 or a computer program loaded from a storage unit 1908 into a Random Access Memory (RAM) 1903. In the RAM 1903, various programs and data necessary for the operation of the device 1900 can be stored. The calculation unit 1901, ROM 1902, and RAM 1903 are connected to each other via a bus 1904. An input/output (I/O) interface 1905 is also connected to bus 1904.

A number of components in device 1900 connect to I/O interface 1905, including: an input unit 1906 such as a keyboard, a mouse, and the like; an output unit 1907 such as various types of displays, speakers, and the like; a storage unit 1908 such as a magnetic disk, an optical disk, or the like; and a communication unit 1909 such as a network card, modem, wireless communication transceiver, and the like. The communication unit 1909 allows the device 1900 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1901 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computation unit 1901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1901 performs various methods and processes described above, such as a 3D video generation method and a training method of a neural network model. For example, in some embodiments, the 3D video generation method and the training method of the neural network model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1908. In some embodiments, part or all of a computer program may be loaded and/or installed onto the device 1900 via the ROM 1902 and/or the communication unit 1909. When the computer program is loaded into the RAM 1903 and executed by the computing unit 1901, one or more steps of the 3D video generation method and the training method of the neural network model described above may be performed. Alternatively, in other embodiments, the computing unit 1901 may be configured by any other suitable means (e.g., by means of firmware) to perform the 3D video generation method and the training method of the neural network model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A 3D video generation method, comprising:

generating an expression coefficient for the expression based on the mean sequence and the variation matrix of the expression, and generating a first 3D video based on the expression coefficient and the PCA parameters;

performing Principal Component Analysis (PCA) processing on the plurality of 3D avatar sequences with the plurality of expressions to obtain PCA parameters and a plurality of PCA coefficients of the plurality of expressions comprises:

carrying out PCA processing on the plurality of 3D avatar sequences to obtain the PCA parameters;

and calculating the projection of the 3D virtual image sequence of the expression on the PCA parameters to obtain a PCA coefficient aiming at the expression.

2. The method of claim 1, further comprising: with respect to the expression, the expression is,

generating a vertex change coefficient of the 3D avatar based on the voice feature and the expression coefficient using a neural network model; and

generating a second 3D video based on the expression coefficients, the PCA parameters, and vertex change coefficients of the 3D avatar.

3. The method of claim 1 or 2, wherein the generating an expression coefficient for the expression based on the mean sequence and the variation matrix of the expression comprises:

applying random fluctuation to the change matrix of the expression to obtain a random fluctuation change coefficient;

generating the expression coefficient for the expression based on the mean sequence of the expression and the random fluctuation change coefficient; and

and adjusting the expression coefficients according to user input.

4. The method of claim 1, wherein the plurality of expressions comprises a first expression and a second expression, the method further comprising:

interpolation processing is carried out between the expression coefficient aiming at the first expression and the expression coefficient aiming at the second expression to obtain an expression conversion coefficient;

generating a third 3D video based on the expression conversion coefficients and the PCA parameters.

5. The method of claim 1, wherein prior to performing PCA processing on the plurality of PCA coefficients, further comprising aligning sequence lengths and magnitudes of individual PCA coefficients.

6. A method of training a neural network model, comprising:

generating vertex change coefficients of a 3D avatar based on speech features and PCA coefficients for silent expressions using the neural network model;

adjusting parameters of the neural network model according to the loss function;

the method further comprises the following steps: generating training data based on 4D data, wherein the training data comprises the speech features, the PCA coefficients for silent expressions, and the target vertex change coefficients;

the generating training data comprises:

extracting a 3D avatar sequence of a silent expression, a 3D avatar sequence of a non-silent expression, and audio data from the 4D data;

determining PCA parameters by performing PCA processing on the 3D avatar sequence of the silent expression;

obtaining PCA coefficients by projecting a 3D avatar sequence of a non-silent expression to the PCA parameters;

reconstructing a 3D avatar sequence based on the PCA parameters and the PCA coefficients;

extracting voice features from the audio data and constructing a 3D avatar sequence with lip movements based on the voice features;

and determining the target 3D vertex change coefficient according to the 3D avatar sequence with the non-silent expression, the 3D avatar sequence with the lip movement and the reconstructed 3D avatar sequence.

7. The method of claim 6, wherein the neural network model comprises a Recurrent Neural Network (RNN).

8. A 3D video generation apparatus comprising:

a first generation module, configured to generate an expression coefficient for the expression based on the mean sequence and the variation matrix of the expression, and generate a first 3D video based on the expression coefficient and the PCA parameter;

wherein the first processing module comprises:

the parameter calculation submodule is used for carrying out PCA processing on the plurality of 3D virtual image sequences to obtain the PCA parameters;

and the coefficient calculation submodule is used for calculating the projection of the 3D virtual image sequence of the expression on the PCA parameter to obtain a PCA coefficient aiming at the expression.

9. The apparatus of claim 8, further comprising: a second generation module, configured to generate, for the expression, a vertex change coefficient of the 3D avatar based on the voice feature and the expression coefficient using a neural network model; and generating a second 3D video based on the expression coefficients, the PCA parameters and the vertex change coefficients of the 3D avatar.

10. The apparatus of claim 8 or 9, wherein the first generating means comprises:

the random fluctuation submodule is used for applying random fluctuation to the change matrix of the expression to obtain a random fluctuation change coefficient;

the expression generation submodule is used for generating the expression coefficient aiming at the expression based on the mean value sequence of the expression and the random fluctuation coefficient; and

and the expression adjusting submodule is used for adjusting the expression coefficient according to the input of a user.

11. The apparatus of claim 8, wherein the plurality of expressions comprises a first expression and a second expression, the apparatus further comprising: the third generation module is used for carrying out interpolation processing between the expression coefficient aiming at the first expression and the expression coefficient aiming at the second expression to obtain an expression conversion coefficient; and generating a third 3D video based on the expression conversion coefficients and the PCA parameters.

12. The apparatus of claim 8, further comprising: and the alignment module is used for aligning the sequence length and the amplitude of each PCA coefficient before PCA processing is carried out on the plurality of PCA coefficients.

13. An apparatus for training a neural network model, comprising:

a coefficient generation module for generating a vertex change coefficient of the 3D avatar based on speech features and PCA coefficients for silent expressions using the neural network model;

an adjustment module for adjusting parameters of the neural network model according to the loss function;

the device further comprises: a data processing module for generating training data based on 4D data, the training data including the speech features, the PCA coefficients for silent expressions, and the target vertex change coefficients;

wherein the data processing module comprises:

the data extraction submodule is used for extracting a 3D avatar sequence with a silent expression, a 3D avatar sequence with a non-silent expression and audio data from the 4D data;

a first calculation sub-module for determining PCA parameters by performing PCA processing on the 3D avatar sequence of the silent expression;

the second calculation submodule is used for obtaining a PCA coefficient by projecting the 3D avatar sequence with the non-silent expression to the PCA parameters;

a first reconstruction submodule for reconstructing a 3D avatar sequence based on the PCA parameters and the PCA coefficients;

the second reconstruction submodule is used for extracting voice features from the audio data and constructing a 3D avatar sequence with lip movement based on the voice features;

and the third calculation submodule is used for determining the target 3D vertex change coefficient according to the 3D avatar sequence with the non-silent expression, the 3D avatar sequence with the lip movement and the reconstructed 3D avatar sequence.

14. The apparatus of claim 13, wherein the neural network model comprises a Recurrent Neural Network (RNN).

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.