CN115810071A

CN115810071A - Animation parameter processing method and device, computer equipment and readable storage medium

Info

Publication number: CN115810071A
Application number: CN202211514658.5A
Authority: CN
Inventors: 马一丰; 王苏振; 丁彧; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-17

Abstract

The embodiment of the application provides an animation parameter processing method, an animation parameter processing device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring audio characteristic information and expression characteristic information; performing correlation processing on the audio feature information and the expression feature information to determine audio expression correlation feature information; inputting the expression feature information into a mapping coefficient generation network obtained by pre-training, and generating each mapping coefficient in the network according to the expression feature information and the mapping coefficient to obtain a target mapping coefficient, wherein each mapping coefficient in the mapping coefficient generation network is respectively used for representing one expression and audio mapping coefficient; and determining mapping characteristic information according to the audio expression associated characteristic information and the target mapping coefficient, and generating facial animation parameters according to the mapping characteristic information. The mapping relation between the target audio and the facial motion of the target expression information can be accurately represented, and therefore the reality effect of the facial animation is improved.

Description

Animation parameter processing method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an animation parameter processing method, an animation parameter processing apparatus, a computer device, and a readable storage medium.

Background

The speaker face animation generation technology can automatically synthesize reference animation of a given reference speaker based on any section of input voice, thereby playing an important role in the fields of virtual human generation, news broadcasting, short video creation, teleconferencing and the like. In the process of generating the speaking face animation, generating a proper expression style is particularly important for the reality of the animation. Therefore, how to fuse the expression style information and the audio information to make the generated talking face animation have higher reality is a problem to be solved.

In the prior art, a connection operator is generally used to connect expression style information vectors and audio information vectors as a fusion backward vector to be input into a subsequent animation generation network so as to generate a speaking face animation.

However, the method in the prior art has the problem that the association between the expression and the audio and the mapping between the audio and different expressions cannot be represented, so that the reality effect of the generated talking face animation is poor.

Disclosure of Invention

An object of the present application is to provide an animation parameter processing method, an apparatus, a computer device and a readable storage medium, aiming at the deficiencies in the prior art, so as to solve the problem that the reality effect of the generated speech face animation is not good due to the problem that the correlation between expressions and audios and the mapping between audios and different expressions cannot be represented in the prior art.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, the present application provides an animation parameter processing method, including:

acquiring audio characteristic information corresponding to target audio and expression characteristic information corresponding to target expression information;

performing correlation processing on the audio feature information and the expression feature information to determine audio expression correlation feature information;

inputting the expression feature information into a mapping coefficient generation network obtained by pre-training, and generating each mapping coefficient in the network according to the expression feature information and the mapping coefficient to obtain a target mapping coefficient, wherein each mapping coefficient in the mapping coefficient generation network is respectively used for representing one expression and audio mapping coefficient;

and determining mapping feature information according to the audio expression associated feature information and the target mapping coefficient, and generating facial animation parameters according to the mapping feature information, wherein the animation parameters are used for generating facial animation matched with the target audio and the target expression information.

As a possible implementation manner, the generating, according to the expression feature information and the mapping coefficient, each mapping coefficient in the network to obtain a target mapping coefficient includes:

determining the weight of each mapping coefficient in the mapping coefficient generation network according to the expression characteristic information;

and determining a target mapping coefficient according to each mapping coefficient and the weight.

As a possible implementation manner, the determining, according to the expression feature information, the weight of each mapping coefficient in the mapping coefficient generation network includes:

and inputting the expression characteristic information into a weight calculation sub-network of the mapping coefficient generation network to obtain the weight of each mapping coefficient.

As a possible implementation manner, the weight calculation sub-network includes a first fully-connected layer, an activation function layer, a second fully-connected layer, and an output layer, which are connected in sequence.

As a possible implementation manner, the determining a target mapping coefficient according to each mapping coefficient and the weight includes:

respectively determining the product of each mapping coefficient and the corresponding weight to obtain an intermediate coefficient corresponding to each mapping coefficient;

and adding the intermediate coefficients corresponding to the mapping coefficients to obtain the target mapping coefficient.

As a possible implementation manner, the associating the audio feature information and the expression feature information to determine audio expression associated feature information includes:

and determining the audio expression associated feature information by using the multi-head attention network obtained by pre-training by using the audio feature information as a key and a value and the expression feature information as a query parameter.

As a possible implementation manner, the determining audio expression associated feature information by using the audio feature information as a key and a value and the expression feature information as a query parameter and using a multi-head attention network obtained through pre-training includes:

inputting the key, the value and the query parameter into the multi-head attention network to respectively map the key, the value and the query parameter into each head, respectively performing dot product processing on the query parameter and the key in each head to obtain the attention weight of each head, and obtaining the audio expression associated feature information according to the attention weight of each head and the value of each head.

As a possible implementation manner, the determining mapping feature information according to the audio expression associated feature information and the target mapping coefficient includes:

determining initial characteristic information according to the audio expression associated characteristic information and the target mapping coefficient;

and taking the initial characteristic information as new expression characteristic information, re-determining new audio expression associated characteristic information, re-determining new initial characteristic information, executing in a circulating mode until a preset condition is met, and taking the new initial characteristic information when the initial characteristic information is stopped as the mapping characteristic information.

As a possible implementation manner, the generating facial animation parameters according to the mapping feature information includes:

and inputting the mapping characteristic information into a linear network to obtain the facial animation parameters.

In a second aspect, the present application provides an animation parameter processing apparatus, comprising:

the acquisition module is used for acquiring audio characteristic information corresponding to the target audio and expression characteristic information corresponding to the target expression information;

the correlation module is used for performing correlation processing on the audio feature information and the expression feature information to determine audio expression correlation feature information;

the mapping module is used for inputting the expression characteristic information into a mapping coefficient generation network obtained by pre-training so as to generate each mapping coefficient in the network according to the expression characteristic information and the mapping coefficient to obtain a target mapping coefficient, wherein each mapping coefficient in the mapping coefficient generation network is respectively used for representing the mapping coefficient of one kind of expression and audio;

and the generating module is used for determining mapping characteristic information according to the audio expression associated characteristic information and the target mapping coefficient, and generating facial animation parameters according to the mapping characteristic information, wherein the animation parameters are used for generating facial animation matched with the target audio and the target expression information.

As a possible implementation manner, the mapping module is specifically configured to:

As a possible implementation manner, the association module is specifically configured to:

As a possible implementation manner, the generating module is specifically configured to:

In a third aspect, the present application provides a computer device comprising: a processor and a memory, the memory storing machine readable instructions executable by the processor, the processor executing the machine readable instructions when the electronic device is running to perform the steps of the animation parameter processing method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the animation parameter processing method as described in the first aspect above.

According to the animation parameter processing method, the animation parameter processing device, the computer equipment and the readable storage medium, the audio expression associated feature information representing the audio and expression associated relationship can be obtained by performing associated processing on the audio feature information and the expression feature information, and the target mapping coefficient which is controlled by the expression feature information and matched with the expression feature information can be obtained on the basis of the mapping coefficient network comprising a plurality of sets of mapping coefficients. On the basis, mapping characteristic information can be determined according to the audio expression associated characteristic information and the target mapping coefficient, the mapping characteristic information can represent the association relationship between the target audio and the target expression information, meanwhile, the target mapping coefficient is controlled by the target expression information and is matched with the target expression information, so that the difference of the mapping of the target audio and the target expression information compared with the mapping of other expression information can be embodied, and the mapping characteristic information can accurately represent the mapping relationship between the target audio and the facial movement of the target expression information. After the facial animation parameters are determined based on the mapping characteristic information and the facial animation is generated based on the facial animation parameters, the reality effect of the facial animation can be greatly improved.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a schematic diagram of an exemplary scenario of the present application;

FIG. 2 is a schematic flow chart of an animation parameter processing method provided by the present application;

FIG. 3 is another schematic flow chart of an animation parameter processing method provided by the present application;

FIG. 4 is a schematic diagram of a network generating a target mapping coefficient based on a mapping coefficient and corresponding mapping feature information;

FIG. 5 is a schematic flow chart of an animation parameter processing method provided by the present application;

FIG. 6 is a complete schematic diagram of the generation of output facial animation parameters in the present application;

FIG. 7 is a block diagram of an animation parameter processing apparatus according to the present application;

fig. 8 is a schematic structural diagram of a computer device 80 provided in the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and that steps without logical context may be reversed in order or performed concurrently. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to the flowchart, or may remove one or more operations from the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

At present, when the expression style information and the audio information are fused, the expression style information vector and the audio information vector are connected as a fused vector only by using a connection operator. The simple connection mode cannot show the association between the expressions and the audios firstly, and secondly, the facial movements of different expressions are often different greatly, and the simple connection mode cannot reflect the mapping between the audios and the facial movements of different expressions. Therefore, the reality effect of the generated talking face animation is not good.

Based on the above problems, the present application provides an animation parameter processing method, which considers the association between the expressions and the audio and the mapping between the audio and the facial movements of different expressions when fusing the audio and the expressions into facial animation parameters, and generates facial animation parameters according to the association and mapping information, so that the reality effect of the facial animation generated according to the facial animation parameters is greatly improved.

Fig. 1 is a schematic diagram of an exemplary scene of the present application, as shown in fig. 1, the present application may be applied to a scene of generating an animation of a speaking face. In the scene, the expression information and the audio information which should be possessed by the animation face can be specified by the user, and then the speaking face animation parameters are generated based on the expression information and the audio information by using the method. And then, the parameters of the speaking facial animation are input into a facial animation generating network so as to generate the speaking facial animation with high reality which is matched with the expression information and the audio information specified by the user.

It should be noted that the scenario shown in fig. 1 is only an exemplary scenario of the present application. The generated facial animation parameters can also realize other effects in other scenes related to facial animation.

Fig. 2 is a schematic flowchart of an animation parameter processing method provided in the present application, where an execution subject of the method may be a computer device with computing processing capability, such as a terminal device or a server.

As shown in fig. 2, the method includes:

s201, acquiring audio feature information corresponding to the target audio and expression feature information corresponding to the target expression information.

Alternatively, the target audio may be, for example, audio specified by the user. For example, if the user wants an animated character to broadcast a news segment, the audio of the news segment may be designated as the target audio. The target expression information may be, for example, expression information specified by the user. For example, if the user wants the animated character to be smiling when broadcasting news, a smiling facial expression information may be designated as the target facial expression information.

Alternatively, the target audio may be any form of audio, such as audio in the form of. Mp3, audio in the form of. Wav, or audio in the form of. Wma, etc. The target expression information may be expression information included in a video or animation specified by the user. Illustratively, a piece of video is specified by the user, and expression information of smile is extracted from the piece of video, which indicates that the target expression information specified by the user is smile expression information.

Optionally, the audio feature information may be obtained by performing feature extraction on the target audio. The expression feature information can be obtained by extracting features of the target expression information. Taking the execution subject of the embodiment as the aforementioned server as an example, the user selects an audio and a video with a smiling expression through the terminal device. And the terminal equipment uploads the audio and the video containing the smiling expression to a server. The server takes the audio as target audio and takes expression information contained in the video containing the smiling expression as target expression information. The server can input the audio and the video containing the smiling expression into a feature extraction network obtained by pre-training respectively or simultaneously, and the feature extraction network performs feature extraction on the audio to obtain the audio feature information. And carrying out feature extraction on the video containing the smiling expression by the feature extraction network to obtain the expression feature information. It is worth to be noted that after feature extraction is performed on a video containing a smiling expression, the obtained expression feature information can represent the expression of smiling.

Optionally, different feature extraction networks may be trained for the audio and the video containing the expression information, or one feature extraction network that supports both audio feature extraction and expression feature extraction may be trained. This is not a particular limitation of the present application.

S202, performing correlation processing on the audio feature information and the expression feature information, and determining audio expression correlation feature information.

Optionally, the audio feature information and the expression feature information are subjected to correlation processing, for example, the correlation processing may be performed through a neural network model obtained through pre-training, or correlation calculation may also be performed through a preset algorithm formula.

And after the audio feature information and the expression feature information are subjected to correlation processing, the audio expression correlation feature information can be obtained. The audio expression associated feature information is used for representing the association relationship between the audio feature information and the expression feature information.

And S203, inputting the expression feature information into a mapping coefficient generation network obtained by pre-training, so as to generate each mapping coefficient in the network according to the expression feature information and the mapping coefficient, and obtain a target mapping coefficient, wherein each mapping coefficient in the mapping coefficient generation network is respectively used for representing a mapping coefficient of an expression and an audio.

Optionally, the mapping coefficient generation network may be obtained by training sample expression feature information labeled with the target mapping coefficient in advance.

The trained mapping coefficient generation network comprises a plurality of sets of mapping coefficients, and each set of mapping coefficients is used for representing a mapping coefficient of one expression and one audio. As mentioned above, the facial movements of different expressions tend to be different, and therefore, the mapping of different expressions to audio should be different. In this embodiment, a plurality of sets of mapping coefficients are set in the mapping coefficient generation network, and each set of mapping coefficients represents one expression and audio mapping coefficient. And combining the input target expression information to obtain a target mapping coefficient which is controlled by the target expression information and is matched with the target expression information. And then the mapping characteristic information which accurately represents the mapping relation between the target audio and the target expression information can be obtained by using the target mapping coefficient in the following steps.

It should be noted that the steps S202 and S203 are executed in parallel, and the execution order may be interchanged.

S204, determining mapping feature information according to the audio expression associated feature information and the target mapping coefficient, and generating facial animation parameters according to the mapping feature information, wherein the facial animation parameters are used for generating facial animation matched with the target audio and the target expression information.

Optionally, the audio expression associated feature information represents an association relationship between the target audio and the target expression information, and the target mapping coefficient represents a mapping coefficient matched with the target expression information. On the basis, the audio expression associated feature information and the target mapping coefficient can be input into a dynamic feed-forward network to determine the mapping feature information. The mapping characteristic information can represent the incidence relation between the target audio and the target expression information, and meanwhile, as the target mapping coefficient is controlled by the target expression information and is matched with the target expression information, the difference of the mapping of the target audio and the target expression information compared with the mapping of other expression information can be embodied, so that the mapping characteristic information can accurately represent the mapping relation between the target audio and the target expression information. Illustratively, for two expressions, namely "smile" and "laugh", the target mapping coefficients obtained by using the embodiment are obviously different. The target mapping coefficient corresponding to the smile can represent the mapping coefficient of the smile and the audio characteristic, and the target mapping coefficient corresponding to the smile can represent the mapping coefficient of the smile and the audio characteristic. Therefore, two types of mapping characteristic information with obvious differences can be obtained based on the two target mapping coefficients, wherein the mapping characteristic information corresponding to the smile can accurately represent the mapping relation between the target audio and the smile expression, and the mapping characteristic information corresponding to the smile can accurately represent the mapping relation between the target audio and the smile expression. It should be understood that the mapping between audio and expression described herein refers to the mapping between audio and facial movements of expression.

It should be noted that the dynamic feedforward network is a network including the mapping coefficient generation network. When network training is performed in advance, the mapping coefficient generation network may be used as a part of the dynamic feedforward network, and the dynamic feedforward network may be trained together as a whole.

After the mapping feature information is obtained, facial animation parameters can be generated based on the mapping feature information. The face animation parameters are parameters to be used when generating the face animation. It is worth noting that in generating a facial animation, a large number of parameters for constructing the facial animation may be involved. By utilizing the facial animation parameters of the embodiment, the finally generated facial animation can be matched with the target audio and the target expression information, so that the authenticity effect of the facial animation is greatly improved. Illustratively, the target audio specified by the user is an audio of a piece of news, and the specified target expression information is smile expression information, based on this embodiment, in the subsequently generated facial animation, when the facial animation is broadcasting news in a smile expression, natural facial motion can be realized along with the rhythm of the audio.

It should be understood that when the present application is applied to a speaking face animation generation scene, the facial animation obtained on the basis of the present step is the speaking face animation.

In this embodiment, the audio feature information and the expression feature information are associated, so that audio expression associated feature information representing an association relationship between audio and expression can be obtained, and based on a mapping coefficient network including a plurality of sets of mapping coefficients, a target mapping coefficient controlled by the expression feature information and matched with the expression feature information can be obtained. On the basis, mapping characteristic information can be determined according to the audio expression associated characteristic information and the target mapping coefficient, the mapping characteristic information can represent the association relationship between the target audio and the target expression information, meanwhile, the target mapping coefficient is controlled by the target expression information and is matched with the target expression information, so that the difference of the mapping of the target audio and the target expression information compared with the mapping of other expression information can be embodied, and the mapping characteristic information can accurately represent the mapping relationship between the target audio and the facial movement of the target expression information. After the facial animation parameters are determined based on the mapping feature information and the facial animation is generated based on the facial animation parameters, the reality effect of the facial animation can be greatly improved.

Hereinafter, a process of obtaining the target mapping coefficient based on the mapping coefficient generation network in step S203 will be described in detail.

Fig. 3 is another schematic flow chart of the animation parameter processing method provided in the present application, and as shown in fig. 3, an alternative manner of the step S203 includes:

s301, determining the weight of each mapping coefficient in the mapping coefficient generation network according to the expression characteristic information.

Optionally, each set of mapping coefficients in the mapping coefficient generation network is used to represent a mapping coefficient of an expression and an audio. Can also be regarded as each set of reflectionThe emission coefficients correspond to one expression respectively. For example, the mapping coefficient generation network obtained by training in the present application includes 8 sets of mapping coefficients. Are respectively { W ₁ ,b ₁ },…,{W _K ,b _K And k is 8. Each set of mapping coefficients corresponds to one expression respectively. After the expression characteristic information is input into the mapping coefficient generation network, the mapping coefficient generation network can determine the similarity degree of the expression characteristic information and the expression corresponding to each set of mapping coefficients by analyzing the expression characteristic information, and accordingly, each set of mapping coefficients is endowed with corresponding weight. The greater the weight of a certain set of mapping coefficients is, the more similar the expression corresponding to the set of mapping coefficients is to the expression represented by the expression characteristic information. Illustratively, 8 sets of mapping coefficients of the mapping coefficient generation network include a set of mapping coefficients corresponding to a smile expression and a set of mapping coefficients corresponding to a smile expression, and the mapping coefficient generation network determines that the expression represented by the expression feature information is an expression between a smile and a laugh through analyzing the expression feature information, so that when weights are given to the sets of mapping coefficients, the same weights with larger values can be given to the set of mapping coefficients corresponding to the smile expression and the set of mapping coefficients corresponding to the laugh expression, and weights with smaller values can be given to the rest of the mapping coefficients.

S302, determining a target mapping coefficient according to each mapping coefficient and the weight.

By giving a weight to each set of mapping coefficients, the greater the weight is, the higher the similarity between the expression represented by the expression feature information and the expression corresponding to the mapping coefficients is, and the target mapping coefficients determined according to each mapping coefficient and the corresponding weight can be more matched with the expression feature information.

Optionally, the product of each mapping coefficient and the corresponding weight may be determined respectively to obtain an intermediate coefficient corresponding to each mapping coefficient, and the intermediate coefficients corresponding to each mapping coefficient are added to obtain the target mapping coefficient.

Illustratively, assume that each mapping coefficient is W ₁ ,b ₁ },…,{W _K ,b _K K takes a value of, for example, 8, and the weights of the mapping coefficients determined by the foregoing steps are pi, respectively ₁ ,…,π _K Then the target mapping coefficient

Can be calculated by the following formula (1).

Is obtained by

On the basis, assuming that the audio expression-related feature information obtained in the foregoing step S202 is x, the audio expression-related feature information x and the target mapping coefficient may be obtained

And inputting the mapping characteristic information into the dynamic feedforward network, and calculating the mapping characteristic information x' by the dynamic feedforward network through the following formula (2).

The larger the weight of a certain set of mapping coefficients is, the larger the role played by the set of mapping coefficients in the target mapping coefficients obtained after the multiplication and addition is, so that the target mapping coefficients are correspondingly controlled by and matched with the expression characteristic information.

FIG. 4 is a schematic diagram of obtaining a target mapping coefficient and correspondingly obtaining mapping feature information based on a mapping coefficient generation network, where as shown in FIG. 4, the mapping coefficient generation network is used as a part of a dynamic feed-forward network, and after expression feature information is input into the mapping coefficient generation network, a weight of each mapping coefficient pi can be obtained through a weight calculation sub-network in the mapping coefficient generation network ₁ 、π ₂ ……π _k And further calculates and outputs the destination based on the above formula (1) in the mapping coefficient generation networkScaling coefficient

On the basis, the dynamic feedforward network takes the target mapping coefficient and the audio expression associated feature information as input, and calculates and outputs mapping feature information x' based on the formula (2).

In this embodiment, the weight of each mapping coefficient is determined according to the expression feature information, and the target mapping coefficient is determined based on the weight of each mapping coefficient, and since the weight can represent the similarity between the expression corresponding to the mapping coefficient and the expression represented by the expression feature information, the target mapping coefficient determined based on the weight can be more matched with the expression feature information.

As an alternative, the weight of each mapping coefficient may be determined by a weight calculation sub-network in the mapping coefficient generation network. The details will be described below.

Optionally, step S301 may include:

Continuing with the example that the mapping coefficient generation network includes 8 sets of mapping coefficients, after the expression feature information is input into the weight calculation sub-network, the weight calculation sub-network may sequentially output the weights of the mapping coefficients according to the order of the mapping coefficients, that is, output 8 weights. Further, the target mapping coefficients are obtained by multiplying the weights by the corresponding mapping coefficients and adding the multiplication results based on the method of step S303.

As an example, the weight calculation sub-network may include a first fully-connected layer, an activation function layer, a second fully-connected layer, and an output layer, which are connected in sequence.

Referring to FIG. 4, FC1 is the first fully-connected layer, reLU is the activation function layer, FC2 is the second fully-connected layer, and softmax is the output slave layer. The first full connected layer and the second full connected layer are full connected layers (FC for short), and can map the learned distributed feature representation to a sample mark space. The activation function layer may implement input-to-output mapping through an activation function such as ReLU. The output layer may be, for example, a softmax output layer.

It is noted that the weight calculation sub-network may use other network configurations in addition to the above-described network configuration. For example, a full connection layer is added before the softmax layer.

In this embodiment, the weights of the mapping coefficients are determined by the weight calculation sub-network, and the characteristics of the expressions corresponding to the mapping coefficients can be learned by the weight calculation sub-network in the pre-training process, so that the determined weights can be accurate.

As described previously, when the audio expression associated feature information is determined in step S202, the association process may be performed based on a neural network model. The following describes a method of such association processing in detail.

As an alternative implementation, the present application may utilize a multi-head attention network for association processing. Accordingly, the step S202 may include:

and determining the audio expression associated feature information by using the audio feature information as a key sum value and the expression feature information as a query parameter and using a multi-head attention network obtained by pre-training.

Optionally, the multi-head attention network is a network trained based on a multi-head attention mechanism. In the multi-head attention mechanism, key, value, and query parameter query are involved. An output result can be mapped through the input key, value and query. In the application, the audio feature information is input into the multi-head attention network to serve as keys and values, meanwhile, the expression feature information is input into the multi-head attention network to serve as query parameters, and the keys, the values and the query parameters are learned and combined through the multi-head attention network based on a multi-head attention mechanism, so that the association relation between the expression feature information and the audio feature information is better learned.

In this embodiment, the multi-head attention network uses a multi-head attention mechanism, uses the audio feature information as a key and a value, and uses the expression feature information as a query parameter, so as to better learn the association relationship between the expression feature information and the audio feature information.

As an optional implementation manner, the process of determining the audio expression associated feature information based on the multi-head attention network may include:

and inputting the key, the value and the query parameter into the multi-head attention network, mapping the key, the value and the query parameter into each head, performing dot product processing on the query parameter and the key in each head to obtain the attention weight of each head, and obtaining the audio expression associated feature information according to the attention weight of each head and the value of each head.

Optionally, after the expression feature information and the audio feature information are input into the multi-head attention network, the multi-head attention network first decomposes the expression feature information and the audio feature information into keys, values and query parameters, and maps the keys, the values and the query parameters to different subspaces to form a multi-head. Such that each header has keys, values and query parameters in it. On the basis, in each head, the dot product operation is carried out on the query parameters and the keys in the head to obtain the attention weight of the head. And after the attention weight of each head is obtained, multiplying the value of each head by the attention weight, and adding the multiplication results of all heads to obtain the audio expression associated characteristic information.

Alternatively, the process of determining the audio expression associated feature information by the multi-head attention network may be expressed as the following formula (3).

x＝F _m (s,f ^a ) (3)

Wherein x is the above-mentioned audio expression associated feature information, s is expression feature information, f ^a For audio feature information, F _m Is a multi-head attention mechanism operation.

A process of determining the mapping feature information based on the audio expression-related feature information and the target mapping coefficient in step S204 will be described below.

As an alternative implementation, after obtaining the audio expression associated feature information and the target mapping coefficient, a feature information may be obtained by the dynamic feed-forward network based on the foregoing formula (2), and the feature information is directly used as the mapping feature information to perform the subsequent processing to obtain the facial animation parameter.

As another alternative implementation, a feature information may be obtained by calculation based on the foregoing formula (2), and the feature information is used as a new expression feature information to perform a loop process, so as to obtain a better mapping feature information. This mode will be described in detail below.

Fig. 5 is a schematic flowchart of another flow of the animation parameter processing method provided in the present application, and as shown in fig. 5, the process of determining the mapping feature information according to the audio expression associated feature information and the target mapping coefficient in step S204 may include:

and S501, determining initial characteristic information according to the audio expression associated characteristic information and the target mapping coefficient.

Optionally, the initial characteristic information may be obtained by calculating based on the formula (2) in the dynamic feedforward network, and the specific processing process is not described again.

And S502, taking the initial feature information as new expression feature information, re-determining new audio expression associated feature information, re-determining new initial feature information, performing in a circulating manner until a preset condition is met, and taking the new initial feature information when the initial feature information is stopped as the mapping feature information.

The step S501 may be regarded as a first loop of the process of determining the mapping feature information, and after the first loop, an initial feature information is obtained. This initial feature information is not used in the subsequent process of determining facial animation parameters, but rather as new expressive feature information. The new expression characteristic information and the audio characteristic information are input into the multi-head attention network again to obtain new audio expression associated characteristic information, and then the dynamic feed-forward network calculates new initial characteristic information based on the target mapping coefficient and the new audio expression associated characteristic information. And the new initial characteristic information is used as new expression characteristic information of the next cycle. And when the cycle times reach the preset times, determining that the preset condition is met, stopping the cycle, and taking the new expression characteristic information when the cycle is stopped as the mapping characteristic information.

Assuming that the number of cycles is represented by N, N may take a value of 3, as an example. It should be understood that the foregoing step S501 belongs to the first loop.

In this embodiment, the initial feature information obtained in the previous cycle is used as new expression feature information to re-determine new audio expression associated feature information, and the new initial feature information is re-determined, so that the initial feature information obtained in the next cycle can express the association relationship between the audio features and the expressions more accurately, and the effect of the facial animation parameters is further improved.

As an alternative implementation, the process of generating the facial animation parameters according to the mapping feature information in step S204 may include:

Optionally, the linear network includes a linear layer, and the facial animation parameters may be obtained by performing linear operation on the mapping feature information through the linear layer.

Alternatively, the process of obtaining the above-mentioned facial animation parameters through the linear layer may be expressed as the following formula (4).

δ＝F _c (x′) (4)

Where δ is a facial animation parameter, F _c For the linear layer, x' is the mapping characteristic information.

FIG. 6 is a complete schematic diagram of the generation of output facial animation parameters in the present application, as shown in FIG. 6, the expression feature information is input into the mapping coefficient generation network to obtain the target mapping coefficient

Meanwhile, the expression characteristic information and the audio characteristic information are input into a multi-head attention network to obtain audio expression associated characteristic information, and then the dynamic feed-forward network is used for obtaining the audio expression associated characteristic information based on the audio expression associated characteristic information and the audio expression associated characteristic informationAnd calculating the target mapping coefficient to obtain initial characteristic information, performing loop processing on the initial characteristic information serving as new expression characteristic information, and inputting the current initial characteristic information into a linear network for linear processing when the loop times reach N times, so as to output ideal facial animation parameters. The specific implementation process of each part shown in fig. 6 may refer to the foregoing embodiments, and is not described herein again.

Based on the same inventive concept, an animation parameter processing device corresponding to the animation parameter processing method is also provided in the embodiments of the present application, and as the principle of solving the problem of the device in the embodiments of the present application is similar to the animation parameter processing method in the embodiments of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Fig. 7 is a block diagram of an animation parameter processing apparatus according to the present application, and as shown in fig. 7, the apparatus includes:

the obtaining module 701 is configured to obtain audio feature information corresponding to the target audio and expression feature information corresponding to the target expression information.

And the association module 702 is configured to perform association processing on the audio feature information and the expression feature information to determine audio expression associated feature information.

The mapping module 703 is configured to input the expression feature information into a mapping coefficient generation network obtained through pre-training, so as to generate each mapping coefficient in the network according to the expression feature information and the mapping coefficient, so as to obtain a target mapping coefficient, where each mapping coefficient in the mapping coefficient generation network is used to represent a mapping coefficient of an expression and an audio, respectively.

A generating module 704, configured to determine mapping feature information according to the audio expression-associated feature information and the target mapping coefficient, and generate facial animation parameters according to the mapping feature information, where the animation parameters are used to generate facial animation matched with the target audio and the target expression information.

As an optional implementation manner, the mapping module 703 is specifically configured to:

and determining the weight of each mapping coefficient in the mapping coefficient generation network according to the expression characteristic information.

As an alternative embodiment, the weight calculation sub-network includes a first fully-connected layer, an activation function layer, a second fully-connected layer, and an output layer, which are connected in sequence.

and respectively determining the product of each mapping coefficient and the corresponding weight to obtain the intermediate coefficient corresponding to each mapping coefficient.

As an optional implementation manner, the association module 702 is specifically configured to:

and determining the audio expression associated feature information by taking the audio feature information as a key and a value and the expression feature information as a query parameter and using a multi-head attention network obtained by pre-training.

As an optional implementation manner, the generating module 704 is specifically configured to:

An embodiment of the present application further provides a computer device 80, as shown in fig. 8, which is a schematic structural diagram of the computer device 80 provided by the present application, and includes: a processor 801, a memory 802 and a bus 803, the memory 802 storing machine readable instructions executable by the processor 801, the processor 801 communicating with the memory 802 via the bus 803 when a computer device executes an animation parameter processing method as in the embodiments, the processor 801 executing the machine readable instructions, the processor 801 being a preamble of the method item to perform the steps of:

In a possible embodiment, when the processor 801 generates each mapping coefficient in the network according to the expression feature information and the mapping coefficient to obtain a target mapping coefficient, the processor is specifically configured to:

In a possible embodiment, when determining the weight of each mapping coefficient in the mapping coefficient generation network according to the expression feature information, the processor 801 is specifically configured to:

In one possible embodiment, the weight calculation sub-network includes a first fully-connected layer, an activation function layer, a second fully-connected layer, and an output layer, which are connected in sequence.

In a possible embodiment, the processor 801, when determining the target mapping coefficient according to each mapping coefficient and the weight, is specifically configured to:

In a possible embodiment, when the processor 801 performs association processing on the audio feature information and the expression feature information to determine audio expression associated feature information, the processor is specifically configured to:

In a possible embodiment, when the processor 801 determines the audio expression associated feature information by using the multi-head attention network obtained through pre-training and using the audio feature information as a key and a value and the expression feature information as a query parameter, the processor is specifically configured to:

In a possible embodiment, when determining the mapping feature information according to the audio expression associated feature information and the target mapping coefficient, the processor 801 is specifically configured to:

In one possible embodiment, the processor 801, when generating the facial animation parameters according to the mapping feature information, is specifically configured to:

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor executes the following steps:

In a possible embodiment, when the processor generates each mapping coefficient in the network according to the expression feature information and the mapping coefficient to obtain a target mapping coefficient, the processor is specifically configured to:

In a possible embodiment, when determining the weight of each mapping coefficient in the mapping coefficient generation network according to the expression feature information, the processor is specifically configured to:

In a possible embodiment, the processor, when determining the target mapping coefficient according to each mapping coefficient and the weight, is specifically configured to:

In a possible embodiment, when the processor performs association processing on the audio feature information and the expression feature information to determine audio expression associated feature information, the processor is specifically configured to:

In a possible embodiment, when the processor determines the audio expression associated feature information by using the audio feature information as a key and a value and the expression feature information as a query parameter and using a multi-head attention network obtained through pre-training, the processor is specifically configured to:

In a possible embodiment, when determining mapping feature information according to the audio expression associated feature information and the target mapping coefficient, the processor is specifically configured to:

In one possible embodiment, the processor, when generating the facial animation parameters according to the mapping feature information, is specifically configured to:

In the embodiments of the present application, when being executed by a processor, the computer program may further execute other machine-readable instructions to perform other methods as described in the embodiments, and for the method steps and principles of specific execution, reference is made to the description of the embodiments, and details are not repeated here.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An animation parameter processing method is characterized by comprising the following steps:

2. The method of claim 1, wherein generating each mapping coefficient in the network according to the expression feature information and the mapping coefficient to obtain a target mapping coefficient comprises:

3. The method of claim 2, wherein determining the weight of each mapping coefficient in the mapping coefficient generation network according to the expression feature information comprises:

4. The method of claim 3, wherein the weight computation sub-network comprises a first fully-connected layer, an activation function layer, a second fully-connected layer, and an output layer connected in sequence.

5. The method of claim 2, wherein determining the target mapping coefficients based on the mapping coefficients and the weights comprises:

6. The method of claim 1, wherein the associating the audio feature information and the expression feature information to determine audio expression associated feature information comprises:

7. The method of claim 6, wherein the determining the audio expression associated feature information by using the audio feature information as a key sum value and the expression feature information as a query parameter and using a multi-head attention network obtained through pre-training comprises:

and inputting the key, the value and the query parameter into the multi-head attention network to map the key, the value and the query parameter into each head respectively, performing dot product processing on the query parameter and the key in each head respectively to obtain the attention weight of each head, and obtaining the audio expression associated feature information according to the attention weight of each head and the value of each head.

8. The method according to any one of claims 1 to 7, wherein the determining mapping feature information according to the audio expression associated feature information and the target mapping coefficient includes:

9. The method according to any one of claims 1-7, wherein generating facial animation parameters according to the mapping feature information comprises:

10. An animation parameter processing apparatus, comprising:

11. A computer device, comprising: a processor and a memory, the memory storing machine readable instructions executable by the processor, the processor executing the machine readable instructions when the electronic device is running to perform the steps of the animation parameter processing method as claimed in any one of claims 1 to 9.

12. A computer-readable storage medium, having stored thereon a computer program for performing, when executed by a processor, the steps of the animation parameter processing method as claimed in any one of claims 1 to 9.