CN116188634A

CN116188634A - Face image prediction method, model, device, equipment and medium

Info

Publication number: CN116188634A
Application number: CN202210826331.5A
Authority: CN
Inventors: 杨茂; 冯晟; 吴海英; 蒋宁
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2023-05-30

Abstract

The disclosure provides a face image prediction method, a model, a device, equipment and a medium, wherein the face image prediction method comprises the following steps: processing preset audio data through an encoder to obtain voice characteristics; autoregressive prediction is carried out on the voice characteristics and the initial face image information through a decoder, so that a face predicted image sequence matched with the audio data is obtained; the initial face image information at least comprises one initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism. According to the embodiment of the disclosure, the accuracy of face prediction can be effectively improved.

Description

Face image prediction method, model, device, equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a face image prediction method, a model and a device, electronic equipment and a computer readable storage medium.

Background

The voice-driven facial animation obtains corresponding facial animation according to voice information, so that a user is helped to better understand voice content, and convenience and friendliness of human-computer interaction are improved. However, due to the complex geometry of the face and the limited nature of the audiovisual data, and the related art is mainly focused on the phonemic features of short audio with limited learning context, inaccurate lip movements may be generated, resulting in inaccurate facial animation.

Disclosure of Invention

The disclosure provides a face image prediction method, a model and a device, electronic equipment and a computer readable storage medium.

In a first aspect, the present disclosure provides a face image prediction method, including: processing preset audio data through an encoder to obtain voice characteristics; performing autoregressive prediction on the voice characteristics and the initial face image information through a decoder to obtain a face predicted image sequence matched with the audio data; the initial face image information at least comprises an initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism.

In a second aspect, the present disclosure provides a predictive model comprising: at least one encoder and at least one decoder; wherein the encoder adopts the encoder according to the embodiment of the disclosure;

the decoder adopts the decoder according to the embodiment of the disclosure.

In a third aspect, the present disclosure provides a face image prediction apparatus, comprising: the first processing unit is used for processing preset audio data through the encoder to obtain voice characteristics; the second processing unit carries out autoregressive prediction on the voice characteristics and the initial face image information through a decoder to obtain a face predicted image sequence matched with the audio data; the initial face image information at least comprises an initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism.

In a fourth aspect, the present disclosure provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executable by the at least one processor to enable the at least one processor to perform the facial image prediction method described above.

In a fifth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor/processing core implements the above-described face image prediction method.

According to the embodiment provided by the disclosure, the encoder is used for processing the preset audio data to obtain the voice characteristics; autoregressive prediction is carried out on the voice characteristics and the initial face image information through a decoder, so that a face predicted image sequence matched with the audio data is obtained; the initial face image information at least comprises one initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism. The method is characterized in that firstly, audio data is processed based on an encoder, so that voice characteristics with lower dimensionality and convenience in processing are obtained, and autoregressive prediction is carried out on the voice characteristics and initial face image information based on a decoder, so that a face predicted image sequence is obtained. The method adopts an autoregressive prediction mode, namely the face images obtained by prediction are used as the known information of the subsequent prediction, so that the relevance between the face prediction images can be fully utilized, and the decoder adopts a preset attention mechanism which can be used for distributing the attention weight of the face prediction images obtained to the face images to be predicted, so that the relevance between the face prediction images is more reasonable, the prediction accuracy is further improved, in addition, the preset attention mechanism can be also used for carrying out the alignment processing of the voice characteristics and the face prediction images, and the prediction quality of the face image sequence can be effectively improved through the alignment processing.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

fig. 1 is a flowchart of a face image prediction method provided in an embodiment of the present disclosure;

fig. 2 is a flowchart of a face image prediction method provided in an embodiment of the present disclosure;

fig. 3 is a flowchart of a face image prediction method provided in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a predictive model provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a predictive model provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a multi-head self-attention mechanism provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a first deviation feature provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a first deviation feature provided by an embodiment of the present disclosure;

FIG. 9 is a flowchart of a predictive model training method provided by an embodiment of the present disclosure;

fig. 10 is a block diagram of a face image prediction apparatus according to an embodiment of the present disclosure;

fig. 11 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In addition to audible information, visual information is also important in performing the understanding of speech information. While presenting speech, it is often helpful to improve understanding of the speech information if corresponding facial animation can be presented. The voice-driven human face animation synthesis is a technology for generating corresponding human face videos according to voice, can be used in scenes such as man-machine interaction, and the convenience and the friendliness are improved.

In some related art, training is performed using only a short audio data set (e.g., phoneme-based training), while ignoring the context of long audio (e.g., training using a short audio window), or a large amount of manual work is required to achieve parameter tuning (e.g., dynamic visual position model (Dynamic Viseme Model) performs face animation synthesis using one-to-many mapping of phonemes and lip movements, simulates co-pronunciation effects based on a canonical set approach, drives three-dimensional face movements using two anatomical movements, etc., while systematically explicit control can be achieved to ensure accuracy of lip movements, but a large amount of manual work is required in terms of parameter tuning), or the resulting face movements only appear in the lower part of the face (e.g., a face animation method based on a sound-driven character animation (Voice Operated Character Animation, VOCA) framework can capture various speaking styles, but the resulting face movements appear mainly in the lower part of the face), or a large amount of high-fidelity face data is required to guarantee the quality and modeling ability of the person.

In view of this, the embodiments of the present disclosure provide a method, a model, a device, an electronic apparatus, and a computer readable storage medium for predicting a face image, which encode a long audio, and obtain a face animation matching the long audio by an autoregressive prediction method, thereby improving lip synchronization and facial motion.

The face image prediction method according to the embodiment of the present disclosure may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a vehicle-mounted device, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor invoking computer readable program instructions stored in a memory. Alternatively, the method may be performed by a server.

In a first aspect, an embodiment of the present disclosure provides a face image prediction method.

Fig. 1 is a flowchart of a face image prediction method provided in an embodiment of the present disclosure. Referring to fig. 1, the method includes:

in step S11, the preset audio data is processed by the encoder to obtain a speech feature.

In step S12, autoregressive prediction is performed on the speech features and the initial face image information by the decoder, so as to obtain a face predicted image sequence matched with the audio data.

The initial face image information at least comprises one initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism.

In some possible implementations, the audio data may be a piece of audio. For example, the audio data includes a segment of audio of a human utterance; as another example, the audio data includes audio of a piece of singing voice; as another example, the audio data includes audio of a piece of laughter.

It should be noted that the foregoing description is merely illustrative of the audio data, and the embodiments of the present disclosure do not limit the audio data.

In some possible implementations, the speech features are features extracted from the audio data that pertain to speech.

In some possible implementations, the encoder has at least a feature extraction function that can extract speech features from the audio data.

In some possible implementations, the encoder includes at least one feature extraction module operable to extract speech features from the audio data.

For example, the feature extraction module may extract features based on a convolution manner of the convolution kernel.

The feature extraction module may also extract features based on a time convolution approach, for example.

In some possible implementations, in step S11, processing, by the encoder, preset audio data to obtain a speech feature includes: the encoder extracts speech features from the audio data based on the feature extraction module.

In some possible implementations, the face initial image information includes at least one initial face image. The initial face image is a face model for face prediction, which may be a real face or a virtual face constructed based on a mesh, and the embodiment of the present disclosure does not limit this.

In some possible implementations, the face initial image information may also include individual style information that characterizes a person's speech style. By setting the individual style information, the prediction result can be more personalized and diversified, so that more prediction demands can be met.

As before, after the speech features are obtained, the face image can be predicted in step S12.

It should be emphasized that in the embodiment of the present disclosure, the autoregressive prediction mode is adopted to predict the face image, and autoregressive refers to the performance of the same variable in each period before the use of the variable to predict the performance of the variable in the present period. In other words, in the embodiment of the present disclosure, a prediction result that has been obtained by predicting the front (i.e., a face predicted image that has been obtained) is taken as the known information of the next round of prediction, thereby obtaining a new prediction result (i.e., a new face predicted image). By means of autoregressive prediction, a face predicted image sequence matched with the audio data can be obtained, wherein the face image predicted image sequence comprises a plurality of face predicted images. The matching of the face image prediction sequence and the audio data means that when the face image prediction sequence is displayed in sequence, the face image prediction sequence and the audio data have a corresponding relation, and the face image prediction sequence and the audio data represent that the face is driven by the audio to perform face movement.

In some possible implementations, in step S12, the speech features and the initial face image information are used as inputs to a decoder, through which an autoregressive prediction is performed, so as to obtain a sequence of face predicted images that match the audio data.

In some possible implementations, the face prediction method provided by the embodiment of the present disclosure is expressed in a modeling manner, that is, the face image prediction is performed on the condition of the audio data, the individual style information, and the face predicted image that has been obtained by prediction, and may be expressed as the following formula:

wherein F represents an autoregressive model corresponding to a decoder, theta is a model parameter, X represents audio data, and s _n Representing individual style information, T current time steps, T being the total number of time steps,

representing face predicted images that have been predicted to be obtained (i.e. face predicted images obtained in a time step before t, the number of which may be one or more),>

predicting an image for a face corresponding to a current time step, < >>

Is in combination withFace predicted image sequence corresponding to audio data, < +.>

In some possible implementations, the decoder includes at least one attention handling layer set based on a preset attention mechanism.

In some possible implementations, the preset Attention mechanisms include a Biased cause and effect Attention mechanism (Biased Causal Self-Attention) and a Biased Cross-Attention mechanism (Biased Cross-mode Attention), and the Attention handling layers include a first Attention handling layer corresponding to the Biased cause and effect Attention mechanism and a second Attention handling layer corresponding to the Biased Cross-Attention mechanism, respectively.

It should be noted that, the first attention processing layer may assign a higher attention weight to a more recent time period, so as to bias the attention score, which is shown that, when a prediction is performed on a certain turn, a face prediction image closest to the current turn has the greatest influence on the prediction of the present turn. Through the operation, the correlation between the face prediction images can be more reasonably utilized, so that the prediction accuracy can be further improved. The second attention processing layer is mainly used for realizing the alignment operation of voice characteristics and facial movement characteristics, so that facial movement is more matched with audio, and the prediction quality is improved.

In some possible implementations, before step S11, further includes: and obtaining audio data corresponding to the text data according to the preset text data. The audio data corresponding to the text data is the audio data preset in step S11. In other words, in the embodiment of the disclosure, in addition to driving a face directly based on audio data, a face predicted image sequence may be obtained, and the face predicted image sequence may be indirectly driven based on text data, so as to adapt to more application scenarios.

In some possible implementations, the preset Text data is converted into corresponding audio data by Text To Speech (TTS) means. TTS belongs to a speech synthesis (speech synthesis) technology, and mainly includes a speech processing and speech synthesis portion, in the speech processing stage, related knowledge such as speech rhythm is mainly used to make word segmentation, part-of-speech judgment, phonetic notation, digital symbol conversion and other processes on text sentences, and in the speech synthesis stage, matched speech is obtained and output mainly by querying corresponding speech libraries.

It should be noted that the above manner of converting text into speech is merely illustrative, and the embodiments of the present disclosure are not limited thereto.

It should also be noted that, in some possible implementations, the text data and the audio data may correspond to the same language, or may correspond to different languages, which is not limited by the embodiments of the present disclosure.

Illustratively, the text data corresponds to a first language (e.g., chinese characters), which may be converted into corresponding first-language audio data (e.g., chinese audio data).

Illustratively, the text data corresponds to a first language (e.g., chinese characters) that may be converted into second language audio data (e.g., english audio data).

When the text and the voice are converted, the text data in the first language can be converted into the text data in the second language, and then the text data in the second language can be converted into the audio data in the second language; the text data in the first language may be converted into the audio data in the first language, and then the audio data in the first language may be converted into the audio data in the second language.

For example, the text data is chinese, which may be converted into corresponding english text data, and then converted into corresponding english audio data.

For another example, the text data is chinese, which may be converted into corresponding chinese audio data first, and then the chinese audio data is converted into corresponding english audio data.

It should be appreciated that, similar to the individual style information described above, individual features (e.g., timbre features, accent features, age features, sentence pause features, etc.) may also be added during the conversion of text data to audio data to obtain more vivid, personalized audio data.

It should be further noted that, the face image prediction method provided by the embodiment of the present disclosure may be applied to application scenarios such as broadcasting, animation production, virtual interaction, etc., and the embodiment of the present disclosure does not limit the application scenarios of the face image prediction method.

The broadcasting scene comprises the steps of driving the digital face to execute various broadcasting tasks based on a face image prediction method, including driving the digital face directly based on audio data and driving the digital face based on text data.

In news broadcasting, a virtual 3D digital person is created as a broadcaster by using a digital person engine, and a data person is driven to broadcast a news manuscript based on preset audio data, which is expressed as a digital person to broadcast news.

Illustratively, in news broadcasting, a virtual 3D digital person is created using a digital person engine as a broadcaster, news manuscripts (i.e., text data) are converted into audio data using TTS, and tone colors corresponding to the audio data can be arbitrarily selected or adjusted. After the audio data is obtained, the data person is driven to broadcast the news manuscript based on the audio data.

The digital person engine is used for creating virtual digital persons, which include, but are not limited to, an UltraEdit (UE code editor, abbreviated as UE), a unit 3D (a real-time 3D interactive content creation and operation platform), character Creator (a 3D modeling software), universal 3D (Universal 3D graphic format standard, abbreviated as U3D), an arkit (an AR development platform that apple has been pushed in 2017), and the like.

For example, an arkit is used to create a digital face model (including a 2D or 3D face model) with 52 blendscape coefficients. Wherein, blendshape coefficients can be regarded as a list locator, corresponding to a series of named coefficient tables. Based on animation data, the numerical value of each Blendhapes in the digital face model can be driven to correspondingly change, and the output Blendhape coefficient data is input into a corresponding engine to drive the digital face model, so that the digital face model is represented to generate corresponding facial actions, and a face image prediction sequence is obtained. And when the audio data and the face image prediction sequence are synchronously played, the digital person is shown to broadcast corresponding news.

The digital person can be applied to the fields of game making, cartoon making and the like in addition to the news broadcasting scene. For example, after a digital person is completed, the game engine may be imported after completion in the corresponding animation plug-in and tools (e.g., C4D,3Dmax, maya, keyslot, rhinoceros, etc.) to obtain the animated figures in each scene.

It should be appreciated that, compared with the traditional news broadcasting by a professional host, the adoption of the digital person for news broadcasting can effectively reduce broadcasting errors caused by human errors, and meanwhile, the labor cost can be reduced.

The virtual interactions include various virtual interaction tasks, including virtual interactions in online games, virtual interactions in online social, and the like.

Illustratively, in a game interaction scenario, a user picks a certain game character to participate in a game, and typically, different game characters have different game images, which may be embodied as interactions between different game characters when interacting based on the game. For example, user a corresponds to animated character a in a game and user B corresponds to animated character B in the game, and when user a sends a first voice message to user B, it appears to user B that animated character a "speaks" the first voice message, i.e., animated character a is driven by the first voice message to read the first voice message with an animated face that matches the first voice message. User B feeds back a second voice message to user a, similarly, as animated character B "speaks" the second voice message, again animated character B being driven by the second voice message to read the second voice message with an animated face that matches the second voice message. In this way, a realistic audiovisual experience can be simulated in the game virtual world, increasing the immersion of the player.

In the embodiments of the present disclosure, the case where the user a sends the first voice message to the user B is taken as an example. The first voice message is audio data and is used for driving corresponding face actions. In some examples, the first voice message is encoded by the encoder to obtain a corresponding voice feature, then the initial face image of the voice feature and the animated character a is subjected to autoregressive prediction by the decoder to obtain a face predicted image sequence of the animated character a matched with the first voice message, and the first voice message and the face predicted image sequence are played at the same time, so that the first voice message is read out from the view of the user B as the animated character a.

In an animation scene, for example, after a certain animation image is designed, the animation image is driven by corresponding dubbing data (including a speech word and the like), and the corresponding speech word in the dubbing data is read out for the animation image, so that the animation image is more vivid and natural, and the viewing experience of a user is improved.

In this embodiment, the dubbing data is encoded by the encoder to obtain the corresponding voice feature, and then the initial face image corresponding to the voice feature and the animated image is autoregressively predicted by the decoder to obtain the sequence of the animated face predicted image matched with the dubbing data, and when the sequence of the animated face predicted image and the dubbing data are played at the same time, the corresponding speech is read out by the animated image, that is, the speech and the animated image are combined together to form an integral image.

The application scenario and the prediction method of other face image prediction methods are similar to those described above, and will not be described here.

It should be emphasized that the embodiment of the disclosure uses an autoregressive prediction mode, that is, a face predicted image obtained by prediction is used as priori information for predicting a subsequent face image, so that the association relationship between the face predicted images is fully utilized, and the prediction accuracy of the face image is improved.

In the embodiment of the disclosure, processing preset audio data by an encoder to obtain voice characteristics; autoregressive prediction is carried out on the voice characteristics and the initial face image information through a decoder, so that a face predicted image sequence matched with the audio data is obtained; the initial face image information at least comprises one initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism. The method fully utilizes the context information of the audio data, so that the voice characteristics obtained based on the audio data comprise rich context information, the prediction accuracy is improved, in addition, in the working process of the decoder, the cross-mode accurate matching of the audio and the face can be realized through a preset attention mechanism, and the prediction accuracy and the prediction quality are further improved.

In some possible implementations, the encoder and decoder process the adjacent face prediction images at a predetermined time step interval based on the time step.

The face image prediction method provided by the embodiment of the present disclosure is described in the following with reference to fig. 2.

Fig. 2 is a flowchart of a face image prediction method provided in an embodiment of the present disclosure. Referring to fig. 2, the method includes:

step S21, aiming at the t time step, processing the audio data corresponding to the t time step through an encoder to obtain the voice feature corresponding to the t time step.

Wherein T is more than 1 and less than or equal to T, T and T are integers, and T is the total number of time steps.

In some possible implementations, the audio data is divided based on time steps such that each time step corresponds to a portion of the audio data and the corresponding audio data is processed within each time step. In other words, in the t-th time step, the encoder performs feature extraction on the audio data corresponding to the t-th time step, thereby obtaining the speech feature corresponding to the t-th time step.

It will be appreciated that for the same audio data, the more predictions are made, the more accurate the prediction result is, as T is greater. Conversely, when T is smaller, the number of predictions is smaller, and the prediction speed is relatively faster. In practical application, the total number of time steps T may be set according to any one or more of experience, statistical data and practical requirements, and the setting manner of the total number of time steps in the embodiment of the present disclosure is not limited.

Step S22, aiming at the t-th time step, processing the voice features corresponding to the 1 st time step to the t-th time step and the 1 st face predicted image to the t-1 st predicted face image through a decoder to obtain the t-th face predicted image.

In some possible implementations, the decoder predicts by adopting an autoregressive manner, uses the speech features corresponding to the 1 st time step to the t time step and the 1 st face predicted image to the t-1 st predicted face image as input information or already information, and obtains the t face predicted image through processing of a preset attention mechanism.

Step S23, when the T-th face predicted image is obtained, the 1 st to T-th face predicted images are arranged according to time sequence, and a face predicted image sequence is obtained.

In some possible implementations, after T predictions are completed, T face prediction images are obtained, and the 1 st to T th face prediction images are sequentially arranged in time order, that is, a face prediction image sequence matched with the audio data is obtained.

It should be noted that, the prediction process of the 1 st time step is slightly different from the prediction method of the subsequent time step. For the 1 st time step (i.e. t=1), the encoder processes the audio data corresponding to the 1 st time step to obtain the voice feature corresponding to the 1 st time step, but when the face image prediction is performed, the prediction belongs to the first prediction, and no corresponding prediction result is needed to perform autoregressive prediction, so that the prediction needs to be performed based on the initial face image as a basis, and the face prediction image corresponding to the 1 st time step is obtained. In the autoregressive prediction of the 2 nd time step, the face predicted image of the 1 st time step is used as known information to predict, and detailed procedures are referred to in the related content of the embodiments of the present disclosure, and will not be described herein.

Fig. 3 is a flowchart of a face image prediction method provided in an embodiment of the present disclosure. Referring to fig. 3, the method includes:

step S31, performing time convolution operation on the audio data based on at least one time convolution layer to obtain an initial voice characteristic with a first frequency.

In some possible implementations, the temporal convolution layer (Temporal Convolutional Network, TCN) performs a temporal convolution process on the input raw audio data to convert the audio data to have a first frequency f _a Is used for the initial speech feature of (a).

Step S32, interpolation processing is carried out on the initial voice characteristic based on the interpolation processing layer, and the intermediate voice characteristic with the second frequency is obtained.

In some possible implementations, the second frequency is determined from the first frequency and a third frequency corresponding to the sequence of face predicted images.

In some possible implementations, the capture frequency f of the facial motion data is taken into account _m (i.e., the third frequency corresponding to the sequence of face predicted images) is generally different from f _a (e.g., f _a ＝49HZ，f _m =25 fps), thus, the interpolation processing layer is used to interpolate the initial speech feature to obtain an intermediate speech feature having a second frequency such that the length of the final output speech feature is

It should be noted that, the interpolation processing layer may perform interpolation processing based on any interpolation algorithm such as linear interpolation and parabolic interpolation and its improved algorithm, and the embodiment of the present disclosure does not limit the interpolation mode.

The linear interpolation function is a first order polynomial, the interpolation error of the linear interpolation function on the interpolation node is zero, and the linear interpolation function has the characteristics of simplicity, convenience and the like. The linear interpolation can be used to replace the original function approximately, and can also be used to calculate the values that are not in the table look-up process.

Step S33, processing the intermediate speech features from a plurality of relevance dimensions of the speech by a multi-headed self-attention processing layer to obtain a context-based speech representation.

In step S34, speech characteristics are obtained by outputting context-based speech through the feedforward layer.

The multi-head self-attention processing layer is connected with at least one feedforward layer, and the feedforward layer has a network layer connection function.

In some possible implementations, the encoder processes based on a preset time step.

Aiming at the t time step, the time convolution layer carries out time convolution on the audio data corresponding to the t time step, so as to obtain an initial voice feature with a first frequency corresponding to the t time step, carries out interpolation processing on the initial voice feature based on the interpolation processing layer, obtains an intermediate voice feature, and inputs the intermediate voice feature to the multi-head self-attention processing layer; the multi-head self-attention processing layer performs self-attention calculation based on intermediate voice features of a plurality of time steps, processes the intermediate voice features from a plurality of correlation dimensions of voice to obtain a voice representation based on context, and outputs the voice representation outwards through the feedforward layer to obtain corresponding voice features A _T′ ＝(a ₁ ，…，a _T′ ) Wherein, the method comprises the steps of, wherein,

and step S35, encoding the 1 st face predicted image to the t-1 st predicted face image through a periodic position encoding module to obtain a first intermediate motion characteristic.

In some possible implementations, the speech features are in vector form (i.e., a sequence of time steps). The periodic position coding module adopts a sinusoidal position coding method, and the sinusoidal position coding method injects position information in a preset period to determine the position of the generated code in the corresponding vector of the voice feature.

In some possible implementations, the encoded PPE generated based on the sinusoidal position coding method can be expressed using the following formula:

PPE _(t,2i) ＝sin((t mod p)/10000 ^2i/d )

PPE _(t,2i+1) ＝cos((t mod p)/10000 ^2i/d )

wherein t is the current time step, d is the model dimension, i is the dimension index, mod is the modulo operator, i is more than or equal to 1 and less than or equal to t, and p represents the preset period.

Step S36, based on the attention processing layer, performing cross-mode attention alignment processing on the first intermediate motion feature and the voice features corresponding to the 1 st time step to the t time step to obtain the t face prediction feature.

In some possible implementations, the attention handling layer includes a first attention handling layer and a second attention handling layer, and the first attention handling layer is connected in series with the second attention handling layer. Accordingly, step S36 includes:

Obtaining a first query feature, a first key feature and a first value feature according to the first intermediate motion feature;

calculating a weighted context in a dot product attention mode which is scaled down according to the first query feature, the first key feature, the first value feature and the preset first deviation feature through a first attention processing layer to obtain a second intermediate motion feature;

and performing frequency alignment on the second intermediate motion feature and the corresponding voice feature through a second attention processing layer to obtain a t-th face prediction feature, wherein the voice feature corresponding to the second intermediate motion feature comprises voice features corresponding to the 1 st time step to the t-th time step.

Illustratively, for a first intermediate motion feature

(/>

I.e. a sequence consisting of face predictive image encodingsThe first attention handling layer will first of all +.>

Linearization to first query feature about dimension d>

First key feature

First value characteristic->

And, in order to learn the dependency relationship between the face predicted images that have been obtained, a first deviation feature +.>

And calculating a weighted context representation by means of a scaled down dot product attention mechanism, obtaining a second intermediate motion feature +.>

Wherein,,

Is an added bias with respect to time to ensure causal relationships and improve the ability to generalize to longer sequences, softmax functions are normalized exponential functions, att is a dot product self-attention mechanism.

In some possible implementations, frequency-aligning, by the second attention processing layer, the second intermediate motion feature and the corresponding speech feature to obtain a t-th face prediction feature, including:

obtaining a second key feature and a second value feature according to the voice feature corresponding to the second intermediate motion feature;

obtaining a second query feature according to the second intermediate motion feature;

and processing the second key feature, the second value feature, the second query feature and the preset second deviation feature through a second attention processing layer to obtain a t-th face prediction feature.

Illustratively, will A _kT And

after input of the second attention handling layer, A _kT Is converted into two separate features: second key feature K ^A And a second value characteristic value V ^A And->

Is converted into the second query feature->

The nature of the t-th face prediction feature of the output is V ^A Can be expressed as:

step S37, according to the t-th face prediction feature, a t-th face prediction image is obtained.

In some possible implementations, the t-th face prediction feature is input to a Motion Decoder (Motion Decoder), and the t-th face prediction image is obtained through processing of the Motion Decoder.

In step S38, when the T-th face predicted image is obtained, the 1 st to T-th face predicted images are arranged in time order to obtain a face predicted image sequence.

It should be noted that, in the embodiment of the present disclosure, considering that the influence of the face prediction image and/or the voice feature that is far from the current time step on the face image prediction of the current time step is smaller, in order to reduce the calculation amount, the influence of the face prediction image and/or the voice feature may be ignored, and only attention mechanism processing is performed on a plurality of face prediction images that are near to the current time step. Illustratively, the face predicted image of the t-th time step is predicted based on the speech features of the m-th time step to the t-th time step, the face predicted image of the n-th time step to the t-1 th time step (ignoring the influence of the speech features of the 1-th time step to the m-1 th time step, and the face predicted image of the 1-th time step to the n-1 th time step), where m is an integer greater than 1 and less than t, n is an integer greater than 1 and less than t-1, n and m may take the same value, or may take different values, which is not limited by the embodiments of the present disclosure.

Fig. 4 is a schematic diagram of a prediction model provided in an embodiment of the disclosure. Referring to fig. 4, the prediction model includes: at least one encoder 41 and at least one decoder 42.

Wherein the encoder 41 adopts the encoder in the embodiment of the present disclosure, and the decoder 42 adopts the decoder in the embodiment of the present disclosure.

The prediction model will be described in detail with reference to fig. 5.

Fig. 5 is a schematic diagram of a prediction model provided in an embodiment of the disclosure. Referring to fig. 5, the prediction model includes: an encoder and a decoder.

In some possible implementations, the design of the encoder follows a self-supervised pre-trained speech model wav2vec2.0 (by way of example only, other versions of the model are possible). In an embodiment of the disclosure, the encoder is composed of a first feature extraction module and a second feature extraction module, and some common network layers. The first feature extraction module comprises a plurality of time convolution layers (TCNs) and interpolation processing layers, wherein the time convolution layers can convert the input original audio data into frequency f _a And then interpolation processing layer is used for interpolation operation so as to be aligned with the motion characteristics of the human face. The second feature extraction module includes a multi-head self-attention (MH self-attention) processing layer and a Feed Forward layer (Feed Forward) which can process the feature vector f _a Converted into a phonetic representation of the context. In the second specialThe syndrome extraction module is also followed by a linear projection (Linear Projection) layer for enabling projection of the data to a preset space. The encoder may use pre-trained wav2vec 2.0 for weight initialization.

Illustratively, the original audio data input to the encoder is X, and feature extraction is first performed based on a plurality of temporal convolution layers to obtain a frequency f _a Is used for the initial speech feature of (a). Frequency f of capturing face motion data in consideration of face _m Is generally different from f _a (e.g., f _a ＝49HZ，f _m =25 fps), thus, the initial speech feature is interpolated using an interpolation processing layer to obtain an intermediate speech feature with a second frequency, the intermediate speech feature is processed from multiple correlation dimensions of speech by a multi-headed self-attention processing layer to obtain a context representation-based speech representation, and input to a linear projection layer via a feed-forward layer, finally outputting speech feature a _kT Having a length of

In other words, A _kT ＝(a ₁ ,……,a _kT ). In the subsequent processing, the voice features and the motion features of the face can be aligned by a biased cross-modal multi-head attention mechanism.

The respective network layers of the encoder are described below.

1. Time convolution layer: which uses convolution for sequence modeling and prediction, consists of dilated, causal one-dimensional convolution layers with the same input and output lengths. The time convolution is mainly set to enable the convolution neural network to have time sequence characteristics. The temporal convolution layer may achieve similar or higher processing power in a variety of task processing scenarios as compared to a variety of recurrent neural network structures.

2. Interpolation processing layer: the frequency of the TCN output by using the corresponding interpolation function is f _a Is processed. In some possible implementations, a linear interpolation mode is adopted, and the corresponding interpolation function is an interpolation mode of a first order polynomial, and the interpolation error at the interpolation node is zero. Linearity ofInterpolation is simpler and more convenient than other interpolation methods (e.g., parabolic interpolation).

3. Multi-head self-care processing layer: which belongs to one of the attention mechanisms. The common single-head self-attention mechanism is based on the calculation of the query feature Q, the key feature K and the value feature V, and the multi-head self-attention mechanism divides Q, K, V into a plurality of branches according to the number of heads (the number of branches is the same as the number of heads), and corresponding calculation is carried out for each branch. In contrast to single-head self-attention, multi-head self-attention introduces multiple vectors to capture correlations from multiple dimensions. Through the processing of the multi-headed attention processing layer, intermediate audio features can be converted into a contextual phonetic representation.

Fig. 6 is a schematic structural diagram of a multi-head self-attention mechanism according to an embodiment of the present disclosure. See fig. 6, which belongs to the structure of the 2-head self-attention mechanism, with two branches, one for each "head".

Taking the first branch as an example for illustration, the input is x ⁱ Respectively with the first-stage linear change matrix W _q 、W _k 、W _v Performing operation to obtain q ⁱ 、k ⁱ And v ⁱ The method comprises the steps of carrying out a first treatment on the surface of the Further, q ⁱ Respectively with a second-stage linear transformation matrix W _q,1 And W is _q,2 Performing operation to obtain q ^i,1 And q ^i,2 ，k ⁱ Respectively with W _k,1 And W is _k,2 Performing operation to obtain k ^i,1 And k ^i,2 ，v ⁱ Respectively with W _v,1 And W is _v,2 Performing operation to obtain v ^i,1 And v ^i,2 . The second branch is similar to the first branch in the same process, and the description is not repeated here.

4. Feed-forward layer: the feed-forward neural network is the simplest neural network, the neurons are arranged in layers, and each neuron is only connected with the neurons of the previous layer. The feed-forward layer is arranged to receive the output of the previous layer and output it to the next layer.

5. Linear projection layer: for projecting the content of the front layer output into a preset dimension. In an embodiment of the present disclosure, it is used to project the output of the feed-forward layer onto

A dimension, thereby obtaining a speech feature, and the speech feature may be represented as A _kT ＝(a ₁ ,……,a _kT )。

In some possible implementations, the decoder includes a periodic position coding module, a first attention processing layer, a second attention processing layer, and a feed-forward layer.

1. And the periodic position coding module is used for:

in the related art, the adoption of the sinusoidal position coding method may result in limited generalization capability for longer sequences. Therefore, in order to improve generalization ability, a method based on linear bias (Attention with Linear Biases, ALiBi) is proposed, in which constant bias is added to the query key attention score to improve generalization ability. However, if ALiBi is directly substituted for the sinusoidal position coding, this will lead to a resting facial expression during prediction. This is because ALiBi does not add any position information to the input representation, which may affect the robustness of the chronological information. To alleviate this problem, embodiments of the present disclosure propose a periodic position coding approach to inject temporal sequential information while being ALiBi compatible. In other words, the embodiment of the present disclosure upgrades the sinusoidal position encoding method in the related art to have periodicity with respect to one super parameter p representing the period:

PPE _(t,2i) ＝sin((t mod p)/10000 ^2i/d )

PPE _(t,2i+1) ＝cos((t mod p)/10000 ^2i/d )

It should be noted that, in the periodic position coding method provided in the embodiment of the present disclosure, instead of assigning a unique position identifier to each input (token), position information is repeatedly injected in each period (i.e., a preset period) p.

It should also be noted that in some possible implementations, the time period is based onBefore the encoding by the phase position encoding module, firstly, the face predictive image corresponding to the first t time steps is encoded by a Motion Encoder (Motion encoding)

Projected to a preset space (i.e., d-dimensional space).

In some possible implementations, to model the speaking Style, a Style Embedding layer (Style Embedding) may also be provided to embed the Speaker Identity (N)/individual Style information into the d-dimensional space and add it to the d-dimensional space

In (a):

wherein f _t For the intermediate operation result, corresponding to the t-th time step, W ^f Weight, b ^f As an amount of the offset to be used,

is a face predicted image corresponding to the last time step.

Obtaining f _t Then, using a periodic position coding module to perform periodic iterative coding:

wherein,,

representing the coding of the face predicted image corresponding to the t-th time step.

It should be noted that in some possible implementations, sn adopts a one-hot encoding (one-hot) method, where one encoding corresponds to one speaker, and thus, different speaking styles can be output by changing one-hot identity vector.

2. First attention handling layer:

in the embodiment of the disclosure, the first attention processing layer is set based on a biased causal attention mechanism, and the biased causal attention mechanism is compatible with ALiBi, so that generalization of a longer sequence in language modeling is facilitated. Predictive image coding sequence for human face

(/>

I.e. first intermediate movement feature), the first attention handling layer will first of all +.>

Linearization to first query feature about dimension d>

First key feature->

First value characteristic->

And computes a weighted context representation by a scaled down dot product attention mechanism:

wherein,,

is added with respect to time deviation to ensure causal relationship and improveGeneralizing to the ability of longer sequences, the softmax function is a normalized exponential function, att is the dot product self-attention mechanism.

By way of example only, and in an illustrative,

is a matrix with negative infinity value in the upper triangle area, and is used for reducing the influence of future face predicted images on the current prediction. And, in order to increase generalization ability, in +.>

Static bias and non-learning bias are added to the lower triangle area of (c).

It should be noted that, unlike ALiBi, in the embodiment of the present disclosure, a preset period p is introduced, and at each period ([ 1:p)]，[p+1:2p]…) injection time offset.

Can be expressed as:

wherein,,

the element representing the ith row and the jth column in the first deviation feature corresponding matrix, i represents the row number, j represents the column number, i is not less than 1 and not more than T, j is not less than 1 and not more than T, T is not less than 1 and not more than T, and p represents a preset period.

Fig. 7 is a schematic diagram of a first deviation feature provided by an embodiment of the present disclosure. Referring to fig. 7, a matrix corresponding to the first deviation feature is illustrated with t=13, where p corresponds to 3 time steps T, and a value corresponding to each element in the matrix is the value of the matrix (the default value of the triangle area on the matrix indicates that the value of the triangle area is- +_infinity).

As shown in fig. 7, when i=13, j=1, substitution is performed

Calculating the calculation formula of (2) to obtain the value of the corresponding element as-4; when i= 4,j =1, substitution is made +.>

Calculating the calculation formula of (2) to obtain the value of the corresponding element as-1; when i=2, j=5, substitution is made +.>

And (3) calculating the calculation formula of the formula (C) to obtain the value of the corresponding element-infinity. />

Other elements in the matrix may be calculated in a similar manner and will not be described again here.

In this way, the disclosed embodiments assign higher attention weights to more recent time periods, thereby prejudicing attention. In other words the first and second phase of the process,

Most likely to affect the current time is not long +.>

Is a prediction of (2). Thus, the time offset proposed by the embodiments of the present disclosure can be regarded as a broad form of ALiBi (ALiBi is a special case when p=1).

In some possible implementations, the biased causal attention mechanism described above may take the form of a Multi-headed version, i.e., a biased causal Multi-headed (MH) self-attention mechanism. Multiple complementary information representing subspaces can be jointly extracted using a multi-headed attention mechanism. And the outputs of the plurality of Heads are connected together by a parameter matrix

Forward projection:

/>

wherein the method comprises the steps of

Concat is the connection function.

It should be noted that, similar to ALiBi, in the embodiment of the present disclosure, a header-specific scalar m is provided for the multi-head. For each head _h The time offset is defined as

Where scalar m is a head-specific slope, and may not be learned during training. For H heads, m will be from +.>

Initially, each element is multiplied by the same value to calculate the next element. For example, in the case of 4 heads, the corresponding slope is 2 ^-2 ，2 ^-4 、2 ^-6 And 2 ^-8 。

3. Second attention handling layer

The second attention processing layer is arranged based on a biased cross-attention mechanism and is mainly used for realizing the alignment of voice characteristics and the movement characteristics of the human face. In the disclosed embodiment, the second attention handling layer is a multi-headed attention mechanism and incorporates a second bias feature B therein ^A Can be expressed as:

wherein B is ^A (i, j) represents an element of the ith row and jth column in the second bias characteristic correspondence matrix, i represents a row number, j represents a column number, 1.ltoreq.i.ltoreq.t, 1.ltoreq.j.ltoreq.kt, k represents a frequency ratio (for example,

)。

fig. 8 is a schematic diagram of a second deviation feature provided by an embodiment of the present disclosure. Referring to fig. 8, a second deviation feature is illustrated by taking t=10, where k corresponds to two time steps T as an example, and a matrix of 10×20 corresponds to each element in the matrix, where the value corresponding to each element in the matrix is the value of the corresponding element (the default value in the matrix indicates that the value of the corresponding element is- ≡).

Sequentially bringing the values of i and j into B ^A The values of the elements in the matrix can be obtained, and the corresponding values are shown in fig. 8, which is not described here.

It should be noted that, due to self-attention mechanism, A _kT Each token captures the context of the audio. Since the output of the first attention handling layer is

And->

Each symbol in (a) encodes the history data of the face predicted image. Will A _kT And->

Is converted into the second query feature- >

The nature of the output is V ^A Can be expressed as:

in some embodiments, the biased cross-attention mechanism may also be based on a multi-headed self-attention mechanism to obtain information of different relevance dimensions.

Similar to the first attention processing layer, the second attention processing layer may be a multi-headed attention processing layer set based on a biased cross-modal multi-headed attention mechanism, thereby jointly extracting the complementary information of the plurality of representation subspaces.

4. Feedforward layer

The feedforward layer in the decoder is similar in structure and function to the feedforward layer in the encoder and will not be described again here.

In some possible implementations, the feedforward layer inputs the result to a Motion Decoder (Motion Decoder), and the Motion Decoder projects the d-dimensional hidden state to a v-dimensional three-dimensional vertex space to finally obtain the human face predicted image sequence

/>

In the embodiment of the disclosure, the prediction model can encode long audio and uses autoregressive to predict a human face image sequence, and two biased semantic force mechanisms are involved, including a biased cross-modal multi-head attention mechanism with a periodic position encoding strategy and a biased causal multi-head self-attention mechanism. The former effectively calibrates the audio motion pattern while the latter provides the ability to extend to long audio sequences. Based on this, the predictive model improves the display quality in terms of lip synchronization and facial motion.

Fig. 9 is a flowchart of a predictive model training method according to an embodiment of the present disclosure. Referring to fig. 9, the method includes:

step S91, acquiring a training data set.

In some possible implementations, an open source 3D dataset is used as the training dataset. For example, a BIWI dataset (Biwi Kinect Head Pose Database, head pose image dataset) and a VOCASET (i.e., VOCA dataset).

The BIWI data set is a corpus of emotion voices and corresponding dense dynamic three-dimensional face geometric figures. 14 subjects were asked to read 40 english sentences, each of which was recorded twice in a neutral or emotional environment, respectively. The 3D face geometry was captured at 25fps, with each face mesh containing 23370 vertices. VOCASET consists of 480 facial motion sequences from 12 subjects. Each sequence was captured at 60fps and was between 3 and 4 seconds in length. Each 3D face mesh has 5023 vertices.

In one possible implementation, the training data set includes one training set, one validation set, and two test sets. Setting 6 subjects, wherein each subject speaks 32 sentences, and speaks 192 sentences in total, and taking the sentences as a training set; for the 6 subjects, each subject utters 4 sentences, and utters 24 sentences in total, and the 6 subjects are taken as a verification set; setting 6 unknown subjects, each speaking 4 sentences, 24 sentences in total, as a first test set, and setting 8 unknown subjects, each speaking 4 sentences, 32 sentences in total, as a second test set.

Step S92, configuring a training environment.

Taking Linux environment as an example, ubuntu (black-office system), python (computer programming language), pytorch (open source Python machine learning library) and dependency package need to be prepared.

For example, version Ubuntu 18.04.1, python 3.7 and Pytorch 1.9.0 in an accurate Linux environment;

the dependent packet related parameters are: numpy; scipy= 1.7.1; librosa= 0.8.1; tqdm; pickle; torch= 1.9.0; torchvision= 0.10.0; torchaudio= 0.9.0; transformers= 4.6.1; trimesh= 3.9.27; pyrender= 0.1.45; opencv-python; ffmpeg

Step S93, training is carried out by using the training data set, and a training result is obtained.

In one possible implementation, the training data set may be preprocessed first, and then the training data set is input into a prediction model to be trained for training, so as to obtain a training result.

And step S94, stopping training to obtain a trained prediction model under the condition that the training result meets the preset stopping condition.

In one possible implementation, the model training is employedBy autoregressive schemes, i.e. by minimizing the output of the decoder

And Y is equal to _t The mean square error between (i.e., the truth sequence) to iteratively adjust the model parameters. And stopping training when the mean square error is smaller than a preset error threshold value, and obtaining a trained prediction model.

The mean square error can be expressed using the following formula:

/>

wherein L is _MSE Representing the mean square error, V is the number of vertices of the three-dimensional face mesh, V is the total number of vertices,

representing the predicted value, y, corresponding to the v-th face mesh at the t-th time step _t,v And representing the true value corresponding to the v face grid at the t time step.

In one possible implementation, the preset stop condition may also be a condition regarding the number of training times. The training frequency threshold is preset, and when the training frequency is larger than the preset training frequency threshold, training is stopped.

It should be noted that the foregoing description is merely illustrative of the preset stop condition, and the embodiment of the present disclosure does not limit the preset stop condition.

It should be noted that, the data processing process of the prediction model during training is similar to the face prediction method provided in this embodiment, and will not be described repeatedly herein. Moreover, based on the training mode, the self-supervision pre-training voice representation can be integrated, so that the problem of data sparseness is solved.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides a face image prediction device, an electronic device, and a computer readable storage medium, where the foregoing may be used to implement any of the face image prediction methods provided in the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.

Fig. 10 is a block diagram of a face image prediction apparatus according to an embodiment of the present disclosure. Referring to fig. 10, the apparatus includes:

the first processing unit 101 is configured to process preset audio data through an encoder to obtain a voice feature.

The second processing unit 102 performs autoregressive prediction on the voice characteristics and the initial face image information through a decoder to obtain a face predicted image sequence matched with the audio data.

Referring to fig. 11, an embodiment of the present disclosure provides an electronic device including: at least one processor 1101; at least one memory 1102, and one or more I/O interfaces 1103 connected between the processor 1101 and the memory 1102; the memory 1102 stores one or more computer programs executable by the at least one processor 1001, and the one or more computer programs are executed by the at least one processor 1101 to enable the at least one processor 1101 to perform the face image prediction method described above.

The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor/processing core implements the above-described face image prediction method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above-described face image prediction method.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A face image prediction method, comprising:

processing preset audio data through an encoder to obtain voice characteristics;

performing autoregressive prediction on the voice characteristics and the initial face image information through a decoder to obtain a face predicted image sequence matched with the audio data;

the initial face image information at least comprises an initial face image, the face image prediction sequence comprises a plurality of face prediction images, the encoder comprises at least one feature extraction module, and the decoder comprises at least one attention processing layer set based on a preset attention mechanism.

2. The face image prediction method according to claim 1, wherein a predetermined time step is provided between adjacent ones of the face predicted images, the encoder and the decoder perform processing based on the time step,

the processing, by the encoder, the preset audio data to obtain the voice feature includes:

aiming at the 1 st time step, processing audio data corresponding to the 1 st time step through the encoder to obtain voice characteristics corresponding to the 1 st time step;

aiming at the T-th time step, processing audio data corresponding to the T-th time step through the encoder to obtain voice characteristics corresponding to the T-th time step, wherein T is more than 1 and less than or equal to T, T is an integer, and T is the total number of the time steps;

the method for obtaining the face prediction image sequence matched with the audio data comprises the following steps of:

aiming at the 1 st time step, processing the voice feature corresponding to the 1 st time step and the initial face image through the decoder to obtain a 1 st face predicted image;

Aiming at the t-th time step, processing the voice features corresponding to the 1 st time step to the t-th time step and the 1 st face predicted image to the t-1 st predicted face image through the decoder to obtain a t-th face predicted image;

and under the condition that the T-th face predicted image is obtained, arranging the 1 st to T-th face predicted images according to a time sequence to obtain the face predicted image sequence.

3. The face image prediction method according to claim 1 or 2, wherein the encoder includes a first feature extraction module and a second feature extraction module;

performing feature extraction on the audio data based on the first feature extraction module to obtain intermediate voice features;

and carrying out context association processing on the intermediate voice features based on the second feature extraction module to obtain the voice features.

4. A face image prediction method according to claim 3, wherein the first feature extraction module comprises at least one temporal convolution layer and an interpolation processing layer;

the feature extraction of the audio data based on the first feature extraction module to obtain an intermediate voice feature includes:

Performing a time convolution operation on the audio data based on at least one of the time convolution layers to obtain an initial speech feature having a first frequency;

and carrying out interpolation processing on the initial voice characteristic based on the interpolation processing layer to obtain an intermediate voice characteristic with a second frequency, wherein the second frequency is determined according to the first frequency and a third frequency corresponding to the face predicted image sequence.

5. The method of claim 4, wherein the second feature extraction module comprises a multi-headed self-attention processing layer and at least one feed-forward layer;

the performing context correlation processing on the intermediate voice feature based on the second feature extraction module to obtain the voice feature includes:

processing the intermediate speech features from a plurality of relevancy dimensions of speech by the multi-headed self-attention processing layer to obtain a context-based speech representation;

outputting the context-based voice through the feedforward layer to obtain the voice characteristics;

6. The face image prediction method of claim 1, wherein the preset attention mechanism includes a biased causal attention mechanism and a biased cross attention mechanism, the attention processing layer includes a first attention processing layer corresponding to the biased causal attention mechanism, and a second attention processing layer corresponding to the biased cross attention mechanism.

7. The face image prediction method according to claim 2, wherein the decoder includes a periodic position coding module;

the processing, by the decoder, the speech features corresponding to the 1 st time step to the t time step and the 1 st face predicted image to the t-1 st predicted face image to obtain the t-th face predicted image includes:

encoding the 1 st human face predicted image to the t-1 st predicted human face image through the periodic position encoding module to obtain a first intermediate motion characteristic;

based on the attention processing layer, performing cross-mode attention alignment processing on the first intermediate motion feature and the voice features corresponding to the 1 st time step to the t time step to obtain a t face prediction feature;

And obtaining the t-th face predicted image according to the t-th face predicted features.

8. The face image prediction method of claim 7, wherein the speech features are in the form of vectors;

the periodic position coding module adopts a sinusoidal position coding method, and the sinusoidal position coding method injects position information in a preset period to determine the position of the generated code in the corresponding vector of the voice characteristic.

9. The face image prediction method of claim 7, wherein the attention processing layer comprises a first attention processing layer and a second attention processing layer, and the first attention processing layer is connected in series with the second attention processing layer;

based on the attention processing layer, performing cross-mode attention alignment processing on the first intermediate motion feature and the voice feature corresponding to the 1 st time step to the t time step to obtain a t face prediction feature, including:

calculating a weighted context by the dot product attention mode which is scaled down according to the first query feature, the first key feature, the first value feature and the preset first deviation feature through the first attention processing layer, and obtaining a second intermediate motion feature;

And performing frequency alignment on the second intermediate motion feature and the corresponding voice feature through the second attention processing layer to obtain the t-th face prediction feature, wherein the voice feature corresponding to the second intermediate motion feature comprises voice features corresponding to the 1 st time step to the t-th time step.

10. The method according to claim 9, wherein the obtaining the t-th face prediction feature by frequency-aligning the second intermediate motion feature and the corresponding speech feature by the second attention processing layer includes:

and processing the second key feature, the second value feature, the second query feature and the preset second deviation feature through the second attention processing layer to obtain the t-th face prediction feature.

11. The face image prediction method of claim 1, wherein the initial face image information further includes individual style information for characterizing a speaking style of a human body.

12. The method according to claim 1, wherein before the processing the preset audio data by the encoder to obtain the voice feature, further comprises:

and obtaining audio data corresponding to the text data according to preset text data.

13. The face image prediction method according to claim 1, wherein the face image prediction method is applied to perform at least one of a broadcasting task, a virtual interaction task, and an animation task.

14. A predictive model for predicting a sequence of face predicted images matching audio data based on the audio data, the model comprising: at least one encoder and at least one decoder;

wherein the encoder employs an encoder as claimed in any one of claims 1-13;

the decoder employing a decoder as claimed in any of claims 1-13.

15. A method of model training, comprising:

acquiring a training data set, wherein the training data set comprises a plurality of training audio data and a face training image sequence corresponding to the training audio data;

training an initial prediction model by using the training data set to obtain a training result;

Under the condition that the training result meets the preset stopping condition, a trained prediction model is obtained;

the prediction model is used for predicting a face predicted image sequence matched with the audio data according to preset audio data.

16. A face image prediction apparatus, comprising:

the first processing unit is used for processing preset audio data through the encoder to obtain voice characteristics;

the second processing unit carries out autoregressive prediction on the voice characteristics and the initial face image information through a decoder to obtain a face predicted image sequence matched with the audio data;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executable by the at least one processor to enable the at least one processor to perform the face image prediction method of any one of claims 1-13.

18. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a face image prediction method according to any of claims 1-13.