CN112785667A

CN112785667A - Video generation method, device, medium and electronic equipment

Info

Publication number: CN112785667A
Application number: CN202110098921.6A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-11

Abstract

The disclosure relates to a video generation method, a video generation device, a video generation medium, and an electronic apparatus. The method comprises the following steps: acquiring a user question video, wherein the user question video comprises face image information and question audio of a user; determining response text information corresponding to the question audio; and generating a response video for responding to the user questions in the user question video according to the user question video and the response text information, wherein the response video comprises an avatar interacted with the user and audio corresponding to the response text information. Because the user question video contains emotion-related information such as expressions, tone, intonation and the like when the user asks a question, the emotion expression of the current virtual image interacted with the user can be controlled according to the emotion-related information, so that the interestingness of man-machine interaction is improved, and the anthropomorphic perfect interaction is realized.

Description

Video generation method, device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of human-computer interaction technologies, and in particular, to a video generation method, apparatus, medium, and electronic device.

Background

Human-Computer Interaction technologies (Human-Computer Interaction technologies) refers to a technology for realizing Human-Computer Interaction in an efficient manner through Computer input and output devices. The man-machine interaction technology comprises the steps that a machine provides a large amount of relevant information and prompt requests for people through an output or display device, and a person inputs the relevant information, answers questions, prompts and the like to the machine through an input device.

At present, human-computer interaction modes include not only keyboard input and handle operation, but also more novel modes, such as touch operation, voice interaction, video interaction and the like, which can realize information transmission and complete 'conversation' between a person and a machine. At the present stage, when a user interacts with a machine video, the expression of the virtual image in the response video of the machine is generally neutral, namely emotion is not reflected, so that the user feels dull and uninteresting.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a video generation method, including:

acquiring a user question video, wherein the user question video comprises face image information and question audio of a user;

determining response text information corresponding to the question audio;

and generating a response video for responding to the user questions in the user question video according to the user question video and the response text information, wherein the response video comprises an avatar interacted with the user and audio corresponding to the response text information.

In a second aspect, the present disclosure provides a video generating apparatus comprising:

the system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring a user question video, and the user question video comprises face image information and question audio of a user;

the determining module is used for determining the response text information corresponding to the question audio acquired by the acquiring module;

and the generating module is used for generating a response video for responding the user question in the user question video according to the user question video acquired by the acquiring module and the response text information determined by the determining module, wherein the response video comprises an avatar interacted with the user and audio corresponding to the response text information.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method provided by the first aspect of the present disclosure.

In the technical scheme, after a user question video comprising face image information of a user and a question audio is acquired, firstly, response text information corresponding to the question audio is determined; and then, generating a response video for responding to the user questions in the user question video according to the user question video and the response text information. Because the user question video contains emotion-related information such as expressions, tone, intonation and the like when the user asks a question, the emotion expression of the current virtual image interacted with the user can be controlled according to the emotion-related information, so that the interestingness of man-machine interaction is improved, and the anthropomorphic perfect interaction is realized.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic diagram illustrating one implementation environment in accordance with an example embodiment.

Fig. 2 is a flow diagram illustrating a video generation method according to an example embodiment.

Fig. 3 is a flow chart illustrating a method of generating a response video from a user query video and response text information in accordance with an exemplary embodiment.

FIG. 4A is a block diagram illustrating a predictive model in accordance with an exemplary embodiment.

FIG. 4B is a block diagram illustrating a predictive model according to an exemplary embodiment.

FIG. 5 is a flow diagram illustrating a predictive model training method in accordance with an exemplary embodiment.

Fig. 6 is a flowchart illustrating a method of generating a response video according to a feature of a face region of an avatar and a first acoustic feature, according to an exemplary embodiment.

Fig. 7 is a flowchart illustrating a method of generating a response video according to a feature of a face region of an avatar and a first acoustic feature, according to another exemplary embodiment.

Fig. 8 is a flowchart illustrating a method of generating a response video from a user query video and response text information according to another exemplary embodiment.

Fig. 9 is a block diagram illustrating a video generation apparatus according to an example embodiment.

FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

An environment for implementing the present disclosure is first described. FIG. 1 is a schematic diagram illustrating one implementation environment in accordance with an example embodiment. In the application scenario of fig. 1, the terminal device 20 first acquires the question video of the user 10, and then, based on the question video, the terminal device 20 generates a response video for responding to the question of the user 10 in the question video of the user 10 (as shown in fig. 1, the question of the user 10 is "how the weather is today").

In addition, in the present disclosure, the terminal device 20 may be, for example, a smart phone, a tablet computer, a notebook computer, a smart wearable device (e.g., a smart band), and the like. In fig. 1, the terminal device 20 is illustrated as a smartphone.

Fig. 2 is a flow chart illustrating a video generation method according to an exemplary embodiment, wherein the method may be applied to a terminal device, such as the smartphone shown in fig. 1. As shown in fig. 2, the method includes S201 to S203.

In S201, a user question video is acquired, where the user question video includes face image information of a user and a question audio.

In the present disclosure, the user face image information includes a sequence of images in a user questioning video.

In S202, the response text information corresponding to the question audio is determined.

In the present disclosure, a user question may be extracted by performing voice recognition on a question audio, and then, response text information corresponding to the user question is determined according to a question and response text information stored in association with each other in advance.

In S203, a response video for responding to the user question in the user question video is generated according to the user question video and the response text information.

In the present disclosure, the response video includes an audio corresponding to the avatar and the response text information interacted with by the user.

A detailed description will be given below of a specific embodiment of generating a response video for responding to a user question in the user question video, based on the user question video and the response text information in S203. Specifically, it can be realized by S301 and S302 shown in fig. 3.

In S301, the feature of the face region of the avatar and the first acoustic feature of the response text information are predicted from the user question video and the response text information.

In S302, a response video is generated based on the feature of the face region of the avatar and the first acoustic feature.

In the present disclosure, the feature of the face region of the avatar may include face keypoint information of the avatar. Wherein the face key point information of the avatar may include at least one of mouth key point information of the avatar, eyebrow key point information of the avatar, eye key point information of the avatar, nose key point information of the avatar, and mandible key point information of the avatar. The first acoustic feature may be, for example, mel-frequency spectral information, fundamental frequency information, linear spectral information, etc.

The following is a detailed description of a specific embodiment of predicting the feature of the face region of the avatar and the first acoustic feature of the response text information from the user question video and the response text information in S301 described above. Specifically, the above S301 may include the steps of:

(1) a second acoustic feature of the questioning audio is extracted.

In the present disclosure, the second acoustic feature may be, for example, mel-frequency spectral information, fundamental frequency information, linear spectral information, or the like.

(2) And determining the characteristics of the face area of the user according to the face image information of the user.

In the present disclosure, the feature of the face region of the user includes face key point information of the user, wherein the face key point information of the user may include mouth key point information of the user, eyebrow key point information of the user, eye key point information of the user, nose key point information of the user, and chin key point information of the user.

(3) And extracting the voice characteristic information of the response text information.

In the present disclosure, the speech feature information may include phonemes, tones, word segmentation, and prosodic boundaries. The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme; phonemes are divided into two major categories, vowels and consonants. For example, for Chinese, a phone includes an initial (an initial, which is a complete syllable formed with a final using a consonant preceding the final) and a final (i.e., a vowel). The tone refers to a change in the elevation of a sound. Illustratively, there are four tones in Chinese: yin-pacify, Yang-pacify, ascending and descending. Prosodic boundaries are used to indicate where pauses should be made while reading text. Illustratively, the prosodic boundaries are divided into four pause levels of "# 1", "# 2", "# 3", and "# 4", and the pause degrees thereof are sequentially increased.

(4) Predicting the feature of the face region of the avatar and the first acoustic feature according to the second acoustic feature, the feature of the face region of the user, and the speech feature information.

A detailed description will be given below of a specific embodiment of determining the feature of the face region of the user based on the face image information of the user in step (2) above.

Specifically, the feature of the face region of the user may be determined in a variety of ways, and in one embodiment, first, for each key point in the face region of the user, that is, each key point in the mouth key point, the eyebrow key point, the eye key point, the nose key point, and the mandible key point, the key point in all images in the image sequence in the user question video may be averaged, so as to obtain the key point information of the user. Then, the key point information obtained after the averaging processing of the key points is determined as the feature of the face region of the user. Each key point in the face area of the user in each image in the image sequence in the user query video can be obtained by a face key point detection method such as an Active Shape Model (ASM), an Active Appearance Model (AAM), a cascading posture Regression Model (CPR), a depth learning-based method, and the like.

Illustratively, for a key point of a mouth, averaging key points of the mouth in all images in an image sequence in a user question video, so as to obtain key point information of the mouth of a user; aiming at the eyebrow key points, averaging the eyebrow key points in all images in an image sequence in a user question video so as to obtain eyebrow key point information of the user; aiming at the eye key points, averaging the eye key points in all images in an image sequence in a user question video so as to obtain the eye key point information of the user; aiming at the nose key points, averaging the nose key points in all images in an image sequence in a user question video so as to obtain the nose key point information of the user; aiming at the mandible key points, averaging the mandible key points in all images in the image sequence in the user question video so as to obtain the mandible key point information of the user; finally, mouth keypoint information, eyebrow keypoint information, eye keypoint information, nose keypoint information, and mandible keypoint information may be used as features of the user's face region.

In another embodiment, the face image information of the user (i.e. the image sequence in the user question video) may be input into a Recurrent Neural Network (RNN), and the final hidden state of the hidden layer of the RNN may be used as the feature of the face region of the user. Therefore, the features of the face region of the user can be directly acquired through the RNN, and the method is convenient and quick.

A specific embodiment of extracting the speech feature information of the response text information in step (3) will be described in detail below.

Specifically, the voice feature information may be obtained in a plurality of ways, and in one embodiment, the voice feature information of the response text information may be labeled in advance by a user and stored in a corresponding storage module, so that the voice feature information of the response text information may be obtained by accessing the storage module.

In another embodiment, the response text information can be input into the information extraction model to obtain the voice characteristic information of the response text information, so that the method is convenient and quick, does not need manual participation, and saves manpower.

In the present disclosure, the information extraction model may include a Text regularization (TN) model, a Grapheme-to-Phoneme (G2P) model, a word segmentation model, and a prosody model. The numbers, symbols, abbreviations and the like in the response text information can be converted into language characters through the TN model, phonemes in the response text information are obtained through the G2P model, the text to be synthesized is segmented through the segmentation model, and the prosodic boundary and the intonation of the response text information are obtained through the prosodic model.

For example, the G2P model may employ Recurrent Neural Networks (RNNs) and Long-Short Term Memory networks (LSTM) to achieve the conversion from grapheme to phoneme.

The word segmentation model can be an n-gram model, a hidden Markov model, a naive Bayes classification model, etc.

The prosodic model is a pre-training language model BERT (bidirectional Encoder reproduction from transformations), a bidirectional LSTM-CRF (Conditional Random Field) model and the like.

In the above embodiment, the text content of the response text information can be focused more by extracting the phonetic feature information, such as phonemes, tones, word segments, and prosodic boundaries, of the response text information and performing response video generation on the response text information based on the phonetic feature information. Therefore, the audio corresponding to the response text information can be paused according to the text content and the word segmentation of the response text information, the accuracy and the intelligibility of the audio in the response video are improved, and a user can conveniently and quickly understand the text content corresponding to the audio in the response video. In addition, pauses can be made at natural prosodic boundaries, thus enhancing the naturalness and fluency of the audio in the response video.

The following is a detailed description of a specific embodiment of predicting the feature of the face region of the avatar and the first acoustic feature based on the second acoustic feature, the feature of the face region of the user, and the speech feature information in step (4) above.

Specifically, the second acoustic feature, the feature of the face region of the user, and the speech feature information may be input into the prediction model, resulting in the feature of the face region of the avatar and the first acoustic feature. As shown in fig. 4A, the prediction model may include an encoding network, an attention network, a decoding network, and a post-processing network (postnet). The coding network is used for generating a representation sequence according to the second acoustic feature, the feature of the face area of the user and the voice feature information; the attention network is used for generating a semantic representation with a fixed length according to the representation sequence, and the decoding network is used for generating a first acoustic feature according to the semantic representation; the post-processing network is configured to generate features of a face region of the avatar based on the first acoustic features.

Exemplarily, the coding network may be, for example, a CBHG (convergence Bank + high-way network + bidirectional Gated Recurrent Unit, i.e., convolutional layer + high-speed network + bidirectional Recurrent neural network, that is, CBHG is composed of convolutional layer, high-speed network, and bidirectional Recurrent neural network) model; the Attention network may be a location Sensitive Attention (GMM) network or a GMM Attention network, i.e. a Gaussian Mixture Model (GMM) based Attention network. Preferably, the attention network may be GMM attention, so that the stability of audio in the response video can be further improved, and phenomena such as missing consonants and vowels, repeated consonants and vowels and the like are avoided.

The following describes the training method of the prediction model in detail. Specifically, the features of the face region of the user include face key point information of the user, the features of the face region of the avatar include face key point information of the avatar, and the prediction model may be trained through S501 to S505 shown in fig. 5.

In S501, a question and answer video is obtained, where the question and answer video includes a reference user question video and a reference answer video of an avatar, the reference user question video includes face image information of a reference user and a reference user question audio, and the reference answer video includes the avatar, an answer audio, and reference answer text information corresponding to the answer audio.

In S502, a reference acoustic feature of the reference user question audio and a labeled acoustic feature of the response audio are extracted, respectively.

In the present disclosure, the reference acoustic feature, the labeled acoustic feature may be, for example, mel-frequency spectral information, fundamental frequency information, linear spectral information, etc.

In S503, the labeling key point information of the face region of the avatar is determined, and the face key point information of the reference user is determined according to the face image information of the reference user.

In this disclosure, the face key point information of the reference user and the labeling key point information of the face region of the avatar may be determined in the same manner as the determination of the features of the face region of the user in the step (2), which is not described herein again.

In S504, reference speech feature information that refers to the response text information is extracted.

In this disclosure, the reference speech feature information may include phonemes, tones, word segments, and prosodic boundaries, and the reference speech feature information of the reference answer text information may be extracted in the same manner as the speech feature information of the answer text information extracted in step (3) above, which is not described herein again.

In S505, a prediction model is obtained by performing model training in a manner that the reference acoustic feature, the face key point information of the reference user, and the reference speech feature information are input to the coding network, the output of the coding network is input to the attention network, the output of the attention network is input to the decoding network, the labeled acoustic feature is output as a target of the decoding network, the output of the decoding network is input to the post-processing network, and the labeled key point information is output as a target of the post-processing network.

Specifically, as shown in fig. 4A, the reference acoustic feature, the face key point information of the reference user, and the reference voice feature information are input into the coding network to obtain a reference representation sequence; then, inputting the reference expression sequence into an attention network to obtain a fixed-length reference semantic representation; then, inputting the reference semantic representation into a decoding network to obtain a predicted acoustic feature; inputting the predicted acoustic features into a post-processing network to obtain predicted key point information; in this way, the model parameters of the coding network, the attention network, the decoding network and the post-processing network can be updated according to the comparison result of the predicted acoustic features and the labeled acoustic features and the comparison result of the predicted key point information and the labeled key point information, so that the prediction model is obtained.

The following is a detailed description of a specific embodiment of generating the response video according to the feature of the face region of the avatar and the first acoustic feature in S302.

In one embodiment, the features of the face region of the avatar include respective key point information of the face of the avatar, i.e., mouth key point information, eyebrow key point information, eye key point information, nose key point information, and chin key point information. At this time, the answer video may be generated by:

first, a target image sequence is generated according to the key point information of each face of the virtual image.

Illustratively, the target image sequence can be generated by a Pix2Pix framework with the face keypoint information of the avatar.

And simultaneously, generating audio corresponding to the response text information according to the first acoustic characteristic.

In the present disclosure, the first acoustic feature may be input into a vocoder to obtain audio corresponding to the reply text information, wherein the vocoder may be, for example, a Wavenet vocoder, a Griffin-Lim vocoder, or the like.

And finally, synthesizing the target image sequence and the audio corresponding to the response text information to obtain a response video.

In another embodiment, the features of the face region of the avatar include mouth keypoint information of the avatar. At this time, the response video may be generated by S601 to S603 shown in fig. 6 as follows:

in S601, a target image sequence is generated based on the mouth key point information of the avatar and other key point information of the face of the avatar, which is stored in advance.

Illustratively, the target image sequence may be generated by a Pix2Pix framework based on mouth keypoint information of the avatar and pre-stored other keypoint information of the face of the avatar.

In S602, audio corresponding to the response text information is generated according to the first acoustic feature.

In S603, the target image sequence and the audio corresponding to the response text information are synthesized to obtain a response video.

In the embodiment, only the key point information of the mouth of the virtual image is predicted, and other key point information of the face of the virtual image is pre-stored, so that the prediction efficiency can be ensured, the generation speed of the response video is increased, the response real-time performance is ensured, and the user experience is improved.

In addition, as shown in fig. 7, before the step S601, the step S302 may further include steps S604 and S605.

In S604, a target emotion category to be expressed by the avatar is determined.

In the present disclosure, the target emotion classification may be one of happy, angry, serious, neutral (i.e., without emotion), and the like.

In S605, the key point information corresponding to the target emotion category is determined as the target key point information of the face region of the avatar according to the correspondence between the preset emotion category and the key point information.

In the present disclosure, the target keypoint information includes eyebrow keypoint information and/or eye keypoint information. At this time, in S601, the target image sequence may be generated according to each piece of key point information, except the target key point information, among the mouth key point information, the target key point information, and the other face key point information of the avatar. Therefore, the actions of the eyes and/or eyebrows of the virtual image in the response video can be controlled according to the target emotion type of the virtual image, for example, the blinking, frowning and the like of the virtual image in the response video are controlled, and therefore the interestingness of man-machine interaction is further improved.

The following is a detailed description of a specific embodiment of determining the target emotion category to be expressed by the avatar in S604.

In one embodiment, the emotion type corresponding to the response text information may be determined as the target emotion type according to a preset correspondence between the response text information and the emotion type.

In another embodiment, the target emotion classification can be determined according to the response text information and the questioning audio. Therefore, the emotion to be expressed by the virtual image is not only related to the response text information, but also related to the emotion expressed by the voice when the user asks questions, so that the strong supervision and control of the emotion expression of the virtual image can be realized. In addition, the automatic extraction of emotion representation can be realized, and the method is convenient and fast.

Specifically, this can be achieved by the following steps 1) and 2):

1) and respectively determining a first probability distribution corresponding to the response text information and a second probability distribution corresponding to the questioning audio.

In the present disclosure, the first probability distribution includes a probability that the emotion represented by the answer text information belongs to each of the preset emotion categories, and the second probability distribution includes a probability that the emotion represented by the question audio belongs to each of the preset emotion categories. The preset emotion classifications may include happy, angry, serious, neutral (i.e., no emotion), and the like.

For example, a first probability distribution corresponding to the response text information and a second probability distribution corresponding to the question audio may be predicted by a Bi-directional Long Short-Term Memory loop network (BiLSTM). Because the BilSTM model is a bidirectional cyclic neural network, the dependency relationship between the forward direction and the backward direction in response text information and questioning audio can be learned, and the corresponding probability distribution can be accurately predicted.

2) And determining the target emotion category according to the first probability distribution and the second probability distribution.

Specifically, a target probability corresponding to each preset emotion category may be determined, where the target probability is equal to the sum of the probability corresponding to the preset emotion category in the first probability distribution and the probability corresponding to the preset emotion category in the second probability distribution; if the maximum value of the target probabilities corresponding to the preset emotion categories is greater than a preset probability threshold (for example, 120%), determining the preset emotion category corresponding to the maximum value as the target emotion category; and if the maximum value is less than or equal to a preset probability threshold value, determining neutral as the target emotion category.

Illustratively, the predetermined emotion categories include happiness, anger, seriousness and neutrality, the probabilities of the emotion represented by the response text information belonging to happiness, anger, seriousness and neutrality are 15%, 38%, 16%, 22% and 9% respectively, i.e. the first probability distribution is a row vector [ 15% 38% 16% 22% 9% ], the probability of the emotion represented by the questioning audio belonging to happiness, anger, seriousness and neutrality is 15%, 8%, 65%, 10% and 2% respectively, i.e. the second probability distribution is a row vector [ 15% 8% 65% 10% 2% ], the predetermined probability threshold is 120%, and thus the target probabilities corresponding to happiness, anger, seriousness and neutrality are 30%, 46%, 81%, 32% and 11% respectively, it can be seen that the maximum value of the target probabilities corresponding to each predetermined emotion category is 81%, which is smaller than the predetermined probability threshold 120%, thus, "neutral" is determined as the target emotion category.

Further exemplary, the predetermined emotion categories include happiness, anger, seriousness, and neutrality, the probabilities of emotions represented by the response text information belonging to happiness, anger, seriousness, and neutrality are 55%, 18%, 6%, 12%, and 9% respectively, that is, the first probability distribution is a row vector [ 55% 18% 6% 12% 9% ], the probabilities of emotions represented by the question audio belonging to happiness, anger, seriousness, and neutrality are 75%, 8%, 5%, 10%, and 2% respectively, that is, the second probability distribution is a row vector [ 75% 8% 5% 10% 2% ], the predetermined probability threshold is 120%, and thus, the target probabilities corresponding to happiness, anger, seriousness, and neutrality are 130%, 26%, 11%, 22%, and 11% respectively, and it can be seen that the maximum value of the target probabilities corresponding to each predetermined category is 130%, which is greater than the predetermined probability threshold 120%, the preset emotion category corresponding to the maximum value of 130% is "happy", and therefore "happy" can be determined as the target emotion category.

Fig. 8 is a flowchart illustrating a method of generating a response video from a user query video and response text information according to another exemplary embodiment. As shown in fig. 8, before S301, the method further includes S303.

In S303, a target emotion category to be expressed by the avatar is determined.

Thus, S301 can predict the feature of the face region of the avatar and the first acoustic feature of the response text information according to the target emotion category, the user question video and the response text information; then, a response video is generated based on the feature of the face region of the avatar and the first acoustic feature.

In the implementation mode, the target emotion type to be expressed by the virtual image interacted with the user at present is used as a basis for generating the response video, so that the emotion expression of the virtual image can be controlled in a strong supervision mode, and the interest of man-machine interaction is further improved.

The following is a detailed description of the above-described specific embodiment of predicting the feature of the face region of the avatar and the first acoustic feature of the response text information according to the target emotion category, the user question video, and the response text information. Specifically, this can be achieved by:

first, second acoustic features of the questioning audio are extracted.

Determining the characteristics of the face area of the user according to the face image information of the user.

And extracting the voice characteristic information of the response text information.

And fourthly, predicting the feature of the face area of the virtual image and the first acoustic feature according to the target emotion type, the second acoustic feature, the feature of the face area of the user and the voice feature information.

Specifically, the target emotion category, the feature of the face region of the second acoustic feature user, and the speech feature information may be input into the prediction model to obtain the feature of the face region of the avatar and the first acoustic feature. The structure of the preset model is shown in fig. 4B, and the specific structure of the prediction model is described in detail in the relevant part of step (4), which is not described herein again.

In the stage of training the prediction model, as shown in fig. 4B, the input of the prediction model is added with the reference emotion category to be expressed by the avatar relative to the input of the prediction model, and the training process is similar to the process described in S501 to S505, and is not repeated here.

Fig. 9 is a block diagram illustrating a video generation apparatus according to an example embodiment. As shown in fig. 9, the apparatus 900 includes: an obtaining module 901, configured to obtain a user question video, where the user question video includes face image information of a user and a question audio; a determining module 902, configured to determine response text information corresponding to the question audio acquired by the acquiring module 901; a generating module 903, configured to generate, according to the user question video acquired by the acquiring module 901 and the response text information determined by the determining module 902, a response video for responding to the user question in the user question video, where the response video includes an avatar interacting with the user and an audio corresponding to the response text information.

In this disclosure, the determining module 902 may extract the user question by performing voice recognition on the question audio, and then determine the response text information corresponding to the user question according to the question and the response text information stored in advance in a correlated manner.

Optionally, the generating module 903 includes: the prediction sub-module is used for predicting the characteristics of the face area of the virtual image and the first acoustic characteristics of the response text information according to the user question video and the response text information; and the video generation submodule is used for generating the response video according to the characteristics of the face area of the virtual image and the first acoustic characteristics.

Optionally, the feature of the face region of the avatar comprises mouth keypoint information of the avatar; the video generation submodule comprises: the image sequence generation submodule is used for generating a target image sequence according to the mouth key point information of the virtual image and other pre-stored key point information of the face of the virtual image; the audio generation submodule is used for generating audio corresponding to the response text information according to the first acoustic characteristic; and the synthesis submodule is used for synthesizing the target image sequence and the audio corresponding to the response text information to obtain the response video.

Optionally, the video generation sub-module further includes: the first determining submodule is used for determining a target emotion category to be expressed by the virtual image before the image sequence generating submodule generates a target image sequence according to the mouth key point information of the virtual image and other pre-stored key point information of the face of the virtual image; the second determining submodule is used for determining key point information corresponding to the target emotion category as target key point information of the face area of the virtual image according to the corresponding relation between the preset emotion category and the key point information, wherein the target key point information comprises eyebrow key point information and/or eye key point information; and the image sequence generation submodule is used for generating a target image sequence according to the mouth key point information of the virtual image, the target key point information and the key point information except the target key point information in the other key point information of the face.

Optionally, the first determining submodule is configured to determine the target emotion category according to the response text information and the question audio.

Optionally, the first determining sub-module includes: a third determining submodule, configured to determine a first probability distribution corresponding to the answer text information and a second probability distribution corresponding to the question audio, respectively, where the first probability distribution includes a probability that an emotion represented by the answer text information belongs to each preset emotion category, and the second probability distribution includes a probability that an emotion represented by the question audio belongs to each preset emotion category; and the fourth determining submodule is used for determining the target emotion category according to the first probability distribution and the second probability distribution.

Optionally, the prediction sub-module comprises: a first extraction sub-module for extracting a second acoustic feature of the questioning audio; a fifth determining submodule, configured to determine, according to the face image information of the user, a feature of the face region of the user; the second extraction submodule is used for extracting the voice characteristic information of the response text information; a feature prediction sub-module for predicting a feature of the face region of the avatar and the first acoustic feature according to the second acoustic feature, the feature of the face region of the user, and the speech feature information.

Optionally, the feature prediction sub-module is configured to input the second acoustic feature, the feature of the face region of the user, and the speech feature information into a prediction model, so as to obtain the feature of the face region of the avatar and the first acoustic feature.

Optionally, the feature of the face region of the user includes face key point information of the user, the feature of the face region of the avatar includes face key point information of the avatar, and the prediction model includes an encoding network, an attention network, a decoding network, and a post-processing network; the prediction model is obtained through training of a prediction model training device. Wherein, the prediction model training device may include: a question-answer video obtaining module, configured to obtain a question-answer video, where the question-answer video includes a reference user question video and a reference answer video of the avatar, the reference user question video includes face image information of a reference user and a reference user question audio, and the reference answer video includes the avatar, an answer audio, and reference answer text information corresponding to the answer audio; the first extraction module is used for respectively extracting reference acoustic features of the reference user question audio and labeled acoustic features of the response audio; the key point determining module is used for determining the labeling key point information of the face area of the virtual image and determining the face key point information of the reference user according to the face image information of the reference user; the second extraction module is used for extracting reference voice characteristic information of the reference response text information; and the training module is used for performing model training by taking the reference acoustic feature, the face key point information of the reference user and the reference voice feature information as the input of the coding network, taking the output of the coding network as the input of the attention network, taking the output of the attention network as the input of the decoding network, taking the labeled acoustic feature as the target output of the decoding network, taking the output of the decoding network as the input of the post-processing network and taking the labeled key point information as the target output of the post-processing network to obtain the prediction model.

Optionally, the generating module 903 further includes: the emotion category determination submodule is used for determining a target emotion category to be expressed by the virtual character before the prediction submodule predicts the feature of the face area of the virtual character and the first acoustic feature of the response text information according to the user question video and the response text information; the prediction submodule is used for predicting the characteristics of the face area of the virtual image and the first acoustic characteristics of the response text information according to the target emotion category, the user question video and the response text information.

The prediction model training apparatus may be integrated into the video generation apparatus 900, or may be independent of the video generation apparatus 900, and is not particularly limited in this disclosure. In addition, with regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.

The present disclosure also provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, implements the steps of the above-mentioned video generation method provided by the present disclosure.

Referring now to fig. 10, a schematic diagram of an electronic device (e.g., a terminal device or a server) 1000 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communication device 1009 may allow the electronic device 1000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a user question video, wherein the user question video comprises face image information and question audio of a user; determining response text information corresponding to the question audio; and generating a response video for responding to the user questions in the user question video according to the user question video and the response text information, wherein the response video comprises an avatar interacted with the user and audio corresponding to the response text information.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation of the module itself, and for example, the acquiring module may also be described as a "module for acquiring a user question video".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a video generation method according to one or more embodiments of the present disclosure, including: acquiring a user question video, wherein the user question video comprises face image information and question audio of a user; determining response text information corresponding to the question audio; and generating a response video for responding to the user questions in the user question video according to the user question video and the response text information, wherein the response video comprises an avatar interacted with the user and audio corresponding to the response text information.

Example 2 provides the method of example 1, the generating a response video for responding to the user question in the user question video according to the user question video and the response text information, including: predicting the characteristics of the face area of the virtual image and the first acoustic characteristics of the response text information according to the user question video and the response text information; and generating the response video according to the characteristics of the face area of the virtual image and the first acoustic characteristics.

Example 3 provides the method of example 2, the features of the face region of the avatar including the avatar mouth keypoint information, according to one or more embodiments of the present disclosure; the generating the answer video according to the feature of the face region of the avatar and the first acoustic feature includes: generating a target image sequence according to the mouth key point information of the virtual image and other pre-stored key point information of the face of the virtual image; generating audio corresponding to the response text information according to the first acoustic characteristic; and synthesizing the target image sequence and the audio corresponding to the response text information to obtain the response video.

Example 4 provides the method of example 3, before the step of generating a target image sequence from mouth keypoint information of the avatar and pre-stored face other keypoint information of the avatar, the generating the response video from features of a face region of the avatar and the first acoustic features, further comprising: determining a target emotion category to be expressed by the virtual image; determining key point information corresponding to the target emotion category as target key point information of a face region of the virtual image according to a corresponding relation between a preset emotion category and the key point information, wherein the target key point information comprises eyebrow key point information and/or eye key point information; generating a target image sequence according to the mouth key point information of the virtual image and other pre-stored key point information of the face of the virtual image, wherein the target image sequence comprises the following steps: and generating a target image sequence according to the mouth key point information of the virtual image, the target key point information and the key point information except the target key point information in the other key point information of the face.

Example 5 provides the method of example 4, the determining the target emotion category to be expressed by the avatar, comprising: and determining the target emotion category according to the response text information and the question audio.

Example 6 provides the method of example 5, wherein determining the target emotion classification from the answer text message and the question audio comprises: respectively determining a first probability distribution corresponding to the response text information and a second probability distribution corresponding to the question audio, wherein the first probability distribution comprises the probability that the emotion represented by the response text information belongs to each preset emotion category, and the second probability distribution comprises the probability that the emotion represented by the question audio belongs to each preset emotion category; and determining the target emotion category according to the first probability distribution and the second probability distribution.

Example 7 provides the method of example 2, predicting features of a face region of the avatar and first acoustic features of the response text information from the user query video and the response text information, comprising: extracting a second acoustic feature of the questioning audio; determining the characteristics of the face area of the user according to the face image information of the user; extracting voice characteristic information of the response text information; predicting the feature of the face region of the avatar and the first acoustic feature according to the second acoustic feature, the feature of the face region of the user, and the voice feature information.

Example 8 provides the method of example 7, predicting the feature of the face region of the avatar and the first acoustic feature according to the second acoustic feature, the feature of the face region of the user, and the speech feature information, including: and inputting the second acoustic feature, the feature of the face area of the user and the voice feature information into a prediction model to obtain the feature of the face area of the virtual image and the first acoustic feature.

Example 9 provides the method of example 8, the features of the face region of the user including face keypoint information of the user, the features of the face region of the avatar including face keypoint information of the avatar, the prediction model including an encoding network, an attention network, a decoding network, and a post-processing network; the prediction model is obtained by training in the following way: acquiring a question and answer video, wherein the question and answer video comprises a reference user question video and a reference response video of the virtual image, the reference user question video comprises face image information of a reference user and a reference user question audio, and the reference response video comprises the virtual image, a response audio and reference response text information corresponding to the response audio; respectively extracting reference acoustic features of the reference user question audio and labeled acoustic features of the response audio; determining the labeling key point information of the face area of the virtual image, and determining the face key point information of the reference user according to the face image information of the reference user; extracting reference voice characteristic information of the reference response text information; and performing model training by taking the reference acoustic feature, the face key point information of the reference user and the reference voice feature information as the input of the coding network, taking the output of the coding network as the input of the attention network, taking the output of the attention network as the input of the decoding network, taking the labeled acoustic feature as the target output of the decoding network, taking the output of the decoding network as the input of the post-processing network, and taking the labeled key point information as the target output of the post-processing network, so as to obtain the prediction model.

Example 10 provides the method of any one of examples 2-9, prior to the step of predicting a feature of a face region of the avatar and a first acoustic feature of the response text information from the user question video and the response text information, generating a response video for responding to a user question in the user question video from the user question video and the response text information, further comprising: determining a target emotion category to be expressed by the virtual image; predicting, according to the user question video and the response text information, a feature of a face region of the avatar and a first acoustic feature of the response text information, including: predicting the feature of the face area of the virtual character and the first acoustic feature of the response text information according to the target emotion category, the user question video and the response text information.

Example 11 provides, in accordance with one or more embodiments of the present disclosure, a video generation apparatus comprising: the system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring a user question video, and the user question video comprises face image information and question audio of a user; the determining module is used for determining the response text information corresponding to the question audio acquired by the acquiring module; and the generating module is used for generating a response video for responding the user question in the user question video according to the user question video acquired by the acquiring module and the response text information determined by the determining module, wherein the response video comprises an avatar interacted with the user and audio corresponding to the response text information.

Example 12 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-10, in accordance with one or more embodiments of the present disclosure.

Example 13 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-10.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of video generation, comprising:

determining response text information corresponding to the question audio;

2. The method according to claim 1, wherein the generating a response video for responding to the user question in the user question video according to the user question video and the response text information comprises:

predicting the characteristics of the face area of the virtual image and the first acoustic characteristics of the response text information according to the user question video and the response text information;

and generating the response video according to the characteristics of the face area of the virtual image and the first acoustic characteristics.

3. The method according to claim 2, wherein the features of the face region of the avatar include mouth keypoint information of the avatar;

the generating the answer video according to the feature of the face region of the avatar and the first acoustic feature includes:

generating a target image sequence according to the mouth key point information of the virtual image and other pre-stored key point information of the face of the virtual image;

generating audio corresponding to the response text information according to the first acoustic characteristic;

and synthesizing the target image sequence and the audio corresponding to the response text information to obtain the response video.

4. The method according to claim 3, wherein the generating the answer video according to the feature of the face region of the avatar and the first acoustic feature, before the generating the target image sequence according to the mouth keypoint information of the avatar and the pre-stored other keypoint information of the face of the avatar, further comprises:

determining a target emotion category to be expressed by the virtual image;

determining key point information corresponding to the target emotion category as target key point information of a face region of the virtual image according to a corresponding relation between a preset emotion category and the key point information, wherein the target key point information comprises eyebrow key point information and/or eye key point information;

generating a target image sequence according to the mouth key point information of the virtual image and other pre-stored key point information of the face of the virtual image, wherein the target image sequence comprises the following steps:

and generating a target image sequence according to the mouth key point information of the virtual image, the target key point information and the key point information except the target key point information in the other key point information of the face.

5. The method of claim 4, wherein said determining a target emotion category to be expressed by said avatar comprises:

and determining the target emotion category according to the response text information and the question audio.

6. The method of claim 5, wherein the determining the target emotion classification from the answer text message and the question audio comprises:

respectively determining a first probability distribution corresponding to the response text information and a second probability distribution corresponding to the question audio, wherein the first probability distribution comprises the probability that the emotion represented by the response text information belongs to each preset emotion category, and the second probability distribution comprises the probability that the emotion represented by the question audio belongs to each preset emotion category;

and determining the target emotion category according to the first probability distribution and the second probability distribution.

7. The method of claim 2, wherein predicting the feature of the face region of the avatar and the first acoustic feature of the response text information from the user query video and the response text information comprises:

extracting a second acoustic feature of the questioning audio;

determining the characteristics of the face area of the user according to the face image information of the user;

extracting voice characteristic information of the response text information;

predicting the feature of the face region of the avatar and the first acoustic feature according to the second acoustic feature, the feature of the face region of the user, and the voice feature information.

8. The method of claim 7, wherein predicting the feature of the face region of the avatar and the first acoustic feature from the second acoustic feature, the feature of the face region of the user, and the speech feature information comprises:

and inputting the second acoustic feature, the feature of the face area of the user and the voice feature information into a prediction model to obtain the feature of the face area of the virtual image and the first acoustic feature.

9. The method of claim 8, wherein the features of the user's face region include user's face keypoint information, the features of the avatar's face region include the avatar's face keypoint information, the predictive model includes an encoding network, an attention network, a decoding network, and a post-processing network;

the prediction model is obtained by training in the following way:

acquiring a question and answer video, wherein the question and answer video comprises a reference user question video and a reference response video of the virtual image, the reference user question video comprises face image information of a reference user and a reference user question audio, and the reference response video comprises the virtual image, a response audio and reference response text information corresponding to the response audio;

respectively extracting reference acoustic features of the reference user question audio and labeled acoustic features of the response audio;

determining the labeling key point information of the face area of the virtual image, and determining the face key point information of the reference user according to the face image information of the reference user;

extracting reference voice characteristic information of the reference response text information;

and performing model training by taking the reference acoustic feature, the face key point information of the reference user and the reference voice feature information as the input of the coding network, taking the output of the coding network as the input of the attention network, taking the output of the attention network as the input of the decoding network, taking the labeled acoustic feature as the target output of the decoding network, taking the output of the decoding network as the input of the post-processing network, and taking the labeled key point information as the target output of the post-processing network, so as to obtain the prediction model.

10. The method according to any one of claims 2 to 9, wherein, prior to the step of predicting the feature of the face region of the avatar and the first acoustic feature of the response text information from the user question video and the response text information, generating a response video for responding to a user question in the user question video from the user question video and the response text information, further comprises:

determining a target emotion category to be expressed by the virtual image;

predicting, according to the user question video and the response text information, a feature of a face region of the avatar and a first acoustic feature of the response text information, including:

predicting the feature of the face area of the virtual character and the first acoustic feature of the response text information according to the target emotion category, the user question video and the response text information.

11. A video generation apparatus, comprising:

12. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-10.

13. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 10.