CN114401438B

CN114401438B - Video generation method and device for virtual digital person, storage medium and terminal

Info

Publication number: CN114401438B
Application number: CN202111674444.XA
Authority: CN
Inventors: 柴金祥; 谭宏冰; 熊兴堂; 王从艺; 王斌; 梁志强; 戴鹭琳
Original assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Current assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-12-09
Anticipated expiration: 2041-12-31
Also published as: WO2023124933A1; CN114401438A

Abstract

A video generation method and device, a storage medium and a terminal of a virtual digital person are provided, wherein the method comprises the following steps: acquiring input information, wherein the input information comprises input text and/or input voice; determining a text-driven instruction according to the input information, wherein the text-driven instruction comprises a text; generating an action driving instruction corresponding to the text driving instruction according to the semantic meaning of the text in the text driving instruction; generating a video of the virtual digital person according to the audio data, the face animation data and the action animation data; wherein the audio data and the facial animation data are obtained according to the text-driven instruction, and the motion animation data are obtained according to the motion-driven instruction. By the scheme of the invention, the video required by the user can be generated quickly and efficiently.

Description

Video generation method and device for virtual digital person, storage medium and terminal

Technical Field

The invention relates to the technical field of video generation, in particular to a video generation method and device for a virtual digital person, a storage medium and a terminal.

Background

With the development of internet technology and self-media, the original content presentation mode mainly based on texts and pictures has gradually developed in a trend mainly based on video content, at present, more and more users publish created long videos and short videos on various video platforms, and the video production demand is larger and larger. Currently, videos are mainly real persons, for example, in a product publicity type video, products are generally introduced and explained by real persons, and in a live type video, live broadcasts are also generally carried out by real person broadcasters. Since the production of such a video depends on a real person, the production efficiency is low and the cost is high.

Therefore, there is a need for a method for generating a video of a virtual digital person, which can generate a video required by a user quickly and efficiently without recording a video of a real person

Disclosure of Invention

The invention aims to provide a video generation method of a virtual digital person, which can quickly and efficiently generate a video required by a user.

In order to solve the above technical problem, an embodiment of the present invention provides a video generation method for a virtual digital person, where the method includes: acquiring input information, wherein the input information comprises input text and/or input voice; determining a text-driven instruction according to the input information, wherein the text-driven instruction comprises a text; generating an action driving instruction corresponding to the text driving instruction according to the semantics of the text in the text driving instruction; generating a video of the virtual digital person according to the audio data, the facial animation data and the motion animation data; wherein the audio data and the facial animation data are obtained according to the text-driven instruction, and the motion animation data are obtained according to the motion-driven instruction.

Optionally, determining a text-driven instruction according to the input information includes: performing word segmentation processing on the input text to obtain a plurality of texts; and generating a text driving instruction corresponding to each text according to each text.

Optionally, determining a text-driven instruction according to the input information includes: performing voice recognition on the input voice to obtain text content corresponding to the input voice; performing word segmentation processing on the text content to obtain a plurality of texts; and generating a text driving instruction corresponding to each text according to each text.

Optionally, before generating the video of the virtual digital person according to the audio data, the face animation data, and the motion animation data, the method further includes: and selecting and determining corresponding action animation data from a preset action database according to the action identifier in the action driving command.

Optionally, before generating the video of the virtual digital person according to the audio data, the face animation data and the motion animation data, the method includes: acquiring feedback information, wherein the feedback information is used for indicating a text driving instruction corresponding to the audio data being output; and judging whether the next text driving instruction has a corresponding action driving instruction according to the feedback information, and if so, determining corresponding action animation data according to the action driving instruction.

Optionally, before generating the video of the virtual digital person according to the audio data, the face animation data and the motion animation data, the method includes: acquiring a first text selected by a user in the text content corresponding to the input information; acquiring an action identifier input by a user aiming at the first text; and generating an action driving instruction corresponding to the first text driving instruction according to an action identifier input by a user for the first text, wherein the first text driving instruction is a text driving instruction containing the first text.

Optionally, before generating the video of the virtual digital person according to the audio data, the face animation data, and the motion animation data, the method further includes: acquiring a second text selected by the user in the text content corresponding to the input information; acquiring display content input by a user aiming at the second text; generating a display driving instruction corresponding to the second text driving instruction according to the display content, wherein the second text driving instruction is a text driving instruction containing the second text; and generating an action driving instruction corresponding to the display driving instruction according to the display driving instruction.

Optionally, before generating the video of the virtual digital person according to the audio data, the face animation data and the motion animation data, the method includes: acquiring feedback information, wherein the feedback information is used for indicating a text driving instruction corresponding to the audio data being output; and judging whether the next text driving instruction has a corresponding display driving instruction according to the feedback information, and if so, displaying display contents corresponding to the display driving instruction.

Optionally, generating the video of the virtual digital person according to the audio data, the facial animation data, and the motion data includes: performing fusion processing on the audio data, the face animation data and the motion animation data to obtain processed animation data; and resolving and rendering the processed animation data to obtain the video of the virtual digital human.

Optionally, before generating the video of the virtual digital person according to the audio data, the facial animation data, and the motion animation data, the method further includes: acquiring object information input by a user, wherein the object information is used for describing the image of the virtual digital person; and generating the virtual digital person according to the object information.

In order to solve the above technical problem, an embodiment of the present invention further provides a video generating apparatus for a virtual digital person, where the apparatus includes: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring input information, and the input information comprises input text or input voice; the text driving instruction generating module is used for determining a text driving instruction according to the input information, wherein the text driving instruction comprises a text; the action driving instruction generating module is used for generating an action driving instruction corresponding to the text driving instruction according to the semantic meaning of the text in the text driving instruction; and the video generation module is used for generating the video of the virtual digital person according to audio data, face animation data and action animation data, wherein the audio data and the face animation data are obtained according to the text driving instruction, and the action animation data are obtained according to the action driving instruction.

An embodiment of the present invention further provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for generating a video of a virtual digital person performs the steps of the method for generating a video of a virtual digital person.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the video generation method of the virtual digital person when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

according to the scheme of the embodiment of the invention, input information is acquired, a text driving instruction and an action driving instruction are determined according to the input information, and finally, a video of a virtual digital person is generated according to audio animation data, face animation data and action animation data. Since the action driving command is obtained according to the semantics of the text in the text driving command, the text driving command and the action driving command having the correspondence are semantically identical. Because the audio data and the face animation data are obtained according to the text driving instruction, and the motion animation data are obtained according to the motion driving instruction, the motion animation data, the audio data and the face animation data can also have the same meaning, so that the virtual digital person can have real and natural sound, facial expression and limb movement, video recording of the real person is not needed, and high-quality video of the virtual digital person can be efficiently generated.

Drawings

Fig. 1 is a schematic flow chart of a video generation method for a virtual digital person according to an embodiment of the present invention;

FIG. 2 is a schematic view of a scene of a video generation method for a virtual digital person according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video generation apparatus for a virtual digital person according to an embodiment of the present invention.

Detailed Description

As described in the background, there is a need for a video generation method for a virtual digital person to generate a video required by a user quickly and efficiently.

The inventor of the invention finds that video production in the prior art usually depends on human recording, and the video generation efficiency is low. Therefore, a technology for recording videos by replacing a real person with a virtual digital person appears, but the quality of the videos of the virtual digital person generated by adopting the prior art is low, and the videos are obviously different from the videos recorded by the real person.

In order to solve the foregoing technical problem, an embodiment of the present invention provides a method for generating a video of a virtual digital person, where in an aspect of the embodiment of the present invention, input information is obtained, a text-driven instruction and a motion-driven instruction are determined according to the input information, and finally, a video of the virtual digital person is generated according to audio animation data, face animation data, and motion animation data. Since the action driving command is obtained according to the semantics of the text in the text driving command, the text driving command and the action driving command having the correspondence are semantically identical. Because the audio data and the face animation data are obtained according to the text driving instruction, and the motion animation data are obtained according to the motion driving instruction, the motion animation data, the audio data and the face animation data can also have the same meaning, so that the virtual digital person can have real and natural sound, facial expression and limb movement, video recording of the real person is not needed, and high-quality video of the virtual digital person can be efficiently generated.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 1, fig. 1 is a flowchart illustrating a video generation method for a virtual digital person according to an embodiment of the present invention, where the method may be executed by a terminal, and the terminal may be various existing terminal devices with data receiving and processing capabilities, such as, but not limited to, a mobile phone, a computer, a tablet computer, and the like. The virtual digital person may be a virtual person of various figures, for example, a virtual newsreader, a virtual teacher, a virtual anchor, and the like, but is not limited thereto. It should be noted that the virtual digital person may be three-dimensional or two-dimensional, which is not limited in this embodiment of the present invention. Through the scheme of the embodiment of the invention, a user can quickly and efficiently generate the video of the high-quality virtual digital person only by inputting a section of text or voice, and the embodiment of the invention does not limit the specific content of the video at all, for example, the video can be a news broadcast video, a product introduction video, a knowledge science popularization video and the like, but is not limited thereto. The video generation method of a virtual digital person shown in fig. 1 may include the steps of:

step S101: acquiring input information, wherein the input information comprises input text and/or input voice;

step S102: determining a text driving instruction according to the input information;

step S103: generating an action driving instruction corresponding to the text driving instruction according to the semantic meaning of the text in the text driving instruction;

step S104: and generating the video of the virtual digital person according to the audio data, the face animation data and the action animation data.

It is understood that in a specific implementation, the method may be implemented by a software program running in a processor integrated within a chip or a chip module; alternatively, the method can be implemented in hardware or a combination of hardware and software.

In the specific implementation of step S101, input information may be obtained, where the input information may be input by a user, for example, the input information may be obtained from a user terminal, or may be uploaded to a server by the user terminal in advance and then obtained from the server, or the input information may be stored in a local database in advance, but is not limited thereto. The method for acquiring the input information is not limited in the embodiments of the present invention.

The input information may include input text, input voice, and both input text and input voice. The input text is input information in a form of text, and the embodiment of the present invention does not limit the type and content of the text, for example, the input text may include common characters such as chinese characters, english characters, numbers, special characters, and the like, but is not limited thereto. The input voice is input information in the form of audio, and the input voice may be pre-recorded, but is not limited thereto.

In a specific implementation of step S102, a plurality of text-driven instructions may be determined according to the input information, wherein each text-driven instruction may include text.

In a specific example, the input text may be subjected to word segmentation processing to obtain a plurality of texts, where a text may be a minimum unit word in the input text, where the minimum unit word may be a single word, or may be a word group, a idiom, or the like, which can represent a specific meaning. The word number of the word of the minimum unit is not limited in the embodiment of the present invention, and may be, for example, but not limited to, "hello", "i", "thank you", and the like.

In another specific example, the input speech may be subjected to speech recognition to obtain text content corresponding to the input speech, that is, to convert the input speech into input information in text form. Further, word segmentation processing may be performed on text content corresponding to the input speech to obtain a plurality of texts. For more contents of processing the text content, reference may be made to the above related contents of performing word segmentation processing on the input text, and details are not described herein again.

Further, a plurality of text-driven instructions may be generated from the plurality of texts. It should be noted that the text-driven command corresponds to a text one-to-one. That is, for each text, a text-driven instruction corresponding to the text may be generated, where the text-driven instruction corresponding to each text includes the text. More specifically, the text-driven instructions may include only text.

In a specific implementation of step S103, an action drive command may be generated.

Specifically, for at least one text-driven instruction, an action-driven instruction corresponding to the text-driven instruction may be generated according to the semantic meaning of the text in the text-driven instruction.

The semantics of the text can be obtained by performing semantic analysis on the text. It should be noted that, the method for determining the semantics of the text in the embodiment of the present invention is not limited at all, and for example, the method may be obtained by Processing the input text by a Natural Language Processing (NLP) technology, but is not limited thereto.

In a specific example, the action driving instruction corresponding to the text driving instruction may be generated according to semantics of a text in the text driving instruction. The action driving instruction includes an action identifier, which may be a character string, and the like, and the embodiment does not limit the expression form of the action identifier. In other words, the action identification may be determined from the semantics of the text.

More specifically, the texts with the same semantics are determined to obtain the same action identifier, and the actions identifiers determined by the texts with different semantics are also different. That is, the action identifiers and semantics can have a correspondence.

The action identifiers are different, and the meanings of the actions corresponding to the action driving instructions are also different, that is, the meanings of the actions corresponding to the action driving instructions are also the same if the action identifiers are the same. Thus, the action identity can indicate semantics as well as meaning of the action.

In another specific example, a text selected by the user from the text content corresponding to the input information may be obtained and recorded as the first text. When the input information is text information, the corresponding text content is the input information; and when the input information is audio information, performing voice recognition on the input voice to obtain text content corresponding to the input voice.

Further, an action identifier set by the user for the first text may be acquired, and then an action driving instruction corresponding to the first text driving instruction may be generated according to the action identifier set by the user for the first text. The first text-driven instruction is a text-driven instruction including a first text, that is, a text-driven instruction corresponding to the first text. By adopting the scheme, the user can configure the limb actions of the virtual digital person by himself, and the personalized setting of the limb actions is realized.

In a specific implementation, before generating an action driving command according to semantics in a first text driving command, it may be determined whether the first text driving command has a corresponding action driving command, and if so, the action driving command is no longer generated according to the semantics of the first text. In other words, it may be determined whether the user has configured the action-driving instruction first, and if so, the action-driving instruction is not generated according to the semantic meaning of the text.

In addition, if the corresponding action driving instruction is generated in advance according to the semantic meaning of the text in the first text driving instruction, the action driving instruction generated according to the action identifier set by the user for the first text may be adopted for updating.

In another specific example, a second text selected by the user from text contents corresponding to the input information may be further obtained, and then display contents input by the user for the second text may be obtained; and further, generating a display driving instruction corresponding to a second text driving instruction according to the display content input by the user for the second text, wherein the second text driving instruction is a text driving instruction containing the second text. The display content refers to content that needs to be displayed.

Further, an action driving instruction corresponding to the display driving instruction may be generated according to the display driving instruction, that is, an action driving instruction corresponding to the second text driving instruction may be generated according to the display driving instruction. For example, if the presentation driving command is a picture, a character, a table, or the like, the motion driving command generated according to the presentation driving command may be a motion driving command corresponding to the guidance motion. In other words, the corresponding action is a guiding action.

Further, an action driving instruction corresponding to the text driving instruction can be generated according to the semantics and the display driving instruction of the text in the text driving instruction. Wherein, the display driving command is used as a reference in the process of generating the action driving command. Taking the display driving instruction as a reference in the process of generating the action driving instruction means that before the action driving instruction is generated according to the semantics in the text driving instruction, if the text driving instruction has a corresponding display driving instruction and the action driving instruction is generated according to the display driving instruction, the action driving instruction is not generated according to the semantics of the text in the text driving instruction.

Accordingly, input information input by a user, action identification for one or more first text inputs and presentation content for one or more second text inputs can be obtained, and then a plurality of action-driving instructions can be generated. Specifically, an action driving instruction corresponding to the first text driving instruction may be generated according to an action identifier input by a user, and an action driving instruction corresponding to the second text driving instruction may be generated according to the display content input by the user; then, for the text-driven instructions in the input information except for the first text-driven instruction and the second text-driven instruction, corresponding action-driven instructions can be generated according to the semantics of the text.

Referring to fig. 2, fig. 2 is a scene schematic diagram of a video generation method for a virtual digital person in an embodiment of the present invention. The scheme of the present embodiment is described below in a non-limiting manner with reference to fig. 2.

In the solution of this embodiment, after the input information is obtained, a plurality of text-driven instructions may be determined, and the plurality of text-driven instructions are sequentially arranged according to the order of the input information, so that the text axis 21 may be obtained. More specifically, different positions of the text axis correspond to different texts, and the texts are arranged semantically on the text axis. Wherein each text has a text-driven instruction in one-to-one correspondence therewith.

Further, a motion-driven command corresponding to one or more text-driven commands may be determined to arrive at the motion axis 22. As shown in fig. 2, the action ID1 corresponds to the text-driven command "hi", and the action ID2 corresponds to the text-driven command "i". Wherein, the duration of the action sequence driven by each action driving instruction is long, and the speed of the action can be set by a user.

Further, the generated motion driving command may be displayed as a corresponding motion identifier on the motion axis 22, and if the motion identifier is not met, the user may adjust the motion identifier on the motion axis 22. Specifically, the user may set a corresponding action identifier for a text at any position to add a new action driving instruction or adjust an existing action driving instruction. Wherein, the text and the action identifier with the corresponding relation can be aligned.

Further, the user may set corresponding presentation contents for the text at an arbitrary position to obtain the presentation axis 23. In a specific implementation, a user can determine a text selected by the user through a position clicked on the interface by the mouse, and the text is recorded as a second text, and then display content corresponding to the second text can be set.

As shown in fig. 2, the user may set the presentation content P1 corresponding to the second text "world" and the user may set the presentation content P2 corresponding to the second text "product". After the display content input by the user is obtained, a corresponding display driving instruction can be generated. Furthermore, after the display driving command is generated, an action driving command corresponding to the display driving command can be further generated. For example, if the presentation content P2 is a picture, the corresponding action driving command with the action ID of ID3 may be determined. Wherein, the action corresponding to the action identifier ID3 is a guiding action.

Therefore, a plurality of text driving instructions, a plurality of action driving instructions and display information driving instructions can be obtained according to one or more items of the input information, the action identifier set by the user and the display content, wherein the text driving instructions, the action driving instructions and the display information driving instructions are semantically aligned.

With continued reference to FIG. 1, in an implementation of step S104, a video of a virtual digital person may be generated according to a variety of instructions.

Specifically, the audio data may be generated sequentially from a plurality of text-driven instructions. More specifically, the speech segment corresponding to each text-driven instruction can be generated in response to each text-driven instruction in sequence from front to back of the text axis. More specifically, text-to-Speech (TTS) technology may be used to generate the Speech segment corresponding to the Text-driven instruction. It is understood that the voice segments corresponding to the text-driven commands are the finally input audio data.

In particular, text-to-speech and animation techniques may be employed to generate audio data and animation data from text in text-driven instructions. Wherein the text in the text-driven command corresponds to the original text entered. The specific process is described below.

Acquiring text information, wherein the text information comprises a text of virtual object animation data to be generated; analyzing emotional characteristics and prosodic boundaries of the text information; performing voice synthesis according to the emotional characteristics, the prosodic boundary and the text information to obtain audio data, wherein the audio data comprise voice with emotion obtained by conversion based on the text information; corresponding virtual object animation data is generated based on the text information and the audio data, and the virtual object animation data is synchronized in time with the audio data, and the virtual object animation data may include face animation data of the virtual object.

Further, analyzing the emotional features and prosodic boundaries of the text information includes: performing word segmentation processing on the text information; for each word obtained by word segmentation, carrying out emotion analysis on the word to obtain the emotional characteristics of the word; prosodic boundaries for each word are determined.

Further, analyzing the emotion characteristics and prosodic boundaries of the text information may further include: analyzing the emotional characteristics of the text information based on a preset text front-end prediction model, wherein the input of the preset text front-end prediction model is the text information, and the output of the preset text front-end prediction model is the emotional characteristics, the rhythm boundary and the word segmentation of the text information.

In one implementation, the predictive text-front model may include a coupled Recurrent Neural Network (RNN) and Conditional Random Fields (CRF). That is, the present embodiment employs a deep learning model of RNN + CRF to quickly predict emotion characteristics and prosodic boundary estimation of each word of text information.

It should be noted that the preset text front-end prediction model may output the emotional features, the prosodic boundary, and the word segmentation result of the text message at the same time. And in the preset text front-end prediction model, word segmentation can be performed firstly, and then the word segmentation result is processed to obtain the corresponding emotional characteristics and prosodic boundary.

Further, performing speech synthesis according to the emotion characteristics, the prosodic boundary, and the text information to obtain audio data includes: inputting the text information, the emotional characteristics and the rhythm boundary into a preset voice synthesis model, wherein the preset voice synthesis model is used for converting an input text sequence into a voice sequence according to a time sequence, and voices in the voice sequence have emotions corresponding to texts at time points; and acquiring audio data output by a preset speech synthesis model.

Further, the preset speech synthesis model is obtained by training based on training data, wherein the training data comprises a text information sample and a corresponding audio data sample, and the audio data sample is obtained by prerecording the text information sample.

Specifically, the predetermined speech synthesis model may be a Sequence-to-Sequence (Seq-to-Sequence) model.

Further, generating a corresponding virtual object animation based on the text information and the audio data includes: and inputting the text information and the audio data into a preset time sequence mapping model to generate corresponding virtual object animation data.

It should be noted that, if the input information is voice information, the input information may be directly used as audio data. Further, facial animation data may be derived from the audio data, for example, using a speech-based animation synthesis technique. More specifically, the audio data may be converted into a pronunciation unit sequence, the pronunciation unit sequence may be subjected to feature analysis to obtain a corresponding linguistic feature sequence, and the linguistic feature sequence may be input to the preset time sequence mapping model to obtain the facial animation data. The pronunciation unit can be a phoneme, the linguistic feature can be used for representing the pronunciation feature of the pronunciation unit, and the preset time sequence mapping model is constructed based on deep learning technology training and used for mapping the input linguistic feature sequence to corresponding facial animation data.

Further, the preset time-series mapping model may be used to map the input sequence of linguistic features to expression parameters of the virtual object in time series based on the deep learning to generate facial animation data of the corresponding virtual object.

Specifically, converting the audio data into a sequence of pronunciation units may comprise the steps of: converting the audio data into a pronunciation unit and a corresponding time code; and carrying out time alignment operation on the pronunciation units according to the time codes to obtain a pronunciation unit sequence after time alignment. For convenience of description, the present embodiment simply refers to the time-aligned pronunciation unit sequence as the pronunciation unit sequence.

Further, the audio data may be converted into text information, and then the text information may be processed to obtain a pronunciation unit and a corresponding time code.

Specifically, the audio data may be converted into a pronunciation unit and a corresponding time code based on an Automatic Speech Recognition (ASR) technology and a preset pronunciation dictionary.

Further, performing feature analysis on the pronunciation unit sequence to obtain a corresponding linguistic feature sequence includes: performing feature analysis on each pronunciation unit in the pronunciation unit sequence to obtain the linguistic feature of each pronunciation unit; based on the linguistic features of each pronunciation unit, a corresponding sequence of linguistic features is generated.

Further, performing feature analysis on each pronunciation unit in the sequence of pronunciation units to obtain the linguistic feature of each pronunciation unit may include: for each pronunciation unit, analyzing pronunciation characteristics of the pronunciation unit to obtain independent linguistic characteristics of the pronunciation unit; linguistic features of the pronunciation unit are generated based on the independent linguistic features of the pronunciation unit.

Further, all adjacent pronunciation units of each pronunciation unit may be analyzed over a time window, with the dimensions of the analysis including, but not limited to, how many vowels or consonants are in the left window of the current pronunciation unit, how many front or back nasal sounds are in the right window of the current pronunciation unit, etc. For example, the type of the pronunciation feature and the number of the same kind of pronunciation features of the adjacent pronunciation unit are counted, and the adjacent linguistic feature is obtained according to the counting result.

Further, the quantized statistical features may be used as adjacent linguistic features of the current pronunciation unit.

Further, the adjoining ones of the sound units may include: the pronunciation units are arranged in front of and behind the pronunciation unit in time sequence and have a preset number.

Further, for each pronunciation unit, the independent linguistic features and the adjacent linguistic features of the pronunciation unit are combined to obtain the complete linguistic features of the pronunciation unit.

Further, inputting the linguistic feature sequence into a preset time-series mapping model to generate facial animation data of the corresponding virtual object based on the linguistic feature sequence includes: performing multi-dimensional information extraction on the linguistic feature sequence based on a preset time sequence mapping model, wherein the multi-dimension comprises a time dimension and a linguistic feature dimension; and mapping the characteristic domain and converting the characteristic dimension of the multi-dimensional information extraction result based on a preset time sequence mapping model to obtain the expression parameters of the virtual object.

The mapping of the feature domain refers to the mapping from a linguistic feature domain to a virtual object facial animation data feature domain, and the facial animation data feature domain of the virtual object at least comprises expression features of the virtual object.

Specifically, since the length of the audio data is not fixed, the variable-length sequence information (i.e., the linguistic feature sequence) processed based on the input information may be processed based on a Recurrent Neural Network (RNN) and a variant thereof (e.g., a Long Short-Term Memory (LSTM)) to extract feature information as a whole. Feature mapping models typically involve feature domain conversion and feature dimension transformation. In this regard, the conversion function may be implemented based on a Full Connected Network (FCN).

Further, the RNN Network may process input features in a time dimension, and in order to process features in more dimensions to extract feature information of higher dimensions, thereby enhancing generalization capability of the model, the input information may be processed based on a Convolutional Neural Network (CNN) and its variants (such as dilation convolution, causal convolution, and the like).

In one embodiment, the predetermined time-series mapping model may be a Convolutional network-long-short-time memory network-Deep Neural network (CLDNN).

Specifically, the presetting of the timing mapping model may include: and the multilayer convolution network is used for receiving the linguistic feature sequence and extracting multi-dimensional information of the linguistic feature sequence.

For example, the multilayer convolutional network can comprise a four-layer expansion convolutional network, and multi-dimensional information extraction is performed on the linguistic feature sequence. The linguistic feature sequence may be two-dimensional data, and assuming that each pronunciation unit is represented by a pronunciation feature with a length of 600 bits and there are 100 pronunciation units in total, the linguistic feature sequence input to the preset time sequence mapping model is a two-dimensional array of 100 × 600. Wherein 100 represents the time dimension and 600 represents the linguistic feature dimension. Accordingly, the multilayer convolutional network performs feature operations in two dimensions, time and linguistic features.

Further, the presetting of the time sequence mapping model may further include: and the long-time and short-time memory network is used for carrying out information aggregation processing on the information extraction result of the time dimension. Therefore, the characteristics after the convolution processing of the multilayer convolution network can be considered continuously on the whole in the time dimension.

For example, the long and short term memory network may include two stacked layers of bidirectional LSTM networks coupled to the multi-layer convolutional network to obtain information extraction results of the multi-layer convolutional network output on the time dimension of the linguistic feature sequence. Further, the two-layer stacked bidirectional LSTM network performs high-dimensional information processing on the information extraction result of the linguistic feature sequence in the time dimension so as to further obtain feature information in the time dimension.

Further, the presetting of the time sequence mapping model may further include: and the deep neural network is coupled with the multilayer convolutional network and the long and short term memory network, and is used for mapping a characteristic domain and performing characteristic dimension transformation on multi-dimensional information extraction results output by the multilayer convolutional network and the long and short term memory network so as to obtain expression parameters of the virtual object.

For example, the deep neural network may receive information extraction results of linguistic feature dimensions output by the multilayer convolutional network, and the deep neural network may also receive information extraction results of updated time dimensions output by the long-time memory network.

The dimension transformation may refer to dimension reduction, and if the input of the preset time sequence mapping model is 600 features, the output is 100 features.

For example, the deep neural network may include: and the full connection layers are connected in series, wherein the first full connection layer is used for receiving the multi-dimensional information extraction result, and the last full connection layer outputs the expression parameters of the virtual object. The number of fully connected layers may be three.

Further, the deep neural network may further include: and the nonlinear transformation modules are respectively coupled between two adjacent full-connection layers except the last full-connection layer, and are used for carrying out nonlinear transformation processing on the output result of the coupled upper full-connection layer and inputting the result of the nonlinear transformation processing into the coupled lower full-connection layer.

The nonlinear transformation module may activate a function for a modified linear unit (ReLU).

The nonlinear transformation module can improve the expression capability and generalization capability of the preset time sequence mapping model.

In a variation, the multilayer convolutional network, the long-time memory network and the deep neural network may be sequentially connected in series, the long-time memory network transmits the information extraction result of the linguistic feature dimension output by the multilayer convolutional network to the deep neural network, and the long-time memory network processes the information extraction result of the time dimension output by the multilayer convolutional network and transmits the processed information extraction result to the deep neural network.

In particular, the virtual object may be a virtual digital person.

The facial animation data may include lip animation data, expression animation data, eye animation data, and the like, but is not limited thereto. The specific expression form of the facial animation data may be a digitized vector sequence, for example, each vector in the sequence may include offset information of virtual digital human facial feature points (lip feature points, eye feature points, etc.), and the like. Wherein the audio data and the face animation data are synchronized in time, more specifically, the face animation data has the same time code as the audio data.

Further, when the audio data is output, feedback information may be obtained, and the feedback information may be used to indicate a text-driven instruction corresponding to the audio data being output.

Further, it may be determined whether the next text-driven instruction has a corresponding motion-driven instruction, and if so, the corresponding motion animation data may be determined according to the motion-driven instruction.

Specifically, the corresponding motion animation data may be selected and specified from a preset motion database according to the motion driving command. The preset action database comprises a plurality of action animation data, each action animation data is provided with a label, and the label is used for indicating the meaning of the action corresponding to the action animation data. More specifically, according to the motion identifier in the motion driving command, a query may be performed in the motion database to obtain motion animation data corresponding to the motion driving command. The preset action database comprises a plurality of action animation data, each action animation data is provided with a label, and the label is used for indicating the meaning of the action corresponding to the action animation data.

Referring to fig. 2, if a text driving instruction corresponding to audio data being output is "o", it may be determined that a next text driving instruction is "i call", and the text driving instruction has a corresponding action driving instruction ID2, and then corresponding action animation data may be determined according to the action driving instruction corresponding to the action identification ID 2.

It can be understood that, in the process of determining the corresponding motion animation data according to the motion driving instruction corresponding to the motion identifier ID2, the face animation data is also generated according to the text driving instruction "i call" and the corresponding audio data is output. Thus, the motion animation data is semantically aligned with the face animation data and the audio data.

Further, whether the next text driving instruction has a corresponding display driving instruction can be judged, and if yes, the display content corresponding to the display driving instruction is displayed. The display content may be uploaded by a user, and may be, for example, but not limited to, a picture, a text, a table, a video, and the like.

Further, if the next text driving instruction has a corresponding display driving instruction, whether the display driving instruction has a corresponding action driving instruction may be further determined, and if so, the corresponding action animation data may be determined according to the action driving instruction.

For example, if the text-driven instruction corresponding to the audio data being output is "money", the next text-driven instruction "product" has the corresponding display content P2, and the display-driven instruction corresponding to the display content P2 has the corresponding action-driven instruction, so that the display content P2 can be displayed, and the corresponding action animation data can be determined according to the action identifier ID 3.

Further, motion animation data, facial animation data, and presentation content may be resolved and rendered to derive a video of a virtual digital person. For example, the processed animation data may be input into a real-time engine (e.g., UE4, unity, etc.) for solution and rendering to obtain a video of a virtual digital person. More specifically, the calculation and rendering can be performed according to preset video parameters to obtain a corresponding video. The video parameters may be preset by a user, and the video parameters may include one or more of the following: video resolution, video frame rate, video format, etc., but is not limited thereto. The video of the virtual digital person can be two-dimensional or three-dimensional.

Before step S104 is executed, a virtual digital person may also be acquired, where the virtual digital person may be preset or obtained by using a face-pinching system technology, and a specific manner of acquiring the virtual digital person is not limited in the embodiment of the present invention.

In a specific example, various virtual digital human images can be customized offline according to requirements to form a virtual digital human image library, and a user can select the virtual digital human images meeting the requirements according to the requirements.

In another specific example, object information input by a user may be acquired, the object information describing an avatar of a virtual digital person, and then the virtual digital person may be generated based on the object information. For example, the object information may be data describing appearance characteristics of a virtual digital person, and the object information may be text, voice, or the like.

In yet another specific example, a scene type to which the input information belongs may be determined, for example, the scene type may be a news report scene, a product introduction scene (more specifically, a scene further subdivided according to the product type, for example, a makeup product introduction scene, a men's clothing product introduction scene, an automobile product introduction scene, and the like), a knowledge science popularization scene, and the like. Further, the virtual digital person may be determined according to a scene type to which the input information belongs. More specifically, the virtual digital person corresponding to the scene type may be determined from a preset virtual digital person database according to the scene type to which the input information belongs, or information such as a face, a hair style, clothes, a makeup, and a pose may be determined according to the scene type to which the input information belongs, and the determined information may be fused to generate the virtual digital person.

Referring to fig. 3, fig. 3 is a video generating apparatus for a virtual digital person according to an embodiment of the present invention, and the apparatus shown in fig. 3 may include:

an obtaining module 31, configured to obtain input information, where the input information includes input text or input voice;

a text-driven instruction generating module 32, configured to determine a text-driven instruction according to the input information, where the text-driven instruction includes a text;

the action driving instruction generating module 33 is used for generating an action driving instruction corresponding to the text driving instruction according to the semantic meaning of the text in the text driving instruction;

and a video generating module 34, configured to generate a video of the virtual digital person according to audio data, face animation data, and motion animation data, where the audio data and the face animation data are obtained according to the input information, and the motion animation data is obtained according to the motion driving instruction.

In a specific implementation, the video generation apparatus of the virtual digital human may correspond to a chip having a video generation function in a terminal, or correspond to a chip module having a video generation function in a terminal, or correspond to a terminal.

For more contents such as the working principle, the working mode, and the beneficial effects of the video generation apparatus for virtual digital people shown in fig. 3, reference may be made to the above description related to the video generation method for virtual digital people, and details are not repeated here.

The embodiment of the present invention further provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the video generation method for a virtual digital person described above are executed. The storage medium may include ROM, RAM, magnetic or optical disks, etc. The storage medium may further include a non-volatile memory (non-volatile) or a non-transitory memory (non-transient), and the like.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with a computer program capable of running on the processor, and the processor executes the steps of the video generation method of the virtual digital person when running the computer program. The terminal includes, but is not limited to, a mobile phone, a computer, a tablet computer and other terminal devices.

It should be understood that, in the embodiment of the present application, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlronous DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions described in accordance with the embodiments of the present application are produced in whole or in part when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus and system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the cell is only a logic function division, and there may be another division manner in actual implementation; for example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. For example, for each device or product applied to or integrated into a chip, each module/unit included in the device or product may be implemented by hardware such as a circuit, or at least a part of the module/unit may be implemented by a software program running on a processor integrated within the chip, and the rest (if any) part of the module/unit may be implemented by hardware such as a circuit; for each device or product applied to or integrated with the chip module, each module/unit included in the device or product may be implemented by using hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components of the chip module, or at least some of the modules/units may be implemented by using a software program running on a processor integrated within the chip module, and the rest (if any) of the modules/units may be implemented by using hardware such as a circuit; for each device and product applied to or integrated in the terminal, each module/unit included in the device and product may be implemented by using hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal, or at least part of the modules/units may be implemented by using a software program running on a processor integrated in the terminal, and the rest (if any) part of the modules/units may be implemented by using hardware such as a circuit.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein indicates that the former and latter associated objects are in an "or" relationship.

The "plurality" appearing in the embodiments of the present application means two or more.

The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of video generation of a virtual digital person, the method comprising:

acquiring input information, wherein the input information comprises input text and/or input voice;

determining a text-driven instruction according to the input information, wherein the text-driven instruction comprises a text;

generating an action driving instruction corresponding to the text driving instruction according to the semantic meaning of the text in the text driving instruction;

generating a video of the virtual digital person according to the audio data, the face animation data and the action animation data;

the audio data and the facial animation data are obtained according to the text driving instruction, and the action animation data are obtained according to the action driving instruction;

wherein, before generating the video of the virtual digital person from the audio data, the face animation data, and the motion animation data, the method further comprises:

acquiring a second text selected by the user in the text content corresponding to the input information;

acquiring display content input by a user aiming at the second text;

generating a display driving instruction corresponding to a second text driving instruction according to the display content, wherein the second text driving instruction is a text driving instruction containing the second text;

generating an action driving instruction corresponding to the display driving instruction according to the display driving instruction;

before the action driving instruction is generated according to the semantics in the text driving instruction, if the text driving instruction has a corresponding display driving instruction and the action driving instruction is generated according to the display driving instruction, the action driving instruction is not generated according to the semantics of the text in the text driving instruction.

2. The method of claim 1, wherein determining text-driven instructions based on the input information comprises:

performing word segmentation processing on the input text to obtain a plurality of texts;

and generating a text driving instruction corresponding to each text according to each text.

3. The method of claim 1, wherein determining text-driven instructions based on the input information comprises:

performing voice recognition on the input voice to obtain text content corresponding to the input voice;

performing word segmentation processing on the text content to obtain a plurality of texts;

4. The method of claim 1, wherein before generating the video of the virtual digital person based on audio data, facial animation data, and motion animation data, the method further comprises:

and selecting and determining corresponding action animation data from a preset action database according to the action identifier in the action driving command.

5. The method of claim 1, wherein prior to generating the video of the virtual digital person from the audio data, the facial animation data, and the motion animation data, the method comprises:

acquiring feedback information, wherein the feedback information is used for indicating a text driving instruction corresponding to the audio data being output;

and judging whether the next text driving instruction has a corresponding action driving instruction according to the feedback information, and if so, determining corresponding action animation data according to the action driving instruction.

6. The method of claim 1, wherein prior to generating the video of the virtual digital person from the audio data, the facial animation data, and the motion animation data, the method comprises:

acquiring a first text selected by a user in the text content corresponding to the input information;

acquiring an action identifier input by a user aiming at the first text;

and generating an action driving instruction corresponding to the first text driving instruction according to an action identifier input by a user for the first text, wherein the first text driving instruction is a text driving instruction containing the first text.

7. The method of claim 6, wherein prior to generating the video of the virtual digital person from the audio data, the facial animation data, and the motion animation data, the method comprises:

and judging whether the next text driving instruction has a corresponding display driving instruction according to the feedback information, and if so, displaying the display content corresponding to the display driving instruction.

8. The method of claim 1, wherein generating the video of the virtual digital person based on the audio data, the facial animation data, and the motion data comprises:

performing fusion processing on the audio data, the face animation data and the motion animation data to obtain processed animation data;

and resolving and rendering the processed animation data to obtain the video of the virtual digital human.

9. The method of claim 1, wherein before generating the video of the virtual digital person based on audio data, facial animation data, and the motion animation data, the method further comprises:

acquiring object information input by a user, wherein the object information is used for describing the image of the virtual digital person; and generating the virtual digital person according to the object information.

10. An apparatus for video generation of a virtual digital person, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring input information, and the input information comprises input text or input voice; the text driving instruction generating module is used for determining a text driving instruction according to the input information, wherein the text driving instruction comprises a text;

the action driving instruction generating module is used for generating an action driving instruction corresponding to the text driving instruction according to the semantic meaning of the text in the text driving instruction;

the video generation module is used for generating the video of the virtual digital person according to the audio data, the face animation data and the action animation data;

wherein the action driving instruction generation module is further configured to:

acquiring display content input by a user aiming at the second text;

11. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, performs the steps of the video generation method of a virtual digital person according to any one of claims 1 to 9.

12. A terminal comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, performs the steps of the video generation method of a virtual digital person according to any of claims 1 to 9.