CN114187547A

CN114187547A - Target video output method and device, storage medium and electronic device

Info

Publication number: CN114187547A
Application number: CN202111474972.0A
Authority: CN
Inventors: 司马华鹏; 王建; 汪圆; 孙雨泽
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-03-15

Abstract

The embodiment of the application provides a target video output method and device, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring a first audio and a first video containing a target person; extracting audio features of the first audio and face features of a target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth; splicing the audio features of the first audio and the face features of the target person and then inputting the audio features and the face features into the trained neural network model; and outputting a target video containing a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and the mouth shape of the target virtual character corresponds to the first audio.

Description

Target video output method and device, storage medium and electronic device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for outputting a target video, a storage medium, and an electronic apparatus.

Background

The generation of a two-dimensional (2D) avatar generally refers To generating a corresponding 2D avatar through video data of one person or several persons, the 2D avatar visually approximates a real person, and a mouth shape close To the real person can be achieved for different Text To Speech (TTS) inputs. The method can be widely applied to tasks related to the 2D virtual character.

At present, in the related art, the definition of a 2D avatar generation scheme is generally limited, convergence is slow during training, and for the situation that different human types or character features are greatly different, for example, whether there is a beard or not, the training mode of a model is not general, and the training mode needs to be continuously modified according to the character features, which directly affects the mouth shape effect of the finally generated 2D avatar.

Aiming at the problems that the model training efficiency of the 2D virtual character generation scheme is low and the definition is poor in the related technology, an effective solution is not provided in the related technology.

Disclosure of Invention

The embodiment of the application provides a target video output method and device, a storage medium and an electronic device, and aims to at least solve the problems that a 2D virtual character generation scheme in the related art is low in model training efficiency and poor in definition.

In an embodiment of the present application, a method for outputting a target video is provided, including: acquiring a first audio and a first video containing a target character, wherein the first audio is voice data converted according to a text; extracting audio features of the first audio and face features of the target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth; splicing the audio features of the first audio and the face features of the target person and inputting the spliced audio features into a trained neural network model, wherein the neural network model is a generated confrontation network model trained by using sample data, the sample data comprises sample video data, the sample video data comprises a plurality of person objects, and the neural network model comprises a plurality of gate convolution layers and a plurality of expansion gate convolution layers; outputting a target video including a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and the mouth shape of the target virtual character corresponds to the first audio.

In an embodiment of the present application, there is also provided an output apparatus of a target video, including: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a first audio and a first video containing a target character, wherein the first audio is voice data converted according to a text; the extraction module is configured to extract audio features of the first audio and face features of the target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth; an input module, configured to splice the audio features of the first audio and the face features of the target person and input the spliced audio features into a trained neural network model, where the neural network model is a generated confrontation network model trained using sample data, the sample data includes sample video data, the sample video data includes a plurality of person objects, and the neural network model includes a plurality of gate convolution layers and a plurality of inflation gate convolution layers; an output module configured to output a target video including a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and a mouth shape of the target virtual character corresponds to the first audio.

In an embodiment of the present application, a computer-readable storage medium is also proposed, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

In an embodiment of the present application, there is further proposed an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the steps of any of the above method embodiments.

According to the embodiment of the application, a first audio and a first video containing a target person are obtained; extracting audio features of the first audio and face features of a target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth; splicing the audio features of the first audio and the face features of the target person and then inputting the audio features and the face features into the trained neural network model; and outputting a target video containing a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and the mouth shape of the target virtual character corresponds to the first audio. The problem that the model training efficiency of a 2D virtual character generation scheme in the related technology is low and the definition is poor is solved, the neural network model is a generated confrontation network model trained by using sample data, the neural network model comprises a plurality of gate convolution layers and a plurality of expansion gate convolution layers, and in the training process, the gate convolution generator is used, so that the convergence speed is high, and the capability of learning characteristics is very strong; the robustness is strong, and different human face features such as beard carrying and glasses can be well supported; people in different regions can be well studied; the resulting digital human picture has a sharpness close to that of the training data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flow chart of an alternative target video output method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative data preprocessing process according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative sample audio feature acquisition process according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative sample face feature acquisition process according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative neural network model training process according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative door convolutional layer training process in accordance with an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative generator configuration according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an alternative output device for target video according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

As shown in fig. 1, an embodiment of the present application provides a method for outputting a target video, including:

step S102, acquiring a first audio and a first video containing a target character, wherein the first audio is voice data converted according to a text;

step S104, extracting audio features of the first audio and face features of a target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth;

step S106, splicing the audio features of the first audio and the face features of the target person and inputting the spliced audio features into a trained neural network model, wherein the neural network model is a generated confrontation network model trained by using sample data, the sample data comprises sample video data, the sample video data comprises a plurality of person objects, and the neural network model comprises a plurality of gate convolution layers and a plurality of expansion gate convolution layers;

step S108, outputting a target video containing a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and the mouth shape of the target virtual character corresponds to the first audio.

It should be noted that the Generative Adaptive Networks (GAN) referred to in the embodiments of the present application is a deep learning model, and samples from complex probability distributions are obtained by training a Discriminator (Discriminator) and a Generator (Generator) in turn to make them oppose each other. The neural network model comprises a plurality of gate convolution layers and a plurality of expansion gate convolution layers, wherein the gate convolution layers and the expansion gate convolution layers can be arranged in a generator part, the generator using gate convolution has high convergence speed, the expansion gate convolution has larger receptive field, and the capability of learning characteristics is very strong; the robustness is strong, and different human face features such as beard carrying and glasses can be well supported; people in different regions can be well studied; the resulting digital human picture has a sharpness close to that of the training data.

The target video generated in the embodiment of the present application may be the generation of a 2D digital person (equivalent to the virtual character) generally, which refers to the generation of a corresponding 2D digital person through video data of one person or several persons, the 2D digital person is close to a real person visually, and the mouth shape close to the real person can be achieved for different TTS inputs. Therefore, in the application process, the first audio is TTS voice data, the first video includes the trained target person, and the first video may be video data including audio or a video picture without audio.

In one embodiment, extracting the facial features of the target person in the first video may be implemented by:

step S1, detecting the face image of the target person in the first video, and cutting the face image;

step S2, setting a covering mask in the mouth peripheral area of the cut human face image, wherein the mouth peripheral area comprises the area below the eyes and above the chin;

and step S3, extracting the local features of the face image with the mask to obtain the face features of the target person.

The mouth peripheral region generally refers to a mouth region, may be a region where the mouth and the nose are combined, may also be a region where the mouth, the nose and the chin are combined, generally does not include eyes and cheeks, may include the chin, and may also include cheeks around the mouth for training, which is not limited in the embodiment of the present application.

In an embodiment, before the audio features of the first audio and the face features of the target person are spliced and input into the trained neural network model, the method further includes training the neural network model, where the training process is as follows:

preprocessing sample video data to obtain sample audio features and sample face features;

a neural network model is trained using the sample audio features and the sample face features.

Fig. 2 is a schematic diagram of an alternative data preprocessing process according to an embodiment of the present application, and as shown in fig. 2, a data preprocessing module is used to perform corresponding processing on video material to generate training data. The data preprocessing mainly comprises the extraction of audio features and the extraction of face features, wherein the extraction of the face features comprises face cutting and mask setting, and then the audio features and the face features are spliced (concat). The preprocessing aiming at the sample video data corresponds to the extraction of the sample audio features and the extraction of the sample human face features.

In one embodiment, pre-processing the sample data comprises:

extracting sample audio data in the sample video data;

extracting Mel acoustic features of the sample audio data;

filtering out mute data in the Mel acoustic characteristics to obtain filtered sample audio characteristics;

and extracting sample audio features corresponding to each frame of image of the sample video data according to the set sliding window.

Fig. 3 is a schematic diagram of an alternative sample audio feature obtaining process according to an embodiment of the present application, as shown in fig. 3, since a digital person needs to be driven according to TTS and the mouth shape is required to be synchronized with the TTS audio, the sample audio feature is an important input of a neural network model, and the sample audio feature extraction first extracts audio data from a sample video and then extracts an mfcc feature using a commonly used audio library such as librosa or soundfile. The mfcc feature is an abbreviation of the acoustic feature Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to mel frequency and then cepstrum analysis. In order to avoid the interference of the silent part in the corpus, most of the silent data needs to be filtered. And then extracting the sample audio characteristics corresponding to each frame of picture according to the set sliding window, and ensuring that the sample audio characteristics and the specific video frame are synchronous. Finally, the sample audio features are normalized.

In one embodiment, pre-processing the sample data comprises:

carrying out face detection on each frame of image of the sample video data;

after the detected face image is cut, a covering mask is arranged in the peripheral area of the mouth of the face image, wherein the peripheral area of the mouth comprises areas below eyes and above the chin;

and carrying out normalization processing on the face image with the covering mask to obtain the face characteristics of the sample.

Fig. 4 is a schematic diagram of an optional sample face feature obtaining process according to an embodiment of the present application, and as shown in fig. 4, the audio-driven digital human effect is mainly in details of a face, so that, in addition to audio, synchronous face features are input during training, first, ffmpeg is used to frame a corpus video, then dlib is used to perform face detection, after a detected face image is cut, a covering mask is set in a peripheral area of a mouth of the face image, and normalization processing is performed on the face image after the covering mask is set, for example, 256 × 256 size is performed to obtain sample face features. Because the neural network model needs to learn whether the face detail features corresponding to the audio features, such as the mouth shape, are correct or not, except some invariant features, such as eyes, forehead, neck, etc., other parts, especially the mouth and the surrounding parts, need to be covered by a mask when inputting, which indicates that the parts need to be generated by the network.

In one embodiment, training a neural network model using sample audio features and sample facial features includes:

splicing sample audio features and sample human face features, and inputting the spliced sample audio features and the spliced sample human face features into a generator of a neural network model, wherein the generator comprises n layers of gate convolution layers and m layers of expansion gate convolution layers, n and m are integers larger than 1, each layer of gate convolution layer comprises a first sub convolution layer and a second sub convolution layer, and a feature map of each layer of expansion gate convolution layer is unchanged;

outputting an estimated video frame image through a generator, wherein the estimated video frame comprises a face image of a virtual character;

determining mouth loss and global loss of the pre-estimated video frame image through a discriminator, and adjusting training parameters of a neural network model according to the mouth loss and the global loss, wherein the mouth loss is used for representing a difference value between a mouth shape image and a true value in the pre-estimated video frame image, and the global loss is used for representing a difference value between an integral image and the true value of the pre-estimated video frame image.

In one embodiment, outputting, by a generator, a predicted video frame comprises:

the input of each layer of gate convolution layer is convolved with the first sub convolution layer and the second sub convolution layer respectively to obtain a first sub convolution value and a second sub convolution value;

and activating the first sub convolution value through an activation function, and multiplying the first sub convolution value by the second sub convolution value to obtain the output of the current gate convolution layer.

Fig. 5 is a schematic diagram of an alternative neural network model training process according to an embodiment of the present application, as shown in fig. 5, during training, processed sample audio features and a face picture (sample face features) after setting a mask are sent to a GAN network together, an encoder + decoder part in fig. 5 is a generator, a discriminator is a discriminator, the generator is used to generate a picture as close as possible to a video frame in an original corpus, the discriminator is used to judge whether the generator is true or false, and the generator and the discriminator continuously resist learning to make the generator approach the original video frame more and more.

It should be noted that, the generator uses gate convolution, and since the input contains mask, the internal and external pixels of the mask play different roles, and the pixels in the mask area can be regarded as invalid or have lower weight, but the conventional convolution takes each pixel as a valid value and cannot be effectively distinguished, and the gate convolution can distinguish semantic differences of each pixel at a spatial position through learning.

Fig. 6 is a schematic diagram of an alternative gate convolutional layer training process according to an embodiment of the present application, and as shown in fig. 6, the gate convolutional layer includes 2 general convolutional layers (corresponding to the first sub-convolutional layer and the second sub-convolutional layer), and the convolutional cores and the windows have the same size but the weights are not shared, where the first sub-convolutional layer is activated by a sigmoid function and then multiplied by the convolution result of the second sub-convolutional layer. Since the second sub-convolutional layer does not add an activation function, the probability that the derivative of the part is 0 or close to 0 is relatively small, and when the number of network layers is large, the probability of gradient disappearance can be reduced to a certain extent. The relationship between the output (Y) and the input (X) of each gate convolution layer can be expressed by the following formula:

Y＝activation(Conv2d₁(X)×σ(Conv2d₂(X)))

fig. 7 is a schematic diagram of an alternative generator structure according to an embodiment of the present application, as shown in fig. 7, in an alternative example, the input is formed by splicing audio features and human face features, the shape is 256 × 4, and the output is 512 × 3. The convolution rates of the expansion gate convolutions are 2, 4, 8, 16, respectively. In other examples, the number of gate convolution layers and the number of expansion gate convolution layers, as well as the convolution rate may be set according to actual requirements, which is not limited in this embodiment of the present application. The expansion gate convolution layer is used, so that the receptive field of the model can be enlarged, and the expansion gate convolution layer is used in the high layer, so that a larger area and more characteristics can be captured under the condition of not increasing parameters.

The generator workflow shown in FIG. 7 is as follows:

1) the first layer is an input layer, and input data are formed by splicing audio features, cut faces and masks, and the size of the input data is 256 × 4.

2) The second layer is a door convolution layer, and the specific parameter format is as follows: [ input channel, output channel, convolution kernel size, stride size, number of padding, dilated convolution rate ], similar in the following layer parameter format. The parameters of the layer are [3, 64, 4, 2, 1,1], the size of the output feature graph featuremap is 128 × 128, and the layer is mainly used for rapidly reducing the pictures, increasing the number of channels and accelerating the learning speed.

3) The third layer is a gate convolution layer, the parameters are [64, 256,3,1,1,1], the output feature map is 128 × 128, the number of channels is increased, and the picture characteristics are further learned.

4) The fourth layer is a gate convolution layer, the parameters are [256,256, 4, 2, 1,1], the output feature map is 64 × 64, the number of channels is kept unchanged, the convolution kernel is increased, and the information fusion between the channels is promoted.

5) The fifth layer is a gate convolution layer with parameters [256,256,3,1,1,1], and the network is further deepened.

6) The sixth layer is an expansion gate convolution layer with the parameters of [256, 3,1,2 and 2] and the expansion rate of 2, and under the condition of keeping the output featuremap unchanged, the receptive field is increased.

7) The seventh layer is an expansion gate convolution layer, the parameters are [256, 3,1,4,4], the expansion rate is 4, and the receptive field is increased under the condition that the output featuremap is kept unchanged.

8) The eighth layer is an expansion gate convolution layer with the parameters of [256, 3,1,8 and 8] and the expansion rate of 8, and under the condition of keeping the output featuremap unchanged, the receptive field is increased.

9) The ninth layer is an expansion gate convolution layer with the parameters of [256, 3,1,16 and 16] and the expansion rate of 16, and under the condition of keeping the output featuremap unchanged, the receptive field is increased.

10) The tenth layer is a gate convolution layer with parameters [256, 3,1,1,1], the output size is unchanged, and the previous different receptive field information is further fused.

11) The eleventh layer is a gate convolution layer with parameters [256, 3,1,1,1], unchanged output size, and further information fusion.

12) The twelfth layer is the transpose gate convolutional layer, starting from which is the decoder, with parameters [256,128,3,1,1,1], output size 128 × 128, number of compressed output channels, fusing more information to featuremap.

13) The thirteenth layer is a transposed gate convolution with parameters [128,64,3,1,1,1], and output size 256 × 256, further reducing the output channels.

14) The fourteenth layer is a transposed gate convolution with parameters [64,3,7,1,3,1], an output size of 512 × 3 and an output channel of 3, i.e., the generated image is output.

Note that, the mouth L1 loss (equivalent to the aforementioned "mouth loss") and the global discriminant loss (equivalent to the aforementioned "global loss") are newly added in the neural network model training process.

Specifically, the reason for adding the mouth L1 loss is mainly to make the mouth shape express different audio frequencies well, and the model generated by adding the L1 loss is compared to obtain good effect on different TTSs.

The original arbiter uses a patch discriminator, and the idea of the arbiter is to map the input to a matrix of N × N, where each position of the matrix corresponds to a small area of the original, representing the probability that the patch is a true sample, and finally, to obtain the average. Through calculation, the corresponding receptive field of each position of the matrix is 70 x 70, so that the discriminator lacks an index for evaluating the authenticity on the whole, and therefore, the global loss is introduced, and the global loss maps the input into a real number, namely the probability that the input sample is a true sample.

The addition of the mouth L1 loss can make the training more stable, and especially when the data is more, it needs more training rounds to achieve better effect on definition and mouth shape, and this method has great advantage. And when the training is not added, the training is about 13 rounds, the G _ GAN loss in the training is gradually increased, and the training deviation is finally caused.

Comparison of effects (lip synchrony F1-score):

	10 frames in a frame	100 frames
			Before optimization	85％	91％
After optimization	90％	94％

Remarking: the average F1-score for the case of a sliding window every 10 frames and every 100 frames, with larger values representing better.

After the neural network model is trained, when the neural network model is actually used, various TTS and silence templates consistent with the human face characteristics of the training persons are input, and the purpose is to enable the generated video mouth shape to be consistent with the TTS and the mouth shape to be correctly opened and closed, namely to accord with the speaking habit of the person. The part mainly utilizes a generator, the preprocessing mode is kept consistent with that during training, including TTS audio feature extraction, face cutting of a silence template and mask setting, and finally each generated frame and corresponding audio are combined into a video. The silence template may be video data containing audio or a video picture without audio, and only the audio data in the video will not be used after being input into the generator, and the silence template will be directly filtered.

The target video output method and the training process of the neural network model provided by the embodiment of the application can enable a 2d digital person to extract the audio features and the face data of the person through the historical linguistic data of the real person to train a model, the model not only vividly approaches to the real person, but also the speaking mode and the mouth shape are close to the original mouth shape and are equivalent to the cloning of the person. The method is combined with a large-screen multimedia terminal or mobile terminal equipment, and can be applied to various video scenes, such as video broadcasting, venue guide explanation and the like.

According to the training method of the neural network model, the generator of the gate convolution layer is used in the training process, the convergence speed is high, and the capability of learning features is very strong; the robustness is strong, and different human face features such as beard carrying and glasses can be well supported; people in different regions can be well studied; the effect is good, and the generated digital human picture has the definition close to the training data. After the addition of L1 loss, the training is more stable, and the mouth shape is basically correct for different test audios.

It should be understood that although the various steps in the flow charts of fig. 1-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

As shown in fig. 8, according to another embodiment of the present application, there is further provided an output apparatus of a target video, configured to implement the method described in any one of the method embodiments above, which has already been described and is not repeated here, where the apparatus includes:

an obtaining module 802 configured to obtain a first audio and a first video including a target character, wherein the first audio is voice data converted according to a text;

an extracting module 804, configured to extract an audio feature of the first audio and a face feature of a target person in the first video, where the face feature of the target person is a local feature that covers a peripheral area of a mouth;

an input module 806, configured to splice audio features of the first audio with face features of the target person and input the spliced audio features into a trained neural network model, where the neural network model is a generated confrontation network model trained using sample data, the sample data includes sample video data, the sample video data includes a plurality of person objects, and the neural network model includes a plurality of gate convolution layers and a plurality of inflation gate convolution layers;

an output module 808 configured to output a target video including a target avatar through the neural network model, wherein the target avatar corresponds to the target avatar and a mouth shape of the target avatar corresponds to the first audio.

For specific limitations of the output device of the target video, reference may be made to the above limitations of the output method of the target video, which are not described herein again. The respective modules in the above-described output apparatus of the target video may be entirely or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the output method of the target video, where the electronic device may be applied to, but not limited to, a server. As shown in fig. 9, the electronic device comprises a memory 902 and a processor 904, the memory 902 having a computer program stored therein, the processor 904 being arranged to perform the steps of any of the above-described method embodiments by means of the computer program.

Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

step S1, acquiring a first audio and a first video including a target character, wherein the first audio is voice data converted from a text;

step S2, extracting audio features of the first audio and face features of a target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth;

step S3, splicing the audio frequency characteristics of the first audio frequency and the face characteristics of the target person and inputting the spliced audio frequency characteristics and the face characteristics into a trained neural network model, wherein the neural network model is a generated confrontation network model trained by using sample data, the sample data comprises sample video data, the sample video data comprises a plurality of person objects, and the neural network model comprises a plurality of gate convolution layers and a plurality of expansion gate convolution layers;

step S4, outputting a target video including a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and the mouth shape of the target virtual character corresponds to the first audio.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 9 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

The memory 902 may be configured to store software programs and modules, such as program instructions/modules corresponding to the target video output method and apparatus in the embodiment of the present application, and the processor 904 executes various functional applications and data processing by running the software programs and modules stored in the memory 902, so as to implement the above-described target video output method. The memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 902 may further include memory located remotely from the processor 904, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 902 may be, but is not limited to, used to store program steps of the voice separation method.

Optionally, the transmitting device 906 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 906 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 906 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In addition, the electronic device further includes: a display 908 for displaying an output process of the target video; and a connection bus 910 for connecting the respective module parts in the above-described electronic apparatus.

Embodiments of the present application further provide a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

Optionally, the storage medium is further configured to store a computer program for executing the steps included in the method in the foregoing embodiment, which is not described in detail in this embodiment.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. An output method of a target video, comprising:

acquiring a first audio and a first video containing a target character, wherein the first audio is voice data converted according to a text;

extracting audio features of the first audio and face features of the target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth;

splicing the audio features of the first audio and the face features of the target person and inputting the spliced audio features into a trained neural network model, wherein the neural network model is a generated confrontation network model trained by using sample data, the sample data comprises sample video data, the sample video data comprises a plurality of person objects, and the neural network model comprises a plurality of gate convolution layers and a plurality of expansion gate convolution layers;

outputting a target video including a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and the mouth shape of the target virtual character corresponds to the first audio.

2. The method of claim 1, wherein the extracting the facial features of the target person in the first video comprises:

detecting a face image of the target person in the first video, and cutting the face image;

arranging a covering mask in the mouth peripheral region of the cut human face image, wherein the mouth peripheral region comprises the regions below the eyes and above the chin;

and extracting the local features of the face image after the covering mask is arranged to obtain the face features of the target person.

3. The method of claim 1, wherein before inputting the trained neural network model after splicing the audio features of the first audio with the face features of the target person, the method further comprises:

preprocessing the sample video data to obtain sample audio features and sample face features;

training the neural network model using the sample audio features and the sample facial features.

4. The method of claim 3, wherein said pre-processing said sample data comprises:

extracting sample audio data in the sample video data;

extracting Mel acoustic features of the sample audio data;

filtering out mute data in the Mel acoustic features to obtain the filtered sample audio features;

and extracting the sample audio features corresponding to each frame of image of the sample video data according to the set sliding window.

5. The method of claim 3, wherein said pre-processing said sample data comprises:

performing face detection on each frame of image of the sample video data;

after the detected face image is cut, a covering mask is arranged in the mouth peripheral area of the face image, wherein the mouth peripheral area comprises areas below eyes and above the chin;

6. The method of claim 3, wherein the training the neural network model using the sample audio features and the sample facial features comprises:

the audio features and the face features are spliced and then input into a generator of the neural network model, wherein the generator comprises n layers of gate convolution layers and m layers of expansion gate convolution layers, n and m are integers larger than 1, each layer of gate convolution layer comprises a first sub convolution layer and a second sub convolution layer, and a feature map of each layer of expansion gate convolution layer is unchanged;

outputting an estimated video frame image through the generator, wherein the estimated video frame comprises a face image of a virtual character;

determining mouth loss and global loss of the estimated video frame image through a discriminator, and adjusting training parameters of the neural network model according to the mouth loss and the global loss, wherein the mouth loss is used for representing a difference value between a mouth shape image and a real value in the estimated video frame image, and the global loss is used for representing a difference value between an integral image and a real value of the estimated video frame image.

7. The method of claim 6, wherein outputting, by the generator, the predicted video frame comprises:

the input of each layer of the gate convolution layer is convolved with the first sub convolution layer and the second sub convolution layer respectively to obtain a first sub convolution value and a second sub convolution value;

and after the first sub convolution value is activated through an activation function, multiplying the first sub convolution value by the second sub convolution value to obtain the output of the current gate convolution layer.

8. An output apparatus of a target video, comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a first audio and a first video containing a target character, wherein the first audio is voice data converted according to a text;

the extraction module is configured to extract audio features of the first audio and face features of the target person in the first video, wherein the face features of the target person are local features covering a peripheral area of a mouth;

an input module, configured to splice the audio features of the first audio and the face features of the target person and input the spliced audio features into a trained neural network model, where the neural network model is a generated confrontation network model trained using sample data, the sample data includes sample video data, the sample video data includes a plurality of person objects, and the neural network model includes a plurality of gate convolution layers and a plurality of inflation gate convolution layers;

an output module configured to output a target video including a target virtual character through the neural network model, wherein the target virtual character corresponds to the target character, and a mouth shape of the target virtual character corresponds to the first audio.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 7 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.