CN116206008A

CN116206008A - Method and device for outputting mouth shape image and audio driving mouth shape network model

Info

Publication number: CN116206008A
Application number: CN202310500594.1A
Authority: CN
Inventors: 司马华鹏; 廖铮
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-06-02

Abstract

The embodiment of the application provides a method and a device for outputting a mouth shape image and an audio driving mouth shape network model, wherein the method comprises the steps of acquiring a first audio feature and a first image feature, inputting the first audio feature and the first image feature into the trained audio driving mouth shape network model, splicing the first audio feature and the first image feature, and sequentially performing convolution coding and deep learning to obtain a second image feature; converting the second image feature into a target image feature with higher resolution through convolution and recombination among multiple channels based on a feature extraction and sub-pixel convolution method; and carrying out deconvolution coding on the target image features, and outputting a mouth shape image of the target object corresponding to the first audio features.

Description

Method and device for outputting mouth shape image and audio driving mouth shape network model

Technical Field

The application relates to the technical field of data processing, in particular to a method and a device for outputting a mouth shape image and an audio driving mouth shape network model.

Background

The research of the audio driving mouth shape is an important content in the field of natural human-computer interaction. The audio driving mouth shape process is to preprocess the voice of the real person record or TTS synthesis, so that the corresponding character is driven to change mouth shape according to the audio, and then the mouth shape video corresponding to the audio is synthesized. Generally, the implementation of the audio driving mouth shape is implemented by a pre-trained neural network model, that is, a video including the mouth shape change of the user is used as a training sample, so that the model learns the relationship between the mouth shape change in the video and the audio in the training process, and the model can further implement the audio driving mouth shape. The object of the driving may be an avatar, such as a digital man, or may be the image of the user.

At present, research on voice-driven mouth shapes is mainly focused on optimizing effects, and most of voice-driven mouth shape technologies neglect the running speed of actual products in use in order to achieve high-precision effects, and the voice-driven mouth shape technologies must be realized through large-scale equipment (such as a server and the like). Today, where mobile terminal devices are popular on a large scale, many voice driven portshapes cannot be used on the user's personal terminal device because of the speed of operation limitations.

Aiming at the technical problems that the voice driving mouth shape method is low in running speed and cannot be used on most terminal equipment in the related art, no effective solution is proposed in the related art.

Disclosure of Invention

The embodiment of the application provides a method and a device for outputting a mouth shape image and an audio driving mouth shape network model, so as to at least solve the technical problem that a voice driving mouth shape method in the related art is low in running speed and cannot be used on most terminal equipment.

In one embodiment of the present application, there is provided a method for outputting a mouth shape image, including: acquiring a first audio feature and a first image feature, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object; inputting the first audio feature and the first image feature into a trained audio driving mouth shape network model, wherein the audio driving mouth shape network model is a model obtained by training an initial audio driving mouth shape network model by using sample video data and sample mouth shape images; the first audio features and the first image features are spliced and then subjected to convolutional encoding and deep learning in sequence to obtain second image features; converting a second image feature with a first resolution into a target image feature with a second resolution through convolution and recombination among multiple channels based on a feature extraction and subpixel convolution method, wherein the first resolution is smaller than the second resolution; and carrying out deconvolution coding on the target image features, and outputting a mouth shape image of the target object corresponding to the first audio features.

In an embodiment, the acquiring the first audio feature includes: extracting the first audio features from the recorded first audio data, or acquiring second audio data in real time by using radio equipment, and extracting the first audio features from the second audio data; the acquiring the first image feature includes: extracting the first image feature from the video or the picture uploaded by the target object, or acquiring the video or the picture from a database, and extracting the first image feature containing the target object, wherein the target object comprises: an end user, other real characters than the end user, virtual characters, and any cartoon character.

In an embodiment, the converting the second image feature of the first resolution to the target image feature of the second resolution includes: inputting the second image features into a pixel reorganization module, and obtaining feature graphs of n-2 channels through convolution, wherein the size of the feature graphs is consistent with the second image features, n is an up-sampling factor, the pixel reorganization module comprises a plurality of convolution layers and n-2 channels, and the weight of each channel is optimized in the training process; and periodically screening and arranging the feature images with the first resolution of the n-2 channels on a large image through convolution operation to obtain the target image features with the second resolution.

In an embodiment, before inputting the first audio feature and the first image feature into the trained audio-driven oral network model, the method further comprises: training the initial audio driving mouth shape network model by using the sample video data and the sample mouth shape image to obtain the audio driving mouth shape network model.

In an embodiment, the training the initial audio-driven oral network model using the sample video data and the sample oral image to obtain the audio-driven oral network model includes: obtaining initial video data containing the target object, processing the initial video data to obtain sample audio data without images and sample image data without audio, taking images corresponding to each frame of the sample audio data and containing real mouth shape pictures as labels, extracting sample audio features from the sample audio data, and converting the sample image data into a sample image sequence arranged according to a time sequence; splicing the sample audio features and the sample image sequence to obtain sample splicing data, wherein each frame of sample image corresponds to the sample audio features under the image frame; performing convolutional encoding on the sample spliced data, performing deep learning processing, inputting the sample spliced data into a pixel reorganizing module for resolution amplification, and outputting a predicted mouth shape picture corresponding to the sample audio characteristics after deconvolution encoding; and adjusting parameters of the initial audio driving mouth shape network model according to errors of the predicted mouth shape picture and the label.

In an embodiment, the optimizing parameters of the initial audio-driven mouth shape network model according to the error of the predicted mouth shape picture and the label includes: and calculating the overall error between the predicted mouth shape picture and the label by using an L1 loss function, and calculating the local error of the predicted mouth shape picture and the mouth region in the label by using a mouth loss function, wherein the sum of the weight of the overall error and the weight of the local error is 1.

In an embodiment, the training the initial audio-driven oral network model using the sample video data and the sample oral image to obtain the audio-driven oral network model includes: training the initial audio-driven oral network model using first sample data to obtain a pre-training model, wherein the first sample data comprises broad sample data of non-specific users; training the pre-training model by using second sample data to obtain the audio driving mouth shape network model, wherein the second sample data comprises personalized data of a specific user.

According to another embodiment of the present application, there is also provided an audio-driven mouth-shape network model including: the splicing module is configured to splice the first audio feature and the first image feature to obtain spliced data, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object; the convolution coding module is configured to carry out convolution coding on the spliced data to obtain coded data; the residual error module is configured to perform deep learning on the encoded data to obtain a second image characteristic; the pixel reorganization module is configured to amplify the second image feature, and based on a feature extraction and sub-pixel convolution method, the second image feature with the first resolution is amplified into a target image feature with the second resolution through convolution and reorganization among multiple channels, wherein the first resolution is smaller than the second resolution; and the deconvolution coding module is used for deconvolution coding the target image features and outputting mouth-shaped images of the target objects corresponding to the first audio features.

In an embodiment, the pixel reorganization module is further configured to: inputting the second image features into a pixel reorganization module, and obtaining feature graphs of n-2 channels through convolution, wherein the size of the feature graphs is consistent with the second image features, n is an up-sampling factor, the pixel reorganization module comprises a plurality of convolution layers and n-2 channels, and the weight of each channel is optimized in the training process; and (3) periodically screening and arranging the feature images with the first resolution of n-2 on a large image through convolution operation to obtain the target image features with the second resolution.

According to another embodiment of the present application, there is also provided an output apparatus of a mouth shape image, including: the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a first audio feature and a first image feature, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object; the input module is configured to input the first audio feature and the first image feature into a trained audio driving mouth shape network model, wherein the audio driving mouth shape network model is a model obtained by training an initial audio driving mouth shape network model by using sample video data and sample mouth shape images; the processing module is configured to splice the first audio feature and the first image feature and then sequentially perform convolutional encoding and deep learning to obtain a second image feature; a conversion module configured to convert a second image feature of a first resolution to a target image feature of a second resolution, wherein the first resolution is less than the second resolution; and the output module is configured to perform deconvolution coding on the target image features and output a mouth shape image of the target object corresponding to the first audio features.

In an embodiment of the present application, a computer-readable storage medium is also presented, in which a computer program is stored, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In an embodiment of the application, there is also proposed an electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the method and the device for outputting the mouth shape image and the audio driving mouth shape network model, after the first audio feature and the first image feature are obtained, the first audio feature and the first image feature are input into the trained audio driving mouth shape network model, and after the first audio feature and the first image feature are spliced, convolution coding and deep learning are sequentially carried out to obtain the second image feature; converting the second image feature into a target image feature with higher resolution through convolution and recombination among multiple channels based on a feature extraction and sub-pixel convolution method; and carrying out deconvolution coding on the target image features, and outputting a mouth shape image of the target object corresponding to the first audio features. The method solves the technical problem that the voice driving mouth shape method in the related art cannot be used on most terminal equipment due to the fact that the running speed is low, through a lightweight model structure, the second image features are converted into target image features with higher resolution through convolution and recombination among multiple channels, the target image features are subjected to deconvolution coding without increasing the number of parameters, mouth shape images of target objects corresponding to the first audio features are output, the operation speed is higher on the premise of guaranteeing high resolution, and the voice driving mouth shape method can be effectively used on most terminal equipment.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of an alternative method of outputting a mouth shape image according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative audio-driven oral network model architecture according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative upsampling method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

Fig. 1 is a flowchart of an alternative method for outputting a mouth shape image according to an embodiment of the present application, and as shown in fig. 1, the embodiment of the present application provides a method for outputting a mouth shape image, including:

Step S102, acquiring a first audio feature and a first image feature, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object;

step S104, inputting a first audio feature and a first image feature into a trained audio driving mouth shape network model, wherein the audio driving mouth shape network model is a model obtained by training an initial audio driving mouth shape network model by using sample video data and sample mouth shape images;

step S106, after the first audio feature and the first image feature are spliced, convolutional encoding and deep learning are sequentially carried out, so that a second image feature is obtained;

step S108, converting second image features with first resolution into target image features with second resolution through convolution and recombination among multiple channels based on the feature extraction and the sub-pixel convolution, wherein the first resolution is smaller than the second resolution;

step S110, carrying out deconvolution coding on the target image features, and outputting a mouth shape image of the target object corresponding to the first audio features.

It should be noted that the first object and the target object may be the same person, or may not be the same person, the first object may be any user, the target object may be another person or a virtual person, and the audio may be any person speaking, or audio generated from Text To Speech (TTS for short).

In an embodiment, the acquiring the first audio feature in step S102 may be implemented by: extracting the first audio features from the recorded first audio data, or acquiring second audio data in real time by using radio equipment, and extracting the first audio features from the second audio data; the acquiring of the first image feature in step S102 may be achieved by: extracting a first image feature from a video or a picture uploaded by a target object, or acquiring the video or the picture from a database, and extracting the first image feature containing the target object, wherein the target object comprises: end users, other real characters outside the end users, virtual characters, and any cartoon character.

When the method for outputting the mouth shape image is used, the first audio feature may be extracted by using an audio uploading model recorded by a user in advance, or may be real-time voice of the user, and the first audio feature is extracted from the voice.

It should be noted that, in the training process, the embodiment of the present application needs to collect the user speaking video as the original training data, where the collected video is divided into two parts, one part is an image part (which refers to multiple frames of continuous images in the user speaking video, the part does not include audio), and the other part is an audio part (which refers to the user audio corresponding to the user speaking video). And for the image part, beautifying the speaking video of the user through a beautifying tool, preprocessing the video after confirming the effect with the user, including beautifying, generating a silent video with preset duration, processing the silent video into sample image data and the like, and generating final image data and audio data for model training and video synthesis after training.

For the above-mentioned audio portion, extraction of audio features is required. Considering the variety of operation speeds and sound sources, the embodiment of the application adopts a generalized audio feature extraction mode, specifically, normalized MFCC features (i.e. the MFCC features are normalized in advance) are selected for feature extraction of the voice. The advantage of doing so is that the normalized MFCC is fast in operation speed, does not consume excessive resources, and has better generalization for different languages, different receiving devices and different speakers after training by a large amount of data.

After the original processing of the user speaking video and the feature lifting of the audio are completed, the image and the audio feature can be input into the audio driving mouth shape network model in the embodiment of the application. The audio driving mouth shape network model related in the embodiment of the application is a self-defined lightweight network structure and is used for remarkably improving the running speed when an actual product is used. Fig. 2 is a schematic structural diagram of an alternative audio driving mouth shape network model according to an embodiment of the present application, as shown in fig. 2, the network model adopts a self-defined deep learning network, and forms a final overall network architecture based on a lightweight network structure idea. The network inputs are audio features (corresponding to the sample audio features) and images (corresponding to the sample image sequence), and the network inputs are pictures corresponding to the mouth shapes. The method comprises the steps of splicing audio features and images, performing convolutional encoding, inputting the images into a residual error network module, and sending a result output by the residual error network into a pixel reorganization module to realize the effect of improving resolution through simple calculation, wherein the principle is that feature images of n 2 channels (the feature images are consistent with the input low-resolution images) are obtained through convolution, and then the high-resolution image is obtained through a period screening (periodic shuffing) method, wherein n is an up-sampling factor (upscaling factor), namely the expansion ratio of the image. And finally, sending the amplified picture into deconvolution coding to output the picture of the final audio corresponding to the mouth shape. By taking n=2 as an example, the pixel reorganization module realizes 256 resolution of the input picture size and 512 resolution of the output, greatly reduces the calculated amount and the running speed of the model, and can drive in real time under the condition of ensuring the high definition of the picture. Cycle filtering refers to the process of arranging multiple sub-patterns of cycle filtering on a large graph, an image such as 4*4 is convolved to yield a 4 x 4 feature map, the high resolution image is changed to 8 x 8 by reshape operation.

The residual network involved in the embodiments of the present application is characterized by easy optimization and can improve accuracy by increasing considerable depth. The residual blocks inside the deep neural network are connected in a jumping mode, and the gradient disappearance problem caused by depth increase in the deep neural network is relieved.

The pixel reorganization module (for example, a PixelShuffle module) in the embodiment of the present application uses an upsampling method, which can effectively amplify the feature map after shrinking. Upscale may be implemented instead of interpolation or deconvolution. The main function of the pixel reorganization module is to obtain a high-resolution characteristic diagram through convolution and reorganization among multiple channels from the low-resolution characteristic diagram. In converting a low resolution input to a high resolution output, feature maps are expanded based on feature extraction and subpixel convolution to convert the feature maps from a low resolution space to a high resolution space. Fig. 3 is a schematic diagram of an alternative upsampling method according to an embodiment of the present application, as shown in fig. 3, where an original low-resolution pixel is divided into r×r smaller lattices, and the r×r feature maps are used to fill the lattices according to a certain rule by using values of corresponding positions of the r×r feature maps. The reorganization process is completed by filling the small lattices divided by each low-resolution pixel according to the same rule. In the process, the model can adjust r times r shuffle channel weights to continuously optimize the generated result. Compared with the conventional network structure for improving the resolution, the PixelSheffe has the advantages of no introduction of other parameters, less parameter quantity and high operation speed, and can solve the problems of interpolation and deconvolution of some artifacts, so that the definition is higher.

In the training process, an L1 loss function is adopted, in order to reproduce an original mouth shape, an L1 error between a real picture and a predicted picture is calculated, in order to enable a prediction effect to be more accurate and stable, a second-order error can be increased, namely, the L1 loss of a difference value between two frames of a generated picture and a difference value between two frames of the real picture is calculated, in addition, in order to better fit details of the mouth shape and a mouth area, the weight of the mouth area loss is increased, and the overall error weight is equal to one-to-one than the mouth error weight, so that a picture with higher definition is generated. The specific loss function calculation formula is as follows:

loss=w_1×l1 (whole) +w_2×l1 (mouth);

w_1: w_2=1:1。

w_1 is the weight of L1 loss of the whole picture, and w_2 is the weight of L1 loss of the mouth region.

It should be noted that, in the training process, a pre-training technique may be used, for example, the model is pre-trained for 20 rounds on 500w big data. And the data of each user is finely adjusted, so that the cost and speed of training are guaranteed, and meanwhile, a good effect and generalization are achieved.

According to another embodiment of the present application, there is also provided an audio-driven mouth-shape network model, as shown in fig. 2, including: the splicing module is configured to splice the first audio feature and the first image feature to obtain spliced data, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object; the convolution coding module is configured to carry out convolution coding on the spliced data to obtain coded data; the residual error module is configured to perform deep learning on the encoded data to obtain a second image characteristic; the pixel reorganization module is configured to amplify the second image feature, and based on a feature extraction and sub-pixel convolution method, the second image feature with the first resolution is amplified into a target image feature with the second resolution through convolution and reorganization among multiple channels, wherein the first resolution is smaller than the second resolution; and the deconvolution coding module is used for deconvolution coding the target image features and outputting mouth-shaped images of the target objects corresponding to the first audio features.

In an exemplary embodiment of the present application, taking an example of simulating actual use of a mobile phone terminal as an example, a user shoots a section of speaking video through the mobile phone terminal, beautifies and confirms the effect, and then uploads the video to a cloud, and the cloud automatically trains an audio driving mouth model to train the audio driving mouth model. After training, the user can use the mobile phone, and when in use, the mobile phone terminal program is used for carrying out streaming radio, the audio equipment obtains audio, and the audio and the prefabricated image sequence (the image can be an image generated by a video uploaded by the user or an image of other characters or virtual characters) are input into a pre-trained audio driving mouth model together, so that a final mouth driving frame is obtained in real time, and displayed on the mobile phone. By the method and the network model structure provided by the embodiment of the application, high-definition speaking video can be generated in real time according to audio in terminal equipment with limited computing power. The use threshold of the technology is greatly reduced under the condition of ensuring the effect. The method and the network model structure provided by the embodiment of the application realize that the input picture size is low-definition picture input resolution and high-definition picture output, greatly reduce the calculated amount and the running speed of the model, and can be driven in real time under the condition of ensuring the high definition of the picture. The method and the pretraining technology used in the network model structure improve the training speed of the user model and the generalization of the use scene, can receive the speaking audios of different languages, different sound receiving devices and different speakers, support TTS and are suitable for various application scenes.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device for implementing the method for outputting a mouth shape image, which may be applied to, but not limited to, a server. As shown in fig. 4, the electronic device comprises a memory 402 and a processor 404, the memory 402 having stored therein a computer program, the processor 404 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, acquiring a first audio feature and a first image feature, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object;

s2, inputting the first audio features and the first image features into a trained audio driving mouth shape network model, wherein the audio driving mouth shape network model is a model obtained by training an initial audio driving mouth shape network model by using sample video data and sample mouth shape images;

S3, splicing the first audio features and the first image features, and then sequentially performing convolutional encoding and deep learning to obtain second image features;

s4, converting second image features with first resolution into target image features with second resolution through convolution and recombination among multiple channels based on the feature extraction and the sub-pixel convolution, wherein the first resolution is smaller than the second resolution;

s5, carrying out deconvolution coding on the target image features, and outputting a mouth shape image of the target object corresponding to the first audio features.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 4 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 4 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.

The memory 402 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for outputting a mouth shape image in the embodiment of the present application, and the processor 404 executes the software programs and modules stored in the memory 402, thereby executing various functional applications and data processing, that is, implementing the method for outputting a mouth shape image described above. Memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 402 may further include memory located remotely from processor 404, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Wherein the memory 402 may specifically, but not exclusively, store program steps of a speech separation method.

Optionally, the transmission device 406 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 406 includes a network adapter (Network Interface Controller, NIC) that can be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 406 is a Radio Frequency (RF) module for communicating with the internet wirelessly.

In addition, the electronic device further includes: a display 408 for displaying an output process of the mouth shape image; and a connection bus 410 for connecting the respective module parts in the above-described electronic device.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

The real-time audio driving oral system related in the embodiment of the application focuses on a back-end algorithm, does not relate to a server for calculation processing or other execution main bodies except for terminal equipment in the implementation process, and only relates to video acquisition equipment, audio receiving equipment, video display equipment and the like commonly used in the field in the stages of information acquisition, display and the like.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of outputting a mouth shape image, comprising:

acquiring a first audio feature and a first image feature, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object;

Inputting the first audio feature and the first image feature into a trained audio driving mouth shape network model, wherein the audio driving mouth shape network model is a model obtained by training an initial audio driving mouth shape network model by using sample video data and sample mouth shape images;

the first audio features and the first image features are spliced and then subjected to convolutional encoding and deep learning in sequence to obtain second image features;

converting a second image feature with a first resolution into a target image feature with a second resolution through convolution and recombination among multiple channels based on a feature extraction and subpixel convolution method, wherein the first resolution is smaller than the second resolution;

and carrying out deconvolution coding on the target image features, and outputting a mouth shape image of the target object corresponding to the first audio features.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the acquiring the first audio feature includes:

extracting the first audio feature from the recorded first audio data, or

Acquiring second audio data in real time by using radio equipment, and extracting the first audio features from the second audio data;

The acquiring the first image feature includes:

extracting the first image feature from the video or picture uploaded by the target object, or

Obtaining a video or a picture from a database, and extracting the first image feature containing the target object, wherein the target object comprises: an end user, other real characters than the end user, virtual characters, and any cartoon character.

3. The method of claim 1, wherein converting the second image feature of the first resolution to the target image feature of the second resolution comprises:

inputting the second image features into a pixel reorganization module, and obtaining feature graphs of n-2 channels through convolution, wherein the size of the feature graphs is consistent with the second image features, n is an up-sampling factor, the pixel reorganization module comprises a plurality of convolution layers and n-2 channels, and the weight of each channel is optimized in the training process;

and periodically screening and arranging the feature images with the first resolution of the n-2 channels on a large image through convolution operation to obtain the target image features with the second resolution.

4. The method of claim 1, wherein prior to inputting the first audio feature and the first image feature into a trained audio-driven oral network model, the method further comprises:

Training the initial audio driving mouth shape network model by using the sample video data and the sample mouth shape image to obtain the audio driving mouth shape network model.

5. The method of claim 4, wherein training the initial audio-driven die network model using the sample video data and the sample die image to obtain the audio-driven die network model comprises:

obtaining initial video data containing the target object, processing the initial video data to obtain sample audio data without images and sample image data without audio, taking images corresponding to each frame of the sample audio data and containing real mouth shape pictures as labels, extracting sample audio features from the sample audio data, and converting the sample image data into a sample image sequence arranged according to a time sequence;

splicing the sample audio features and the sample image sequence to obtain sample splicing data, wherein each frame of sample image corresponds to the sample audio features under the image frame;

performing convolutional encoding on the sample spliced data, performing deep learning processing, inputting the sample spliced data into a pixel reorganizing module for resolution amplification, and outputting a predicted mouth shape picture corresponding to the sample audio characteristics after deconvolution encoding;

And adjusting parameters of the initial audio driving mouth shape network model according to errors of the predicted mouth shape picture and the label.

6. The method of claim 5, wherein adjusting parameters of the initial audio-driven die network model based on errors of the predicted die picture and the tag comprises:

and calculating the overall error between the predicted mouth shape picture and the label by using an L1 loss function, and calculating the local error of the predicted mouth shape picture and the mouth region in the label by using a mouth loss function, wherein the sum of the weight of the overall error and the weight of the local error is 1.

7. The method of claim 4, wherein training the initial audio-driven die network model using the sample video data and the sample die image results in the audio-driven die network model, comprising:

training the initial audio-driven oral network model using first sample data to obtain a pre-training model, wherein the first sample data comprises broad sample data of non-specific users;

training the pre-training model by using second sample data to obtain the audio driving mouth shape network model, wherein the second sample data comprises personalized data of a specific user.

8. An audio-driven die network model, comprising:

the splicing module is configured to splice the first audio feature and the first image feature to obtain spliced data, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object;

the convolution coding module is configured to carry out convolution coding on the spliced data to obtain coded data;

the residual error module is configured to perform deep learning on the encoded data to obtain a second image characteristic;

the pixel reorganization module is configured to amplify the second image feature, and based on a feature extraction and sub-pixel convolution method, the second image feature with the first resolution is amplified into a target image feature with the second resolution through convolution and reorganization among multiple channels, wherein the first resolution is smaller than the second resolution;

and the deconvolution coding module is used for deconvolution coding the target image features and outputting mouth-shaped images of the target objects corresponding to the first audio features.

9. The audio-driven oral network model of claim 8, wherein the pixel reorganization module is further configured to:

and (3) periodically screening and arranging the feature images with the first resolution of n-2 on a large image through convolution operation to obtain the target image features with the second resolution.

10. An output device for a mouth shape image, comprising:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a first audio feature and a first image feature, wherein the first audio feature is a voice data feature of a first object, and the first image feature is an image sequence containing a mouth image of a target object;

the input module is configured to input the first audio feature and the first image feature into a trained audio driving mouth shape network model, wherein the audio driving mouth shape network model is a model obtained by training an initial audio driving mouth shape network model by using sample video data and sample mouth shape images;

The processing module is configured to splice the first audio feature and the first image feature and then sequentially perform convolutional encoding and deep learning to obtain a second image feature;

a conversion module configured to convert a second image feature of a first resolution to a target image feature of a second resolution, wherein the first resolution is less than the second resolution;

and the output module is configured to perform deconvolution coding on the target image features and output a mouth shape image of the target object corresponding to the first audio features.

11. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when run.

12. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of claims 1 to 7.