CN115690238A

CN115690238A - Image generation and model training method, device, equipment and storage medium

Info

Publication number: CN115690238A
Application number: CN202211259334.1A
Authority: CN
Inventors: 周航; 孙亚圣; 何栋梁; 刘经拓
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-02-03

Abstract

The disclosure provides an image generation and model training method, device, equipment and storage medium, relates to the technical field of artificial intelligence, specifically to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes such as a meta universe, virtual digital people and the like. The image generation method comprises the following steps: masking the lip region of the first image to obtain a second image; encoding the first image to obtain a first image characteristic; encoding the second image to obtain a second image characteristic; coding the voice to obtain voice characteristics; acquiring a fusion feature based on the first image feature, the second image feature and the voice feature; and decoding the fusion features to generate a target image, wherein the target image is an image obtained after the voice drives the lip region of the first image. The present disclosure can improve image quality.

Description

Image generation and model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, can be applied to scenes such as a meta universe and a virtual digital person, and particularly relates to an image generation and model training method, device, equipment and storage medium.

Background

The virtual digital human is a product of the integration of information science and life science, and is used for performing virtual simulation on the shapes and functions of human bodies at different levels by using an information science method. With the development of virtual digital human technology, the virtual image of the virtual digital human is more and more vivid. The virtual digital person can be applied to scenes such as voice broadcasting.

The voice-driven lip shape is to drive the lip shape of the virtual character according to the input voice under the condition of keeping other face information of the virtual character unchanged.

Disclosure of Invention

The disclosure provides an image generation and model training method, an image generation and model training device and a storage medium.

According to an aspect of the present disclosure, there is provided an image generation method including: masking the lip region of the first image to obtain a second image; encoding the first image to obtain a first image characteristic; encoding the second image to obtain a second image characteristic; coding voice to obtain voice characteristics, wherein the voice is used for driving a lip region of the first image; acquiring a fusion feature based on the first image feature, the second image feature and the voice feature; and decoding the fusion features to generate a target image, wherein the target image is an image obtained after the voice drives the lip region of the first image.

According to another aspect of the present disclosure, there is provided a model training method, the model including: a first encoder, a second encoder, a third encoder, a fused encoder and a decoder, the method comprising: obtaining training samples, the training samples comprising: a real image, a reference image and a voice for driving a lip region of the real image sample; performing mask processing on the lip region of the real image to obtain a masked image; coding the reference image by adopting the first coder to obtain a first image characteristic; adopting the second encoder to encode the masked image so as to obtain a second image characteristic; coding the voice sample by adopting the third coder to obtain voice characteristics; acquiring a fusion feature based on the first image feature, the second image feature and the voice feature by using the fusion encoder; decoding the fused features with the decoder to generate a predicted image; constructing a total loss function based on the real image and the predicted image; adjusting model parameters of the first encoder, the second encoder, the third encoder, the fusion encoder, and the decoder based on the total loss function.

According to another aspect of the present disclosure, there is provided an image generating apparatus including: the mask module is used for performing mask processing on the lip region of the first image to obtain a second image; the first encoding module is used for encoding the first image to acquire first image characteristics; the second coding module is used for coding the second image to acquire second image characteristics; the third coding module is used for coding voice to acquire voice characteristics, wherein the voice is used for driving a lip region of the first image; a fusion coding module for obtaining fusion features based on the first image features, the second image features and the voice features; and the decoding module is used for decoding the fusion features to generate a target image, wherein the target image is an image obtained after the lip region of the first image is driven by the voice.

According to another aspect of the present disclosure, there is provided a model training apparatus, the model including: a first encoder, a second encoder, a third encoder, a fused encoder and a decoder, the apparatus comprising: an obtaining module, configured to obtain a training sample, where the training sample includes: a real image, a reference image and a voice for driving a lip region of the real image sample; the mask module is used for performing mask processing on the lip area of the real image to obtain a masked image; the first encoding module is used for encoding the reference image by adopting the first encoder so as to acquire a first image characteristic; the second coding module is used for coding the image after the mask by adopting the second coder so as to obtain a second image characteristic; the third coding module is used for coding the voice sample by adopting the third coder so as to obtain voice characteristics; a fusion coding module for acquiring a fusion feature based on the first image feature, the second image feature and the voice feature by using the fusion encoder; a decoding module, configured to perform decoding processing on the fusion feature by using the decoder to generate a predicted image; a construction module for constructing a total loss function based on the real image and the predicted image; an adjustment module to adjust model parameters of the first encoder, the second encoder, the third encoder, the fusion encoder, and the decoder based on the total loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme of the disclosure, the image quality can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an application scenario suitable for use in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an overall architecture for model-based generation of a target image provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a fourth encoder and a fifth encoder provided in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a B-TB in a fourth encoder provided in accordance with an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a CCF-TB in a fifth encoder provided in accordance with an embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 9 is an overall architecture diagram of a model training process provided in accordance with an embodiment of the present disclosure;

FIG. 10 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 11 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 12 is a schematic diagram of an electronic device for implementing an image generation method or a model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, for voice-driven lip shape, audio features and image features are usually spliced and input to a convolutional neural network, and a final image is generated by using the convolutional neural network.

However, the simple splicing and convolution processing method causes the quality of the generated final image to be poor and the effect to be not ideal.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, which provides an image generation method, including:

101. and masking the lip region of the first image to obtain a second image.

102. And carrying out encoding processing on the first image to obtain a first image characteristic.

103. And carrying out coding processing on the second image to acquire a second image characteristic.

104. And carrying out coding processing on voice to obtain voice characteristics, wherein the voice is used for driving a lip region of the first image.

105. And acquiring a fusion feature based on the first image feature, the second image feature and the voice feature.

106. And decoding the fusion features to generate a target image, wherein the target image is an image obtained after the voice drives the lip region of the first image.

The first image is an image to be driven by voice, and is usually a human face image. It will be appreciated that if the avatar is a character other than a person, the first image may also be a face image of another character, such as an animal character.

An image in which the lip region of the first image is subjected to mask processing may be referred to as a second image.

The preset region of the first image may be used as the lip region, for example, a lower preset size of the first image, for example, a lower w × h region, where w and h are both preset values.

The masking of the lip region may be to set the pixel values of the lip region to the pixel values corresponding to the black pixels, for example, to set all the pixel values of the lip region to 0.

After the first image and the second image are acquired, corresponding image features, which are referred to as a first image feature and a second image feature, may be extracted, respectively.

The first image and the second image may employ the same or different encoders (encoders) to extract corresponding image features. The encoder may be a pre-trained deep neural network model.

The voice refers to the voice of the first image to be driven, and the voice can be input into the encoder corresponding to the voice, and the corresponding output is the voice characteristic of the voice. The encoder that encodes speech may also be a pre-trained deep neural network model.

102. 103, 104 have no timing constraint relationship.

After the first image feature, the second image feature and the voice feature are obtained, they may be subjected to fusion processing, and the features obtained by performing fusion processing on the above three features may be referred to as fusion features. The fusion processing refers to integrating information of the three features, and can be performed by using a deep neural network, and the specific process of the fusion processing can be referred to in the following embodiments.

After the fusion feature is obtained, a decoder (decoder) may be used to decode the fusion feature, where the output of the decoder is a target image, and the target image is an image obtained by driving the first image with speech, that is, a final image. The decoder may also be a pre-trained deep neural network model.

In this embodiment, the target image is generated based on the fusion feature, which is obtained by performing fusion processing on the image feature and the voice feature, so that the image information and the voice information can be better expressed, and the quality of the generated target image is further improved. In addition, the image characteristics comprise the first image characteristics and the second image characteristics, so that the image information can be better expressed, and the quality of the target image is further improved.

For better understanding of the embodiments of the present disclosure, an application scenario to which the embodiments of the present disclosure are applicable is described below.

As shown in fig. 2, a user may input an image to be driven and a voice for driving at a client, the client transmits the image and its corresponding voice to a server, and the server performs voice-driven lip processing based on the image and its corresponding voice to generate a target image. And then, the server side can feed the target image back to the client side. The client may be deployed on a user terminal 201, and the user terminal may be a Personal Computer (PC), a notebook Computer, a mobile device (e.g., a mobile phone), and the like. The server may be deployed on the server 202, and the server may be a local server or a cloud server, and the server may be a single server or a server cluster. In addition, for example, if the user terminal where the client is located has the corresponding capability, the voice-driven lip processing may be executed locally at the user terminal.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

As shown in fig. 2, after the server acquires the image and the voice, it may process them by using the voice-driven lip technique to generate the target image after the voice-driven image.

The voice-driven lip technique can be applied to a video scene, and for a single-frame original image, voices at different times can be adopted to drive the original image to generate a target image at a corresponding time, and a video is generated based on the target images at different times.

The voice-driven lip technique may be implemented by using a deep neural network model, and as shown in fig. 3, the overall architecture for generating the target image based on the model may include: a masking module 301 and a generator 302. The first image refers to an image to be voice-driven, and voice is used to drive a lip region of the first image. After voice-driven lip processing, a corresponding target image may be generated.

The mask model 301 is used for performing mask processing on the first image to obtain a second image.

The generator 302 includes: a first encoder, a second encoder, a third encoder, a fourth encoder, a fifth encoder and a decoder.

The first encoder is used for encoding the first image to obtain first image characteristics.

The second encoder is used for encoding the second image to obtain the second image characteristic.

The third encoder is used for encoding the voice to obtain voice characteristics.

Generally, to reduce the number of parameters, the first encoder and the second encoder may be selected as the same encoder, or encoders sharing parameters.

And the fourth encoder is used for encoding the first image characteristic to obtain a third image characteristic.

The fourth encoder is an encoder based on a self-attention (self) mechanism, such as an encoder that can be selected as a transform model.

And the fifth encoder is used for encoding the second image characteristic, the third image characteristic and the voice characteristic to obtain a fusion characteristic.

Wherein the fifth encoder is an encoder based on a common attention (mutual attention) mechanism and a fusion process.

The decoder is used for decoding the fusion features to generate a target image.

In combination with the application scenario, the present disclosure further provides an image generation method.

Fig. 4 is a schematic diagram according to a second embodiment of the present disclosure, which provides an image generation method, including:

401. and masking the lip region of the first image to obtain a second image.

The first image is an image to be driven by voice, and a user can select the first image according to the requirement of the user.

For example, a face image of a certain virtual person is selected as the first image.

In addition, after a certain face image is selected, the image can be preprocessed, and the preprocessed image is used as a first image.

Wherein the pre-processing may comprise: and selecting an image with a preset size (such as 256 × 256) as the preprocessed image by taking the central point of two eyes in the selected face image as the central point of the preprocessed image.

The preset region of the first image may be used as the lip region, for example, a lower preset size of the first image, for example, a lower w x h region, where w and h are both preset values.

402. And adopting a first encoder to perform encoding processing on the first image so as to acquire first image characteristics.

403. And adopting a second encoder to perform encoding processing on the second image so as to acquire second image characteristics.

The first encoder and the second encoder are both encoders for extracting image features, and may both be convolutional neural network models.

To reduce the number of parameters, the first encoder and the second encoder may be selected to be the same encoder. Of course, the first encoder and the second encoder may alternatively be different encoders.

404. And adopting a third encoder to perform encoding processing on the voice so as to obtain voice characteristics.

Wherein, the third coder is a coder for extracting the speech feature, and can be selected as ResNet network model.

Of these, 402 to 404 have no timing constraint relationship.

In this embodiment, the first encoder, the second encoder, and the third encoder may respectively acquire the first image feature, the second image feature, and the voice feature so as to generate the target image based on these features.

405. And adopting a fourth encoder to perform encoding processing on the first image characteristics so as to obtain third image characteristics.

406. And coding the second image characteristic, the third image characteristic and the voice characteristic by adopting a fifth coder to obtain a fusion characteristic.

In this embodiment, encoding processing is performed on the first image feature to obtain a third image feature, and then the third image feature is fused with other features, so that the feature corresponding to the first image can be better expressed, and the quality of the target image corresponding to the first image is further improved.

For the fourth encoder, may include:

taking the first image feature as an input to a first encoding module included in a fourth encoder;

coding the input of each coding module by using each coding module in at least one coding module included in the fourth encoder to obtain the output of each coding module, and taking the output of each coding module as the input of the next coding module; and the number of the first and second groups,

and taking the output of each coding module as a group of third image characteristics.

The fourth encoder may be an encoder based on a self-attention mechanism, and may specifically be an encoder of a transform network. Accordingly, the encoding module it comprises may be referred to as a Basic Transformer Block (B-TB).

As shown in fig. 5, the fourth encoder includes M (M is a positive integer, which may be set) B-TBs. The output of the first B-TB serves as the input of the second B-TB, the output of the second B-TB serves as the input of the third B-TB, and so on. Wherein the input to the first B-TB is the first image characteristic output by the first encoder.

Each B-TB may include: convolutional layers and N (N is a positive integer, and may be provided) subblock layers. For each B-TB, the input of the first sub-block layer of each B-TB is the output of the convolutional layer of the corresponding B-TB, the input of the second sub-block layer is the output of the first sub-block layer, the input of the third sub-block layer is the output of the second sub-block layer, and so on. The output of the last sub-block layer of each B-TB is the output of that B-TB.

In addition, the output of each B-TB also serves as a set of third image features.

Each sub-block layer includes: a Layer Normalization (LN) module, a Multi-head self attention (MSA) module, and a Multi-layer Perceptron (MLP) module.

As shown in fig. 6, the input for the kth (k =1,2.., M) B-TB, its l (l =1,2.., N) sub-block layer is F-TB _r(k,l) The output is F _r(k,l+1) . Wherein if k = l =1, F _r(k,l) Is the first image feature, otherwise is the output of the previous sub-block layer.

In the l sub-block layer of the k B-TB, F input by LN module pair _r(k,l) LN processing is carried out, MSA processing is carried out on the image characteristics after LN processing by adopting an MSA module, addition processing is carried out on the characteristics after MSA processing and the input image characteristics, the characteristics after addition processing are subjected to LN processing, MLP processing and addition processing to obtain corresponding output characteristics F _r(k,l+1) 。

In this embodiment, each set of third image features may be obtained by each encoding module included in the fourth encoder.

For the fifth encoder, may include:

respectively taking each group of third image characteristics as the input of each coding module in at least one coding module included by a fifth coder, taking the second image characteristics as the input of a first coding module included by the fifth coder, and taking the voice characteristics as the input of a coding module to be fused; wherein the encoding module to be fused is at least part of at least one encoding module included in the fifth encoder;

coding the input of each coding module by using each coding module included in the fifth coder to obtain the output of each coding module, and using the output as the input of the next coding module;

taking an output of a last encoding module included in the fifth encoder as the fused feature.

Wherein the fifth encoder may be an encoder based on a co-attention mechanism and a fusion process.

The encoding module of the fifth encoder may be obtained by modifying a transform Block, and the encoding module of the fifth encoder may be referred to as a multimodal-context Fusion transform Block (CCF-TB).

The number of CCF-TBs included in the fifth encoder may be the same as the number of B-TBs included in the fourth encoder, and as shown in fig. 5, the number of CCF-TBs and the number of B-TBs are both 5 as an example.

For the fifth encoder, the output of the first CCF-TB serves as the input of the second CCF-TB, the output of the second CCF-TB serves as the input of the third CCF-TB, and so on. The input of the first CCF-TB comprises a second image feature output by the second encoder, in addition, the input of each CCF-TB also comprises a corresponding group of third image features, and the input also comprises a voice feature for the encoding module to be fused.

The encoding modules to be fused are part or all of the CCF-TBs, and fig. 5 illustrates that the encoding modules to be fused are the last three CCF-TBs. In addition, the speech features may be input into the encoding module to be fused through a fully connected (Full Connection) network.

Each CCF-TB may include: convolutional layers and N (N is a positive integer, which may be set) subblock layers. For each CCF-TB, the input to the first sub-block layer of each CCF-TB is the output of the convolutional layer of the corresponding CCF-TB, the input to the second sub-block layer is the output of the first sub-block layer, the input to the third sub-block layer is the output of the second sub-block layer, and so on. The output of the last sub-block layer of each CCF-TB is the output of that CCF-TB.

In addition, the output of the last CCF-TB is taken as the fusion feature.

Each sub-block layer includes: a Layer Normalization (LN) module, a Multi-head self Attention (MSA) module, a Multi-head common Attention (MMA) module, a Full Connection (FC) module, and a Multi-layer Perceptron (MLP) module.

As shown in fig. 7, the input for the kth (k =1,2.., M) CCF-TB, its l (l =1,2.., N) sub-block layer includes: f _t(k,l) 、F _r(k,l) And f _a(k,l) The output is F _t(k,l+1) . Wherein if k = l =1, F _t(k,l) Is the second image feature, otherwise is the output of the previous sub-block layer. F _r(k,l) Is a third image feature, f _a(k,l) Is a speech feature.

Adopting LN module pair to input F at the ith sub-block layer of the kth CCF-TB _t(k,l) LN processing is carried out, and MSA processing is carried out on the image characteristics after LN processing by adopting an MSA module; f using LN module pair input _r(k,l) Performing LN processing, and performing MMA processing on the image characteristics after the LN processing by adopting an MMA module; for the non-to-be-fused coding modules (i.e. k =1, 2), FC processing (denoted by FC 1) is performed on the MSA and MMA processed features, and then LN and MLP processing are performed; for an encoding module to be fused (i.e., k)>2) The feature after FC1 processing is added to the voice feature and then FC processing (denoted by FC 2) is carried out, and then LN and MLP processing are carried out to obtain the corresponding output feature F _r(k,l+1) 。

In this embodiment, the fusion feature can be obtained by the encoding module included in the fifth encoder.

407. And decoding the fusion characteristics by adopting a decoder to generate a target image.

The decoder is also a pre-trained deep neural network model, and specifically may be a convolutional neural network model. A corresponding target image may be generated by the decoder based on the features.

In this embodiment, the first image feature is encoded based on the self-attention mechanism to obtain the third image feature, so that the third image feature with better expression capability can be obtained. In addition, the voice features are integrated into at least part of coding modules of the fifth coder, so that the fusion of the voice features and the image features can be realized, the fusion features of the image information and the voice information can be further obtained, and the quality of the generated target image is improved.

The above embodiments refer to a generator, which may also be referred to as a generative model, and the following describes a training process of the generative model.

Fig. 8 is a schematic diagram according to a third embodiment of the present disclosure, which provides a model training method, where the model includes: a first encoder, a second encoder, a third encoder, a fused encoder and a decoder, the method comprising:

801. obtaining training samples, the training samples comprising: a real image, a reference image and speech for driving a lip region of the real image sample.

802. And performing mask processing on the lip region of the real image to obtain a masked image.

803. And encoding the reference image by adopting the first encoder to acquire a first image characteristic.

804. And coding the masked image by adopting the second coder to obtain a second image characteristic.

805. And adopting the third encoder to encode the voice sample so as to obtain voice characteristics.

806. And acquiring a fusion feature based on the first image feature, the second image feature and the voice feature by adopting the fusion encoder.

807. And decoding the fused features by adopting the decoder to generate a predicted image.

808. And constructing a total loss function based on the real image and the predicted image.

809. Adjusting model parameters of the first encoder, the second encoder, the third encoder, the fusion encoder, and the decoder based on the total loss function.

In the training process, the training samples may be collected in advance, for example, one frame of image in a certain video may be used as a real image, another frame of image in the video at a different time from the real image may be used as a reference image, and in addition, speech corresponding to the real image may also be collected.

The training process is similar to the model application process, when the model is applied, the image (the first image) can be used as a reference image, and when the model is trained, the image which belongs to the same video as the real image but is different in time can be selected as the reference image.

Referring to the architecture diagram shown in fig. 9, for a certain video, each frame image V = { I = ₁ ,...,I _T The images can be respectively used as real images, and for each real image, an image at a different time from the real image can be selected as a reference image I _t In addition, the voice a = { a } corresponding to each real image may be selected ₁ ,...,a _T }；

In FIG. 9, the first encoder and the second encoder are the same, and E is used as an example _e Indicating that the third encoder uses E _a Net for presentation, decoder _d And (4) showing. The fourth encoder comprises 5B-TBs and the fifth encoder comprises 5 CCF-TBs.

Similar to the model application process, after inputting the real image, the parameter image and the voice into the generation model, the image output by the decoder can be called a predicted image, and is represented by I _t ' means. The generative model is represented in FIG. 9 asThe transform-based trunk network comprises a reference branch, a trunk branch and a voice branch.

After obtaining the predicted image, a total loss function can be constructed based on the real image and the predicted image.

The total loss function may be constructed based on the first loss function, the second loss function, and the third loss function, for example, by adding the three loss functions as the total loss function.

Wherein a first loss function can be constructed based on the real image and the predicted image; the first loss function may also be referred to as a reconstruction loss function, and may be based on an L2 norm between a pixel value of the real image and a pixel value of the predicted image, for example.

The real image and the predicted image may be input into a feature extraction network to obtain image features of the real image and image features of the predicted image, and a second loss function may be constructed based on the image features of the real image and the image features of the predicted image.

Wherein, the feature extraction network may be a VGG network, and thus, the second loss function may also be referred to as a VGG loss function, with L _VGG Specifically, the L2 norm may be constructed by the two image features.

The real image and the predicted image may be input into a discrimination network to obtain a discrimination result, and a third loss function may be constructed based on the discrimination result.

Wherein the third loss function may be referred to as a generative penalty function, denoted L _GAN And (4) showing. Generating a correlation loss function against the network may be particularly useful.

After obtaining the total loss function, a Back Propagation (BP) algorithm may be used to adjust the relevant model parameters based on the total loss function.

After the adjustment times reach a preset value, the adjusted model parameters can be used as final model parameters for image generation in the inference stage.

In this embodiment, a predicted image is generated based on the fusion feature, which is obtained by performing fusion processing on the image feature and the voice feature, and image information and voice information can be better expressed, so that the image generation effect of the model can be improved. In addition, the image characteristics comprise the first image characteristics and the second image characteristics, so that image information can be better expressed, and the model effect is further improved.

In some embodiments, the fusion encoder comprises: a fourth encoder and a fifth encoder;

acquiring, by the fusion encoder, a fusion feature based on the first image feature, the second image feature, and the voice feature, including:

encoding the first image characteristics by adopting the fourth encoder to obtain third image characteristics;

and encoding the third image feature, the second image feature and the voice feature by using the fifth encoder to obtain the fusion feature.

In this embodiment, coding is performed on the first image feature to obtain a third image feature, and then the third image feature is fused with other features, so that features corresponding to the first image can be better expressed, and further, the quality of the model is improved.

In some embodiments, said encoding, with the fourth encoder, the first image feature to obtain a third image feature includes:

-taking said first image feature as input to a first encoding module comprised by said fourth encoder;

In some embodiments, said encoding, with the fifth encoder, the third image feature, the second image feature, and the speech feature to obtain the fused feature includes:

taking each group of third image features as the input of each coding module in at least one coding module included in the fifth encoder, taking the second image features as the input of the first coding module included in the fifth encoder, and taking the voice features as the input of a coding module to be fused; wherein the encoding module to be fused is at least part of at least one encoding module included in the fifth encoder;

adopting each coding module included in the fifth coder to code the input of each coding module to obtain the output of each coding module, and using the output as the input of the next coding module;

In this embodiment, the fusion feature may be obtained by an encoding module included in the fifth encoder.

In some embodiments, said constructing a total loss function based on said real image and said predicted image comprises:

constructing a first loss function based on the real image and the predicted image;

inputting the real image and the predicted image into a feature extraction network to obtain the image features of the real image and the image features of the predicted image, and constructing a second loss function based on the image features of the real image and the image features of the predicted image;

inputting the real image and the predicted image into a discrimination network to obtain a discrimination result, and constructing a third loss function based on the discrimination result;

constructing the total loss function based on the first, second, and third loss functions.

In this embodiment, by constructing the first loss function, the second loss function, and the third loss function, and constructing the total loss function based on the three loss functions, the factors of multiple dimensions can be referred to when constructing the total loss function, and the model effect is improved.

Fig. 10 is a schematic diagram according to a fourth embodiment of the present disclosure, which provides an image generating apparatus 1000 including: a masking module 1001, a first encoding module 1002, a second encoding module 1003, a third encoding module 1004, a fused encoding module 1005, and a decoding module 1006.

The mask module 1001 is configured to perform mask processing on a lip region of a first image to obtain a second image; the first encoding module 1002 is configured to perform encoding processing on the first image to obtain a first image feature; the second encoding module 1003 is configured to perform encoding processing on the second image to obtain a second image feature; the third encoding module 1004 is configured to perform encoding processing on a voice to obtain a voice feature, where the voice is used to drive a lip region of the first image; the fusion coding module 1005 is configured to obtain a fusion feature based on the first image feature, the second image feature, and the voice feature; the decoding module 1006 is configured to perform decoding processing on the fusion feature to generate a target image, where the target image is an image obtained after the voice drives a lip region of the first image.

In some embodiments, the fusion encoding module 1005 includes: a fourth encoding module and a fifth encoding module. The fourth encoding module is used for encoding the first image characteristics to acquire third image characteristics; and the fifth coding module is used for coding the third image feature, the second image feature and the voice feature so as to obtain the fusion feature.

In some embodiments, the fourth encoding module is further to: -taking said first image features as input to a first encoding module comprised by a fourth encoder; coding the input of each coding module by using each coding module in at least one coding module included in the fourth encoder to obtain the output of each coding module, and taking the output of each coding module as the input of the next coding module; and, taking the output of each encoding module as a set of third image features.

In some embodiments, the fifth encoding module is further configured to: respectively taking each group of third image characteristics as the input of each coding module in at least one coding module included by a fifth coder, taking the second image characteristics as the input of a first coding module included by the fifth coder, and taking the voice characteristics as the input of a coding module to be fused; wherein the encoding module to be fused is at least part of at least one encoding module included in the fifth encoder; coding the input of each coding module by using each coding module included in the fifth coder to obtain the output of each coding module, and using the output as the input of the next coding module; taking an output of a last encoding module included in the fifth encoder as the fused feature.

In some embodiments, the first encoding module is further configured to: adopting a first encoder to perform encoding processing on the first image so as to obtain first image characteristics; and/or the second encoding module is further configured to: adopting a second encoder to perform encoding processing on the second image so as to obtain second image characteristics; and/or the third encoding module is further configured to: and adopting a third encoder to perform encoding processing on the voice so as to obtain the voice characteristics.

Fig. 11 is a schematic diagram according to a fifth embodiment of the present disclosure, which provides a model training apparatus, the model including: a first encoder, a second encoder, a third encoder, a fused encoder and a decoder, the apparatus 1100 comprising: the encoding method comprises an obtaining module 1101, a masking module 1102, a first encoding module 1103, a second encoding module 1104, a third encoding module 1105, a fusion encoding module 1106, a decoding module 1107, a constructing module 1108, and an adjusting module 1109.

The obtaining module 1101 is configured to obtain a training sample, where the training sample includes: a real image, a reference image and a voice for driving a lip region of the real image sample; the mask module 1102 is configured to perform mask processing on the lip region of the real image to obtain a masked image; the first encoding module 1103 is configured to perform encoding processing on the reference image by using the first encoder to obtain a first image feature; the second encoding module 1104 is configured to perform encoding processing on the masked image by using the second encoder to obtain a second image feature; the third encoding module 1105 is configured to employ the third encoder to perform encoding processing on the voice sample to obtain voice features; the fusion coding module 1106 is configured to obtain a fusion feature based on the first image feature, the second image feature and the voice feature by using the fusion encoder; the decoding module 1107 is configured to perform decoding processing on the fusion feature by using the decoder to generate a predicted image; a construction module 1108 for constructing a total loss function based on the real image and the predicted image; an adjusting module 1109 is configured to adjust model parameters of the first encoder, the second encoder, the third encoder, the fusion encoder, and the decoder based on the total loss function.

the fusion encoding module includes: a fourth encoding module and a fifth encoding module. The fourth encoding module is used for encoding the first image characteristics by adopting the fourth encoder to acquire third image characteristics; and the fifth coding module is used for coding the third image feature, the second image feature and the voice feature by adopting the fifth coder so as to obtain the fusion feature.

In some embodiments, the fourth encoding module is further to: -taking said first image feature as input to a first encoding module comprised by said fourth encoder; coding the input of each coding module by using each coding module in at least one coding module included in the fourth encoder to obtain the output of each coding module, and taking the output of each coding module as the input of the next coding module; and, taking the output of each encoding module as a set of third image features.

In some embodiments, the fifth encoding module is further configured to: taking each group of third image features as the input of each coding module in at least one coding module included in the fifth encoder, taking the second image features as the input of the first coding module included in the fifth encoder, and taking the voice features as the input of the coding module to be fused; wherein the encoding module to be fused is at least part of at least one encoding module included in the fifth encoder; coding the input of each coding module by using each coding module included in the fifth coder to obtain the output of each coding module, and using the output as the input of the next coding module; taking an output of a last encoding module included in the fifth encoder as the fused feature.

In some embodiments, the building module 1108 is further configured to: constructing a first loss function based on the real image and the predicted image; inputting the real image and the predicted image into a feature extraction network to obtain the image features of the real image and the image features of the predicted image, and constructing a second loss function based on the image features of the real image and the image features of the predicted image; inputting the real image and the predicted image into a discrimination network to obtain a discrimination result, and constructing a third loss function based on the discrimination result; constructing the total loss function based on the first, second, and third loss functions.

It is to be understood that in the disclosed embodiments, the same or similar elements in different embodiments may be referenced.

It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure. The electronic device 1200 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device 1200 may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the electronic apparatus 1200 includes a computing unit 1201 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the electronic apparatus 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the electronic device 1200 are connected to the I/O interface 1205, including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the electronic device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs the respective methods and processes described above, such as an image generation method or a model training method. For example, in some embodiments, the image generation method or the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the image generation method or the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the image generation method or the model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable retrieval device, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image generation method, comprising:

masking the lip region of the first image to obtain a second image;

encoding the first image to obtain a first image characteristic;

encoding the second image to obtain a second image characteristic;

coding voice to obtain voice characteristics, wherein the voice is used for driving a lip region of the first image;

acquiring a fusion feature based on the first image feature, the second image feature and the voice feature;

and decoding the fusion features to generate a target image, wherein the target image is an image obtained after the voice drives the lip region of the first image.

2. The method of claim 1, wherein said obtaining a fused feature based on the first image feature, the second image feature, and the voice feature comprises:

encoding the first image characteristics to obtain third image characteristics;

and coding the third image feature, the second image feature and the voice feature to obtain the fusion feature.

3. The method of claim 2, wherein the encoding the first image feature to obtain a third image feature comprises:

-taking said first image features as input to a first encoding module comprised by a fourth encoder;

4. The method of claim 3, wherein the encoding the third image feature, the second image feature, and the speech feature to obtain the fused feature comprises:

taking an output of a last encoding module included in the fifth encoder as the fusion feature.

5. The method of any one of claims 1-4,

the encoding the first image to obtain a first image feature includes:

adopting a first encoder to perform encoding processing on the first image so as to obtain first image characteristics; and/or the presence of a gas in the gas,

the encoding the second image to obtain a second image feature includes:

adopting a second encoder to perform encoding processing on the second image so as to obtain second image characteristics; and/or the presence of a gas in the gas,

the encoding processing of the speech to obtain the speech features includes:

and adopting a third encoder to perform encoding processing on the voice so as to obtain the voice characteristics.

6. A method of model training, the model comprising: a first encoder, a second encoder, a third encoder, a fused encoder and a decoder, the method comprising:

obtaining training samples, the training samples comprising: a real image, a reference image and a voice for driving a lip region of the real image sample;

masking the lip region of the real image to obtain a masked image;

coding the reference image by adopting the first coder to obtain a first image characteristic;

adopting the second encoder to encode the masked image so as to obtain a second image characteristic;

coding the voice sample by adopting the third coder to obtain voice characteristics;

acquiring a fusion feature based on the first image feature, the second image feature and the voice feature by using the fusion encoder;

decoding the fused features with the decoder to generate a predicted image;

constructing a total loss function based on the real image and the predicted image;

adjusting model parameters of the first encoder, the second encoder, the third encoder, the fusion encoder, and the decoder based on the total loss function.

7. The method of claim 6, wherein,

the fusion encoder includes: a fourth encoder and a fifth encoder;

the obtaining, by the fusion encoder, a fusion feature based on the first image feature, the second image feature, and the speech feature includes:

8. The method of claim 7, wherein said encoding, with said fourth encoder, said first image feature to obtain a third image feature comprises:

9. The method of claim 8, wherein said encoding, with said fifth encoder, said third image feature, said second image feature, and said speech feature to obtain said fused feature comprises:

taking each group of third image features as the input of each coding module in at least one coding module included in the fifth encoder, taking the second image features as the input of the first coding module included in the fifth encoder, and taking the voice features as the input of the coding module to be fused; wherein the encoding module to be fused is at least part of at least one encoding module included in the fifth encoder;

10. The method according to any of claims 6-9, wherein said constructing a total loss function based on said real image and said predicted image comprises:

11. An image generation apparatus comprising:

the mask module is used for performing mask processing on the lip region of the first image to obtain a second image;

the first encoding module is used for encoding the first image to acquire first image characteristics;

the second coding module is used for coding the second image to acquire second image characteristics;

the third coding module is used for coding voice to acquire voice characteristics, wherein the voice is used for driving a lip region of the first image;

a fusion coding module for obtaining fusion features based on the first image features, the second image features and the voice features;

and the decoding module is used for decoding the fusion features to generate a target image, wherein the target image is an image obtained after the lip region of the first image is driven by the voice.

12. The apparatus of claim 11, wherein the fusion encoding module comprises:

the fourth encoding module is used for encoding the first image characteristics to acquire third image characteristics;

and the fifth coding module is used for coding the third image feature, the second image feature and the voice feature so as to obtain the fusion feature.

13. The apparatus of claim 12, wherein the fourth encoding module is further configured to:

14. The apparatus of claim 12, wherein the fifth encoding module is further configured to:

15. The apparatus of any one of claims 11-14,

the first encoding module is further to: adopting a first encoder to perform encoding processing on the first image so as to obtain first image characteristics; and/or the presence of a gas in the gas,

the second encoding module is further to: adopting a second encoder to perform encoding processing on the second image so as to obtain second image characteristics; and/or the presence of a gas in the gas,

the third encoding module is further to: and adopting a third encoder to encode the voice to acquire the voice characteristics.

16. A model training apparatus, the model comprising: a first encoder, a second encoder, a third encoder, a fused encoder and a decoder, the apparatus comprising:

an obtaining module, configured to obtain a training sample, where the training sample includes: a real image, a reference image and a voice for driving a lip region of the real image sample;

the mask module is used for performing mask processing on the lip area of the real image to obtain a masked image;

the first encoding module is used for encoding the reference image by adopting the first encoder so as to acquire a first image characteristic;

the second coding module is used for coding the image after the mask by adopting the second coder so as to obtain a second image characteristic;

the third coding module is used for coding the voice sample by adopting the third coder so as to obtain voice characteristics;

a fusion coding module for acquiring a fusion feature based on the first image feature, the second image feature and the voice feature by using the fusion encoder;

a decoding module, configured to perform decoding processing on the fusion feature by using the decoder to generate a predicted image;

a construction module for constructing a total loss function based on the real image and the predicted image;

an adjustment module to adjust model parameters of the first encoder, the second encoder, the third encoder, the fusion encoder, and the decoder based on the total loss function.

17. The apparatus of claim 16, wherein,

the fusion encoder includes: a fourth encoder and a fifth encoder;

the fusion encoding module includes:

a fourth encoding module, configured to perform encoding processing on the first image feature by using the fourth encoder to obtain a third image feature;

and a fifth encoding module, configured to perform encoding processing on the third image feature, the second image feature, and the voice feature by using the fifth encoder, so as to obtain the fusion feature.

18. The apparatus of claim 17, wherein the fourth encoding module is further configured to:

adopting each coding module in at least one coding module included in the fourth coder to code the input of each coding module so as to obtain the output of each coding module, and taking the output of each coding module as the input of the next coding module; and the number of the first and second groups,

19. The apparatus of claim 17, wherein the fifth encoding module is further configured to:

20. The apparatus of any of claims 16-19, wherein the build module is further to:

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.