CN115131849A

CN115131849A - Image generation method and related device

Info

Publication number: CN115131849A
Application number: CN202210477320.0A
Authority: CN
Inventors: 朱飞达; 朱俊伟; 储文青; 邰颖; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-04
Filing date: 2022-05-04
Publication date: 2022-09-30

Abstract

The application discloses an image generation method and related equipment, and related embodiments can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like; the method comprises the steps of acquiring an original face image frame of a target object and audio driving information of a target face image frame to be generated; extracting the spatial features of the original facial image frame to obtain the spatial features of the original facial image; performing time sequence feature extraction on the audio driving information to obtain local facial posture features; and carrying out face reconstruction processing on the target object based on the original face space characteristics and the face local posture characteristics to generate a target face image frame. According to the method and the device, the feature extraction can be carried out on the audio driving information, the facial posture detail information of the target object part is captured, and then the facial adjustment is carried out on the original facial image frame based on the captured information, so that the target facial image frame corresponding to the audio driving information is obtained, and the generation efficiency and the accuracy of the target facial image frame are improved.

Description

Image generation method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image generation method and a related device.

Background

With the development of computer technology, image processing technology is applied to more and more fields, for example, image processing technology may include image generation, specifically, face image generation, which may be applied to the fields of animation and the like.

In the related art at present, if facial images of the same target object in different facial poses are to be acquired, a modeler and an animator are required to respectively draw the facial images in the facial poses, and this image generation method is relatively time-consuming and labor-consuming and has low image generation efficiency.

Disclosure of Invention

The embodiment of the application provides an image generation method and related equipment, wherein the related equipment can comprise an image generation device, electronic equipment, a computer readable storage medium and a computer program product, and the generation efficiency and accuracy of a target face image frame can be improved.

The embodiment of the application provides an image generation method, which comprises the following steps:

acquiring an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated;

extracting the spatial features of the original facial image frame to obtain the original facial spatial features corresponding to the original facial image frame;

performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame;

and performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature to generate the target face image frame.

Accordingly, an embodiment of the present application provides an image generating apparatus, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated;

the first extraction unit is used for extracting the spatial features of the original face image frame to obtain the original face spatial features corresponding to the original face image frame;

the second extraction unit is used for extracting time sequence characteristics of the audio driving information to obtain local facial posture characteristics corresponding to the target facial image frame;

and the reconstruction unit is used for carrying out face reconstruction processing on the target object based on the original face space characteristic and the face local posture characteristic to generate the target face image frame.

Optionally, in some embodiments of the present application, the second extraction unit may include an extraction subunit, a processing subunit, and a first fusion subunit, as follows:

the extraction subunit is configured to perform feature extraction on each audio frame in the audio driving information to obtain audio semantic feature information of each audio frame;

the processing subunit is used for processing the audio semantic feature information of each audio frame based on the audio semantic feature information of the front and rear audio frames of each audio frame;

and the first fusion subunit is used for fusing the processed audio semantic feature information of each audio frame to obtain the local facial pose feature corresponding to the target facial image frame.

Optionally, in some embodiments of the present application, the reconstruction unit may include a second fusion subunit, a reconstruction subunit, and a generation subunit, as follows:

the second fusion subunit is configured to fuse the original facial spatial feature and the facial local pose feature to obtain a fused posterior spatial feature;

the reconstruction subunit is configured to perform face reconstruction processing on the target object based on the fused posterior spatial feature to obtain a reference face image frame corresponding to the target object;

a generating subunit, configured to generate the target face image frame based on the original face image frame, the fused posterior spatial feature, and the reference face image frame.

Optionally, in some embodiments of the present application, the reconstruction subunit may be specifically configured to perform face reconstruction processing on the target object based on the fused posterior spatial feature, so as to obtain a reconstructed three-dimensional face image corresponding to the target object; and rendering and mapping the reconstructed three-dimensional face image to obtain a reference face image frame corresponding to the target object.

Optionally, in some embodiments of the present application, the generating subunit may be specifically configured to perform multi-scale feature extraction on the original facial image frame, so as to obtain an original facial feature map under multiple scales corresponding to the original facial image frame; performing multi-scale feature extraction on the reference face image frame to obtain a reference face feature map under multiple scales corresponding to the reference face image frame; encoding and mapping the fused posterior spatial features to obtain hidden feature information corresponding to the fused posterior spatial features; and fusing the original facial feature maps under the multiple scales, the reference facial feature maps under the multiple scales and the hidden feature information to obtain the target facial image frame.

Optionally, in some embodiments of the present application, the step of "fusing the original facial feature maps at the multiple scales, the reference facial feature maps at the multiple scales, and the hidden feature information to obtain the target facial image frame" may include:

fusing the hidden feature information, an original facial feature map under a target scale and a reference facial feature map under the target scale to obtain a corresponding fused facial feature map under the target scale, wherein the target scale is selected from the multiple scales;

and fusing the corresponding fused face feature map under the target scale, the original face feature map under the adjacent scale and the reference face feature map under the adjacent scale to obtain the target face image frame.

Optionally, in some embodiments of the present application, the step "fusing the fused facial feature map corresponding to the target scale, the original facial feature map in the adjacent scale, and the reference facial feature map in the adjacent scale to obtain the target facial image frame" may include:

based on the hidden feature information, performing style modulation processing on the corresponding fused face feature map under the target scale to obtain modulated style features;

and fusing the modulated style features, the original facial feature map under the adjacent scale and the reference facial feature map under the adjacent scale to obtain the target facial image frame.

Optionally, in some embodiments of the application, the first extraction unit may be specifically configured to perform spatial feature extraction on the original face image frame through an image generation model, so as to obtain an original face spatial feature corresponding to the original face image frame;

the second extraction unit may be specifically configured to perform, by using the image generation model, time-series feature extraction on the audio driving information to obtain a local facial pose feature corresponding to the target facial image frame;

the reconstruction unit may be specifically configured to perform face reconstruction processing on the target object based on the original face spatial feature and the face local pose feature through the image generation model, so as to generate the target face image frame.

Optionally, in some embodiments of the present application, the image generation apparatus may further include a training unit, and the training unit may be configured to train the image generation model;

the training unit may be specifically configured to acquire training data, where the training data includes an original face image frame sample of a sample object, a target driving face image frame sample, and an audio driving information sample corresponding to the target driving face image frame sample; performing spatial feature extraction on the original facial image frame sample through a preset image generation model to obtain original facial spatial features corresponding to the original facial image frame sample; performing time sequence feature extraction on the audio driving information sample to obtain a local facial posture feature corresponding to the target driving facial image frame sample; based on the original face space feature and the face local posture feature, carrying out face reconstruction processing on the sample object to obtain a prediction driving face image frame; and adjusting parameters of a preset image generation model based on the target driving face image frame sample and the prediction driving face image frame to obtain a trained image generation model.

Optionally, in some embodiments of the present application, the step "adjusting parameters of a preset image generation model based on the target driving face image frame sample and the prediction driving face image frame to obtain a trained image generation model", may include:

performing spatial feature extraction on the target driving face image frame sample to obtain a target face spatial feature corresponding to the target driving face image frame sample;

determining first loss information based on the local facial pose characteristics corresponding to the target driving facial image frame samples and the target facial space characteristics;

determining second loss information based on the target drive face image frame sample and the predicted drive face image frame;

and adjusting parameters of a preset image generation model according to the first loss information and the second loss information to obtain a trained image generation model.

Optionally, in some embodiments of the present application, the step of "determining second loss information based on the target driving face image frame sample and the predicted driving face image frame" may include:

respectively predicting the probabilities that the target driving face image frame sample and the predicted driving face image frame belong to a real driving face image frame, and determining countermeasure loss information of a preset image generation model based on the probabilities;

determining reconstruction loss information of a preset image generation model based on a similarity between the target driving face image frame sample and the prediction driving face image frame;

respectively carrying out identity recognition on the target driving face image frame sample and the prediction driving face image frame, and determining identity loss information of a preset image generation model based on an identity recognition result;

and determining second loss information according to the countermeasure loss information, the reconstruction loss information and the identity loss information.

The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor loads the instructions to execute the steps in the image generation method provided by the embodiment of the application.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the image generation method provided in the embodiments of the present application.

In addition, the embodiment of the present application also provides a computer program product, which includes a computer program or instructions, and the computer program or instructions, when executed by a processor, implement the steps in the image generation method provided by the embodiment of the present application.

The embodiment of the application provides an image generation method and related equipment, which can acquire an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated; extracting the spatial features of the original facial image frame to obtain the original facial spatial features corresponding to the original facial image frame; performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame; and performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature to generate the target face image frame. According to the method and the device, the audio driving information can be subjected to feature extraction, the facial posture detail information of the target object part is captured, and then the original facial image frame is subjected to facial adjustment based on the captured information, so that the target facial image frame corresponding to the audio driving information is obtained, and the generation efficiency and accuracy of the target facial image frame are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of an image generation method provided in an embodiment of the present application;

FIG. 1b is a flowchart of an image generation method provided in an embodiment of the present application;

fig. 1c is an explanatory diagram of an image generation method provided in an embodiment of the present application;

fig. 1d is another illustrative diagram of an image generation method provided in an embodiment of the present application;

fig. 1e is a model structure diagram of an image generation method provided in the embodiment of the present application;

FIG. 1f is a diagram of another model structure of an image generation method according to an embodiment of the present application;

FIG. 1g is a diagram of another model structure of an image generation method provided in the embodiment of the present application;

FIG. 2 is another flow chart of an image generation method provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of an image generating apparatus provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides an image generation method and related equipment, and the related equipment can comprise an image generation device, electronic equipment, a computer readable storage medium and a computer program product. The image generating apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal or a server.

It is understood that the image generation method of the present embodiment may be executed on a terminal, may be executed on a server, or may be executed by both the terminal and the server. The above examples should not be construed as limiting the present application.

As shown in fig. 1a, the image generation method is performed by the terminal and the server together. The image generation system provided by the embodiment of the application comprises a terminal 10, a server 11 and the like; the terminal 10 and the server 11 are connected via a network, such as a wired or wireless network, wherein the image generating device may be integrated in the server.

The server 11 may be configured to: acquiring an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated; extracting spatial features of the original face image frame to obtain original face spatial features corresponding to the original face image frame; performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame; and performing face reconstruction processing on the target object based on the original face space features and the face local posture features, generating a target face image frame, and sending the target face image frame to the terminal 10. The server 11 may be a single server, or may be a server cluster or a cloud server composed of a plurality of servers. In the image generation method or apparatus disclosed in the present application, a plurality of servers can be grouped into a blockchain, and the servers are nodes on the blockchain.

The terminal 10 may be configured to: the target face image frame transmitted by the server 11 is received. The terminal 10 may include a mobile phone, a smart television, a tablet Computer, a notebook Computer, a Personal Computer (PC), an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, or an aircraft. A client, which may be an application client or a browser client or the like, may also be provided on the terminal 10.

The above-described step of generating the target face image frame by the server 11 may be executed by the terminal 10.

The image generation method provided by the embodiment of the application relates to a computer vision technology and a voice technology in the field of artificial intelligence.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) is a science for researching how to make a machine look, and in particular, it refers to a technology for using a camera and a Computer to replace human eyes to perform machine Vision such as identifying and measuring a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitting to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Among the key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and is the development direction of human-computer interaction in the future, wherein voice becomes one of the best human-computer interaction modes in the future.

The following are detailed descriptions. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of an image generation apparatus, which may be specifically integrated in an electronic device, which may be a server or a terminal or the like.

It is understood that in the specific implementation of the present application, related data such as user information, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

The embodiment can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like.

As shown in fig. 1b, the specific flow of the image generation method may be as follows:

101. and acquiring the original face image frame of the target object and the audio driving information corresponding to the target face image frame to be generated.

The target object may be an object whose facial pose is to be adjusted, the original facial image frame may be an image including a face of the target object, and the target facial image frame may specifically be a corresponding facial image obtained by adjusting the facial pose in the original facial image frame based on the audio driving information. The facial gesture mentioned here may specifically refer to facial expressions, such as the mouth shape, eye spirit and other facial information of the subject, which is not limited in this embodiment.

The audio driving information is audio information used for adjusting the face posture of the original face image frame, and specifically, the audio driving information can be used for replacing the face posture of a target object in the original face image frame with a corresponding face posture when the target object speaks so as to obtain the target face image frame, wherein the audio information corresponding to the target object speaking is the audio driving information. The audio length corresponding to the audio driving information may be 1 second or 2 seconds, which is not limited in this embodiment.

In this embodiment, the change of the facial posture of the target object can be determined by using the change information, such as the mouth shape and the like, of the target object when the target object speaks, which is included in the audio driving information; in addition, the emotion change of the target object can be judged by utilizing the speaking content and the volume of the target object contained in the audio driving information, and the face posture change of the target object is further determined, so that the face posture information of the target object during speaking can be acquired by extracting the obtained audio semantic feature information from the audio driving information, and a target face image corresponding to the audio driving information is further generated.

In a specific scenario, this embodiment may obtain multiple pieces of audio driving information of a target object, and generate, for each piece of audio driving information, a target face image frame corresponding to each piece of audio driving information according to each piece of audio driving information and an original face image frame, and further splice the target face image frames corresponding to each piece of audio driving information to generate a target face video segment corresponding to the target object, where the target face video segment includes a face posture change process when the target object speaks, and the audio information corresponding to the target object speaking is each piece of audio driving information, it is to be noted that a principal angle in the target face video segment is an object face in the original face image frame, and an expression (particularly, a mouth shape) of the target object in the generated target face video segment corresponds to each piece of audio driving information.

In an embodiment, the present application may also be applied in a video repair scene, for example, if a certain speech video related to a target object is damaged and a part of video frames in the speech video are lost, the image generation method provided in the present application may generate a repaired lost video frame by using other video frames in the speech video and corresponding audio information, where the audio information used for repairing may specifically be an audio clip of the lost video frame in 1 second before and after the speech video, and the audio clip is also the audio driving information in the above embodiment.

Specifically, as shown in fig. 1c, taking the target object as a human face as an example, a target face image frame generated based on an original face image frame and audio driving information is shown. Among them, the original face image frame may be regarded as an original face image to be driven.

102. And extracting the spatial features of the original face image frame to obtain the original face spatial features corresponding to the original face image frame.

The original facial spatial features may specifically include three-dimensional (3D, 3-dimensional) facial coefficients corresponding to the original facial image frame, and for example, the original facial spatial features may specifically include identity information (identity), lighting (lighting), texture (texture), expression (expression), pose (position), gaze (size), and the like. From these face coefficients, the face of the original face image frame can be reconstructed, as shown in fig. 1 d.

The spatial feature extraction of the original face image frame may specifically be performing convolution processing, pooling processing, and the like on the original face image frame, which is not limited in this embodiment.

In this embodiment, the original facial image frame may be subjected to spatial feature extraction through an image feature extraction Network, where the image feature extraction Network may specifically be a neural Network model, and the neural Network may be a Visual Geometry Group Network (VGGNet, Visual Geometry Group Network), a Residual Network (ResNet, Residual Network), a Dense connection convolution Network (densnet, etc.), and the neural Network of this embodiment is not limited to the above listed types.

Specifically, the image feature extraction network is pre-trained, and a three-dimensional face coefficient corresponding to a face image can be predicted through the image feature extraction network.

In a specific scenario, ResNet50 or other network structure may be used to extract the original facial spatial features corresponding to the original facial image frame, and the feature extraction process may be expressed by the following formula (1):

wherein coeff is a three-dimensional face coefficient, namely, an original face spatial feature corresponding to an original face image frame,

indicating a ResNet50 network, I _face Representing an original facial image frame.

After the initial original facial space features are extracted, specifically, features associated with the facial pose of the target object can be screened and acquired from the original facial space features, for example, three-dimensional facial coefficients such as identity information, light shadow, texture, expression, pose, eye spirit and the like can be extracted from the original facial space features, and the three-dimensional facial coefficients are taken as the final original facial space features.

103. And performing time sequence feature extraction on the audio driving information to obtain a local facial gesture feature corresponding to the target facial image frame.

Optionally, in this embodiment, the step of "performing time-series feature extraction on the audio driving information to obtain a local facial pose feature corresponding to the target facial image frame" may include:

extracting the characteristics of each audio frame in the audio driving information to obtain audio semantic characteristic information of each audio frame;

processing the audio semantic feature information of each audio frame based on the audio semantic feature information of the front and back audio frames of each audio frame;

and fusing the processed audio semantic feature information of each audio frame to obtain the local facial gesture feature corresponding to the target facial image frame.

The step of extracting features of each audio frame in the audio driving information to obtain audio semantic feature information of each audio frame may include: and carrying out convolution operation and pooling operation on each audio frame in the audio driving information through a neural network to obtain audio semantic feature information of each audio frame.

The audio semantic feature information of each audio frame can be processed based on the audio semantic feature information of the audio frames before and after each audio frame by using a memory network model. The Memory network model can be a Long Short-Term Memory network (LSTM), or a double-layer Gated recurrent neural unit network (GRU), etc.

The LSTM can selectively forget part of historical data through three gate structures (an input gate, a forgetting gate and an output gate), add part of current input data, and finally integrate the current state and generate an output state. LSTM is well suited for extracting semantic features from time series data, often used to extract semantic features from context information in natural language processing tasks. The GRU is a kind of recurrent neural network, and like LSTM, is proposed to solve the problems of long-term memory and gradient in back propagation.

Optionally, in this embodiment, the step of "fusing the audio semantic feature information of each processed audio frame to obtain the local facial pose feature corresponding to the target facial image frame" may include:

and performing weighted transformation on the audio semantic feature information of each processed audio frame to obtain the local facial posture feature corresponding to the target facial image frame.

It can be understood that other fusion methods may also be used to fuse the audio semantic feature information of each processed audio frame, for example, a splicing method may be used to splice the audio semantic feature information of each processed audio frame to obtain the local facial pose feature corresponding to the target facial image frame.

It should be noted that, in some embodiments, the local facial pose feature corresponding to the target facial image frame may also be obtained by using text-driven information corresponding to the target facial image frame instead of using audio-driven information. In a specific scenario, for example, when a certain speech video of a target object is damaged, and a part of video frames in the speech video are lost, the subtitle information corresponding to the lost video frames can be used as the text-driven information, and other video frames that are not lost in the speech video can be used as the original face image frame and the text-driven information to generate a repaired lost video frame. Specifically, feature extraction may be performed on the text driving information to obtain a local facial pose feature corresponding to the target facial image frame to be generated; and extracting the spatial features of the original facial image frame to obtain the original facial spatial features corresponding to the original facial image frame, and performing facial reconstruction processing based on the original facial spatial features and the facial local posture features to generate the lost target facial image frame.

104. And performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature to generate the target face image frame.

Optionally, in this embodiment, the step "performing face reconstruction processing on the target object based on the original face spatial feature and the face local pose feature to generate the target face image frame" may include:

fusing the original face space feature and the face local posture feature to obtain a fused back face space feature;

performing face reconstruction processing on the target object based on the fused posterior spatial features to obtain a reference face image frame corresponding to the target object;

generating the target facial image frame based on the original facial image frame, the fused posterior spatial feature, and the reference facial image frame.

The local facial pose features include part of facial pose information in a target facial image frame to be generated, for example, the local facial pose features may include related three-dimensional facial coefficients such as expressions (expressions), poses (positions), and gaze (size); and the original facial spatial features contain facial pose information for the original facial image frame.

There are various fusion methods for the original facial spatial feature and the facial local pose feature, for example, the fusion method may be stitching processing or weighting fusion, and the present embodiment does not limit this.

Optionally, in this embodiment, the step "performing face reconstruction processing on the target object based on the fused posterior spatial feature to obtain a reference face image frame corresponding to the target object" may include:

performing face reconstruction processing on the target object based on the fused back face space features to obtain a reconstructed three-dimensional face image corresponding to the target object;

and rendering and mapping the reconstructed three-dimensional face image to obtain a reference face image frame corresponding to the target object.

Specifically, a face reconstruction process may be performed on the target object based on the fused posterior spatial features through a 3D digital media Model (3D deformable Model) Model or the like to obtain a reconstructed three-dimensional face image, the face reconstruction process is also referred to as 3D reconstruction, the 3D reconstruction may represent the input two-dimensional face image by using 3D mesh (three-dimensional mesh Model), and the 3D reconstruction may include vertex coordinates and colors of the three-dimensional mesh structure. In this embodiment, the 3D mesh (i.e., the reconstructed three-dimensional face image) may be further projected onto a two-dimensional plane by a rendering method, as shown in fig. 1D.

The texture shadow of the reconstructed three-dimensional facial image can be from an original facial image frame, the posture expression of the reconstructed three-dimensional facial image can drive information by audio, the reconstructed three-dimensional facial image is rendered and mapped, and the three-dimensional image can be projected onto a two-dimensional plane to obtain a reference facial image frame corresponding to the target object.

Specifically, the fused back spatial features may include geometric features and texture features of the target object, and a reconstructed three-dimensional face image may be constructed according to the geometric features and the texture features. Here, the geometric feature may be understood as coordinate information of a key point of a 3D mesh structure of the target object, and the texture feature may be understood as a feature indicating texture information of the target object. For example, position information of at least one facial key point can be extracted from the fused back face space features, the position information of the facial key point is converted into geometric features, and texture features of the target object are extracted from the fused back face space features, which can be specifically shown in formula (2):

wherein Coeff _cat The spatial features of the face after fusion are represented,

represents a 3DMM model, S represents a geometric feature, and T represents a texture feature.

After the geometric features and the texture features are converted, a three-dimensional object model of the target object, namely, the reconstructed three-dimensional face image in the above embodiment, can be constructed, and the three-dimensional object model is projected to the two-dimensional plane to obtain the reference face image frame. There are various ways to obtain the reference facial image frame, for example, the three-dimensional model parameters of the target object may be determined according to the geometric features and the texture features, the three-dimensional object model of the target object is constructed based on the three-dimensional model parameters, and the three-dimensional object model is projected onto the two-dimensional plane to obtain the reference facial image frame, which is specifically shown in formula (3):

wherein, I _rendered For reference to the facial image frame, S denotes a geometric feature, T denotes a texture feature,

representing a function for transforming a three-dimensional model into a two-dimensional planar image.

Optionally, in this embodiment, the step of "generating the target facial image frame based on the original facial image frame, the fused posterior spatial feature, and the reference facial image frame" may include:

carrying out multi-scale feature extraction on the original facial image frame to obtain an original facial feature map under multiple scales corresponding to the original facial image frame;

performing multi-scale feature extraction on the reference face image frame to obtain a reference face feature map under multiple scales corresponding to the reference face image frame;

encoding and mapping the fused posterior spatial features to obtain hidden feature information corresponding to the fused posterior spatial features;

and fusing the original facial feature maps under the multiple scales, the reference facial feature maps under the multiple scales and the hidden feature information to obtain the target facial image frame.

The spatial features of the original facial image frame or the reference facial image frame under each preset resolution can be obtained through multi-scale feature extraction, the image scales of the original facial feature maps corresponding to different resolutions are different, and similarly, the image scales of the reference facial feature maps corresponding to different resolutions are different. In this embodiment, by multi-scale extraction, an original spatial feature of an original face image frame and a reference spatial feature of a reference face image frame may be obtained, where the original spatial feature includes original face feature maps at multiple scales, and the reference spatial feature includes reference face feature maps at multiple scales, and therefore, both the original spatial feature of the original face image frame and the reference spatial feature of the reference face image frame are multi-layer spatial features (spatial features).

Various ways of extracting the original facial feature maps of the original facial image frame under multiple scales and the reference facial feature maps of the reference facial image frame under multiple scales are provided, which are specifically as follows:

for example, an encoding network (Enc Block) of a trained image generation model may be used to spatially encode the original face image frame and the reference face image frame at each preset resolution, so as to obtain an original face feature map and a reference face feature map at each resolution.

The encoding network (Enc Block) may include a plurality of sub-encoding networks, each sub-encoding network corresponds to a preset resolution, the sub-encodings may be sequentially arranged from small to large according to the size of the resolution, so as to obtain the encoding network, and when the original facial image frame and the reference facial image frame are input to the encoding network for network encoding, each sub-encoding network may output a spatial feature corresponding to the preset resolution. The coding networks for the original face image frame and the reference face image frame may be the same or different coding networks, but different coding networks share network parameters, and the coding sub-network may have a variety of structures, for example, may be formed by a simple layer of convolutional network, or may also be other coding network structures. The preset resolution may be set according to the actual application, for example, the resolution may be from 4 × 4 to 512 × 512.

The hidden feature information is specifically an intermediate feature w obtained by encoding and mapping the fused rear spatial feature, different visual features are controlled by different elements of the intermediate feature w, so that the correlation (decoupling and feature separation) between the features is reduced, and the encoding and mapping process can be implemented by extracting hidden deep-level relationships under the surface features from the fused rear spatial feature, decoupling the relationships, and thus obtaining a hidden feature (late code). There are various ways to map the fused posterior spatial features into hidden feature information by using the trained image generation model, for example, a mapping network of the trained image generation model (c)

) And mapping the fused rear spatial features into hidden feature information (w).

Optionally, in this embodiment, the step of "fusing the original facial feature maps under the multiple scales, the reference facial feature maps under the multiple scales, and the hidden feature information to obtain the target facial image frame" may include:

and fusing the fused face feature map corresponding to the target scale, the original face feature map in the adjacent scale and the reference face feature map in the adjacent scale to obtain the target face image frame.

Optionally, in this embodiment, the step of fusing the fused facial feature map corresponding to the target scale, the original facial feature map in the adjacent scale, and the reference facial feature map in the adjacent scale to obtain the target facial image frame may include:

The adjacent scale may be a larger scale than the target scale among the plurality of scales. Specifically, if the multiple scales include 4 × 4, 8 × 8, 16 × 16, 32 × 32, and 64 × 64, if the target scale is 16 × 16, the neighboring scale may be 32 × 32; if the target dimension is 4 x 4, the neighboring dimension may be 8 x 8.

Specifically, the present embodiment may adjust the preset basic style feature based on the hidden feature information to obtain the modulated style feature. The preset basic style characteristics can be understood as style characteristics in a constant tensor (Const) preset in the image driving process. By style feature is understood feature information for generating an image of a particular style.

The style modulation processing method includes multiple ways, for example, adjusting the size of the basic style feature to obtain an initial style feature, modulating the hidden feature information to obtain a convolution weight corresponding to the initial style feature, and adjusting the initial style feature based on the convolution weight to obtain a modulated style feature.

The convolution weight may be understood as weight information when performing convolution processing on the initial style feature, and there may be a variety of ways for performing modulation processing on the hidden feature information, for example, a basic convolution weight may be obtained, and the convolution weight is adjusted based on the hidden feature information, so as to obtain a convolution feature corresponding to the initial style feature. The convolution weight adjustment based on the hidden feature information can be mainly realized by adopting a Mod module and a Demod module in a decoding network of stylegan v2 (a style migration model).

After the hidden feature information is modulated, convolution weights can be obtained based on the modulated information, and the initial style features can be adjusted in various ways, for example, a target style convolution network corresponding to the resolution of the basic facial image is screened out from a style convolution network (style _ conv) of the trained image generation model, and the initial style features are adjusted based on the convolution weights to obtain the modulated style features. Wherein the base face image at the initial resolution is generated based on a preset base style feature.

The step of fusing the modulated style features, the original facial feature map in the adjacent scale, and the reference facial feature map in the adjacent scale to obtain the target facial image frame may include:

fusing the modulated style features, the original facial feature map under the adjacent scale and the reference facial feature map under the adjacent scale to obtain a fused facial feature map under the adjacent scale;

and generating a target face image frame by using the fused face feature image and the basic face image under the adjacent scales.

The fused facial feature map can also be regarded as a fused style feature.

The step of generating the target face image frame by using the fused face feature map and the base face image at the adjacent scales may include: and taking the fused face feature map under the adjacent scale as a fused face feature map under a new target scale, returning to execute the step of performing style modulation processing on the fused face feature map corresponding to the target scale based on the hidden feature information to obtain modulated style features until the scale of the obtained target face image frame meets a preset scale condition.

The preset scale condition may specifically be that the scale of the target face image frame is the largest scale of the multiple scales.

Specifically, in this embodiment, based on the preset resolution, a target original spatial feature (that is, an original facial feature map at the target scale) may be screened from the original spatial feature, and a target reference spatial feature (that is, a reference facial feature map at the target scale) may be screened from the reference spatial feature in a variety of manners, for example, the original spatial feature and the reference spatial feature may be sorted based on the preset resolution, based on the sorting information, the original spatial feature with the minimum resolution is screened from the original spatial feature as the target original spatial feature, and the original spatial feature with the minimum resolution is screened from the reference spatial feature as the target original spatial feature. After the target original spatial features and the target reference spatial features are screened out, the target original spatial features can be deleted from the original spatial features and deleted from the reference spatial features, so that the spatial features with the minimum resolution can be screened out from the original spatial features and the reference spatial features every time, and the target original spatial features and the target reference spatial features are obtained.

After the target original spatial feature and the target reference spatial feature are screened out, the modulated style feature, the target original spatial feature and the target reference spatial feature may be fused in a variety of ways, for example, the target original spatial feature, the target reference spatial feature and the modulated style feature may be directly spliced to obtain the fused style feature under the current resolution, which may be specifically shown in formula (4):

wherein the content of the first and second substances,

the merged style feature may be a style feature corresponding to a next predetermined resolution of the base style feature,

is used as the basic characteristic of the style,

for the target original spatial feature at the preset resolution,

for the target reference space at the preset resolutionInter-feature, Concat denotes concatenating or splicing features, style convolutional network.

After the fused style features at the current resolution are obtained, a target face image frame at the target resolution may be generated based on the fused style features and the base face image, and there are various ways of generating the target face image frame at the target resolution, for example, a current face image may be generated based on the fused style features, and the current face image and the base face image are fused to obtain a fused face image at the current resolution, and the fused style features are used as preset base style features, and the fused face image is used as the base face image, and the step of adjusting the preset base style features based on the hidden feature information is executed in return until the current resolution is the target resolution, so that the target face image frame is obtained.

In the process of generating the target face image frame with the target resolution, it can be found that the current face image and the basic face image under different resolutions are sequentially superposed, and in the superposing process, the resolutions are sequentially increased, so that the high-definition target face image frame can be output.

Optionally, the basic optical flow field at the initial resolution may be generated based on the basic style features, so as to output the target optical flow field at the target resolution. The base optical flow field may be understood as the field at the initial resolution used to indicate visualization of the facial image, which is equivalent to the movement of the key points. The target optical flow field may be output in various ways, for example, a basic optical flow field under the initial style resolution may be generated based on the basic style features, and the modulated style features, the original spatial features, and the reference spatial features are fused according to the basic optical flow field to obtain the target optical flow field under the target resolution.

The method for fusing the modulated style features, the original spatial features and the reference spatial features may be various, for example, a target original spatial feature may be screened from the original spatial features based on a preset resolution, a target reference spatial feature may be screened from the reference spatial features, the modulated style features and the target style features are fused to obtain a fused style feature at a current resolution, and a target optical flow field at the target resolution is generated based on the fused style feature and the basic optical flow field.

For example, the current optical flow field may be generated based on the post-fusion style features and the basic optical flow field, the current optical flow field and the basic optical flow field are fused to obtain a fusion optical flow field at the current resolution, the post-fusion style features are used as preset basic style features, the fusion optical flow field is used as the basic optical flow field, the step of adjusting the preset basic style features based on the hidden feature information is executed in a return mode, and the target optical flow field is obtained until the current resolution is the target resolution.

The target face image and the target optical flow field may be generated simultaneously based on preset basic style features, or the target face image or the target optical flow field may be generated separately based on preset style features. Taking the simultaneous generation of the target object and the target optical flow field as an example, the preset basic style features, the basic face image and the basic optical flow field may be processed through a decoding network of the trained image generation model, each decoding network may include a decoding subnetwork corresponding to each preset resolution, the resolution may be increased from 4 × 4 to 512 × 512, a network structure of the decoding subnetwork may be as shown in fig. 1e, and the fused style features (i), (ii), (iii) and (iv) output by the previous decoding subnetwork are received

) Fusing facial images (I) ⁱ ) And a fused optical flow field (f) ⁱ ) Modulating the hidden characteristic information w to obtain

Corresponding convolution weight, and based on the convolution weight, for

The convolution process is carried out and the convolution process is carried out,obtaining modulated style characteristics

) Based on the resolution corresponding to the decoding sub-network, the original spatial features of the target corresponding to the decoding sub-network are screened from the original spatial features (

) And screening out target reference spatial features corresponding to the decoding subnetworks from the reference spatial features (a)

)，

Have the same spatial resolution as

And

connected in series to obtain the fused style characteristics of the decoded subnetwork output (

) Then, based on

Generating a current face image and a current optical flow field, and outputting the current face image and the I output by the previous layer decoding subnetwork ⁱ The fusion is carried out to obtain the fusion face image (I) output by the current decoding sub-network ⁱ⁺¹ ) The current optical flow field and the fused optical flow field (f) output by the previous layer decoding sub-network ⁱ ) The fusion is carried out, and the fusion optical flow field (f) output by the current decoding sub-network can be obtained ⁱ⁺¹ ) Then, the data is output to the next decoding sub-network until the target resolution is rightThe corresponding decoding subnetwork outputs the fused face image and the fused optical flow field, so that the fused face image at the target resolution (specifically, resolution 512) can be used as the target face image frame, and the fused optical flow field at the target resolution can be used as the target optical flow field.

Optionally, in this embodiment, the step of "performing spatial feature extraction on the original face image frame to obtain an original face spatial feature corresponding to the original face image frame" may include:

extracting the spatial features of the original facial image frame through an image generation model to obtain the original facial spatial features corresponding to the original facial image frame;

the step of performing timing characteristic extraction on the audio driving information to obtain a local facial posture characteristic corresponding to the target facial image frame may include:

performing time sequence feature extraction on the audio driving information through the image generation model to obtain a local facial posture feature corresponding to the target facial image frame;

the step of performing face reconstruction processing on the target object based on the original face spatial feature and the face local pose feature to generate the target face image frame may include:

and performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature through the image generation model to generate the target face image frame.

The image generation model may be a Visual Geometry Group Network (VGGNet), a Residual Network (ResNet), a Dense connection convolution Network (densnet), and the like, but it should be understood that the image generation model of the present embodiment is not limited to the above listed types.

The image generation model may be trained from a plurality of sets of training data, and may be specifically provided to the image generation apparatus after being trained by another device, or may be trained by the image generation apparatus itself.

If the image generation device performs the training by itself, before the step of performing spatial feature extraction on the original face image frame through an image generation model to obtain an original face spatial feature corresponding to the original face image frame, the method may further include:

acquiring training data, wherein the training data comprises an original face image frame sample of a sample object, a target driving face image frame sample and an audio driving information sample corresponding to the target driving face image frame sample;

performing spatial feature extraction on the original facial image frame sample through a preset image generation model to obtain original facial spatial features corresponding to the original facial image frame sample;

performing time sequence feature extraction on the audio driving information sample to obtain a local facial posture feature corresponding to the target driving facial image frame sample;

based on the original face space feature and the face local posture feature, carrying out face reconstruction processing on the sample object to obtain a prediction driving face image frame;

and adjusting parameters of a preset image generation model based on the target driving face image frame sample and the prediction driving face image frame to obtain a trained image generation model.

The target driving face image frame sample may be regarded as tag information, and specifically may be a desired driving face image frame corresponding to the audio driving information sample.

There are various ways to obtain an original face image frame sample, a target driving face image frame sample, and an audio driving information sample corresponding to the target driving face image frame sample of a sample object, which is not limited in this embodiment.

For example, any two frames of video frames containing the face of the subject may be extracted from a certain speech video about the sample subject, one frame may be used as an original face image frame sample, the remaining frame may be used as a target driving face image frame sample, and audio information corresponding to the target driving face image frame sample 1 second before and after the speech video may be used as an audio driving information sample.

Optionally, in this embodiment, the step "adjusting parameters of a preset image generation model based on the target driving face image frame sample and the prediction driving face image frame to obtain a trained image generation model", may include:

and adjusting parameters of a preset image generation model according to the first loss information and the second loss information to obtain the trained image generation model.

The spatial feature extraction performed on the target driving face image frame sample may be, for example, convolution processing and pooling processing performed on the target driving face image frame sample, which is not limited in this embodiment. The embodiment can perform spatial feature extraction on the target-driven facial image frame sample through a trained image feature extraction network.

The target spatial facial features corresponding to the extracted target-driven facial image frame samples may specifically include three-dimensional (3D, 3-dimensional) facial coefficients corresponding to the target-driven facial image frame samples, and for example, the target spatial features may specifically include identity information (identity), lighting (lighting), texture (texture), expression (expression), pose (position), gaze (size), and the like.

The embodiment can use the target face space feature corresponding to the target driving face image frame sample as a supervision signal of the local face posture feature extracted from the audio driving information sample. Specifically, the present embodiment may calculate a vector distance between the local facial pose feature corresponding to the target driving facial image frame sample and the spatial feature of the target face, and determine the first loss information according to the vector distance, where the larger the vector distance is, the larger the loss value corresponding to the first loss information is, and conversely, the smaller the vector distance is, the smaller the loss value corresponding to the first loss information is.

The step of adjusting parameters of a preset image generation model according to the first loss information and the second loss information to obtain a trained image generation model may include:

fusing the first loss information and the second loss information to obtain total loss information;

and adjusting parameters of a preset image generation model according to the total loss information to obtain a trained image generation model.

There are various fusion manners of the first loss information and the second loss information, which is not limited in this embodiment, for example, the fusion manner may be weighted fusion or the like.

The training process of the preset image generation model includes the steps of firstly calculating total loss information, then adjusting parameters of the preset image generation model by using a back propagation algorithm, and optimizing the parameters of the image generation model based on the total loss information, so that a loss value corresponding to the total loss information is smaller than a preset loss value, and the trained image generation model is obtained. For example, if the accuracy requirement on the image generation model is higher, the preset loss value is smaller.

Alternatively, in this embodiment, the step of "determining second loss information based on the target driving face image frame sample and the predicted driving face image frame" may include:

determining reconstruction loss information of a preset image generation model based on the similarity between the target driving face image frame sample and the prediction driving face image frame;

In this embodiment, the step of "predicting probabilities that the target driving face image frame sample and the predicted driving face image frame belong to real driving face image frames, respectively, and determining countermeasure loss information of a preset image generation model based on the probabilities" may include:

predicting first probability information that the target driving face image frame sample belongs to a real driving face image frame through a preset discrimination model;

predicting second probability information that the predicted driving face image frame belongs to a real driving face image frame through the preset discrimination model;

determining countermeasure loss information of a preset image generation model based on the first probability information and the second probability information.

The preset discrimination model is a discriminator D, in the training process, the target driving face image frame sample is a real image, the prediction driving face image frame is a generation result of the preset image generation model, and the discriminator needs to judge that the generation result is false and the real image is true. The preset image generation model can be regarded as an integral driving network G, and in the training and learning process, an image generated by the driving network G needs to be able to cheat the discriminator D, that is, the probability that the discriminator D judges that a predicted driving face image frame generated by the driving network G belongs to a real driving face image frame is 1 as much as possible.

The input of the discriminator is the real image or the output of the image generation model, and the aim is to distinguish the output of the image generation model from the real image as much as possible. The image generation model spoofs the discriminator as much as possible. The image generation model and the discriminator are mutually confronted and the parameters are continuously adjusted, so that the trained image generation model is obtained.

In this embodiment, the step of "determining reconstruction loss information of a preset image generation model based on the similarity between the target driving face image frame sample and the prediction driving face image frame" may include:

performing feature extraction on the target driving face image frame sample to obtain first feature information corresponding to the target driving face image frame sample;

feature extraction is carried out on the prediction driving face image frame, and second feature information corresponding to the prediction driving face image frame is obtained;

and determining reconstruction loss information of a preset image generation model according to the similarity between the first characteristic information and the second characteristic information.

The vector distance between the feature vector corresponding to the first feature information and the feature vector corresponding to the second feature information can be calculated, the similarity between the first feature information and the second feature information is determined according to the vector distance, and the larger the vector distance is, the lower the similarity is, and the larger the loss value corresponding to the reconstruction loss information is; conversely, the smaller the vector distance, the higher the similarity, and the smaller the loss value corresponding to the reconstruction loss information.

In this embodiment, the steps of "respectively performing identity recognition on the target driving face image frame sample and the prediction driving face image frame, and determining identity loss information of a preset image generation model based on an identity recognition result" may include:

performing identity recognition on the target driving face image frame sample to obtain a first identity recognition result;

performing identity recognition on the prediction driving face image frame to obtain a second identity recognition result;

and comparing the first identity recognition result with the second identity recognition result to obtain identity loss information of a preset image generation model.

And if the first identity recognition result is the same as the second identity recognition result, the identity loss information of the preset image generation model is 0.

Wherein the step of determining second loss information according to the countermeasure loss information, the reconstruction loss information, and the identity loss information may include:

and fusing the countermeasure loss information, the reconstruction loss information and the identity loss information to obtain second loss information.

There are various ways of fusing the countermeasure loss information, the reconstruction loss information, and the identity loss information, which is not limited in this embodiment. For example, the fusion method may be weighted fusion or the like.

Specifically, in the training process of the preset image generation model G, the original facial image frame sample of the sample object may be denoted as I _source The target driven facial image frame sample may be denoted as I _drive The audio driving information sample corresponding to the target driving face image frame sample can be recorded as V _drive And the predicted driving face image frame generated by the preset image generation model can be recorded as G (I) _source ，V _drive ) If the predetermined discriminant model is denoted as D, the countermeasure loss information in the above embodiment can be represented by the following formula (5):

wherein D (I) _drive ) First probability information indicating that the discriminator predicts that the target driving face image frame sample belongs to the real driving face image frame in the above embodiment, and D (G (I) _source ，V _drive ) Second probability information, L) indicating that the discriminator predicts that the driving face image frame belongs to the real driving face image frame in the above-described embodiment _GAN And (4) countermeasure loss information representing a preset image generation model.

And the reconstruction loss information can be expressed by equation (6):

wherein the loss information L is reconstructed _rec The L1 loss function and LPIPS (perceptual loss) loss function are used, and the reconstruction loss information is to make the image generated by the trained image generation model consistent with the real driving image (i.e. the target driving face image frame sample).

The identity loss information can be expressed by the following equation (7):

since the identity information of the target face image frame generated in the trained image generation model should be consistent with the identity information of the driving image (i.e. the target driving face image frame corresponding to the audio driving information), the identity loss information L is set in the training process _ID 。

Wherein the content of the first and second substances,

representing an identity extraction network.

The first loss information may be represented by the following formula (8)

The first loss information is an L2 loss function, and the local facial pose feature predicted by the trained image generation model through the audio driving information should be the same as the target facial spatial feature of the driving image (i.e. the target driving facial image frame corresponding to the audio driving information), so the first loss information is set in the training process.

Where exp represents an expressive feature, pos represents a pose feature, and size represents an eye feature.

In some embodiments, the total loss information L may be as shown in equation (9):

L＝L _GAN +L _rec +L _3d +L _ID (9)

in a specific scenario, the whole training process of the preset image generation model may be as shown in fig. 1f, specifically, any two frames of video frames containing the face of the subject may be extracted from a certain speech video about the sample subject, and one of the frames may be used as the original face image frame sample (I) _s ) Taking the remaining frame as a target to drive a face image frame sample (I) _d ) And audio information corresponding to the target driving face image frame sample in 1 second before and after the lecture video is taken as an audio driving information sample, wherein the target driving face image frame sample (I) is required to be explained _d ) May be referred to as Groudtruth (GT), in machine learning GT may represent the classification accuracy of the supervised learning training set.

In the specific training process of the preset image generation model, spatial feature extraction can be performed on the original facial image frame sample through an image feature extraction network in the preset image generation model to obtain original facial spatial features corresponding to the original facial image frame sample, wherein the original facial spatial features can include facial three-dimensional coefficients such as identity, shadow, texture and the like; and performing time sequence feature extraction on the audio driving information sample through an audio semantic extraction network in a preset image generation model to obtain a local facial posture feature corresponding to the target driving facial image frame sample, wherein the local facial posture feature can contain partial facial posture information such as posture, eye, expression and the like, and further can fuse an original facial space feature and the local facial posture feature to obtain a fused rear face space feature, and performing face reconstruction processing on the sample object through a reconstruction network based on the fused rear face space feature to obtain a reference facial image frame of the sample object.

After the reference face image frame is obtained, the original face image frame sample, the reference face image frame and the spatial features of the fused back face can be fused through a generation network in a preset image generation model to obtain a prediction driving faceAnd a partial image frame. Specifically, the part of the generating network is mainly built based on Stylegan v2, a co-inclusive coding network (Enc Block), and a mapping network (b: (a/b))

) And decoding the network. The encoding network can respectively carry out multi-scale feature extraction on the original facial image frame sample and the reference facial image frame to obtain multi-layer spatial features (spatial features) of the original facial image frame sample and the reference facial image frame, the lowest resolution output by the encoding network is 4 x 4, the highest resolution is 512 x 512, the encoding network can be matched with the output features of the decoding module one by one, and the encoding module can be formed by a simple layer of convolution network. And map the network (a)

) The fused back spatial features can be mapped into hidden features w in a hidden space, the spatial features corresponding to each preset resolution which are coded in the space are decoded through a decoding network, in the decoding process, basic style features, basic face images and basic optical flow fields are generated according to a constant tensor (const), and the basic face images and the current face images are superposed according to the resolution from small to small, so that the prediction driving face image frame under the target resolution is obtained. And calculating total loss information based on the target driving face image frame sample and the prediction driving face image frame, and adjusting parameters of a preset image generation model based on the total loss information to obtain a trained image generation model.

When the preset image generation model is trained, the generation network (mapping network and decoding network) and the discriminator in the preset image generation model may be trained in advance, and the coding network needs to be trained from the beginning, so that, during training, the learning rates of the three networks are different, and the ratio of the learning rates may be set according to practical applications, for example, the ratio of the learning rates of the coding network, the generation network and the discriminator may be 100: 10: 1 or other ratio.

The trained image generation model can output the target face image frame and the target optical flow field simultaneously, and can also output the target face image frame or the target optical flow field independently.

Taking an example that the trained image generation model outputs a target face image frame, a process of outputting the target face image frame under a target resolution through the trained image generation model may be as shown in fig. 1g, specifically, spatial feature extraction may be performed on the original face image frame through an image feature extraction network in the image generation model to obtain an original face spatial feature corresponding to the original face image frame, where the original face spatial feature may include face three-dimensional coefficients such as identity, shadow, and texture; and then, performing time sequence feature extraction on the audio driving information through an audio semantic extraction network in the image generation model to obtain a local facial pose feature corresponding to the target facial image frame, wherein the local facial pose feature can comprise partial facial pose information such as a pose, an eye, an expression and the like, further fusing an original facial space feature and the local facial pose feature to obtain a fused back facial space feature, and performing facial reconstruction processing on the target object through a reconstruction network based on the fused back facial space feature to obtain a reference facial image frame of the sample object.

After the reference face image frame is obtained, the original face image frame, the reference face image frame and the spatial features of the fused back face can be fused through a generation network in the image generation model to obtain a target face image frame. Specifically, the part of the generating network is mainly built based on Stylegan v2, a co-inclusive coding network (Enc Block), and a mapping network (b: (a/b))

) And decoding the network. The encoding network can respectively perform multi-scale feature extraction on the original facial image frame and the reference facial image frame to obtain multi-layer spatial features (spatial features) of the original facial image frame and the reference facial image frame, the lowest resolution output by the encoding network is 4 × 4, the highest resolution is 512 × 512, and the encoding network can be matched with the features output by the decoding module one by one. And map the network (

) The fused back spatial features can be mapped into hidden features w in a hidden space, the spatial features corresponding to each preset resolution ratio coded in the space are decoded through a decoding network, in the decoding process, a basic style feature, a basic face image and a basic optical flow field are generated according to a constant tensor (const), and the basic face image and a current face image are superposed according to the resolution ratio from small to small, so that a target face image frame under the target resolution ratio is obtained, the texture of the target face image frame is consistent with the texture of an original face image frame, and the posture mouth shape of the target face image frame conforms to audio driving information.

As can be seen from the above, the present embodiment may obtain the original face image frame of the target object and the audio driving information corresponding to the target face image frame to be generated; extracting the spatial features of the original facial image frame to obtain the original facial spatial features corresponding to the original facial image frame; performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame; and performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature to generate the target face image frame. According to the method and the device, the audio driving information can be subjected to feature extraction, the facial posture detail information of the target object part is captured, and then the original facial image frame is subjected to facial adjustment based on the captured information, so that the target facial image frame corresponding to the audio driving information is obtained, and the generation efficiency and accuracy of the target facial image frame are improved.

The method described in the foregoing embodiment will be described in further detail below with an example in which the image generating apparatus is specifically integrated in a server.

An embodiment of the present application provides an image generation method, and as shown in fig. 2, a specific flow of the image generation method may be as follows:

201. the server acquires an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated.

The audio driving information is audio information used for adjusting the face posture of the original face image frame, and specifically, the audio driving information can be used for replacing the face posture of a target object in the original face image frame with a corresponding face posture when the target object speaks so as to obtain the target face image frame, wherein the audio information corresponding to the target object speaking is the audio driving information. In this embodiment, the change of the facial pose of the target object can be determined by using the change information, such as the mouth shape, of the target object when the target object speaks, which is included in the audio driving information; in addition, the emotion change of the target object can be judged by utilizing the speaking content and the volume of the target object contained in the audio driving information, and the face posture change of the target object is further determined, so that the face posture information of the target object during speaking can be acquired by extracting the audio semantic feature information from the audio driving information, and a target face image corresponding to the audio driving information is further generated.

202. And the server extracts the spatial features of the original face image frame to obtain the original face spatial features corresponding to the original face image frame.

The original facial spatial feature may specifically include a three-dimensional (3D, 3-dimensional) facial coefficient corresponding to the original facial image frame, and for example, may specifically include identity information (identity), lighting (lighting), texture (texture), expression (expression), pose (position), gaze (size), and the like.

203. And the server extracts time sequence characteristics of the audio driving information to obtain local facial posture characteristics corresponding to the target facial image frame.

processing the audio semantic feature information of each audio frame based on the audio semantic feature information of the front and rear audio frames of each audio frame;

It should be noted that, in some embodiments, the local facial pose feature corresponding to the target face image frame may also be obtained by using text driving information corresponding to the target face image frame instead of using audio driving information. In a specific scenario, for example, when a certain speech video of a target object is damaged, and a part of video frames in the speech video are lost, the subtitle information corresponding to the lost video frames can be used as the text-driven information, and other video frames that are not lost in the speech video can be used as the original face image frame and the text-driven information to generate a repaired lost video frame. Specifically, feature extraction can be performed on the text-driven information to obtain local facial posture features corresponding to a target facial image frame to be generated; and extracting the spatial features of the original face image frame to obtain the original face spatial features corresponding to the original face image frame, and performing face reconstruction processing based on the original face spatial features and the face local posture features to generate the lost target face image frame.

204. And the server fuses the original face spatial features and the face local attitude features to obtain fused back face spatial features.

The local facial pose features include part of facial pose information in a target facial image frame to be generated, for example, the local facial pose features may include related three-dimensional facial coefficients such as expression (expression), pose (position), catch (size), and the like; and the original facial spatial features contain facial pose information for the original facial image frame.

There are various fusion manners of the original facial spatial feature and the facial local pose feature, for example, the fusion manner may be a stitching process, or a weighted fusion, and the like, which is not limited in this embodiment.

205. And the server carries out face reconstruction processing on the target object based on the fused back face space features to obtain a reference face image frame corresponding to the target object.

The texture shadow of the reconstructed three-dimensional face image can be from an original face image frame, the posture expression of the reconstructed three-dimensional face image can drive information in an audio mode, the reconstructed three-dimensional face image is subjected to rendering and mapping processing, and the three-dimensional image can be projected onto a two-dimensional plane to obtain a reference face image frame corresponding to a target object.

206. The server generates the target facial image frame based on the original facial image frame, the fused posterior spatial feature, and the reference facial image frame.

The spatial features of the original facial image frame or the reference facial image frame under each preset resolution can be obtained through multi-scale feature extraction, the image scales of the original facial feature maps corresponding to different resolutions are different, and similarly, the image scales of the reference facial feature maps corresponding to different resolutions are different. In this embodiment, by multi-scale extraction, an original spatial feature of an original facial image frame and a reference spatial feature of a reference facial image frame may be obtained, the original spatial feature including an original facial feature map at multiple scales, and the reference spatial feature including a reference facial feature map at multiple scales, so that the original spatial feature of the original facial image frame and the reference spatial feature of the reference facial image frame are both multilayer spatial features (spatial features).

and fusing the modulated style features, the original face feature map under the adjacent scale and the reference face feature map under the adjacent scale to obtain the target face image frame.

extracting spatial features of the original face image frame through an image generation model to obtain original face spatial features corresponding to the original face image frame;

acquiring training data, wherein the training data comprises an original face image frame sample, a target driving face image frame sample and an audio driving information sample corresponding to the target driving face image frame sample of a sample object;

For example, any two frames of video frames containing the face of the subject may be extracted from a certain speech video about the sample subject, one frame of the two frames may be used as an original face image frame sample, the remaining one frame may be used as a target driving face image frame sample, and audio information corresponding to the target driving face image frame sample 1 second before and after the speech video may be used as an audio driving information sample.

determining first loss information based on the local facial pose characteristics corresponding to the target driving facial image frame sample and the target facial spatial characteristics;

The training process of the preset image generation model includes the steps of firstly calculating total loss information, then adjusting parameters of the preset image generation model by using a back propagation algorithm, and optimizing the parameters of the image generation model based on the total loss information, so that a loss value corresponding to the total loss information is smaller than a preset loss value, and the trained image generation model is obtained. The preset loss value may be set according to an actual situation, for example, if the accuracy requirement on the image generation model is higher, the smaller the preset loss value is.

Alternatively, in this embodiment, the step of "determining second loss information based on the target driving face image frame sample and the prediction driving face image frame" may include:

As can be seen from the above, in this embodiment, the server may obtain the original facial image frame of the target object and the audio driving information corresponding to the target facial image frame to be generated; extracting the spatial features of the original facial image frame to obtain the original facial spatial features corresponding to the original facial image frame; performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame; fusing the original face space feature and the face local posture feature to obtain a fused back face space feature; performing face reconstruction processing on the target object based on the fused posterior spatial features to obtain a reference face image frame corresponding to the target object; generating the target facial image frame based on the original facial image frame, the fused posterior spatial feature, and the reference facial image frame. According to the method and the device, the audio driving information can be subjected to feature extraction, the facial posture detail information of the target object part is captured, and then the original facial image frame is subjected to facial adjustment based on the captured information, so that the target facial image frame corresponding to the audio driving information is obtained, and the generation efficiency and accuracy of the target facial image frame are improved.

In order to better implement the above method, an embodiment of the present application further provides an image generation apparatus, as shown in fig. 3, the image generation apparatus may include an acquisition unit 301, a first extraction unit 302, a second extraction unit 303, and a reconstruction unit 304, as follows:

(1) an acquisition unit 301;

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated.

(2) A first extraction unit 302;

the first extraction unit is used for extracting the spatial features of the original face image frame to obtain the original face spatial features corresponding to the original face image frame.

(3) A second extraction unit 303;

and the second extraction unit is used for extracting time sequence characteristics of the audio driving information to obtain local facial posture characteristics corresponding to the target facial image frame.

(4) A reconstruction unit 304;

a generating subunit configured to generate the target face image frame based on the original face image frame, the fused posterior spatial feature, and the reference face image frame.

Optionally, in some embodiments of the present application, the step of fusing the original facial feature maps at the multiple scales, the reference facial feature maps at the multiple scales, and the hidden feature information to obtain the target facial image frame may include:

Optionally, in some embodiments of the present application, the first extraction unit may be specifically configured to perform spatial feature extraction on the original facial image frame through an image generation model, so as to obtain an original facial spatial feature corresponding to the original facial image frame;

the second extraction unit may be specifically configured to perform time-series feature extraction on the audio driving information through the image generation model to obtain a local facial pose feature corresponding to the target facial image frame;

the reconstruction unit may be specifically configured to perform, by using the image generation model, face reconstruction processing on the target object based on the original face spatial feature and the face local pose feature, and generate the target face image frame.

the training unit may be specifically configured to acquire training data, where the training data includes an original face image frame sample of a sample object, a target driving face image frame sample, and an audio driving information sample corresponding to the target driving face image frame sample; performing spatial feature extraction on the original facial image frame sample through a preset image generation model to obtain original facial spatial features corresponding to the original facial image frame sample; performing time sequence feature extraction on the audio driving information sample to obtain a local facial gesture feature corresponding to the target driving facial image frame sample; based on the original face space feature and the face local posture feature, carrying out face reconstruction processing on the sample object to obtain a prediction driving face image frame; and adjusting parameters of a preset image generation model based on the target driving face image frame sample and the prediction driving face image frame to obtain a trained image generation model.

Optionally, in some embodiments of the application, the step of "determining second loss information based on the target driving face image frame sample and the predicted driving face image frame" may include:

respectively predicting the probabilities that the target driving face image frame sample and the predicted driving face image frame belong to a real driving face image frame, and determining the antagonistic loss information of a preset image generation model based on the probabilities;

As can be seen from the above, the present embodiment may acquire, by the acquisition unit 301, the original face image frame of the target object and the audio driving information corresponding to the target face image frame to be generated; extracting spatial features of the original facial image frame through a first extraction unit 302 to obtain original facial spatial features corresponding to the original facial image frame; performing time sequence feature extraction on the audio driving information by a second extraction unit 303 to obtain a local facial posture feature corresponding to the target facial image frame; the target object is subjected to face reconstruction processing by a reconstruction unit 304 based on the original face spatial feature and the face local pose feature, and the target face image frame is generated. According to the method and the device, the audio driving information can be subjected to feature extraction, the facial posture detail information of the target object part is captured, and then the original facial image frame is subjected to facial adjustment based on the captured information, so that the target facial image frame corresponding to the audio driving information is obtained, and the generation efficiency and accuracy of the target facial image frame are improved.

An electronic device according to an embodiment of the present application is further provided, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, where the electronic device may be a terminal or a server, and specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring an original face image frame of a target object and audio driving information corresponding to a target face image frame to be generated; extracting the spatial features of the original facial image frame to obtain the original facial spatial features corresponding to the original facial image frame; performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame; and performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature to generate the target face image frame.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the present embodiment may obtain the original face image frame of the target object and the audio driving information corresponding to the target face image frame to be generated; extracting spatial features of the original face image frame to obtain original face spatial features corresponding to the original face image frame; performing time sequence feature extraction on the audio driving information to obtain a local facial posture feature corresponding to the target facial image frame; and performing face reconstruction processing on the target object based on the original face space feature and the face local posture feature to generate the target face image frame. According to the method and the device, the feature extraction can be carried out on the audio driving information, the facial posture detail information of the target object part is captured, and then the facial adjustment is carried out on the original facial image frame based on the captured information, so that the target facial image frame corresponding to the audio driving information is obtained, and the generation efficiency and the accuracy of the target facial image frame are improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the image generation methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any image generation method provided in the embodiments of the present application, beneficial effects that can be achieved by any image generation method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

According to an aspect of the application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the methods provided in the various alternative implementations of the image generation aspect described above.

The foregoing detailed description is directed to an image generation method and related devices provided in the embodiments of the present application, and specific examples are applied herein to explain the principles and embodiments of the present application, and the description of the foregoing embodiments is only used to help understand the method and its core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An image generation method, comprising:

2. The method of claim 1, wherein the performing time-series feature extraction on the audio driving information to obtain a local facial pose feature corresponding to the target facial image frame comprises:

3. The method of claim 1, wherein said performing a face reconstruction process on said target object based on said original face spatial features and said face local pose features to generate said target face image frame comprises:

performing face reconstruction processing on the target object based on the fused back spatial features to obtain a reference face image frame corresponding to the target object;

4. The method according to claim 3, wherein performing facial reconstruction processing on the target object based on the fused posterior spatial feature to obtain a reference facial image frame corresponding to the target object comprises:

5. The method of claim 3, wherein generating the target facial image frame based on the original facial image frame, the fused posterior spatial feature, and the reference facial image frame comprises:

6. The method according to claim 5, wherein the fusing the original facial feature maps at the plurality of scales, the reference facial feature maps at the plurality of scales, and the hidden feature information to obtain the target facial image frame comprises:

fusing the hidden feature information, the original facial feature map under the target scale and the reference facial feature map under the target scale to obtain a fused facial feature map corresponding to the target scale, wherein the target scale is selected from the multiple scales;

7. The method according to claim 6, wherein the fusing the corresponding fused facial feature map at the target scale, the original facial feature map at the adjacent scale, and the reference facial feature map at the adjacent scale to obtain the target facial image frame comprises:

8. The method according to claim 1, wherein the performing spatial feature extraction on the original facial image frame to obtain an original facial spatial feature corresponding to the original facial image frame comprises:

the extracting time sequence features of the audio driving information to obtain local facial posture features corresponding to the target facial image frame includes:

the facial reconstruction processing is performed on the target object based on the original facial space feature and the facial local posture feature, and the target facial image frame is generated, including:

9. The method of claim 8, wherein before the spatial feature extraction of the original face image frame through the image generation model to obtain the spatial feature of the original face corresponding to the original face image frame, the method further comprises:

extracting spatial features of the original face image frame sample through a preset image generation model to obtain original face spatial features corresponding to the original face image frame sample;

10. The method of claim 9, wherein adjusting parameters of a pre-defined image generation model based on the target driving face image frame sample and the predicted driving face image frame to obtain a trained image generation model comprises:

11. The method of claim 10, wherein determining second loss information based on the target drive face image frame sample and the predictive drive face image frame comprises:

12. An image generation apparatus, comprising:

13. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the image generation method according to any one of claims 1 to 11.

14. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the image generation method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, implement the steps in the image generation method of any of claims 1 to 11.