CN115131636A

CN115131636A - Image processing method, image processing device, electronic equipment and computer readable storage medium

Info

Publication number: CN115131636A
Application number: CN202210476124.1A
Authority: CN
Inventors: 朱飞达; 朱俊伟; 储文青; 邰颖; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-09-30

Abstract

The embodiment of the invention discloses an image processing method, an image processing device, electronic equipment and a computer readable storage medium; after an original image of an original object and a driving image of a driving object are obtained, extracting object features of the original object from the original image, extracting posture features of the driving object from the driving image, fusing the object features and the posture features to obtain three-dimensional image features, constructing a three-dimensional object image based on the three-dimensional image features, extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain original spatial features of the original image and three-dimensional spatial features of the three-dimensional object image, and fusing the original spatial features, the three-dimensional spatial features and preset basic style features to obtain a target object image; the scheme can improve the accuracy of image processing, and the embodiment of the invention can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like.

Description

Image processing method, image processing device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to an image processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In recent years, with the development of internet technology, processing of images has become more and more diversified, and for example, the posture of an object in an original image can be adjusted to the posture of an object in a driven image by driving the image. The existing image processing method represents the gesture through the key point track of the key points in the image or the local affine transformation of the key points, thereby finishing the gesture migration.

In the course of research and practice on the prior art, the inventors of the present invention found that, with the existing image processing method, the keypoint trajectory cannot accurately capture the local pose details of the object in the object image, and is influenced by the resolution of the object image, so that a high-resolution image cannot be restored, and therefore, the accuracy of image processing is greatly reduced.

Disclosure of Invention

The embodiment of the invention provides an image processing method, an image processing device, electronic equipment and a computer readable storage medium, which can improve the accuracy of image processing.

An image processing method comprising:

acquiring an original image of an original object and a driving image of a driving object, wherein the driving image is used for carrying out a template image for body posture adjustment on the original object;

extracting object features of the original object from the original image, and extracting posture features of the driving object from the driving image;

fusing the object features and the posture features to obtain three-dimensional image features, and constructing a three-dimensional object image based on the three-dimensional image features;

extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image;

and fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target object image under a target resolution, wherein the target object image is an object image obtained by replacing the body posture of the original object with the body posture of the driving object.

Optionally, the present scheme may further provide an image processing method, including:

acquiring an original face image and a face driving image, wherein the face driving image is used for adjusting a template image of a face posture in the original face image;

extracting facial image features from the original facial image, and extracting facial posture features from the facial driving image;

fusing the facial image features and the facial posture features to obtain three-dimensional facial features, and constructing a three-dimensional facial image based on the three-dimensional facial image;

extracting spatial features under each preset resolution from the original face image and the face driving image to obtain the original spatial features of the original face image and the three-dimensional spatial features of the three-dimensional face image;

and fusing the original spatial features, the three-dimensional spatial features and preset basic style features to obtain a target face image under a target resolution, wherein the target face image is an image obtained by replacing the face posture in the original face image with the face posture in the face driving image.

Accordingly, an embodiment of the present invention provides an image processing apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an original image of an original object and a driving image of a driving object, and the driving image is used for carrying out a template image for adjusting the body posture of the original object;

a first extraction unit configured to extract an object feature of the original object in the original image and extract an attitude feature of the driving object in the driving image;

the construction unit is used for fusing the object features and the posture features to obtain three-dimensional image features, and constructing a three-dimensional object image based on the three-dimensional image features;

the second extraction unit is used for extracting the spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image;

and the fusion unit is used for fusing the original space characteristic, the three-dimensional space characteristic and the preset basic style characteristic to obtain a target object image under a target resolution, wherein the target object image is an object image obtained by replacing the body posture of the original object with the body posture of the driving object.

Optionally, in some embodiments, the fusion unit may be specifically configured to generate a basic object image at an initial resolution based on the preset basic style feature; mapping the three-dimensional image features into hidden features by adopting a trained image generation model, and adjusting the preset basic style features based on the hidden features; and according to the basic object image, fusing the adjusted style characteristic, the original spatial characteristic and the three-dimensional spatial characteristic to obtain a target object image under the target resolution.

Optionally, in some embodiments, the fusion unit may be specifically configured to screen a target original spatial feature from the original spatial features and screen a target three-dimensional spatial feature from the three-dimensional spatial features based on the preset resolution; fusing the adjusted style characteristic, the target original spatial characteristic and the target three-dimensional spatial characteristic to obtain a fused style characteristic under the current resolution; and generating a target object image under the target resolution based on the fused style features and the basic object image.

Optionally, in some embodiments, the fusion unit may be specifically configured to generate a current object image based on the fused style features, and fuse the current object image and the basic object image to obtain a fusion object image at the current resolution; taking the fused style features as the preset basic style features, and taking the fused object image as the basic object image; and returning to the step of adjusting the preset basic style characteristics based on the hidden characteristics until the current resolution is the target resolution, so as to obtain the target object image.

Optionally, in some embodiments, the fusion unit may be specifically configured to adjust the size of the basic style feature to obtain an initial style feature; modulating the hidden features to obtain convolution weights corresponding to the initial style features; and adjusting the initial style characteristics based on the convolution weight to obtain adjusted style characteristics.

Optionally, in some embodiments, the fusion unit may be specifically configured to screen a target style convolutional network corresponding to the resolution of the base object image from a style convolutional network of the trained image generation model; and performing convolution processing on the initial style features by adopting the target style convolution network based on the convolution weight to obtain the adjusted style features.

Optionally, in some embodiments, the fusion unit may be specifically configured to generate a basic optical-flow field at the initial resolution based on the basic style features; and according to the basic optical flow field, fusing the adjusted style characteristics, the original spatial characteristics and the three-dimensional spatial characteristics to obtain a target optical flow field under the target resolution.

Optionally, in some embodiments, the image processing apparatus may further include a training unit, where the training unit may be specifically configured to obtain a video sample, and screen out any two frames of video frames from the video sample as an original image sample and a driving image sample, respectively; adopting a preset image generation model to carry out object driving on the original image sample and the driving image sample to obtain a prediction object image; and converging the preset image generation model based on the prediction object image and the driving image sample to obtain a trained image generation model.

Optionally, in some embodiments, the training unit may be specifically configured to detect the prediction object image and the driving image sample respectively, and determine, based on a detection result, countermeasure loss information of the preset image generation model; determining reconstruction loss information of the preset image generation model based on the prediction object image and the driving image sample; and fusing the antagonistic loss information and the reconstruction loss information, and converging the preset image generation model based on the fused loss information to obtain a trained image generation model.

Optionally, in some embodiments, the training unit may be specifically configured to calculate an image similarity between the prediction object image and a driving image sample, so as to obtain perceptual loss information of the preset image generation model; comparing the prediction object image with the driving image sample to obtain image loss information of the preset image generation model; and fusing the perception loss information and the image loss information to obtain the reconstruction loss information of the preset image generation model.

Optionally, in some embodiments, the construction unit may be specifically configured to perform image feature extraction on the original image and the driving image, respectively, to obtain an original image feature of the original image and a driving image feature of the driving image; screening the object characteristics of the original object from the original image characteristics; and screening out the attitude characteristic of the driving object from the driving image characteristic.

Optionally, in some embodiments, the first extracting unit may be specifically configured to perform image feature extraction on the original image and the driving image, respectively, to obtain an original image feature of the original image and a driving image feature of the driving image; screening out object features of the original object from the original image features; and screening out the attitude characteristic of the driving object from the driving image characteristic.

In addition, an electronic device is further provided in an embodiment of the present invention, and includes a processor and a memory, where the memory stores an application program, and the processor is configured to run the application program in the memory to implement the image processing method provided in the embodiment of the present invention.

In addition, the embodiment of the present invention further provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the image processing methods provided by the embodiment of the present invention.

After an original image of an original object and a driving image of a driving object are obtained, extracting object features of the original object from the original image, extracting posture features of the driving object from the driving image, fusing the object features and the posture features to obtain three-dimensional image features, constructing a three-dimensional object image based on the three-dimensional image features, extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain original spatial features of the original image and three-dimensional spatial features of the three-dimensional object image, and fusing the original spatial features, the three-dimensional spatial features and preset basic style features to obtain a target object image under a target resolution; according to the scheme, the three-dimensional object image can be constructed based on the three-dimensional object features extracted from the original image and the driving image, so that the layout posture details of the driving object can be accurately captured, the target object image under the target resolution is generated through the space features of the driving image and the three-dimensional reconstruction image under different resolutions, the resolution of the generated target object image is ensured, and therefore the accuracy of image processing can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a schematic scene diagram of an image processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an image processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of body pose adjustment provided by embodiments of the present invention;

FIG. 4 is a schematic illustration of a 3D reconstruction provided by an embodiment of the present invention;

fig. 5 is a schematic network structure diagram of a decoding subnetwork provided in an embodiment of the present invention;

FIG. 6 is a schematic diagram of an overall training process of a preset image generation model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a process for using a trained image model provided by an embodiment of the invention;

FIG. 8 is a schematic flow chart of an image processing method according to an embodiment of the present invention;

FIG. 9 is a schematic flow chart of an image processing method according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of facial pose adjustment provided by embodiments of the present invention;

FIG. 11 is a schematic diagram of a 3D reconstruction of a face provided by an embodiment of the invention;

fig. 12 is a schematic flowchart of a process of training a preset image generation model through a human face image according to an embodiment of the present invention;

FIG. 13 is a schematic flow chart illustrating a process of performing face driving on a face image by using a trained image generation model according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of another structure of an image processing apparatus according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The embodiment of the invention provides an image processing method and device, electronic equipment and a computer-readable storage medium. The image processing apparatus may be integrated into an electronic device, and the electronic device may be a server or a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data, an artificial intelligence platform, and the like. The terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an aircraft, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.

For example, referring to fig. 1, taking as an example that the image processing apparatus is integrated in an electronic device, after acquiring an original image of an original object and a driving image of a driving object, extracting object characteristics of an original object from the original image, extracting attitude characteristics of a driving object from the driving image, then, the object features and the attitude features are fused to obtain three-dimensional image features, a three-dimensional object image is constructed based on the three-dimensional image features, and then, extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image, then, the original space characteristic, the three-dimensional space characteristic and the preset basic style characteristic are fused, so as to obtain a target object image under the target resolution, and further improve the accuracy of image processing.

The image processing method provided by the embodiment of the application relates to the computer vision direction in the field of artificial intelligence. The embodiment of the application can perform feature extraction on the original image and the driving image, and perform object driving on the original image based on the driving image, so as to obtain a target object image and the like under a target resolution.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a machine learning/deep learning direction and the like.

Computer Vision technology (CV) is a science for researching how to make a machine look, and in particular, refers to machine Vision for identifying and measuring a target by replacing human eyes with a Computer and further performing image processing, so that an image is processed by the Computer to be an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, and the like.

It should be understood that, in the specific implementation of the present application, related data such as the original object image and the driving object image of the object are involved, when the following embodiments of the present application are applied to specific products or technologies, permission or approval needs to be obtained, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of an image processing apparatus, which may be specifically integrated in an electronic device, where the electronic device may be a server or a terminal; the terminal may include a tablet Computer, a notebook Computer, a Personal Computer (PC), a wearable device, a virtual reality device, or other intelligent devices capable of performing image processing.

An image processing method, comprising:

acquiring an original image of an original object and a driving image of a driving object, the driving image being a template image for performing body pose adjustment on the original object, extracting object characteristics of an original object from an original image, extracting attitude characteristics of a driving object from a driving image, fusing the object characteristics and the attitude characteristics to obtain three-dimensional image characteristics, constructing a three-dimensional object image based on the three-dimensional image characteristics, extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image, fusing the original spatial features, the three-dimensional spatial features and preset basic style features to obtain a target object image under a target resolution, the target object image is an object image in which the body posture of the original object is replaced with the body posture of the driving object.

As shown in fig. 2, the specific flow of the image processing method is as follows:

101. an original image of an original object and a drive image of a drive object are acquired.

The original object may be an object whose body posture is to be adjusted, and the original image may be an image including the original object. The driving image is used for carrying out body posture adjustment on the original object. The body posture of the original object is adjusted to be understood as the body posture of the driving object is replaced by the body posture of the original object, so that the target object image is obtained. The type of the object may be various, for example, the object may be a human body, a facial object, an animal, or the like, and the process of adjusting the body posture may be as shown in fig. 3 by taking the object as a human body as an example.

The manner of acquiring the original image of the original object and the driving image of the driving object may be various, and specifically, the manner may be as follows:

for example, an original image of an original object and a driving image of a driving object uploaded by a terminal may be directly acquired, or any two frames of video frames including the object may be extracted from a video, one frame may be used as the original image, the object in the original image may be used as the original object, the remaining frame may be used as the driving image, and the object in the driving image may be used as the driving object, or an object image including the object may be screened from image data, and any two images may be screened from the object image as the original image and the driving image, respectively.

102. The object features of the original object are extracted from the original image, and the attitude features of the driving object are extracted from the driving image.

The object feature may be understood as feature information indicating a pose of the original object, for example, the feature information may include an identity (identity), a lighting (lighting), a texture (texture), and the like, the pose feature may be understood as feature information indicating a pose of a driving object, the type of the driving object is different, and the pose feature of the driving object is different, for example, when the driving object is a human body, the pose feature may be a body pose (position) of the human body, and when the driving object is a face, the pose feature may be a face pose (position), an eye (size), an expression (expression), and the like.

The object feature of the original object is extracted from the original image, and the posture feature of the driving object is extracted from the driving image in various ways, which may specifically be as follows:

for example, image feature extraction may be performed on the original image and the driving image, respectively, to obtain an original image feature of the original image and a driving image feature of the driving image, to screen an object feature of the original object from the original image feature, and to screen an attitude feature of the driving object from the driving image feature.

For example, an image feature extraction network of the trained image generation model may be adopted to extract 3D coefficients (coeff) from the original image and the driving image, respectively, the 3D coefficients extracted from the original image are used as the original image features, and the 3D coefficients extracted from the driving image are used as the driving image features.

The 3D coefficients may be understood as three-dimensional parameters of an object in the image, the object in the image is different, and the 3D coefficients are also different, when the object in the image is a human body, the 3D coefficients may be face identity (identity), lighting (lighting), texture (texture), and pose (position), and when the object in the image is a face object, the 3D coefficients may be face identity (identity), lighting (lighting), texture (texture), expression (expression), pose (position), and gaze (size). The network structure of the image feature extraction network for extracting the 3D coefficients may be various, for example, ResNet50 or other network structures, and taking the network structure of the image feature extraction network as ResNet50 as an example, the process of extracting the 3D coefficients may be as shown in formula (1):

wherein coeff is a 3D coefficient, I _lq Either the original image or the drive image.

After the original image features and the driving image features are extracted, the object features of the original object can be directly screened out from the original image features, for example, by taking the original object as a human body, parameters such as identity, shadow, texture and the like can be directly extracted from the original image features, so that the object features of the original object can be obtained. The posture characteristic of the driving object can be screened out from the driving image characteristic, for example, if the driving object is a human body, parameters such as the posture and the like can be directly screened out from the driving image characteristic, so that the posture characteristic of the driving object is obtained.

103. And fusing the object features and the posture features to obtain three-dimensional image features, and constructing a three-dimensional object image based on the three-dimensional image features.

A three-dimensional image feature is understood to be an image feature that reconstructs a three-dimensional object.

The three-dimensional object image may be an object image which renders a three-dimensional image feature into a three-dimensional object model in a rendering manner and projects the three-dimensional object model onto a two-dimensional plane, and may also be understood as 3D reconstruction, and a reconstruction process may be as shown in fig. 4.

The object feature and the posture feature may be fused in various ways, and specifically, the ways may be as follows:

for example, the object feature and the posture feature may be directly spliced to obtain the three-dimensional image feature, or weighting parameters of the object feature and the posture feature may be obtained, the object feature and the posture feature are weighted based on the weighting parameters, respectively, to obtain a weighted object feature and a weighted posture feature, and the weighted object feature and the weighted posture feature are fused to obtain the three-dimensional image feature.

After the object features and the posture features are fused, three-dimensional image features can be obtained based on the fusion to construct a three-dimensional object image, and various ways for constructing the three-dimensional object image can be provided, for example, the three-dimensional image features can be converted into the geometric features and the texture features of the three-dimensional object, a three-dimensional object model of the three-dimensional object is constructed according to the geometric features and the texture features, and the three-dimensional object model is projected to a two-dimensional plane to obtain the three-dimensional object image of the three-dimensional object.

The geometric feature may be understood as coordinate information of a key point of a 3D mesh structure of the three-dimensional object, the texture feature may be understood as a feature indicating texture information of the three-dimensional object, and various ways of converting the three-dimensional image feature into the geometric feature and the texture feature of the three-dimensional object may be provided, for example, position information of at least one key point may be extracted from the three-dimensional image feature, the position information of the key point may be converted into the geometric feature, texture information of the original object may be extracted from the three-dimensional image feature, and the texture feature may be extracted from the texture information, and a specific conversion process may be as shown in formula (2):

wherein S is a geometric feature, T is a texture feature, and coeff is a three-dimensional image feature.

After the geometric features and the texture features are converted, a three-dimensional object model of the three-dimensional object can be constructed, and the three-dimensional object model is projected onto a two-dimensional plane, so as to obtain a three-dimensional object image, and various ways of obtaining the three-dimensional object image can be provided, for example, three-dimensional model parameters of the three-dimensional object can be determined according to the geometric features and the texture features, based on the three-dimensional model parameters, a three-dimensional object model of the three-dimensional object is constructed, and the three-dimensional object model is projected onto the two-dimensional plane, so as to obtain the three-dimensional object image, which can be specifically represented by formula (3):

wherein, I _3d The three-dimensional object image is obtained, S is a geometric feature, and T is a texture feature.

The three-dimensional object image can be constructed by performing 3D reconstruction of the three-dimensional object based on three-dimensional image features, and the reconstruction method may be various, for example, an SMPL reconstruction method, a 3DMMM reconstruction method, or other 3D reconstruction methods may be adopted.

104. And extracting the spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image.

The spatial feature may be understood as representing spatial information in each preset resolution in the original image or the three-dimensional object image, and thus, both the original spatial feature of the original image and the three-dimensional spatial feature of the three-dimensional object may be multi-layer spatial features (spatial features).

The method for extracting the spatial features under each preset resolution from the original image and the three-dimensional object may be various, and specifically may be as follows:

for example, an encoding network (Enc Block) of the trained image generation model may be used to perform spatial encoding on the original image and the three-dimensional object image at each preset resolution, so as to obtain an original spatial feature of the original image and a three-dimensional spatial feature of the three-dimensional object image at each resolution.

The encoding network (Enc Block) may include a plurality of sub-encoding networks, each sub-encoding network corresponds to a preset resolution, the sub-encodings may be sequentially arranged from small to large according to the size of the resolution, so as to obtain the encoding network, and when the original image and the three-dimensional object image are input to the encoding network for network encoding, each sub-encoding network may output a spatial feature corresponding to the preset resolution. The encoding networks for the original image and the three-dimensional object image may be the same or different encoding networks, but the different encoding networks share network parameters, and the structure of the encoding sub-network may be various, for example, the encoding sub-network may be formed by a simple layer of convolutional network, or may also be other encoding network structures. The preset resolution may be set according to the actual application, for example, the resolution may be from 4 × 4 to 512 × 512.

105. And fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target object image under the target resolution.

The target object is an object image in which the body posture of the original object is replaced by the body posture of the driving object, and the body posture can be understood as posture information on the body of the object, for example, the body posture can include the body posture of a human body by taking the object as a human body, and the face posture, the eye spirit, the expression and the like can be included by taking the object as a face object, namely, a human face.

The method for fusing the original spatial features, the three-dimensional spatial features and the preset basic style features can be various, and specifically can be as follows:

for example, a basic object image under an initial resolution may be generated based on a preset basic style feature, a three-dimensional image feature is mapped to a hidden feature by using a trained image generation model, the preset basic style feature is adjusted based on the hidden feature, and the adjusted style feature, an original spatial feature and a three-dimensional spatial feature are fused according to the basic object image to obtain a target object image under a target resolution.

The preset basic style characteristics can be understood as style characteristics in a constant tensor (Const) preset in the image driving process. The style characteristics are characteristic information for generating an image of a specific style.

The hidden features can understand the intermediate features w for encoding the three-dimensional image features, different visual features are controlled by different elements of the intermediate features w, so that the correlation (decoupling and feature separation) between the features is reduced, and the encoding process can extract the hidden deep-level relationships under the surface features from the three-dimensional image features and decouple the relationships, so that the hidden features (late codes) can be obtained. The method for mapping the three-dimensional image features into the hidden features by adopting the trained image generation model can be used for multiple purposesFor example, a mapping network of trained image generation models may be employed

And directly mapping the three-dimensional image features into hidden features (w).

After the hidden features are mapped, the preset basic style features can be adjusted based on the hidden features, the adjustment can be understood as style modulation, and the adjustment mode can be various, for example, the size of the basic style features is adjusted to obtain initial style features, the hidden features are modulated to obtain convolution weights corresponding to the initial style features, and the initial style features are adjusted based on the convolution weights to obtain adjusted style features.

The convolution weight may be understood as weight information when convolution processing is performed on the initial style feature, and there may be a variety of ways to perform modulation processing on the hidden feature, for example, a basic convolution weight may be obtained, and the convolution weight is adjusted based on the hidden feature, so as to obtain a convolution feature corresponding to the initial style feature. The convolution weight adjustment based on the hidden feature can be mainly realized by adopting a Mod module and a Demod module in a decoding network of style van v2 (a style migration model).

After the hidden features are modulated, convolution weights can be obtained based on the modulation, the initial style features can be adjusted in various ways, for example, a target style convolution network corresponding to the resolution of the basic object image is screened from a style convolution network (StyleConv) of a trained image generation model, and the initial style features are adjusted based on the convolution weights, so that the adjusted style features are obtained.

After the preset basic style features are adjusted, the adjusted style features, the original spatial features and the three-dimensional spatial features can be fused according to the basic object image, the fusion mode can be various, for example, based on the preset resolution, the target original spatial features are screened out from the original spatial features, the target three-dimensional spatial features are screened out from the three-dimensional spatial features, the adjusted style features, the target original spatial features and the target three-dimensional spatial features are fused to obtain the fused style features under the current resolution, and based on the fused style features and the basic object image, the target object image under the target resolution is generated.

The method for screening the target original spatial features from the original spatial features based on the preset resolution and screening the target three-dimensional spatial features from the three-dimensional spatial features may be various, for example, the original spatial features and the three-dimensional spatial features may be sorted based on the preset resolution, the original spatial features with the smallest resolution may be screened from the original spatial features as the target original spatial features based on the sorting information, and the original spatial features with the smallest resolution may be screened from the three-dimensional spatial features as the target original spatial features. After the target original space features and the target three-dimensional space features are screened out, the target original space features can be deleted from the original space features and deleted from the three-dimensional space features, so that the space features with the minimum resolution can be screened out from the original space features and the three-dimensional space features every time, and the target original space features and the target three-dimensional space features are obtained.

After the target original spatial feature and the target three-dimensional spatial feature are screened out, the adjusted style feature, the target original spatial feature and the target three-dimensional spatial feature may be fused in a variety of ways, for example, the target original spatial feature, the target three-dimensional spatial feature and the adjusted style feature may be directly spliced to obtain the fused style feature under the current resolution, which may be specifically shown in formula (4):

wherein,

the style feature after being fused can be the style feature corresponding to the next preset resolution of the basic style feature，

In order to be the basis of the style characteristics,

for the target original spatial feature at the preset resolution,

for the target three-dimensional space feature at the preset resolution, Concat represents that the features are connected in series or spliced, and style convolutional network.

After the fused style features at the current resolution are obtained, a target object image at the target resolution may be generated based on the fused style features and the basic object image, and there are various ways of generating the target object image at the target resolution, for example, the current object image may be generated based on the fused style features, and the current object image and the basic object image are fused to obtain a fused object image at the current resolution, and the fused style features are used as preset basic style features, and the fused object image is used as the basic object image, and the step of adjusting the preset basic style features based on the hidden features is returned to be executed until the current resolution is the target resolution, so as to obtain the target object image.

In the process of generating the target object image with the target resolution, it can be found that the current object image and the basic object image under different resolutions are sequentially overlapped, and in the overlapping process, the resolutions are sequentially increased, so that the high-definition target object image can be output. By taking the target object image as the target human body image as an example, compared with the digital human body driving adopting CG modeling, the generated target human body image is more real (no terror valley effect) and efficient by adopting the scheme.

Optionally, the basic optical flow field at the initial resolution may be generated based on the basic style characteristics, so as to output the target optical flow field at the target resolution. The fundamental optical flow field may be understood as the field of visualization at the initial resolution used to indicate the movement of the same keypoints in the image of the object. The target optical flow field may be output in various ways, for example, a basic optical flow field with an initial style resolution may be generated based on the basic style features, and the adjusted style features, the original spatial features, and the three-dimensional spatial features may be fused according to the basic optical flow field to obtain the target optical flow field with the target resolution.

The method for fusing the adjusted style features, the original spatial features and the three-dimensional spatial features may be multiple, for example, the target original spatial features may be screened out from the original spatial features based on a preset resolution, the target three-dimensional spatial features may be screened out from the three-dimensional spatial features, the adjusted style features and the target style features may be fused to obtain fused style features at a current resolution, and the target optical flow field at the target resolution may be generated based on the fused style features and the basic optical flow field.

For example, the current optical flow field may be generated based on the post-fusion style features and the basic optical flow field, the current optical flow field and the basic optical flow field are fused to obtain a fusion optical flow field at the current resolution, the post-fusion style features are used as preset basic style features, the fusion optical flow field is used as the basic optical flow field, the step of adjusting the preset basic style features based on the hidden features is executed in a return mode, and the target optical flow field is obtained until the current resolution is the target resolution.

The target object image and the target optical flow field may be generated simultaneously based on preset basic style characteristics, or the target object image or the target optical flow field may be generated separately based on preset style characteristics. Taking the example of generating the target object and the target optical flow field simultaneously, the preset basic style features, the basic object image and the basic optical flow field can be processed through the decoding network of the trained image generation model, each decoding network can comprise a decoding subnetwork corresponding to each preset resolution, the resolution can be increased from 4 to 512, and the network nodes of the decoding subnetwork can be connectedThe merged style characteristics output by the upper layer decoding subnetwork may be received as shown in FIG. 5

Fusion object image (I) ⁱ ) And a fused optical flow field (f) ⁱ ) Modulating the hidden feature w to obtain

Corresponding convolution weight, and based on the convolution weight, for

Performing convolution processing to obtain adjusted style characteristics

Screening out the original spatial features of the target corresponding to the decoding sub-network from the original spatial features based on the resolution corresponding to the decoding sub-network

And screening out target three-dimensional space characteristics corresponding to the decoding sub-network from the three-dimensional space characteristics

Will be provided with

And

are connected in series, so that the fused style characteristics of the output of the decoding sub-network can be obtained

Then based on

Generating current object image and current optical flow field, and outputting current object image and I of previous layer decoding sub-network ⁱ The fusion is performed to obtain a fusion object image (I) outputted by the current decoding subnetwork ⁱ⁺¹ ) The current optical flow field and the fused optical flow field (f) output by the previous layer decoding sub-network ⁱ ) The fusion is carried out, and the fusion optical flow field (f) output by the current decoding sub-network can be obtained ⁱ⁺¹ ) And then, outputting the fused image to the next decoding subnetwork layer until the decoding subnetwork corresponding to the target resolution outputs the fused image and the fused optical flow field, so that the fused image under the target resolution can be used as the target image, and the fused optical flow field under the target resolution can be used as the target optical flow field.

Optionally, the trained image generation model may be set according to practical application, and it should be noted that, the trained image generation model may be set in advance by a maintenance person, or may be trained by an image processing apparatus, that is, before the step "mapping the three-dimensional image features into the hidden features by using the trained image generation model", the image processing method may further include:

the method comprises the steps of obtaining a video sample, screening out any two frames of video frames from the video sample to be used as an original image sample and a driving image sample respectively, carrying out object driving on the original image sample and the driving image sample by adopting a preset image generation model to obtain a predicted object image, converging the preset image generation model based on the predicted object image and the driving image sample to obtain a trained image generation model, wherein the method specifically comprises the following steps:

(1) and acquiring a video sample, and screening any two frames of video frames in the video sample to be used as an original image sample and a driving image sample respectively.

The video sample acquisition method may be various, and specifically may be as follows:

for example, a video sample uploaded by the terminal may be obtained, or at least one candidate video may be obtained from a video database or a network, and any one of the candidate videos is screened out as the video sample, or at least one candidate video may be obtained from the video database or the network, and a video including the object is screened out from the candidate videos, so as to obtain the video sample.

After the video sample is obtained, any two frames of video frames can be screened out from the video sample to be used as the original image sample and the driving image sample respectively, and the screening out of the original image sample and the driving image sample can be performed in various ways, for example, the video sample is subjected to framing processing to obtain a video frame set corresponding to the video sample, any two frames of video frames are screened out from the video frame set, and the two frames of video frames are used as the original image sample and the driving image sample respectively.

(2) And carrying out object driving on the original image sample and the driving image sample by adopting a preset image generation model to obtain a predicted object image.

For example, a sample object feature may be extracted from an original image sample, a sample posture feature may be extracted from a driving image sample, the sample object feature and the sample posture feature may be fused to obtain a three-dimensional image sample feature, a three-dimensional object image sample may be constructed based on the three-dimensional image sample feature, a spatial feature at each preset resolution may be extracted from the original image sample and the three-dimensional object image sample to obtain an original spatial sample feature of the original image sample and a three-dimensional spatial sample feature of the three-dimensional object image sample, and the original spatial sample feature, the three-dimensional spatial sample feature, and a preset basic style feature may be fused to obtain a predicted object image at a target resolution.

(3) And converging the preset image generation model based on the prediction object image and the driving image sample to obtain the trained image generation model.

For example, the prediction object image and the driving image sample are detected respectively, and based on the detection result, the countermeasure loss information of the preset image generation model is determined, based on the prediction object image and the driving image sample, the reconstruction loss information of the preset image generation model is determined, the countermeasure loss information and the reconstruction loss information are fused, and based on the fused loss information, the preset image generation model is converged, so as to obtain the trained image generation model.

For example, a discriminator D of a trained image generation model can be used to discriminate a driving image sample GT and a prediction object image, the generated discrimination result is false, and the true image is true, so that the discrimination result and the detection probability of the discrimination result are obtained, and the discrimination result and the detection probability are used as the detection result.

After the predicted object image and the driving image sample are detected, the countermeasure loss information of the preset image generation model can be determined based on the detection result, and the manner of determining the countermeasure loss information may be various, for example, the detection probability of each discrimination result can be extracted from the detection result, the countermeasure parameters of the predicted object image and the driving image sample can be determined based on the detection probability, and the countermeasure parameters are fused to obtain the countermeasure loss information, which may be specifically as shown in formula (5):

wherein L is _GAN To cope with the loss information, D (GT) is a detection result of the driving image sample, D (G (input)) is a detection result of the prediction target image, GT is the driving image sample, input is the input original image sample and the driving image sample, D is the discriminator, and G is the driving network (generation network).

The reconstruction loss information may be understood as loss information between a reconstructed prediction object image and a reconstructed driving image, and various ways of determining the reconstruction loss information may be provided, for example, image similarity between the prediction object image and a driving image sample may be calculated to obtain perceptual loss information of a preset image generation model, the prediction object image and the driving image are compared to obtain image loss information of the preset image generation model, and the perceptual loss information and the image loss information are fused to obtain reconstruction loss information of the preset image generation model.

The perceptual loss information is used to indicate the image similarity between the prediction object image and the driving image sample, and there are various ways to calculate the perceptual loss information, for example, LPIPS loss (a perceptual loss function) may be used to calculate the image similarity between the prediction object image and the driving image sample, and the image similarity is used as the perceptual loss information.

The image loss information is used to indicate a feature distance between image features of the prediction object image and the driving image sample, and there are various ways to compare the prediction object image with the driving image sample, for example, the object image feature may be extracted from the prediction object image, the driving image feature may be extracted from the driving image sample, and the feature distance between the object image feature and the driving image may be calculated, so as to obtain the image loss information of the preset image generation model. For example, L1 loss may be used to calculate the feature distance between the image feature of the object and the image feature of the driving image, so as to obtain the image loss information of the preset image generation model.

After the perceptual loss information and the image loss information are calculated, the perceptual loss information and the image loss information may be fused to obtain reconstruction loss information, and the fusion manner may be various, for example, the perceptual loss information and the image loss information may be directly added to obtain the reconstruction loss information, which may be specifically represented by formula (6):

L _rec ＝|G(input)-GT| ₁ +|LPIPS(G(input))-LPIPS(GT)| ₁ (6)

wherein L is _rec For reconstructing loss information, GT is a driving image sample, input is an input original image sample and a driving image sample, G is a driving network (generating network), and L LPIPS (G (input) -LPIPS (GT) is charging current ₁ To perceive the loss information.

After the reconstruction loss information and the countermeasure loss information are determined, the reconstruction loss information and the countermeasure loss information may be fused, and there are various fusion manners, for example, the reconstruction loss information and the countermeasure loss information may be directly added to obtain the fused loss information, which may be specifically shown in formula (7):

L＝L _GAN +L _rec (7)

wherein L is loss information after fusion, L _GAN To combat loss information, L _rec To reconstruct the lost information.

Or, weighting parameters of the reconstruction loss information and the countermeasure loss information may be acquired, the reconstruction loss information and the countermeasure loss information are weighted based on the weighting parameters, so as to obtain weighted reconstruction loss information and weighted countermeasure loss information, and the weighted reconstruction loss information and weighted countermeasure loss information are fused to obtain fused loss information.

After the post-fusion loss information is obtained, the pre-set image generation model may be converged based on the post-fusion loss information to obtain a trained image generation model, and the pre-set image generation model may be converged in various manners, for example, a gradient descent algorithm may be used to update network parameters of the pre-set image generation model based on the post-fusion loss information to converge the pre-set image generation model, so as to obtain the trained image generation model, or other convergence algorithms may be used to update network parameters of the pre-set image generation model based on the post-fusion loss information, so as to obtain the trained image generation model.

When the preset image generation model is trained, as shown in fig. 6, two frames are arbitrarily selected from the same video, and one frame is used as an original image sample (I) _s ) One frame as the driving image sample (I) _d ) Meanwhile, the frame driving image is also used as a target generation result (GroudTruth), thereby obtaining training data. After the training data is obtained, the process of training the preset image generation model is generally divided into two parts, one part is 3D reconstruction, and the other part is generation of a prediction object image based on a generation network (driving network). In the 3D reconstruction part, ResNet50 is used as a 3D coefficient prediction network, and the original image sample (I) is obtained through a ResNet50 network _s ) Extracting identity, shadow, texture information from the driven image samples (I) _d ) And extracting the pose, and reconstructing a 3D image by an SMPL (smooth Markov chain) method according to the combination of the 3D coefficients, wherein the texture light and shadow of the 3D image are from the original image sample, and the pose expression is from the driving image sample. In the part of generating the network, the network is mainly established based on Stylegan v2, and a common inclusion coding network (Enc Block) and a mapping network

And decoding the network. The encoding network extracts a plurality of layers of spatial features (spatial features) from the 3D image and the driving image samples, respectively, the lowest resolution of the output is 4 x 4, the highest resolution is 512 x 512, and the features can be output by the decoding module in a one-to-one matching manner. The coding module is composed of a simple layer of convolutional network. Mapping networks

The method comprises the steps of mapping a combination of 3D coefficients into an implicit feature w in an implicit space, decoding a spatial feature corresponding to each preset resolution obtained through spatial coding through a decoding network, generating a basic style feature, a basic object image and a basic optical flow field according to a constant tensor (const) in the decoding process, and superposing the basic object image and a current object image according to the resolution which is reached from the beginning to the end, so as to obtain a prediction object image in a target resolution. And converging the preset image generation model based on the prediction object image and the driving image sample so as to obtain the trained image generation model. When the preset image generation model is trained, the generation network (mapping network and decoding network) and the discriminator in the preset image generation model may be trained in advance, and the coding network needs to be trained from the beginning, so that, during training, the learning rates of the three networks are different, and the ratio of the learning rates may be set according to practical applications, for example, the ratio of the learning rates of the coding network, the generation network and the discriminator may be 100: 10: 1 or other ratio.

The trained image generation model can output the target object image and the target optical flow field simultaneously, and can also output the target object image or the target optical flow field independently. Taking the example of the trained image generation model outputting the target object image, the process of outputting the target object image at the target resolution by the trained image generation model may be as shown in fig. 7, where the body posture and the texture of the target image output by the trained image generation model are consistent with those of the driving image and the texture is consistent with those of the original image. Compared with fig. 6, it is found that the biggest difference between the training process and the use process of the trained image generation model is that, in the use process, the input original image and the driving image may be any object image, the objects in the original image and the driving image may be the same object or different objects, and in the training process, in order to improve the training accuracy of the model, different video frames in the same video are generally used as the original image sample and the driving image sample respectively.

As can be seen from the above, in the embodiment of the present application, after an original image of an original object and a driving image of a driving object are obtained, an object feature of the original object is extracted from the original image, a posture feature of the driving object is extracted from the driving image, then, the object feature and the posture feature are fused to obtain a three-dimensional image feature, a three-dimensional object image is constructed based on the three-dimensional image feature, then, a spatial feature at each preset resolution is extracted from the original image and the three-dimensional object image to obtain an original spatial feature of the original image and a three-dimensional spatial feature of the three-dimensional object image, and then, the original spatial feature, the three-dimensional spatial feature and a preset basic style feature are fused to obtain a target object image at a target resolution; according to the scheme, the three-dimensional object image can be constructed based on the three-dimensional object features extracted from the original image and the driving image, so that the layout posture details of the driving object can be accurately captured, the target object image under the target resolution is generated through the space features of the driving image and the three-dimensional reconstruction image under different resolutions, the resolution of the generated target object image is ensured, and therefore the accuracy of image processing can be improved.

The method described in the above examples is further illustrated in detail below by way of example.

In this embodiment, an example will be described in which the image processing apparatus is specifically integrated in an electronic device, an electronic device server, and an original object and a driving object are all human bodies.

As shown in fig. 8, a specific flow of an image processing method is as follows:

201. the server acquires an original image of an original human body and a driving image for driving the human body.

For example, the server may directly obtain an original image of an original human body and a driving image of a driving human body uploaded by the terminal, or may extract any two frames of video frames including the human body from the video, use one frame as the original image, use the human body in the original image as the original human body, use the remaining frame as the driving image, and use the human body in the driving image as the driving human body, or may further screen out a human body image including the human body from the image data, and screen out any two images from the human body image as the original image and the driving image, and so on.

202. The server extracts the human body characteristics of the original human body from the original image and extracts the posture characteristics of the driving human body from the driving image.

For example, the server may extract a 3D coefficient (coeff) in the original image and the driving image respectively using a ResNet50 network of the trained image generation model, take the 3D coefficient extracted in the original image as an original image feature, and take the 3D coefficient extracted in the driving image as a driving image feature. And extracting parameters such as identity, shadow, texture and the like from the original image characteristics so as to obtain the human body characteristics of the original human body. And (4) screening parameters such as postures from the driving image characteristics so as to obtain the posture characteristics of the driving human body.

203. And the server fuses the human body characteristics and the posture characteristics to obtain three-dimensional image characteristics.

For example, the server may directly stitch the human body features and the posture features to obtain three-dimensional image features, or may further obtain weighting parameters of the human body features and the posture features, weight the human body features and the posture features respectively based on the weighting parameters to obtain weighted human body features and weighted posture features, and fuse the weighted human body features and the weighted posture features to obtain three-dimensional image features.

204. And the server constructs a three-dimensional human body image based on the three-dimensional image characteristics.

For example, the server may extract position information of at least one key point from the three-dimensional image features, convert the position information of the key point into geometric features, extract texture information of an original human body from the three-dimensional image features, and extract texture features from the texture information, where a specific conversion process may be as shown in formula (2). Determining three-dimensional model parameters of the three-dimensional human body according to the geometric characteristics and the textural characteristics, reconstructing the three-dimensional human body model of the three-dimensional human body by adopting an SMPL reconstruction method based on the three-dimensional model parameters, and projecting the three-dimensional human body model to a two-dimensional plane so as to obtain a three-dimensional human body image, wherein the three-dimensional human body image can be specifically shown as a formula (3).

205. The server extracts the spatial features under each preset resolution from the original image and the three-dimensional human body image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional human body image.

For example, the server may use an encoding network (Enc Block) of the trained image generation model to perform spatial encoding on the original image and the three-dimensional human body image at each preset resolution, so as to obtain an original spatial feature of the original image and a three-dimensional spatial feature of the three-dimensional human body image at each resolution, where the preset resolution may be 4 × 4 to 512 × 512.

206. And the server fuses the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target human body image under the target resolution.

For example, the server may generate a basic human body image at an initial resolution based on preset basic style characteristics, and generate a mapping network of the model by using the trained image

And the server adjusts the size of the basic style characteristic to obtain the initial style characteristic. And acquiring a basic convolution weight, and adjusting the convolution weight based on the hidden features, thereby obtaining convolution features corresponding to the initial style features. And (3) screening out a target style convolution network corresponding to the resolution of the basic human body image from a style convolution network (StyleConv) of the trained image generation model, and adjusting the initial style characteristics based on the convolution weight to obtain adjusted style characteristics.

The server sorts the original space features and the three-dimensional space features respectively based on preset resolution, screens out the original space features with the minimum resolution from the original space features as target original space features based on sorting information, and screens out the original space features with the minimum resolution from the three-dimensional space features as target original space features. After the target original space features and the target three-dimensional space features are screened out, the target original space features can be deleted from the original space features and deleted from the three-dimensional space features, so that the space features with the minimum resolution can be screened out from the original space features and the three-dimensional space features every time, and the target original space features and the target three-dimensional space features are obtained.

And the server splices the target original spatial feature, the target three-dimensional spatial feature and the adjusted style feature, so as to obtain the fused style feature under the current resolution, which can be specifically shown in formula (4). And generating a current human body image based on the fused style characteristics, fusing the current human body image and the basic human body image to obtain a fused human body image under the current resolution, taking the fused style characteristics as preset basic style characteristics, taking the fused human body image as the basic human body image, returning to execute the step of adjusting the preset basic style characteristics based on the hidden characteristics until the current resolution is the target resolution, and obtaining the target human body image.

Optionally, the server may further generate a basic optical flow field in the initial style resolution based on the basic style features. Screening target original spatial features from the original spatial features based on a preset resolution, screening target three-dimensional spatial features from the three-dimensional spatial features, fusing the adjusted style features with the target style features to obtain fused style features under the current resolution, generating a current optical flow field based on the fused style features, fusing the current optical flow field with a basic optical flow field to obtain a fused optical flow field under the current resolution, taking the fused style features as preset basic style features, taking the fused optical flow field as the basic optical flow field, returning to execute the step of adjusting the preset basic style features based on the hidden features until the current resolution is the target resolution, and obtaining the target optical flow field.

For the target human body image and the target optical flow field, the server may generate the target human body image and the target optical flow field simultaneously based on preset basic style characteristics, or may generate the target human body image and the target optical flow field separately based on preset style characteristics.

Optionally, before the server uses the trained image generation model to map the three-dimensional image features into the hidden features, the server may also train a preset image generation model to obtain the trained image generation model, which may specifically be as follows:

(1) the server obtains the video samples, and any two frames of video frames are screened out from the video samples to be used as an original image sample and a driving image sample respectively.

For example, the server may obtain a video sample uploaded by the terminal, or may obtain at least one candidate video from a video database or a network, and filter out any one of the candidate videos as the video sample, or may also obtain at least one candidate video from the video database or the network, and filter out videos including a human body from the candidate videos, so as to obtain the video sample.

(2) And the server adopts a preset image generation model to carry out human body driving on the original image sample and the driving image sample to obtain a predicted human body image.

For example, the server may extract a sample human body feature from an original image sample, extract a sample posture feature from a driving image sample, fuse the sample human body feature and the sample human body feature to obtain a three-dimensional image sample feature, construct a three-dimensional human body image sample based on the three-dimensional image sample feature, extract a spatial feature at each preset resolution from the original image sample and the three-dimensional human body image sample to obtain an original spatial sample feature of the original image sample and a three-dimensional spatial sample feature of the three-dimensional human body image sample, and fuse the original spatial sample feature, the three-dimensional spatial sample feature and a preset basic style feature to obtain a predicted human body image at a target resolution.

(3) And the server converges the preset image generation model based on the predicted human body image and the driving image sample to obtain the trained image generation model.

For example, the server may use the discriminator D of the trained image generation model to discriminate the driving image sample GT and the predicted human body image, and determine that the generated result is false and the real image is true, so as to obtain a discrimination result and a detection probability of the discrimination result, and use the discrimination result and the detection probability as the detection result.

The server extracts the detection probability of each discrimination result from the detection results, determines the confrontation parameters of the predicted human body image and the driving image sample based on the detection probabilities, and fuses the confrontation parameters to obtain the confrontation loss information, which can be specifically shown in formula (5).

The server can adopt LPIPS loss to calculate and predict the image similarity between the human body image and the driving image sample, and the image similarity is used as perception loss information. And extracting human body image features from the predicted human body image, extracting driving image features from the driving image sample, and calculating the feature distance between the human body image features and the driving image so as to obtain image loss information of a preset image generation model. For example, L1 loss may be used to calculate the feature distance between the human image feature and the driving image feature, so as to obtain the image loss information of the preset image generation model. The reconstruction loss information can be obtained by adding the perceptual loss information and the image loss information, and may be specifically expressed as formula (6). The reconstruction loss information and the countermeasure loss information are added to obtain the fused loss information, which may be specifically expressed as formula (7).

And the server adopts a gradient descent algorithm to update the network parameters of the preset image generation model based on the loss information after fusion so as to converge the preset image generation model, thereby obtaining the trained image generation model, or adopts other convergence algorithms to update the network parameters of the preset image generation model based on the loss information after fusion, thereby obtaining the trained image generation model.

The generation network (mapping network and decoding network) and the discriminator in the preset image generation model can be trained in advance, and the coding network needs to be trained from the beginning, so that the learning rates of the generation network and the discriminator are different during training, and the learning rate ratio of the coding network to the discriminator is 100: 10: 1.

as can be seen from the above, in the embodiment of the application, after the server acquires the original image of the original human body and the driving image of the driving human body, the human body feature of the original human body is extracted from the original image, the posture feature of the driving human body is extracted from the driving image, then, the human body feature and the posture feature are fused to obtain the three-dimensional image feature, the three-dimensional human body image is constructed based on the three-dimensional image feature, then, the spatial feature under each preset resolution is extracted from the original image and the three-dimensional human body image to obtain the original spatial feature of the original image and the three-dimensional spatial feature of the three-dimensional human body image, and then, the original spatial feature, the three-dimensional spatial feature and the preset basic style feature are fused to obtain the target human body image under the target resolution; according to the scheme, the three-dimensional human body image can be constructed based on the three-dimensional human body features extracted from the original image and the driving image, so that the layout posture details of the driving human body can be accurately captured, the target human body image under the target resolution is generated through the spatial features of the driving image and the three-dimensional reconstruction image under different resolutions, the resolution of the generated target human body image is ensured, and the accuracy of image processing can be improved.

In the present embodiment, an example will be described in which the image processing apparatus is specifically integrated in an electronic device, and both the original object and the driving object are faces.

An image processing method comprising:

acquiring an original face image and a face driving image, the face driving image being a template image for adjusting a face pose in the original face image, extracting facial image features from an original facial image, extracting facial posture features from a face driving image, fusing the facial image features and the facial posture features to obtain three-dimensional facial features, constructing a three-dimensional facial image based on the three-dimensional facial image, extracting spatial features under each preset resolution from the original face image and the face driving image to obtain original spatial features of the original face image and three-dimensional spatial features of the three-dimensional face image, fusing the original spatial features, the three-dimensional spatial features and preset basic style features to obtain a target face image under a target resolution, the target face image is an image in which the face pose in the original face image is replaced with the face pose in the face drive image.

As shown in fig. 9, the specific flow of the image processing method is as follows:

301. an original face image and a face driving image are acquired.

The original face image is a face image of which the face posture needs to be adjusted, and the face driving image is used for a template image for adjusting the face posture in the original face image. The facial pose adjustment of the original facial image can be understood as replacing facial expressions, eye movements, poses and the like in the original facial image with facial expressions, eye movements, poses and the like of the face-driven image, and taking a face as an example, the facial pose adjustment process can be as shown in fig. 10.

The manner of acquiring the original face image and the face driving image may be various, and specifically, the manner may be as follows:

for example, the original face image and the face driving image uploaded by the terminal may be directly acquired, or any two frames of video frames containing faces may be extracted from the video, one frame may be used as the original face image, and the remaining frame may be used as the face driving image, or the face image containing faces may be screened from the image data, and any two frames may be screened from the face image as the original face image and the face driving image, respectively.

302. Facial image features are extracted from the original facial image, and facial pose features are extracted from the face-driven image.

Among them, the facial image features are feature information for characterizing the identity (identity), lighting (lighting), texture (texture), and the like of the original facial image, and the facial pose features are facial pose (position), gaze (size), expression (expression), and the like for characterizing the face-driven image.

For example, the original face image and the face driving image may be subjected to image feature extraction by an image feature extraction network (ResNet50) of the trained image generation model to obtain original image features and driving image features, feature information such as identity, shadow, texture, and the like may be screened out from the original image features as face image features, and face pose features may be screened out from the driving image features.

The image feature extraction method may refer to the feature extraction methods for the original image and the driving image, which are not described in detail herein.

303. And fusing the facial image features and the facial posture features to obtain three-dimensional facial features, and constructing a three-dimensional facial image based on the three-dimensional facial image.

The three-dimensional facial features can be understood as image features for three-dimensional reconstruction of the face.

The three-dimensional face image may be a face image obtained by rendering three-dimensional face features into a three-dimensional face model in a rendering manner and projecting the three-dimensional face model onto a two-dimensional plane, or may be understood as a 3D face reconstruction, and the reconstruction process may be as shown in fig. 11.

The mode of fusing the facial image features and the facial pose features can be various, and specifically, the mode can be as follows:

for example, the facial image features and the facial pose features may be directly spliced to obtain three-dimensional facial features, or weighting parameters of the facial image features and the facial pose features may be obtained, the facial image features and the facial pose features are weighted based on the weighting parameters, respectively, to obtain weighted rear facial image features and weighted rear facial pose features, and the weighted rear facial image features and the weighted rear facial pose features are fused to obtain three-dimensional facial features.

After the facial image features and the facial pose features are fused, three-dimensional facial features can be obtained based on the fusion, three-dimensional facial images can be constructed, and the three-dimensional facial images can be constructed in various modes.

The method for constructing the three-dimensional face image may refer to a construction process of the three-dimensional object image, which is not described in detail herein.

304. And extracting the spatial features under each preset resolution from the original face image and the face driving image to obtain the original spatial features of the original face image and the three-dimensional spatial features of the three-dimensional face image.

For example, an encoding network (Enc Block) of the trained image generation model may be used to perform spatial encoding on the original face image and the three-dimensional face image at each preset resolution, so as to obtain an original spatial feature of the original face image and a three-dimensional spatial feature of the three-dimensional face image at each resolution.

The specific process of spatial coding may refer to a method of performing spatial coding on an original image and a three-dimensional object image, which is not described herein any more.

305. And fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target face image under the target resolution.

Wherein the target face image replaces the face pose in the original face image with the image of the face pose in the face drive image. The facial pose can be understood as state information of a facial region, for example, if a face is taken as a face, the facial pose can be eye spirit, expression, facial pose of the face, and the like.

for example, a basic object image at an initial resolution may be generated based on a preset basic style feature, a three-dimensional image feature is mapped to a hidden feature by using a trained image generation model, the preset basic style feature is adjusted based on the hidden feature, and the adjusted style feature, an original spatial feature and a three-dimensional spatial feature are fused according to the basic object image to obtain a target face image at a target resolution.

The process of generating the target face image with the target resolution by fusing the original spatial features, the three-dimensional spatial features and the preset basic style features may refer to the process of generating the target object image with the target resolution, and is not described in detail herein.

Optionally, the basic optical flow field at the initial resolution may be generated based on the basic style features, so as to output the target optical flow field at the target resolution, which may be specifically described above, and thus is not described herein any more.

Optionally, the trained image generation model may be set according to practical application, and it should be noted that, the trained image generation model may be set in advance by a maintenance person, or may be trained by an image processing apparatus, and the training process may be as follows:

the method comprises the steps of obtaining a video sample, screening out any two frames of video frames from the video sample to be used as an original face image sample and a face driving image sample respectively, carrying out face driving on the original face image sample and the face driving image sample by adopting a preset image generation model to obtain a predicted face image, converging the preset image generation model based on the predicted face image and the face driving image sample to obtain a trained image generation model, wherein the method specifically comprises the following steps:

(1) and acquiring a video sample, and screening any two frames of video frames in the video sample to be used as an original face image sample and a face driving image sample respectively.

The method for obtaining the video sample may refer to the above, and is not repeated here, after obtaining the video sample, the original face image sample and the face driving image sample may be screened out from the video sample, and the method for screening the original face image sample and the face driving image sample may be various, and specifically may be as follows:

for example, a video sample is subjected to framing processing, so that a video frame set corresponding to the video sample is obtained, any two video frames containing a face area are screened out from the video frame set, and the two video frames are respectively used as an original face image sample and a face driving image sample.

(2) And carrying out object driving on the original face image sample and the face driving image sample by adopting a preset image generation model to obtain a prediction object image.

For example, a sample face image feature may be extracted from an original face image sample, a sample face posture feature may be extracted from a face driving image sample, the sample face image feature and the sample face posture feature may be fused to obtain a three-dimensional face sample feature, a three-dimensional face image sample may be constructed based on the three-dimensional face sample feature, a spatial feature at each preset resolution may be extracted from the original image sample and the three-dimensional face image sample to obtain an original spatial sample feature of the original face image sample and a three-dimensional spatial sample feature of the three-dimensional face image sample, and the original spatial sample feature, the three-dimensional spatial sample feature, and a preset basic style feature may be fused to obtain a predicted face image at a target resolution.

(3) And converging the preset image generation model based on the predicted face image and the face driving image sample to obtain the trained image generation model.

For example, the predicted face image and the face driving image sample are detected respectively, the countermeasure loss information of the preset image generation model is determined based on the detection result, the reconstruction loss information of the preset image generation model is determined based on the predicted face image and the face driving image sample, the identity loss information of the preset image generation model is determined according to the predicted face image and the face driving image, the countermeasure loss information, the reconstruction loss information and the identity loss information are fused, the preset image generation model is converged based on the fused loss information, and the trained image generation model is obtained.

The manner of determining the countermeasure loss information and the reconstruction loss information based on the predicted face image and the face driving image sample may refer to the manner of determining the countermeasure loss information and the reconstruction loss information based on the predicted target image and the driving image sample, which is detailed above and is not repeated herein.

Wherein the identity loss information may be understood as difference information between the identity information of the reconstructed predicted face image and the identity information of the face-driven image sample. For example, facial identity features may be extracted from the predicted facial image samples, facial driving identity features may be extracted from the facial driving image samples, feature similarity between the facial identity features and the facial driving identity features may be calculated, and identity loss information of the preset image generation model may be determined based on the feature similarity, which may be specifically represented by formula (8):

wherein L is _ID In order to have the information lost to the identity,

in order to be a measure of the similarity of features,

for Identity (ID) feature extraction networks, input is the original face image sample and the face-driven image sample, GT is the face-driven image sample, and G is the driving network.

After the reconstruction loss information, the countermeasure loss information, and the identity loss information are determined, the reconstruction loss information, the countermeasure loss information, and the identity loss information may be fused, and there are various fusion manners, for example, the reconstruction loss information, the countermeasure loss information, and the identity loss information may be directly added to obtain the fused loss information, which may be specifically shown as formula (9):

L＝L _GAN +L _rec +L _ID (9)

wherein L is loss information after fusion, L _GAN To combat loss information, L _rec To reconstruct the lost information, L _ID Is identity loss information.

Or, weighting parameters of the countermeasure loss information, the reconstruction loss information and the identity loss information may be obtained, the countermeasure loss information, the reconstruction loss information and the identity loss information are weighted based on the weighting parameters, and the weighted countermeasure loss information, the weighted reconstruction loss information and the weighted identity loss information are fused to obtain fused loss information.

After the fused loss information is obtained, the preset image generation model can be converged based on the fused loss information, and the convergence process can be referred to as the above, which is not described any more herein.

Taking an object as a face as an example, when a preset image generation model is trained, as shown in fig. 12, two frames including a face are arbitrarily selected from the same video in the training process, and one frame is used as an original face image sample (I) _s ) One frame as a face-driven image sample (I) _d ) Meanwhile, the frame of face driving image is also used as a target generation result (GroudTruth), so as to obtain training data. After the training data are obtained, in the process of training the preset image generation model, the whole body is divided into two parts, one part is 3D reconstruction of the human face, and the other part is 3D reconstruction of the human faceThe predicted face image is generated based on a generation network (drive network). In the 3D reconstruction part of the human face, ResNet50 is used as a 3D coefficient prediction network, and the original human face image sample (I) is extracted through a ResNet50 network _s ) Extracting identity, shadow, texture information, driving image samples (I) from the face _d ) And extracting expression, attitude and the like, reconstructing a 3D face image by an SMPL (short distance doppler) method according to the combination of the 3D coefficients, wherein the identity, texture and light shadow of the 3D face image come from an original face image sample, and the attitude, expression, attitude and the like come from a face driving image sample. In the part of generating the network, the network is mainly built based on Stylegan v2, and a co-inclusive coding network (Enc Block) and a mapping network

And decoding the network. The encoding network extracts multi-layer spatial features (spatial features) from the 3D face image and the face driving image samples respectively, the lowest resolution of the output is 4 x 4, the highest resolution is 512 x 512, and the decoding module can output feature one-to-one matching. The coding module is composed of a simple layer of convolutional network. Mapping networks

The method comprises the steps of mapping a combination of 3D coefficients into an implicit feature w in an implicit space, decoding spatial features corresponding to each preset resolution ratio obtained through space coding through a decoding network, generating a basic style feature, a basic face image and a basic optical flow field according to a constant tensor (const) in the decoding process, and superposing the basic face image and a current face image according to the resolution ratio which is reached from small to small, so that a predicted face image under a target resolution ratio is obtained. And converging the preset image generation model based on the predicted face image and the face driving image sample, thereby obtaining the trained image generation model. When the preset image generation model is trained, the generation network (mapping network and decoding network) and the discriminator in the preset image generation model can be trained in advance, and the coding network needs to be trained from the beginning, so that the learning rates of the generation network and the coding network are different during training, and the proportion of the learning rates can be determined according to the actual conditionsThe application sets, for example, the learning rate ratio of the coding network, the generating network and the arbiter is 100: 10: 1, or other ratio.

The trained image generation model can output a target face image and a target optical flow field simultaneously, and can also output the target face image or the target optical flow field independently. Taking the example of the trained image generation model outputting the target face image, the process of outputting the target face image under the target resolution by the trained image generation model may be as shown in fig. 13, and the facial pose, expression, catch of the target face image output by the trained image generation model is consistent with the face-driven image, and the identity, shadow, texture and original face image are consistent. Compared with fig. 12, it is found that the biggest difference between the use process and the training process of the trained image generation model is that, in the use process, the input original face image and the face driving image can be face images of any object, the objects in the original face image and the face driving image can be the same object or different objects, and in the training process, in order to improve the training precision of the model, different video frames in the same video are generally used as an original face image sample and a face driving image sample, and the objects in the original face image sample and the face driving image sample are often the same object.

Optionally, in some embodiments, a current original image and a current driving image may also be obtained, and area detection may be performed on the current original image and the current driving image, when it is detected that the current original image and the current driving image include a face area and an object area, an image corresponding to the face area may be segmented in the current original image to obtain an original face image, an image corresponding to the object area is segmented in the original image to obtain an original image corresponding to the original object, an image corresponding to the face area is segmented in the current driving image to obtain a face driving image, and an image corresponding to the object area is segmented in the current driving image to obtain a driving image corresponding to the driving object. Then, an object image at a target resolution is generated from the original image and the driving image, and a face image at the target resolution is generated from the original face image and the face driving image. And fusing the object image and the face image to obtain a target image under the target resolution, wherein the target image is obtained by replacing the body posture and the face posture in the current original image with the body posture and the face posture in the driving image.

Taking the face region as a face region and the body region as a human body region as an example, the method of performing image driving according to the current original image and the current driving image can be regarded as combining human body driving and human face driving, so that a high-definition target image can be generated, and the accuracy of image processing is greatly increased.

The process of generating the face image at the target resolution according to the original face image and the face driving image may refer to the above, and is not described in detail herein.

As can be seen from the above, in the embodiment of the present application, after an original face image and a face driving image are obtained, face image features are extracted from the original face image, face posture features are extracted from the face driving image, then, the face image features and the face posture features are fused to obtain three-dimensional face features, a three-dimensional face image is constructed based on the three-dimensional face image, then, spatial features under each preset resolution are extracted from the original face image and the face driving image to obtain original spatial features of the original face image and three-dimensional spatial features of the three-dimensional face image, and the original spatial features, the three-dimensional spatial features and preset basic style features are fused to obtain a target face image under a target resolution; according to the scheme, the three-dimensional face image can be constructed based on the three-dimensional face features extracted from the original face image and the face driving image, so that the details such as the face posture, the expression and the eye spirit of the driving face can be accurately captured, the target face image under the target resolution is generated through the spatial features of the face driving image and the three-dimensional reconstruction image under different resolutions, the resolution of the generated target face image is ensured, and therefore the accuracy of image processing can be improved.

In order to better implement the above method, the embodiment of the present invention further provides an image processing apparatus, which may be integrated in an electronic device, such as a server or a terminal, and the terminal may include a tablet computer, a notebook computer, and/or a personal computer.

For example, as shown in fig. 14, the image processing apparatus may include an acquisition unit 401, a first extraction unit 402, a construction unit 403, a second extraction unit 404, and a fusion unit 405 as follows:

(1) an acquisition unit 401;

an acquiring unit 401 configured to acquire an original image of an original object and a driving image of a driving object, where the driving image is a template image for performing body posture adjustment on the original object;

for example, the obtaining unit 401 may be specifically configured to obtain an original image of an original object and a driving image of a driving object uploaded by a terminal, or may extract any two frames of video frames including the object from a video, take one frame as the original image, take the object in the original image as the original object, take the remaining one frame as the driving image, and take the object in the driving image as the driving object, or may further screen out an object image including the object from image data, and screen out any two pieces of the object image as the original image and the driving image, respectively.

(2) A first extraction unit 402;

a first extraction unit 402, configured to extract an object feature of an original object in an original image, and extract an attitude feature of a driving object in a driving image.

For example, the first extracting unit 402 may be specifically configured to perform image feature extraction on the original image and the driving image, respectively, to obtain an original image feature of the original image and a driving image feature of the driving image, screen an object feature of the original object from the original image feature, and screen an attitude feature of the driving object from the driving image feature.

(3) A construction unit 403;

a constructing unit 403, configured to fuse the object feature and the pose feature to obtain a three-dimensional image feature, and construct a three-dimensional object image based on the three-dimensional image feature.

For example, the constructing unit 403 may be specifically configured to fuse the object feature and the pose feature to obtain a three-dimensional image feature, convert the three-dimensional image feature into a geometric feature and a texture feature of the three-dimensional object, construct a three-dimensional object model of the three-dimensional object according to the geometric feature and the texture feature, and project the three-dimensional object model onto a two-dimensional plane to obtain a three-dimensional object image of the three-dimensional object.

(4) A second extraction unit 404;

the second extracting unit 404 is configured to extract a spatial feature at each preset resolution from the original image and the three-dimensional object image, so as to obtain an original spatial feature of the original image and a three-dimensional spatial feature of the three-dimensional object image.

For example, the second extracting unit 404 may be specifically configured to perform spatial encoding on the original image and the three-dimensional object image at each preset resolution by using an encoding network of the trained image generation model, so as to obtain an original spatial feature of the original image and a three-dimensional spatial feature of the three-dimensional object image at each resolution.

(5) A fusion unit 405;

the fusion unit 405 is configured to fuse the original spatial feature, the three-dimensional spatial feature, and the preset basic style feature to obtain a target object image at a target resolution, where the target object image is an object image obtained by replacing the body posture of the original object with the body posture of the driving object.

For example, the fusion unit 405 may be specifically configured to generate a basic object image at an initial resolution based on a preset basic style feature, map a three-dimensional image feature to a hidden feature by using a trained image generation model, adjust the preset basic style feature based on the hidden feature, screen a target original spatial feature from an original spatial feature based on the preset resolution, screen a target three-dimensional spatial feature from the three-dimensional spatial feature, fuse the adjusted style feature, the target original spatial feature and the target three-dimensional spatial feature to obtain a fused style feature at a current resolution, and generate a target object image at the target resolution based on the fused style feature and the basic object image.

Optionally, in some embodiments, the image processing apparatus may further include a training unit 406, as shown in fig. 15, which may specifically be as follows:

the training unit 406 is configured to train a preset image generation model to obtain a trained image generation model.

For example, the training unit 406 may be specifically configured to obtain a video sample, screen out any two frames of video frames from the video sample to serve as an original image sample and a driving image sample, perform object driving on the original image sample and the driving image sample by using a preset image generation model to obtain a prediction object image, and converge the preset image generation model based on the prediction object image and the driving image sample to obtain a trained image generation model.

In specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily, and implemented as the same or several entities, and specific implementations of the above units may refer to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, after the acquiring unit 401 acquires the original image of the original object and the driving image of the driving object, the first extraction unit 402 extracts an object feature of an original object in an original image, and extracts an attitude feature of a driving object in a driving image, then, the construction unit 403 fuses the object feature and the pose feature to obtain a three-dimensional image feature, constructs a three-dimensional object image based on the three-dimensional image feature, then, the second extraction unit 404 extracts spatial features at each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image, then, the fusion unit 405 fuses the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target object image at a target resolution; according to the scheme, the three-dimensional object image can be constructed on the basis of the three-dimensional object features extracted from the original image and the driving image, so that the layout posture details of the driving object can be accurately captured, the target object image under the target resolution is generated through the spatial features of the driving image and the three-dimensional reconstruction image under different resolutions, the resolution of the generated target object image is ensured, and the accuracy of image processing can be improved.

An embodiment of the present invention further provides an electronic device, as shown in fig. 16, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 501 of one or more processing cores, memory 502 of one or more computer-readable storage media, a power supply 503, and an input unit 504. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 16 is not limiting of electronic devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 501 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502. Optionally, processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.

The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

The electronic device further comprises a power source 503 for supplying power to each component, and preferably, the power source 503 may be logically connected to the processor 501 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 503 may also include any component such as one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may also include an input unit 504, where the input unit 504 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 501 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application programs stored in the memory 502, thereby implementing various functions as follows:

acquiring an original image of an original object and a driving image of a driving object, the driving image being a template image for performing body pose adjustment on the original object, extracting object characteristics of an original object from an original image, extracting attitude characteristics of a driving object from a driving image, fusing the object characteristics and the attitude characteristics to obtain three-dimensional image characteristics, constructing a three-dimensional object image based on the three-dimensional image characteristics, extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image, fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target object image under a target resolution, the target object image is an object image in which the body posture of the original object is replaced with the body posture of the driving object.

For example, the electronic device obtains an original image of an original object and a driving image of a driving object uploaded by a terminal, or may extract any two frames of video frames containing the object from a video, take one frame as the original image, take the object in the original image as the original object, take the remaining frame as the driving image, and take the object in the driving image as the driving object, or may further screen out an object image containing the object from image data, and screen out any two images from the object image as the original image and the driving image, respectively. Respectively extracting image characteristics of the original image and the driving image to obtain the original image characteristics of the original image and the driving image characteristics of the driving image, screening object characteristics of an original object from the original image characteristics, and screening posture characteristics of the driving object from the driving image characteristics. Fusing the object features and the posture features to obtain three-dimensional image features, converting the three-dimensional image features into geometric features and texture features of the three-dimensional object, constructing a three-dimensional object model of the three-dimensional object according to the geometric features and the texture features, and projecting the three-dimensional object model to a two-dimensional plane to obtain a three-dimensional object image of the three-dimensional object. And respectively carrying out space coding on the original image and the three-dimensional object image under each preset resolution by adopting a coding network of the trained image generation model, so that the original space characteristic of the original image and the three-dimensional space characteristic of the three-dimensional object image under each resolution can be obtained. Generating a basic object image under an initial resolution ratio based on a preset basic style characteristic, mapping a three-dimensional image characteristic to be a hidden characteristic by adopting a trained image generation model, adjusting the preset basic style characteristic based on the hidden characteristic, screening a target original space characteristic from an original space characteristic based on the preset resolution ratio, screening a target three-dimensional space characteristic from the three-dimensional space characteristic, fusing the adjusted style characteristic, the target original space characteristic and the target three-dimensional space characteristic to obtain a fused style characteristic under the current resolution ratio, and generating the target object image under the target resolution ratio based on the fused style characteristic and the basic object image.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, in the embodiment of the present application, after an original image of an original object and a driving image of a driving object are obtained, an object feature of the original object is extracted from the original image, a posture feature of the driving object is extracted from the driving image, then, the object feature and the posture feature are fused to obtain a three-dimensional image feature, and based on the three-dimensional image feature, a three-dimensional object image is constructed, then, a spatial feature at each preset resolution is extracted from the original image and the three-dimensional object image to obtain an original spatial feature of the original image and a three-dimensional spatial feature of the three-dimensional object image, and then, the original spatial feature, the three-dimensional spatial feature and a preset basic style feature are fused to obtain a target object image at a target resolution; according to the scheme, the three-dimensional object image can be constructed based on the three-dimensional object features extracted from the original image and the driving image, so that the layout posture details of the driving object can be accurately captured, the target object image under the target resolution is generated through the space features of the driving image and the three-dimensional reconstruction image under different resolutions, the resolution of the generated target object image is ensured, and therefore the accuracy of image processing can be improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any image processing method provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

acquiring an original image of an original object and a driving image of a driving object, the driving image being a template image for performing body pose adjustment on the original object, extracting object characteristics of an original object from the original image, extracting attitude characteristics of a driving object from the driving image, fusing the object characteristics and the attitude characteristics to obtain three-dimensional image characteristics, constructing a three-dimensional object image based on the three-dimensional image characteristics, extracting spatial features under each preset resolution from the original image and the three-dimensional object image to obtain the original spatial features of the original image and the three-dimensional spatial features of the three-dimensional object image, fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target object image under a target resolution, the target object image is an object image in which the body posture of the original object is replaced with the body posture of the driving object.

For example, an original image of an original object and a driving image of a driving object uploaded by a terminal are acquired, or any two frames of video frames containing the object may be extracted from a video, one frame is used as the original image, the object in the original image is used as the original object, the remaining frame is used as the driving image, and the object in the driving image is used as the driving object, or an object image containing the object may be screened from image data, and any two images are screened from the object image as the original image and the driving image, respectively. Respectively extracting image characteristics of the original image and the driving image to obtain the original image characteristics of the original image and the driving image characteristics of the driving image, screening object characteristics of an original object from the original image characteristics, and screening posture characteristics of the driving object from the driving image characteristics. And fusing the object features and the posture features to obtain three-dimensional image features, converting the three-dimensional image features into geometric features and texture features of the three-dimensional object, constructing a three-dimensional object model of the three-dimensional object according to the geometric features and the texture features, and projecting the three-dimensional object model to a two-dimensional plane to obtain a three-dimensional object image of the three-dimensional object. And respectively carrying out space coding on the original image and the three-dimensional object image under each preset resolution by adopting a coding network of the trained image generation model, so that the original space characteristic of the original image and the three-dimensional space characteristic of the three-dimensional object image under each resolution can be obtained. Generating a basic object image under an initial resolution ratio based on a preset basic style characteristic, mapping a three-dimensional image characteristic to be a hidden characteristic by adopting a trained image generation model, adjusting the preset basic style characteristic based on the hidden characteristic, screening a target original space characteristic from an original space characteristic based on the preset resolution ratio, screening a target three-dimensional space characteristic from the three-dimensional space characteristic, fusing the adjusted style characteristic, the target original space characteristic and the target three-dimensional space characteristic to obtain a fused style characteristic under the current resolution ratio, and generating the target object image under the target resolution ratio based on the fused style characteristic and the basic object image.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any image processing method provided in the embodiment of the present invention, the beneficial effects that can be achieved by any image processing method provided in the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the image processing aspect or the object-driven or face-driven aspect described above.

The image processing method, the image processing apparatus, the electronic device, and the computer-readable storage medium according to the embodiments of the present invention are described in detail, and a specific example is applied to illustrate the principles and embodiments of the present invention, and the description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An image processing method, comprising:

2. The image processing method according to claim 1, wherein the fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain the target object image at the target resolution comprises:

generating a basic object image under an initial resolution based on the preset basic style characteristics;

mapping the three-dimensional image features into hidden features by adopting a trained image generation model, and adjusting the preset basic style features based on the hidden features;

and according to the basic object image, fusing the adjusted style characteristic, the original spatial characteristic and the three-dimensional spatial characteristic to obtain a target object image under the target resolution.

3. The image processing method according to claim 2, wherein the fusing the adjusted style feature, the original spatial feature, and the three-dimensional spatial feature according to the basic object image to obtain the target object image at the target resolution comprises:

screening target original space features from the original space features based on the preset resolution, and screening target three-dimensional space features from the three-dimensional space features;

fusing the adjusted style characteristic, the target original spatial characteristic and the target three-dimensional spatial characteristic to obtain a fused style characteristic under the current resolution;

and generating a target object image under the target resolution ratio based on the fused style features and the basic object image.

4. The image processing method according to claim 3, wherein generating the target object image at the target resolution based on the fused style features and the base object image comprises:

generating a current object image based on the fused style characteristics, and fusing the current object image and the basic object image to obtain a fused object image under the current resolution;

taking the fused style features as the preset basic style features, and taking the fused object image as the basic object image;

and returning to the step of adjusting the preset basic style characteristics based on the hidden characteristics until the current resolution is the target resolution, and obtaining the target object image.

5. The image processing method according to claim 2, wherein the adjusting the preset basic style feature based on the hidden feature comprises:

adjusting the size of the basic style characteristic to obtain an initial style characteristic;

modulating the hidden features to obtain convolution weights corresponding to the initial style features;

and adjusting the initial style characteristics based on the convolution weight to obtain adjusted style characteristics.

6. The image processing method according to claim 5, wherein the adjusting the initial style feature based on the convolution weight to obtain an adjusted style feature comprises:

screening out a target style convolution network corresponding to the resolution of the basic object image from the style convolution network of the trained image generation model;

and performing convolution processing on the initial style features by adopting the target style convolution network based on the convolution weight to obtain the adjusted style features.

7. The image processing method according to claim 2, further comprising:

generating a basic optical flow field under the initial resolution based on the basic style characteristics;

and according to the basic optical flow field, fusing the adjusted style characteristics, the original spatial characteristics and the three-dimensional spatial characteristics to obtain a target optical flow field under the target resolution.

8. The image processing method of claim 2, wherein before mapping the three-dimensional image features into hidden features using the trained image generation model, further comprising:

acquiring a video sample, and screening any two frames of video frames from the video sample to be used as an original image sample and a driving image sample respectively;

adopting a preset image generation model to carry out object driving on the original image sample and the driving image sample to obtain a prediction object image;

and converging the preset image generation model based on the prediction object image and the driving image sample to obtain a trained image generation model.

9. The image processing method according to claim 8, wherein the converging the preset image generation model based on the prediction object image and the driving image sample to obtain a trained image generation model comprises:

respectively detecting the prediction object image and the driving image sample, and determining the countermeasure loss information of the preset image generation model based on the detection result;

determining reconstruction loss information of the preset image generation model based on the prediction object image and the driving image sample;

and fusing the antagonistic loss information and the reconstruction loss information, and converging the preset image generation model based on the fused loss information to obtain a trained image generation model.

10. The image processing method according to claim 9, wherein the determining reconstruction loss information of the preset image generation model based on the prediction object image and the driving image samples comprises:

calculating the image similarity of the prediction object image and the driving image sample to obtain the perception loss information of the preset image generation model;

comparing the prediction object image with the driving image sample to obtain image loss information of the preset image generation model;

and fusing the perceived loss information and the image loss information to obtain the reconstruction loss information of the preset image generation model.

11. The image processing method according to any one of claims 1 to 10, wherein the constructing a three-dimensional object image based on the three-dimensional image feature comprises:

converting the three-dimensional image features into geometric features and texture features of the three-dimensional object;

constructing a three-dimensional object model of the three-dimensional object according to the geometric features and the texture features;

and projecting the three-dimensional object model to a two-dimensional plane to obtain a three-dimensional object image of the three-dimensional object.

12. The image processing method according to any one of claims 1 to 10, wherein the extracting of the object feature of the original object in the original image and the extracting of the posture feature of the driving object in the driving image include:

respectively extracting image characteristics of the original image and the driving image to obtain the original image characteristics of the original image and the driving image characteristics of the driving image;

screening the object characteristics of the original object from the original image characteristics;

and screening out the attitude characteristic of the driving object from the driving image characteristic.

13. An image processing method, comprising:

and fusing the original spatial features, the three-dimensional spatial features and the preset basic style features to obtain a target face image under a target resolution, wherein the target face image is an image obtained by replacing the face posture in the original face image with the face posture in the face driving image.

14. An image processing apparatus characterized by comprising:

15. An electronic device comprising a processor and a memory, the memory storing an application program, the processor being configured to run the application program in the memory to perform the steps of the image processing method according to any one of claims 1 to 13.

16. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps in the image processing method according to any of claims 1 to 13.

17. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the image processing method according to any one of claims 1 to 13.