CN117252791A

CN117252791A - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN117252791A
Application number: CN202311163136.XA
Authority: CN
Inventors: 王凡祎; 苏婧文
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-19

Abstract

The application discloses an image processing method, an image processing device, electronic equipment and a storage medium, wherein the image processing method comprises the following steps: acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture; generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture; and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image. The method can improve the generation efficiency of the portrait images.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium.

Background

With the continuous development of image processing technology, methods for editing face changing, posture adjustment, style modification and the like of portrait images in images or videos by artificial intelligence (Artificial Intelligence, AI) have been developed. In the related art, a portrait image is generally edited based on a generated artificial intelligence (Artificial Intelligence Generated Content, AIGC) to generate a portrait image required by a user, but the efficiency of editing a portrait image based on the AIGC is low at present.

Disclosure of Invention

The application provides an image processing method, an image processing device, electronic equipment and a storage medium, which can improve the generation efficiency of portrait images.

In a first aspect, an embodiment of the present application provides an image processing method, which is characterized in that the method includes: acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture; generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture; and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

In a second aspect, an embodiment of the present application provides an image processing apparatus, including: the system comprises an image acquisition module, an image generation module and a face replacement module, wherein the image acquisition module is used for acquiring a first human image, a second human image and a target keyword corresponding to a target human image style, the first human image comprises a first face, and the second human image comprises a human body with a target gesture; the image generation module is used for generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, and the gesture of a human body in the third portrait image is matched with the target gesture; the face replacing module is used for replacing the face in the third face image with the first face based on the first face image to obtain a fourth face image.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the image processing method provided in the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored therein program code that is callable by a processor to perform the image processing method provided in the first aspect described above.

According to the scheme, through obtaining the first portrait image, the second portrait image and the target keywords corresponding to the target portrait style, the first portrait image comprises a first face, the second portrait image comprises a human body with a target gesture, the third portrait image with the target portrait style is generated based on the target keywords and the second portrait image through a large language model trained in advance, the gesture of the human body in the third portrait image is matched with the target gesture, and the human face in the third portrait image is replaced with the first face based on the first portrait image, so that a fourth portrait image is obtained. Therefore, after the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword corresponding to the target portrait style, the generated portrait image is replaced according to the required human face, and the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flow diagram of an image processing method according to an embodiment of the present application.

Fig. 2 shows a flow diagram of an image processing method according to another embodiment of the present application.

Fig. 3 shows a flow diagram of an image processing method according to a further embodiment of the present application.

Fig. 4 shows a flow diagram of an image processing method according to a further embodiment of the present application.

Fig. 5 shows a flow diagram of an image processing method according to yet another embodiment of the present application.

Fig. 6 shows a flow diagram of an image processing method according to yet another embodiment of the present application.

Fig. 7 shows an interface schematic provided in an embodiment of the present application.

Fig. 8 is a flowchart of an image processing method according to still another embodiment of the present application.

Fig. 9 shows a block diagram of an image processing apparatus according to an embodiment of the present application.

Fig. 10 is a block diagram of an electronic device for performing an image processing method according to an embodiment of the present application.

Fig. 11 is a storage unit for storing or carrying program codes for implementing the image processing method according to the embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

In recent years, with the development of image processing technology, in applications such as film special effects and internet social contact, there is a need for generating a corresponding portrait style, a corresponding portrait posture and a corresponding portrait image of a face of a user, and the need can be understood as editing the portrait image of the user, so as to edit the portrait style and the portrait posture required by the user, so that the portrait style of the finally obtained portrait image is the required portrait style, the portrait posture in the obtained portrait image is the required posture, and the face in the obtained portrait image is the face required by the user.

In the related art, generally, a controllable generation is performed based on an AIGC, so as to obtain a portrait image required by a user. When the controllable generation is performed based on the AIGC, a Low-Rank adaptive (lore) model is added on the basis of a Stable Diffusion (SD) model of an open source, and the lore model corresponding to the face is trained according to the style selected by the user and a large number of face images of the required face provided by the user, so that the customized fine tuning can be performed on the SD model by the LoRA model obtained by training, the LoRA model and the SD model are combined to output a portrait image of the style selected by the user, and the face in the portrait image is the face provided by the user. However, in such a manner, when generating a portrait image matching the style and face required by the user, the user is required to provide a large number of face images of the required face, if there are not enough face images at present, the user is required to re-shoot, and when the faces required each time are different, the user is required to re-provide the LoRA model corresponding to the face image training, so that the speed of generating the face images each time is slow, and more time is required to be consumed; in addition, the mode can only meet the requirements of users on the styles and faces of the generated portrait images, and cannot meet the requirements of users on different human body postures in the portrait images.

In order to solve the problems, the inventor proposes an image processing method, an image processing device, an electronic device and a storage medium, which are provided by the embodiment of the application, so that after a portrait image is generated according to a portrait image corresponding to a required human body posture and a keyword corresponding to a target portrait style, the generated portrait image is subjected to face replacement according to the required human face, and a portrait image matched with the required target portrait style, the required human face and the required human body posture can be obtained. The specific image processing method is described in detail in the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of an image processing method according to an embodiment of the present application. In a specific embodiment, the image processing method is applied to an image processing apparatus 700 as shown in fig. 9 and an electronic device 100 (fig. 10) provided with the image processing apparatus 700. In the following, the specific flow of the present embodiment will be described by taking an electronic device as an example, and it will be understood that the electronic device applied in the present embodiment may be a smart phone, a tablet computer, a notebook computer, an electronic book, etc., which is not limited herein. The following will describe the flowchart shown in fig. 1 in detail, and the image processing method specifically may include the following steps:

Step S110: and acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture.

In the embodiment of the application, the electronic device may acquire a first portrait image including a first face, a second portrait image including a human body with a target pose, and a keyword corresponding to a target portrait style, where the first face is a face in a portrait image required to be generated, the target pose is a human body pose in a portrait image required to be generated, and the target portrait style is a portrait style corresponding to the portrait image required to be generated. That is, the electronic device may acquire a first portrait image to provide a reference of a required face when generating a portrait image, may acquire a second portrait image to provide a reference of a human body posture when generating a portrait image, and may acquire a target keyword corresponding to a target portrait style to prompt a portrait style of a portrait image that needs to be generated by the large language model.

In some embodiments, the electronic device may obtain the above first portrait image, second portrait image by photographing; the first portrait image and the second portrait image can be obtained from the locally stored images; the required image to be processed can be downloaded from a corresponding server through a wireless network, a data network and the like; the first portrait image and the second portrait image transmitted from other electronic devices may also be received.

In one possible implementation manner, when the electronic device is a mobile terminal provided with a camera, such as a smart phone, a tablet computer, a smart watch, etc., the image acquisition can be performed through a front camera or a rear camera, so as to obtain a first portrait image and a second portrait image, for example, the electronic device can acquire the images through the rear camera so as to obtain the first portrait image and the second portrait image. By way of example, if the user needs to generate a portrait image corresponding to the face of the user, the portrait image including the face of the user can be obtained by a self-timer mode and used as a first portrait image; the user needs to generate a portrait image of the human body in the target posture, and then the second portrait image can be obtained by shooting the human body in the target posture.

In one possible implementation, the album of the electronic device may store a plurality of portrait images, and the electronic device may determine the above first portrait image and the second portrait image from the plurality of portrait images stored in the album according to an operation input by a user.

In one possible implementation, the electronic device may also access the image repository in the cloud through a wireless network, a data network, or the like, and download the first portrait image and the second portrait image from the image repository according to an operation input by a user.

Of course, the specific manner in which the electronic device obtains the first portrait image and the second portrait image may not be limited.

In some cases, if the face in the same portrait image is a first face and the pose of the human body is a target pose, the portrait image may be used as the first portrait image or the second portrait image, that is, in such a case, the first portrait image and the second portrait image may be the same portrait image. For example, in some cases, when a user desires to generate a portrait image with his face and his pose, the user who poses a target may be photographed, so that a portrait image of a human body including both the first face and the target pose is obtained, and the obtained portrait image may be used as a first portrait image and a second portrait image.

In some embodiments, when the electronic device obtains the target keyword corresponding to the target portrait style, the electronic device may determine, according to the detected user operation, the keyword input by the user as the target keyword corresponding to the target portrait style; and determining the target portrait style selected by the user according to the detected user operation, and acquiring a target keyword corresponding to the target portrait style. The keyword corresponding to the portrait style may be a name of the portrait style, for example, the name corresponding to the portrait style 1 is professional photo, and the keyword corresponding to the portrait style 1 may be "professional photo".

In addition, it should be noted that the execution sequence between the electronic device acquiring the first portrait image, the second portrait image and the target keyword may not be limited, for example, the first portrait image may be acquired first, the second portrait image may be acquired, and the target keyword may be acquired; or the target keyword is firstly acquired, then the first portrait image is acquired, and then the second portrait image is acquired; the method may also include acquiring the target keyword first, then acquiring the second portrait image, and then acquiring the first portrait image, which is, of course, only an example, and does not limit the actual execution sequence.

Step S120: and generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

In the embodiment of the present application, after the above second portrait image and the target keyword are obtained, the human body posture of the portrait image to be generated and the portrait style of the portrait image may be known, so that a third portrait image may be generated based on the target keyword and the second portrait image and through a pre-trained large language model, and the generated third portrait image not only has the above target portrait style, but also includes the human body with the target posture. The large language model trained in advance can be a large language model based on AIGC, the large language model can generate an image according to input text, target keywords can be used as promptts (prompt words) which are required to be input into the large language model when the AIGC is realized, after the prompts are provided to the large language model, the large language model can understand the user intention of the prompts, and then a portrait image matched with the user intention is generated; the second portrait image includes a human body of the target pose, and thus pose information of the human body in the second portrait image may be used to input a large language model to constrain the human body pose in the portrait image.

In some embodiments, the above large language model may be a diffusion model, which is a progressively denoising network that requires progressive restoration of images from noisy images to achieve progressive image generation from random noise. Optionally, the large language model may be an SD model, and the SD model is a diffusion model based on latency, which is based on the principle that a picture is forward-diffused and noisy to obtain a normal-distributed vector z, a text feature is fused and then is reversely diffused to obtain a picture, and in the reasoning stage, the forward-diffused and noisy process is abandoned, and the text feature and the random normal-distributed vector are directly reversely diffused to obtain a picture corresponding to the text feature. When the SD model is utilized to generate the third portrait image based on the target keywords and the second portrait image, the LoRA model and the control network control Net model can be combined to further control the SD model, so that the SD model can accept the input of the prompt and the image, and the SD model is controlled to generate a portrait image of the target portrait style with the human body posture matched with the target posture.

Step S130: and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

In the embodiment of the present application, after the third portrait image of the target portrait style in which the pose of the human body is matched with the target pose is obtained, because when the above third portrait image is generated by using the large language model, the faces of the portrait image generated by the large language model are not constrained by the image including the required first portrait, the large language model is randomly generated according to the faces of the portrait image adopted when the large language model is trained, that is, the faces in the current third portrait image are not required faces, so that the faces in the third portrait image can be replaced with the first portrait image based on the first portrait image to obtain the fourth portrait image, thereby obtaining the portrait image in which the portrait style is the target portrait style, the pose of the human body is matched with the target pose, and the faces are matched with the first portrait. As can be appreciated, when the generation of the portrait image of the corresponding portrait style of the target face is to be realized in the related art, the LoRA model corresponding to the target face is usually trained through a large number of portrait images including the target face, so that the LoRA model obtained by training can perform customized fine tuning on the SD model, the portrait image of the corresponding portrait style is output by combining the LoRA model and the SD model, and the face in the portrait image is matched with the target face.

In some embodiments, the face in the third portrait image may be replaced with the first face in the first portrait image through a pre-trained face replacement model, so as to obtain a fourth portrait image. The face replacement model may be a self-encoding depth neural network based on encoding and decoding. Of course, the specific way of replacing the face in the third image may not be limited, for example, the face in the first image and the face in the second image may be identified, then feature extraction is performed on the identified face, after positioning of key points such as eyes, nose, mouth, chin, etc., a transformation matrix is calculated according to the key point information, so as to convert the face in the first image into the same pose and expression as the face in the third image, and then the face region in the first image is placed in the face region in the third image, thereby completing face replacement.

According to the image processing method, after the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword corresponding to the target portrait style, the generated portrait image is subjected to face replacement according to the required human face, so that the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained; and the convenience of generating the portrait image required by the user is improved without providing a large number of images including the face by the user.

Referring to fig. 2, fig. 2 is a schematic flow chart of an image processing method according to another embodiment of the present application. The image processing method is applied to the electronic device, and will be described in detail with respect to the flowchart shown in fig. 2, and the image processing method specifically includes the following steps:

step S210: and acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture.

In the embodiment of the present application, step S210 may refer to the content of other embodiments, which is not described herein.

Step S220: and acquiring a skeleton key point image of a human body in the second portrait image.

In the embodiment of the application, when the third portrait image is generated based on the target keyword and the second portrait image through the large language model, the skeleton key point image of the human body in the second portrait image can be acquired so as to use the skeleton key point image and the target keyword together as the input of the large language model, so that the large language model can generate the portrait image according to the input conditions of the skeleton key point image and the target keyword.

In some implementations, a skeletal keypoint image of the human body in the second portrait image may be acquired by a human body pose estimation (Human Posture Estimation). The human body posture estimation is to accurately link the detected key points of the human body in the picture so as to estimate the human body posture. Alternatively, the skeleton key point image of the human body in the second portrait image may be extracted by a skeleton point extraction algorithm (openpore algorithm, etc.). Of course, the specific manner of acquiring the bone key point image for the second portrait image may not be limited.

Step S230: and generating a third portrait image with the target portrait style based on the target keywords and the skeleton key point image through a pre-trained stable diffusion SD model, a low-rank self-adaptive LoRA model and a control network ControlNet model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

In the embodiment of the present application, when the third portrait image is generated, the above third portrait image may be generated based on the target keyword and the bone key point image and through the SD model, the lorea model, and the ControlNet model that are trained in advance. After the target keyword is input into the SD model, the lore model matched with the target keyword (i.e., the lore model corresponding to the target portraits style) may be called, and in addition, the above skeleton key point image is used as the input of the ControlNet model, so that the SD model, the lore model, and the ControlNet model are combined to generate the above third portraits image. The LoRA model is obtained by fine tuning a cross attention layer in a UNet module in the SD model; the control Net model is obtained by training all layers in a UNet module in the SD model under constraint conditions, the LoRA model and the control Net model can be regarded as plug-ins of the SD model, the SD model can obtain more input conditions, the constraint SD model is used for generating a portrait style as a target portrait style, and the gesture of a human body is matched with a portrait image of the target gesture. The SD model in the embodiment of the present application may be a model of stablediffusion v 1.5.1.5, tablealiffusion v2.1, or the like.

The ControlNet model further controls the overall behavior of the overall neural network by manipulating the input conditions of the neural network block, which refers to a set of neural layers that are grouped together as frequently used units to build the neural network, such as a "resnet" block, a "conv-bn-relu" block, a multi-headed attention block, a transform block, and the like. The network structure of the ControlNet model can be divided into two parts, locked copy and trainable copy. The locked copy fixes the original weight of the stablishfunction, and retains the image generation capability which the stablishfunction has learned; the parameters of the available copy are initialized to the corresponding parameters of the available copy, and the parameters of the portion can be updated when the control net model is trained.

In some embodiments, the LoRA model is trained by: acquiring a first portrait data set, wherein the first portrait data set comprises a plurality of portrait images corresponding to each portrait style in a plurality of portrait styles, and the target portrait style is included in the plurality of portrait styles; and under the condition of fixing the model parameters of the SD model, training the initial LoRA model through the first human image data set to obtain the trained LoRA model.

In the above embodiment, the LoRA model is applied to the cross attention layer between the image representations related to the prompts describing the LoRA model in the SD model, and the LoRA model corresponding to the portrait style can be obtained by training the portrait images of the same portrait style under the condition of freezing the parameters of the SD model, and when the LoRA model is used, the parameters of the LoRA model are injected into the (object) SD model, so that the portrait image generation style of the SD model is changed, and the portrait style of the generated portrait image is the portrait style; after the LoRA model is obtained by training in the mode aiming at each portrait style, the LoRA model corresponding to each portrait style can be obtained.

In one possible implementation, when the above first portrait data set is acquired, multiple portrait images corresponding to each portrait style may be generated through a pre-trained SD model for each portrait style, for example, multiple portrait images may be acquired respectively for a cartoon style, a professional photo style, an ancient style, and the like. Therefore, when the LoRA model is trained, the sample portrait image is generated by the SD model, so that the trained LoRA model can more accurately control the SD model to generate a portrait image of a corresponding portrait style; in addition, the cost and time for collecting sample portrait images of different portrait styles are also saved.

In some embodiments, the SD model is trained by: acquiring a second portrait data set, wherein the second portrait data set comprises a plurality of portrait images, and the portrait images comprise portrait images with different styles; and training the initial SD model based on the second portrait data set to obtain the trained SD model. The initial SD model may be an acquired open-source SD model. It can be appreciated that, in general, the open-source SD model has the capability of generating different types of images, for example, the open-source SD model can generate plant images, building images, scenic images, portrait images, etc., so that in order to better enhance the capability of generating portrait images by the SD model, training can be performed on the initial SD model through a large number of portrait images in different styles, so that the SD model has better portrait image generating capability, and the generated portrait images have higher accuracy.

It should be noted that, for the SD model, the LoRA model, and the ControlNet model in the embodiments of the present application, the training process may be performed in advance, and then the SD model, the LoRA model, and the ControlNet model obtained by training may be used each time a portrait image needs to be generated, so that the SD model, the LoRA model, and the ControlNet model do not need to be trained each time a portrait image needs to be generated.

In addition, the above process of generating the third portrait image may be performed by the server, for example, the electronic device may transmit the second portrait image and the target keyword to the server where the above pre-trained SD model, the lorea model, and the ControlNet model are deployed after acquiring the second portrait image and the target keyword, perform the process of generating the third portrait image by the server, and then receive the third portrait image generated in the above manner returned by the server.

Step S240: and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

In the embodiment of the present application, step S240 may refer to the content of other embodiments, which is not described herein.

According to the image processing method, the portrait image corresponding to the required human body posture can be obtained, the skeleton key point image is obtained, then the skeleton key point image and the key words corresponding to the target portrait style are based, after the portrait image is generated through the pre-trained SD model, the pre-LoRA model and the pre-trained ControlNet model, the generated portrait image is subjected to face replacement according to the required human face, and the portrait image matched with the required target portrait style, the required human face and the required human body posture can be obtained, and in the image processing process, the portrait image matched with the required human face is not required to be processed according to a large number of portrait images comprising the required human face, so that the generation speed of the portrait image can be improved while the requirements of users on the portrait style, the human face and the human body posture are met; and the convenience of generating the portrait image required by the user is improved without providing a large number of images including the face by the user.

Referring to fig. 3, fig. 3 is a flow chart illustrating an image processing method according to another embodiment of the present application. The image processing method is applied to the electronic device, and will be described in detail below with respect to the flowchart shown in fig. 3, where the image processing method specifically includes the following steps:

step S310: and acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture.

Step S320: and generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

In the embodiment of the present application, the step S310 and the step S320 may refer to the content of other embodiments, which are not described herein.

Step S330: and acquiring the face feature of the first human image as a first face feature and the face feature of the third human image as a second face feature.

In the embodiment of the application, when the face replacement is performed on the above third face image generated by using the large language model, the face replacement may be performed by a ROOP face-changing algorithm. Firstly, face features can be extracted for a first face image and a third face image respectively, so that a first face feature corresponding to the first face in the first face image and a second face feature corresponding to the face in the third face image are obtained. The face features may be feature information of a face region in the portrait image, and the face features may be feature vectors of multiple dimensions.

In some embodiments, the face feature extraction method may be used to extract the face features of the first portrait image and the third portrait image, for example, a feature extraction network of a preset image processing model may be used to perform convolution feature extraction on the face image sample to obtain an initial face feature, and then a full connection layer is used to process the initial face feature to obtain a multi-dimensional feature vector, where the multi-dimensional feature vector is used as the face feature.

In some embodiments, before extracting the face features of the first and third images, the first face in the first image may be aligned with the face in the third image, so that the positions and angles of the faces in the first and third images match, so as to achieve subsequent feature fusion and generation. The face detection can use a deep learning algorithm, and the face position in the image can be automatically identified by training a face detector; the face alignment can be achieved by performing key point calibration on the detected faces, and then aligning the two faces by affine transformation, so that the faces in the first and third human images can keep consistent positions and angles.

Step S340: and carrying out feature fusion on the first face feature and the second face feature to obtain a fused face feature serving as a third face feature.

In this embodiment of the present application, after the first face feature corresponding to the first face in the first image and the second face feature corresponding to the face in the third image are obtained, feature fusion may be performed on the first face feature and the second face feature, so as to obtain a fused face feature, and the fused face feature is used as the third face feature.

In some embodiments, the first face features may be converted into a preset number of first basic face features, and a convolution layer corresponding to each first basic face feature is determined; converting the second face features into a preset number of second basic face features, and determining a convolution layer corresponding to each second basic face feature; and according to the convolution layer, fusing the first basic face features and the second basic face to obtain fused face features.

In one possible implementation, a mapping network (a mapping network) may be used to map the multidimensional first face feature to a plurality of first basic face features, where each first basic face feature is used as a noise input for a plurality of convolution layers, so that the convolution layer corresponding to each first basic face feature may be determined. For example, a 512-dimensional face vector is mapped to a 14×512-dimensional face vector (256×256 resolution), thereby obtaining 14 basic face features. After converting the basic facial features, 14 basic facial features can be input as noise of 14 convolution layers, respectively, so that the convolution layer corresponding to each basic facial feature can be determined. Similarly, the second face features can be converted into a preset number of second basic face features in the mode, and a convolution layer corresponding to each second basic face feature is determined.

In a possible implementation manner, when the first basic face feature and the second basic face are fused according to the convolution layer to obtain the fused face feature, a first basic face feature corresponding to the first convolution layer can be selected from the first basic face features to obtain a first target basic face feature, a second target basic face feature corresponding to the first convolution layer is selected from the second basic face features to obtain a second target basic face feature, and the convolution layer is adopted to fuse the first target basic face feature and the second target basic face feature to obtain the fused basic face feature. The steps are repeated for different convolution layers, so that the first face features and the second face features can be fused layer by layer, and the fused face features are obtained.

Step S350: and generating a first face image based on the third face feature.

In the embodiment of the present application, after the third face feature is obtained, a new face image may be generated as the first face image based on the fused third face feature. The third face feature may be input into a generation model that generates the countermeasure network, thereby obtaining a new face image. Generating an antagonism network (Generative Adversarial Network, GAN), the GAN model being a deep learning model applicable in an unsupervised image conversion (Unsupervised Image Translation, UIT) scenario, the model being built from at least two models: the mutual game learning between the Generative Model and the discriminant Model (Discriminative Model) produces a fairly good output. In the training process of the GAN, the object of the discrimination model G is to generate a real picture as much as possible to deception the discrimination model D, and the object of the discrimination model D is to separate the picture generated by the discrimination model G from the real picture as much as possible, so that a dynamic game process is formed between the generation model G and the discrimination model D, and the generated model is trained, so that the generated first face image is more vivid.

Step S360: and carrying out face replacement on the third portrait image based on the first portrait image to obtain the fourth portrait image.

In this embodiment of the present application, after the obtained first face image, face replacement may be performed on the third face image based on the first face image to obtain a fourth face image, that is, the face in the third face image is replaced with the face in the first face image to obtain a new face image, which is used as the fourth face image.

In some embodiments, face regions of the first face image and the third face image may be segmented to obtain a first face region corresponding to the first face image and a second face region corresponding to the third face image; and replacing the second face area in the third face image with the first face area. It will be appreciated that after the new face image (i.e. the first face image) is obtained, the face replacement may be performed by identifying the face regions in the first face image and the third face image, and then replacing the face region in the third face image with the face region in the obtained first face image.

It should be noted that, the above process of performing face replacement on the third portrait image may be performed by a server, and if the process is performed by the server, the electronic device may receive the fourth portrait image obtained by the server in the above manner.

According to the image processing method, after the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword corresponding to the target portrait style, the generated portrait image is replaced by the ROOP face-changing algorithm according to the required human face, so that the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained; and the convenience of generating the portrait image required by the user is improved without providing a large number of images including the face by the user.

Referring to fig. 4, fig. 4 is a flowchart illustrating an image processing method according to still another embodiment of the present application. The image processing method is applied to the electronic device, and will be described in detail with respect to the flowchart shown in fig. 4, and the image processing method specifically includes the following steps:

Step S410: and acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture.

Step S420: and generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

Step S430: and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

In the embodiment of the present application, the steps S410 to S430 may refer to the content of other embodiments, which are not described herein.

Step S440: and carrying out face region segmentation on the first face image and the fourth face image to obtain a third face region of the first face image and a fourth face region corresponding to the fourth face image.

In this embodiment of the present application, in order to make the face in the finally obtained portrait image closer to the first face in the first portrait image of the requirement, after the face is replaced with the generated third portrait image according to the required face, the fourth portrait image matching with the required target portrait style, the face and the human body posture is obtained, the first portrait image may be reused, and the face fusion is performed on the fourth portrait image through a face fusion algorithm, that is, the face in the fourth portrait image and the face in the first portrait image are fused. First, face region segmentation can be performed on a first face image and a fourth face image to obtain a third face region of the first face image and a fourth face region corresponding to the fourth face image.

Step S450: and carrying out face fusion on the first human image and the fourth human image based on the third human face area and the fourth human face area to obtain a fifth human image.

In the embodiment of the present application, after the above third face area and fourth face area are obtained, the first face image and the fourth face image may be subjected to face fusion based on the third face area and the fourth face area to obtain a fifth face image, so that the faces in the fourth face image may be enabled to retain more face features of the first face, and the faces in the obtained new face image may be enabled to be closer to the first face.

In some embodiments, after the third face region and the fourth face region are color corrected and face aligned, the third face region and the fourth face region are fused, and then the fused face image is replaced to the fourth face region in the fourth image, so as to obtain a fifth image. Of course, the specific manner of performing the face fusion of the first portrait image and the fourth portrait image based on the third face area and the fourth face area is not limited.

It should be noted that, the above process of performing face fusion on the fourth portrait image may be performed by the server, and if the process is performed by the server, the electronic device may receive the fifth portrait image obtained by the server in the above manner.

According to the image processing method, after the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword corresponding to the target portrait style, the generated portrait image is subjected to face replacement according to the required human face, so that the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained; the user is not required to provide a large number of images including faces, and convenience in generating the portrait images required by the user is also improved; in addition, after the image is generated, the face in the fourth image and the face in the first image are fused, so that the face in the finally obtained image is more similar to the face in need, and the accuracy of the generated image is improved.

Referring to fig. 5, fig. 5 is a flowchart illustrating an image processing method according to still another embodiment of the present application. The image processing method is applied to the electronic device, and will be described in detail below with respect to the flowchart shown in fig. 5, where the image processing method specifically includes the following steps:

step S510: and acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture.

Step S520: and generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

Step S530: and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

In the embodiment of the present application, the steps S510 to S530 may refer to the content of other embodiments, which are not described herein.

Step S540: and acquiring a first prompt text for editing the fourth portrait image.

In this embodiment of the present application, after performing face replacement on the generated third portrait image according to the required face to obtain a fourth portrait image matching with the required target portrait style, face, and human body pose, there may be a need for the user to edit the generated portrait image, for example, change the wearing, hairstyle, color, decoration, etc. of the portrait image, so that a first prompt text for editing the fourth portrait image may be further obtained, so that the portrait image may be edited according to the first prompt text, thereby obtaining the portrait image required by the user.

Step S550: and according to the first prompt text, performing image editing on the fourth portrait image through a pre-trained denoising diffusion implicit model to obtain at least one fifth portrait image.

In this embodiment of the present application, after the first prompt text is obtained, the image editing may be performed on the fourth portrait image according to the first prompt text and through a pre-trained denoising diffusion implicit model (Denoising Diffusion Implicit Model, DDIM) to obtain at least one fifth portrait image. The DDIM is mainly used for editing the image generated by the SD model, and the core of the DDIM is to edit the image based on the cross attention maps of UNet, so that a more real image can be obtained by editing.

In some embodiments, the first prompt text may include a plurality of prompt words, where each prompt word may be used to prompt the DDIM to perform different editing, so, according to the first prompt text, after the fourth portrait image is edited by the DDIM, a plurality of fifth portrait images after performing different image editing may be obtained. For example, the first prompt text may include: and after the fourth portrait image is edited through the DDIM according to the first prompt text, a portrait image with yellow clothes, a portrait image with white clothes, a portrait image with red clothes and a portrait image with blue clothes can be obtained.

It should be noted that the above process of editing the fourth portrait image through the DDIM may be performed by the server, and if the process is performed by the server, the electronic device may receive the fifth portrait image obtained by the server in the above manner.

According to the image processing method, after the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword corresponding to the target portrait style, the generated portrait image is subjected to face replacement according to the required human face, so that the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained; the user is not required to provide a large number of images including faces, and convenience in generating the portrait images required by the user is also improved; in addition, after the image is generated, the generated image is further edited through the DDIM according to the acquired prompt text, so that the requirement of a user for editing the generated image is met.

Referring to fig. 6, fig. 6 is a flowchart illustrating an image processing method according to still another embodiment of the present application. The image processing method is applied to the electronic device, and will be described in detail with respect to the flowchart shown in fig. 6, and the image processing method specifically may include the following steps

Step S610: and displaying a first interface, wherein the first interface comprises a first uploading control, a second uploading control and a style selection control.

In this embodiment of the present application, referring to fig. 7, when the electronic device obtains the first portrait image, the second portrait image, and the target keyword corresponding to the target portrait style, the first interface A1 may be displayed, where the first interface includes a first upload control A2, a second upload control A3, and a style selection control A4, so that the user may input the first portrait image through the first upload control A2 in the first interface A1, input the second portrait image through the second upload control A2, and input the target keyword corresponding to the target portrait style through the style selection control A4.

Step S620: and responding to the operation of the first uploading control, and acquiring the selected portrait image from an image library as the first portrait image.

In this embodiment of the present application, after the electronic device displays the first interface, the operation input in the first interface may be detected, and in the case that the operation for the first upload control is detected, for example, a click operation, a long press operation, or the like, the selected portrait image may be obtained from the image library as the first portrait image in response to the operation.

In some embodiments, the electronic device may display a presentation interface of the image library in response to an operation for the first upload control, and then obtain the selected portrait image as the first portrait image according to a selection operation for the portrait image in the presentation interface.

In a possible implementation manner, the display interface may also include a shooting control, and the electronic device may also display the shooting interface in response to an operation for the shooting control, and perform image shooting according to the shooting operation, so as to obtain a shot image as the first portrait image.

Step S630: and responding to the operation of the second uploading control, and acquiring the selected portrait image from an image library as the second portrait image.

In this embodiment of the present application, after the electronic device displays the first interface, the operation input in the first interface may be detected, and in a case where an operation for the second upload control is detected, for example, a click operation, a long press operation, or the like, the selected portrait image may be obtained from the image library as the second portrait image in response to the operation.

In some embodiments, the electronic device may display a presentation interface of the image library in response to an operation for the second upload control, and then obtain the selected portrait image as the second portrait image according to a selection operation for the portrait image in the presentation interface.

In a possible implementation manner, the display interface may also include a shooting control, and the electronic device may also display the shooting interface in response to an operation for the shooting control, and perform image shooting according to the shooting operation, so as to obtain a shot image as the second portrait image.

Step S640: and responding to the operation of the style selection control, and acquiring a target keyword corresponding to the selected target portrait style.

In this embodiment of the present application, after the electronic device displays the first interface, the electronic device may detect an operation input in the first interface, and in a case that an operation for a style selection control is detected, for example, a click operation, a long press operation, or the like, may respond to the operation to obtain a target keyword corresponding to the selected target portrait style.

In some embodiments, the electronic device may display a style selection list in response to an operation for the style selection control, determine a target portrait style selected by the user according to a selection operation for the style selection list, and then obtain a keyword corresponding to the target portrait style as a target keyword.

In some embodiments, the electronic device may display a keyword input interface in response to an operation for the style selection control, and according to an input operation in the keyword input interface, may obtain a keyword input by a user, and use the keyword as a target keyword corresponding to a target portrait style.

It should be noted that the order among the above steps S620, S630 and S640 is not limited, that is, the electronic device may execute the above step S620 first, then execute the step S630, and then execute the step S640; the above step S620 may be performed first, then step S640 may be performed, and then step S630 may be performed, however, the above execution order is merely an example and does not limit the actual execution order, and for example, the electronic device may perform the above steps S620, S630, and S640 in parallel.

Step S650: and generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

Step S660: and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

In the embodiment of the present application, the steps S650 to S660 may refer to the content of other embodiments, which are not described herein.

According to the image processing method, the situation that the user inputs the portrait image corresponding to the required human body posture, the keyword of the required target portrait style and the portrait image corresponding to the required human face in the first interface can be achieved, then the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword of the required target portrait style, then the generated portrait image is subjected to face replacement according to the required human face, and the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained, and in the image processing process, the portrait image does not need to be processed according to a large number of portrait images including the required human face, so that the requirements of the user on the portrait style, the human face and the human body posture can be met, and meanwhile the generation speed of the portrait image is improved; and the convenience of generating the portrait image required by the user is improved without providing a large number of images including the face by the user.

The image processing method provided in the embodiment of the present application is described below with reference to the accompanying drawings.

Referring to fig. 8, an initial SD model may be trained based on a number of different styles of portrait images; generating portrait images of different styles based on the SD model obtained by training to obtain a portrait data set; training the LoRA model based on the portrait data set; in the application process, skeleton key point extraction can be carried out on a portrait image of a target gesture to obtain a skeleton key point image, then a portrait image with the human gesture matched with the target gesture is generated through a trained SD model, a LoRA model and a ControlNet model based on the skeleton key point image and a keyword corresponding to the target portrait style, the portrait style is a portrait image of the target portrait style, and then a ROOP face-changing algorithm is applied to the generated portrait image based on the portrait image comprising the target face to obtain a portrait image with the human gesture matched with the target gesture, and the human face is matched with the target face; the generated portrait image and the portrait image comprising the target face can be subjected to face fusion to obtain a portrait image after face change, so that the generated portrait image comprises more characteristics of the target face; and the face-changed portrait image can be subjected to image editing based on the DDIM, so that the edited portrait image is obtained, and the image editing requirement of a user on the generated portrait image is met.

Referring to fig. 9, a block diagram of an image processing apparatus 700 according to an embodiment of the present application is shown. The image processing apparatus 700 is applied to the above-described electronic device, and the image processing apparatus 700 includes: an image acquisition module 710, an image generation module 720, and a face replacement module 730. The image obtaining module 710 is configured to obtain a first portrait image, a second portrait image, and a target keyword corresponding to a target portrait style, where the first portrait image includes a first face, and the second portrait image includes a human body with a target pose; the image generating module 720 is configured to generate a third portrait image with the target portrait style based on the target keyword and the second portrait image and through a pre-trained large language model, where a pose of a human body in the third portrait image is matched with the target pose; the face replacing module 730 is configured to replace a face in the third portrait image with the first face based on the first portrait image, so as to obtain a fourth portrait image.

In some embodiments, the face replacement module 730 may be specifically configured to obtain a face feature of the first image as a first face feature, and a face feature of the third image as a second face feature; performing feature fusion on the first face feature and the second face feature to obtain a fused face feature serving as a third face feature; generating a first face image based on the third face feature; and carrying out face replacement on the third portrait image based on the first portrait image to obtain the fourth portrait image.

In a possible implementation manner, the face replacement module 730 may be further configured to segment the face regions of the first face image and the third face image to obtain a first face region corresponding to the first face image and a second face region corresponding to the third face image; and replacing the second face area in the third face image with the first face area.

In one possible implementation, the face replacement module 730 may also be configured to align the first face in the first portrait image with a face in the third portrait image.

In some embodiments, the image processing apparatus 700 may further include a face segmentation module and a face fusion module. The face segmentation module may be configured to replace a face in the third image with the first face based on the first image to obtain a fourth image, and then segment the face regions of the first image and the fourth image to obtain a third face region of the first image and a fourth face region corresponding to the fourth image; and the face fusion module is used for carrying out face fusion on the first human image and the fourth human image based on the third human face area and the fourth human face area to obtain a fifth human image.

In some embodiments, the image processing apparatus 700 may further include a portrait editing module, where the portrait editing module is configured to obtain, after the replacing a face in the third portrait image with the first face based on the first portrait image to obtain a fourth portrait image, a first prompt text for editing the fourth portrait image; and according to the first prompt text, performing image editing on the fourth portrait image through a pre-trained denoising diffusion implicit model to obtain at least one fifth portrait image.

In some embodiments, the image generating module 720 may be specifically configured to obtain a skeletal keypoint image of a human body in the second portrait image; and generating a third portrait image with the target portrait style based on the target keywords and the skeleton key point image through a pre-trained stable diffusion SD model, a low-rank self-adaptive LoRA model and a control network ControlNet model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

In a possible implementation manner, the image processing apparatus 700 may further include a first training module, where the first training module is configured to obtain a first person data set, where the first person data set includes a plurality of person images corresponding to each of a plurality of person styles, and the plurality of person styles includes the target person style; and under the condition of fixing the model parameters of the SD model, training an initial LoRA model through the first portrait data set to obtain the LoRA models corresponding to different portrait styles after training.

Optionally, the first training module may be further configured to generate, for each of the portrait styles, a plurality of portrait images corresponding to each of the portrait styles through the SD model.

In a possible implementation manner, the image processing apparatus 700 may further include a second training module, where the second training module may be configured to obtain a second image data set, where the second image data set includes a plurality of portrait images, and the plurality of portrait images include portrait images of different styles; and training the initial SD model based on the second portrait data set to obtain the trained SD model.

In some implementations, the image acquisition module 710 may be specifically configured to display a first interface including a first upload control, a second upload control, and a style selection control; responding to the operation of the first uploading control, and acquiring a selected portrait image from an image library as the first portrait image; responding to the operation of the second uploading control, and acquiring the selected portrait image from an image library as the second portrait image; and responding to the operation of the style selection control, and acquiring a target keyword corresponding to the selected target portrait style.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided herein, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

In summary, according to the scheme provided by the application, through acquiring the first portrait image, the second portrait image and the target keywords corresponding to the target portrait style, the first portrait image comprises a first face, the second portrait image comprises a human body with a target gesture, the third portrait image with the target portrait style is generated based on the target keywords and the second portrait image and through a pre-trained large language model, the gesture of the human body in the third portrait image is matched with the target gesture, and the human face in the third portrait image is replaced with the first face based on the first portrait image, so that a fourth portrait image is obtained. Therefore, after the portrait image is generated according to the portrait image corresponding to the required human body posture and the keyword corresponding to the target portrait style, the generated portrait image is replaced according to the required human face, and the portrait image matched with the required target portrait style, the human face and the human body posture can be obtained.

Referring to fig. 10, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 100 may be a smart phone, a tablet computer, a notebook computer, an electronic book, or an electronic device capable of running an application program. The electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more applications configured to perform the method as described in the foregoing method embodiments.

Processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The Memory 120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the electronic device 100 in use (e.g., phonebook, audiovisual data, chat log data), and the like.

Referring to fig. 11, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 800 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An image processing method, the method comprising:

acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture;

generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, wherein the gesture of a human body in the third portrait image is matched with the target gesture;

and replacing the face in the third face image with the first face image based on the first face image to obtain a fourth face image.

2. The method of claim 1, wherein the replacing the face in the third portrait image with the first face based on the first portrait image, to obtain a fourth portrait image, comprises:

acquiring the face features of the first human image as first face features and the face features of the third human image as second face features;

performing feature fusion on the first face feature and the second face feature to obtain a fused face feature serving as a third face feature;

generating a first face image based on the third face feature;

and carrying out face replacement on the third portrait image based on the first portrait image to obtain the fourth portrait image.

3. The method of claim 2, wherein performing face replacement on the third portrait image based on the first face image to obtain the fourth portrait image includes:

carrying out face region segmentation on the first face image and the third face image to obtain a first face region corresponding to the first face image and a second face region corresponding to the third face image;

And replacing the second face area in the third face image with the first face area.

4. The method of claim 2, wherein prior to the acquiring the face feature of the first portrait image as a first face feature and the face feature of the third portrait image as a second face feature, the method further comprises:

and aligning the first face in the first portrait image with the face in the third portrait image.

5. The method of claim 1, wherein after replacing the face in the third portrait image with the first face based on the first portrait image, the method further comprises:

carrying out face region segmentation on the first face image and the fourth face image to obtain a third face region of the first face image and a fourth face region corresponding to the fourth face image;

and carrying out face fusion on the first human image and the fourth human image based on the third human face area and the fourth human face area to obtain a fifth human image.

6. The method of claim 1, wherein after replacing the face in the third portrait image with the first face based on the first portrait image, the method further comprises:

Acquiring a first prompt text for editing the fourth portrait image;

and according to the first prompt text, performing image editing on the fourth portrait image through a pre-trained denoising diffusion implicit model to obtain at least one fifth portrait image.

7. The method of any of claims 1-6, wherein the generating a third portrait image having the target portrait style based on the target keyword and the second portrait image and through a pre-trained large language model includes:

acquiring a skeleton key point image of a human body in the second portrait image;

and generating a third portrait image with the target portrait style based on the target keywords and the skeleton key point image through a pre-trained stable diffusion SD model, a low-rank self-adaptive LoRA model and a control network ControlNet model, wherein the gesture of a human body in the third portrait image is matched with the target gesture.

8. The method of claim 7, wherein the LoRA model is trained by:

acquiring a first portrait data set, wherein the first portrait data set comprises a plurality of portrait images corresponding to each portrait style in a plurality of portrait styles, and the target portrait style is included in the plurality of portrait styles;

And under the condition of fixing the model parameters of the SD model, training an initial LoRA model through the first portrait data set to obtain the LoRA models corresponding to different portrait styles after training.

9. The method of claim 8, wherein the acquiring the first set of personal data comprises:

and generating a plurality of portrait images corresponding to each portrait style through the SD model aiming at each portrait style.

10. The method according to claim 7, wherein the SD model is trained by:

acquiring a second portrait data set, wherein the second portrait data set comprises a plurality of portrait images, and the portrait images comprise portrait images with different styles;

and training the initial SD model based on the second portrait data set to obtain the trained SD model.

11. The method according to any one of claims 1-6, wherein the acquiring the target keyword corresponding to the first portrait image, the second portrait image, and the target portrait style, the first portrait image including the first face, includes:

displaying a first interface, wherein the first interface comprises a first uploading control, a second uploading control and a style selection control;

Responding to the operation of the first uploading control, and acquiring a selected portrait image from an image library as the first portrait image;

responding to the operation of the second uploading control, and acquiring the selected portrait image from an image library as the second portrait image;

and responding to the operation of the style selection control, and acquiring a target keyword corresponding to the selected target portrait style.

12. An image processing apparatus, characterized in that the apparatus comprises: an image acquisition module, an image generation module and a face replacement module, wherein,

the image acquisition module is used for acquiring a first portrait image, a second portrait image and a target keyword corresponding to a target portrait style, wherein the first portrait image comprises a first face, and the second portrait image comprises a human body with a target gesture;

the image generation module is used for generating a third portrait image with the target portrait style based on the target keywords and the second portrait image through a pre-trained large language model, and the gesture of a human body in the third portrait image is matched with the target gesture;

the face replacing module is used for replacing the face in the third face image with the first face based on the first face image to obtain a fourth face image.

13. An electronic device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-11.

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for executing the method according to any one of claims 1-11.