CN116958326A

CN116958326A - Image editing method, device, electronic equipment and storage medium

Info

Publication number: CN116958326A
Application number: CN202311018253.7A
Authority: CN
Inventors: 张严浩; 刘鹏
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-10-27

Abstract

The application discloses an image editing method, an image editing device, electronic equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: the method comprises the steps of obtaining an image to be edited and a target text, carrying out image reconstruction on the image to be edited based on the target text, carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process, obtaining a fine-adjusted target text, obtaining an attention feature map corresponding to the image to be edited based on the fine-adjusted target text, and carrying out editing processing on the image to be edited based on the attention feature map to obtain a target image. According to the application, the target text fine adjustment and the attention feature map editing are combined to obtain the target image, so that the controllable image editing is realized according to the target text, and the user requirement in actual application is met.

Description

Image editing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image editing method, an image editing device, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence technology, image editing is widely applied in the fields of games, animation, webpage design and the like. However, when image editing is performed at present, the image editing is often uncontrollable, so that the image obtained by editing does not meet the requirements of users.

Disclosure of Invention

In view of the above, the present application proposes an image editing method, apparatus, electronic device, and storage medium to solve the above problems.

In a first aspect, an embodiment of the present application provides an image editing method, including: acquiring an image to be edited and a target text; performing image reconstruction on the image to be edited based on the target text, and performing fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text; and obtaining an attention characteristic diagram corresponding to the image to be edited based on the fine-tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

In a second aspect, an embodiment of the present application provides an image editing apparatus, including: the information acquisition module is used for acquiring an image to be edited and a target text; the text fine adjustment module is used for carrying out image reconstruction on the image to be edited based on the target text, and carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text; and the image editing module is used for obtaining an attention characteristic diagram corresponding to the image to be edited based on the finely tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

In a third aspect, an embodiment of the present application provides an electronic device comprising a memory and a processor, the memory coupled to the processor, the memory storing instructions that when executed by the processor perform the above-described method.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, the program code being callable by a processor to perform the above method.

According to the image editing method, the device, the electronic equipment and the storage medium, the image to be edited and the target text are obtained, image reconstruction is carried out on the image to be edited based on the target text, fine adjustment is carried out on the target text based on image reconstruction loss in the image reconstruction process, the fine-adjusted target text is obtained, the attention feature map corresponding to the image to be edited is obtained based on the fine-adjusted target text, the image to be edited is edited based on the attention feature map, the target image is obtained, and the target image is obtained through combining the fine adjustment of the target text and the attention feature map editing, so that controllable image editing is achieved according to the target text, and the user requirements in practical application are met.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an image editing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a target text and an image to be edited according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a target image generated based on the prior art according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a target image generated according to the manner provided by the present embodiment;

FIG. 5 is a schematic diagram of a target text and an image to be edited according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a target image generated based on a prior art method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a target image generated according to the manner provided by the present embodiment;

FIG. 8 is a flowchart of an image editing method according to an embodiment of the present application;

FIG. 9 is a flowchart of an image editing method according to an embodiment of the present application;

fig. 10 is a flowchart showing step S330 of the image editing method shown in fig. 9 according to the present application;

FIG. 11 is a flowchart of an image editing method according to an embodiment of the present application;

FIG. 12 is a flowchart of an image editing method according to an embodiment of the present application;

FIG. 13 is a flowchart of an image editing method according to an embodiment of the present application;

fig. 14 is a flowchart showing step S640 of the image editing method shown in fig. 13 of the present application;

FIG. 15 is a schematic diagram of an image editing method according to an embodiment of the present application;

FIG. 16 is a block diagram showing an image editing apparatus according to an embodiment of the present application;

fig. 17 shows a block diagram of an electronic device for performing an image editing method according to an embodiment of the present application;

fig. 18 illustrates a storage unit for storing or carrying program code for implementing an image editing method according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

In the current generation research of the literature, compared with the traditional antagonism generation network (GAN), the Diffusion (Diffusion) model realizes a more stable and diversified generation effect through a unique Diffusion process (Diffusion process), and gradually becomes the main stream of the current generation model. The method is represented by Stable Diffusion (a Stable Diffusion model can realize conversion from text to image), introduces VAE on the basis of a classical Diffusion model to migrate the denoising process of the model to a hidden space, and simultaneously adds additional conditions such as text, mask and image during training, thereby greatly reducing the requirements of training and reasoning operation resources, maintaining excellent generation performance and enhancing the controllability of a generated result. And the method can be used for pre-training and fine-tuning on different tasks and data sets, and has higher mobility. Therefore, it has been widely used in the fields of image generation, text generation, and the like.

While the current diffusion model has a sufficiently surprising performance in the field of image generation, it still has some drawbacks in the task of image editing. Where image editing often requires that the overall structure be kept unchanged, fine tuning is performed on a particular region in the image. The method has the advantages that the controllability and the accuracy of the model are required to be high, the existing diffusion model supports the mask (mask) input into the area to be edited to carry out local editing of the image, however, the method needs to manually select or call the semantic segmentation model to generate a corresponding image mask, and the method has great disadvantages in editing efficiency and diversity. Under the condition that the new additional condition is not considered to be added to guide retraining of the diffusion model, the text semantic information is utilized to directly realize local editing of the image, so that the best choice is achieved by comprehensively considering the editing efficiency of the image and the difficulty of practical operation. However, there are still challenges and difficulties to be addressed in achieving local editing of images directly using text semantic information conditional guidance. The following are some of the difficulties:

first, there is a very strong correlation between the generated image based on LDM (language diffusion model) and the conditional text, so image specific region editing based on self-attention and cross-attention mechanism feature maps is a good solution. However, such methods require that a precise descriptive text and artwork be presented to construct the feature relationships, and then image editing be accomplished by modifying/adding this text. Although this method does not need to provide an image mask additionally, it is still necessary to provide two text sections to realize image editing progressively, which is not concise. And is limited by the limitations of descriptive text and image cross-attention, which hardly enables action modification in images.

Secondly, the method for finely adjusting the text vector from the semantic space and integrating new semantic information on the basis of maintaining the original image layout also has good enough performance, and the method has great advantages in action editing. However, the method needs to fine-tune (fine tuning means slightly changing parameters thereof on the basis of the original image to realize performance improvement) diffusion model on the basis of a single image, which greatly increases the time required for editing the image and affects the usability of the image in a real scene. More deadly, the fine-tune diffusion model on a single picture can destroy model characteristics, and the semantic expression capability and the robustness of the model are seriously affected.

As the second problem is, the usability of the diffusion model after fine-tune is greatly reduced. Therefore, the diffusion model before fine-tune is introduced again on the basis of the diffusion model after fine-tune, and the noise prediction results of the fine-tune and the fine-tune are weighted by the guiding strategy of the classification-free guidance, so that the problem caused by a single image fine-tune can be alleviated, however, the method not only needs to introduce the fine-tune diffusion model, but also needs to introduce a diffusion model before fine-tune, and the expected effect is difficult to achieve in terms of time efficiency and space efficiency.

In order to solve the problems, the inventor discovers and proposes the image editing method, the device, the electronic equipment and the storage medium provided by the embodiment of the application through long-term research, and obtains the target image by combining the fine adjustment of the target text and the attention characteristic diagram editing, so that the controllable image editing is realized according to the target text, the operation is simpler, and the user requirements in practical application are met. The specific image editing method is described in detail in the following embodiments.

Referring to fig. 1, fig. 1 is a flowchart illustrating an image editing method according to an embodiment of the application. The method is used for obtaining the target image by combining the fine adjustment of the target text and the editing of the attention feature map, so that the controllable image editing is realized according to the target text, the operation is simpler, and the user requirements in practical application are met. In a specific embodiment, the image editing method is applied to the image editing apparatus 200 shown in fig. 16 and the electronic device 100 (fig. 17) configured with the image editing apparatus 200. The specific flow of the present embodiment will be described below by taking an electronic device as an example, and it will be understood that the electronic device applied in the present embodiment may include a smart phone, a tablet computer, a desktop computer, a car machine, a game machine, a wearable electronic device, and the like, which is not limited herein. The following will describe the flowchart shown in fig. 1 in detail, and the image editing method specifically may include the following steps:

Step S110: and acquiring an image to be edited and a target text.

In the present embodiment, an image to be edited and a target text may be acquired. Alternatively, the image to be edited may be an original image that the user desires to edit, for example, the image to be edited may be "an image in which a dog stands on a lawn"; alternatively, the target text may be a target that the user desires to edit the image to be edited to achieve, for example, the target text may be "a text of a cat standing on a lawn," a text of a dog standing on a desert, "an image of a dog sitting on a lawn," an image of a dog lying on a lawn, "or the like, which is not limited herein.

In some embodiments, the number of images to be edited may be one or more. When the number of the images to be edited is one, the number of the target texts can be one, and the image to be edited corresponds to one target text; when the number of the images to be edited is plural, the number of the target texts may be plural, and the plural images to be edited and the plural target texts are in one-to-one correspondence.

As an implementation manner, the electronic device may acquire the image to be edited in advance and store the image locally, and correspondingly, when the image to be edited needs to be edited, the image to be edited may be acquired from the local of the electronic device, for example, the image to be edited may be acquired by the electronic device through a camera and then stored locally, or the image to be edited may be downloaded by the electronic device from a server through a network and then stored locally. As yet another implementation manner, the image to be edited may not be stored locally in the electronic device, and when the image to be edited needs to be edited, the electronic device may acquire the image to be edited from the server through the network.

In some implementations, the electronic device can obtain text entered by a user and take the text entered by the user as target text; alternatively, the electronic device may acquire the voice input by the user, convert the voice input by the user into text through a voice conversion text technique, and use the text obtained by conversion as the target text.

As an embodiment, in the case of obtaining the input text, the input text may be directly determined as the target text; alternatively, in the case of obtaining an input text, it is possible to extract key information from the input text and determine the extracted key information as a target text.

Step S120: and carrying out image reconstruction on the image to be edited based on the target text, and carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text.

In order to make the target text have stronger correlation with the image to be edited, which is obtained after the image to be edited is edited based on the target text, the correlation between the target text and the image to be edited can be constructed, optionally, the correlation between the target text and the image to be edited can be constructed by fine tuning the target text in the process of reconstructing the image to be edited based on the target text, so as to ensure the correlation between the attention feature map and the image to be edited in the reasoning process.

In this embodiment, under the condition that an image to be edited and a target text are obtained, the image to be edited may be reconstructed based on the target text, and the target text may be trimmed based on image reconstruction loss in the image reconstruction process, so as to obtain the trimmed target text.

In some embodiments, in the case of obtaining the image to be edited and the target text, the image to be edited may be reconstructed based on the target text, and the corresponding image reconstruction loss in the image reconstruction process may be calculated, so as to perform fine tuning on the target text based on the image reconstruction loss, so as to obtain the target text after fine tuning.

As an implementation manner, in the case of obtaining the image to be edited and the target text, the target text and the image to be edited may be calculated by an image reconstruction algorithm, and a corresponding image reconstruction loss in the image reconstruction process is calculated, so as to fine tune the target text based on the image reconstruction loss, and obtain the fine-tuned target text.

As yet another implementation manner, in the case of obtaining the image to be edited and the target text, the target text and the image to be edited may be processed based on an image reconstruction model (such as a diffusion model), and a corresponding image reconstruction loss in the image reconstruction process may be calculated, so as to fine-tune the target text based on the image reconstruction loss, and obtain the fine-tuned target text.

Step S130: and obtaining an attention characteristic diagram corresponding to the image to be edited based on the fine-tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

In this embodiment, in the case of obtaining the target text after fine adjustment, an attention feature map corresponding to an image to be edited may be obtained based on the target text after fine adjustment, and the image to be edited may be edited based on the attention feature map, to obtain the target image. The target image is an image obtained after the image to be edited is edited based on the target text, namely, the target text indicates the target which is expected to be edited to be realized.

In some embodiments, in the case of obtaining the target text after fine adjustment, the image to be edited may be reconstructed based on the target text after fine adjustment, so that the attention feature map corresponding to the image to be edited may be obtained while perfect reconstruction of the image to be edited is achieved, so as to construct the feature relationship between the target text after fine adjustment and the image to be edited.

As an implementation manner, in the case of obtaining the fine-tuned target text, the fine-tuned target text and the image to be edited may be calculated by an image reconstruction algorithm, so that the attention feature map corresponding to the image to be edited may be obtained while perfect reconstruction of the image to be edited is achieved.

As still another implementation manner, in the case of obtaining the fine-tuned target text, the fine-tuned target text and the image to be edited may be processed based on the image reconstruction model, so that the attention profile corresponding to the image to be edited may be obtained while perfect reconstruction of the image to be edited is achieved.

In some embodiments, editing the image to be edited based on the attention profile may include: adjusting part or all of the images to be edited based on the attention feature map to obtain a target image; the target image or the like is obtained by replacing part or all of the image to be edited based on the attention profile, and is not limited herein.

Referring to fig. 2 to fig. 4, as shown in fig. 2, the image to be edited may be "an image of a dog standing on a lawn", the target text may be "a text of a cat standing on a lawn", and if the manner of fine tuning and attention feature map editing of the target text in this embodiment is not combined, the generated target image may be as shown in fig. 3, although the obtained result is an image of a cat standing on a lawn, the appearance and the posture of the displayed cat and the dog in the image to be edited have great difference, resulting in poor image editing effect; if the method of fine tuning the target text and editing the attention feature map of the embodiment is combined, the generated target image may be as shown in fig. 4, the result obtained is an image of a cat standing on the lawn, the displayed cat and the dog in the image to be edited have a strong correlation in appearance and posture, and the image editing effect is good.

Referring to fig. 5 to fig. 7, as shown in fig. 5, the image to be edited may be "an image of a dog squatting on a lawn", the target text may be "a text of a dog standing on a lawn or lying on a lawn", and if the manner of fine tuning and editing the attention profile of the target text in this embodiment is not combined, the generated target image may be as shown in fig. 6, although the obtained result is an image of a dog standing on a lawn, the dog displayed by the image has a great difference in appearance from the dog in the image to be edited, resulting in poor image editing effect; if the method of fine tuning the target text and editing the attention feature map of the embodiment is combined, the generated target image may be as shown in fig. 7, the result obtained is an image of a dog lying on the lawn, the dog displayed by the image has a strong correlation with the dog in the image to be edited in appearance, and the image editing effect is good.

According to the image editing method provided by the embodiment of the application, the image to be edited and the target text are obtained, the image to be edited is reconstructed based on the target text, the target text is finely tuned based on image reconstruction loss in the image reconstruction process, the finely tuned target text is obtained, the attention feature map corresponding to the image to be edited is obtained based on the finely tuned target text, the image to be edited is edited based on the attention feature map, the target image is obtained, and the target image is obtained by combining the fine tuning of the target text and the attention feature map editing, so that controllable image editing is realized according to the target text, the operation is simpler, and the user requirements in practical application are met.

Referring to fig. 8, fig. 8 is a flowchart illustrating an image editing method according to an embodiment of the application. The following will describe the flowchart shown in fig. 8 in detail, and the image editing method specifically may include the following steps:

step S210: and acquiring an image to be edited and a target text.

The specific description of step S210 is referred to step S110, and will not be repeated here.

Step S220: an embedded vector of the target text is determined.

In the present embodiment, in the case that the target text is acquired, an embedded vector of the target text may be determined (conditional text embedding).

In some embodiments, in the case of obtaining the target text, the feature vector may be extracted from the target text to obtain the embedded vector of the target text. Alternatively, a feature vector extraction manner may be preset, and in the case of obtaining the target text, feature extraction may be performed on the target text based on the preset feature vector extraction manner, so as to obtain an embedded vector of the target text. Alternatively, a plurality of extraction modes of feature vectors may be preset, and when the target text is obtained, one extraction mode of the feature vectors may be selected from the preset plurality of extraction modes of feature vectors, and feature extraction may be performed on the target text by using the selected extraction mode of feature vectors to obtain an embedded vector of the target text, where the selection mode of feature vectors may be based on the target text, for example, may be based on text content of the target text, may be based on a target format of the target text, and the like, and is not limited herein.

Step S230: and carrying out image reconstruction on the image to be edited based on the embedded vector of the target text, and carrying out fine adjustment on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process to obtain the target text after fine adjustment.

In this embodiment, under the condition that the embedded vector of the target text and the image to be edited is obtained, the image to be edited can be reconstructed based on the embedded vector of the target text, and the embedded vector of the target text can be finely tuned based on the image reconstruction loss in the image reconstruction process, so as to obtain the finely tuned target text.

In some embodiments, when the embedded vector of the target text and the image to be edited is obtained, the image to be edited may be reconstructed based on the embedded vector of the target text, and the corresponding image reconstruction loss in the image reconstruction process may be calculated, so as to fine tune the embedded vector of the target text based on the image reconstruction loss, to obtain the fine-tuned target text.

As an implementation manner, under the condition that the embedded vector of the target text and the image to be edited are obtained, the embedded vector of the target text and the image to be edited can be calculated through an image reconstruction algorithm, and corresponding image reconstruction loss in the image reconstruction process is calculated, so that the embedded vector of the target text is finely adjusted based on the image reconstruction loss, and the finely adjusted target text is obtained.

As still another implementation manner, in the case of obtaining the embedded vector of the target text and the image to be edited, the embedded vector of the target text and the image to be edited may be processed based on the image reconstruction model, and the corresponding image reconstruction loss in the image reconstruction process may be calculated, so as to fine tune the embedded vector of the target text based on the image reconstruction loss, and obtain the fine-tuned target text.

Step S240: and obtaining an attention characteristic diagram corresponding to the image to be edited based on the fine-tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

The specific description of step S240 is referred to step S130, and will not be repeated here.

According to the image editing method provided by the embodiment of the application, the image to be edited and the target text are obtained, the embedded vector of the target text is determined, the image to be edited is reconstructed based on the embedded vector of the target text, fine adjustment is performed on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process, the fine-adjusted target text is obtained, the attention feature map corresponding to the image to be edited is obtained based on the fine-adjusted target text, and the image to be edited is edited based on the attention feature map, so that the target image is obtained. Compared with the image editing method shown in fig. 1, the embodiment further determines that the embedded vector of the target text performs image reconstruction on the image to be edited, so as to improve the subsequent fine adjustment effect on the target text, thereby improving the editing effect of the image.

Referring to fig. 9, fig. 9 is a flowchart illustrating an image editing method according to an embodiment of the application. The following will describe the flowchart shown in fig. 9 in detail, and the image editing method specifically may include the following steps:

step S310: and acquiring an image to be edited and a target text.

The specific description of step S310 is referred to step S110, and will not be repeated here.

Step S320: an embedded vector of the target text is determined.

The specific description of step S320 refers to step S220, and is not repeated here.

Step S330: and carrying out image reconstruction on the image to be edited based on the embedded vector of the target text through a diffusion model, and carrying out fine adjustment on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process to obtain the target text after fine adjustment.

In this embodiment, under the condition that the embedded vector of the target text and the image to be edited is obtained, the image to be edited can be reconstructed based on the embedded vector of the target text through the diffusion model, and the embedded vector of the target text is finely tuned based on the image reconstruction loss in the image reconstruction process, so as to obtain the finely tuned target text. Alternatively, the diffusion model may include a DDIM model (implicit denoising diffusion model, an iterative implicit diffusion model that can improve sampling efficiency).

For a given target text and an image to be edited, highly controllable image editing is realized on the premise of not changing the weight of a diffusion model, and the editing mode of the attention characteristic diagram in a highly dependent reasoning stage is needed. However, images generated by DDIM processes, including forward (noise adding) and backward (noise removing) processes, which are text-containing conditional guides and do not have additional processing, are distorted to a greater extent than original images, resulting in a significant decrease in correlation between their attention profile and the original images. Therefore, in the implementation, before image editing reasoning is performed, fine adjustment can be performed on the embedded vector of the target text so as to ensure the correlation between the attention feature map and the image to be edited in the reasoning process of the diffusion model.

Wherein, the image reconstruction of the image to be edited based on the embedded vector of the target text through the diffusion model may include: the whole process of the diffusion model is completed based on the embedded vector of the target text and the image to be edited, that is, the forward process and the reverse process of the diffusion model are completed based on the embedded vector of the target text and the image to be edited, such as the DDIM process is completed.

In some embodiments, performing image reconstruction on an image to be edited based on an embedded vector of a target text through a diffusion model, and performing fine adjustment on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process, the obtaining of the fine-adjusted target text may include: in the DDIM process taking the target text as a condition, taking the embedded vector of the target text as a learnable parameter, and taking the mean square error between the noise generated under the condition of the target text and the noise generated in the step of the standard DDIM process as a loss function to finely tune the embedded vector of the target text, thereby obtaining the finely tuned target text. Based on the method, the feature relation between the target text after fine adjustment and the image to be edited is constructed while perfect image reconstruction is realized, and the correlation between the attention feature map and the image to be edited can be ensured.

Referring to fig. 10, fig. 10 is a flowchart illustrating a step S330 of the image editing method shown in fig. 9 according to the present application. The following will describe the flow chart shown in fig. 10 in detail, and the method specifically may include the following steps:

step S331: based on the image to be edited, first noise data in a forward direction of the diffusion model is determined, and second noise data in a reverse direction of the diffusion model with the embedded vector of the target text as a learnable parameter is determined.

The embodiment may perform image reconstruction on the image to be edited by using a standard diffusion model process, where the standard diffusion model process may include a quasi-diffusion model forward process (a noise adding process) and a quasi-diffusion model backward process (a noise removing process). Alternatively, the diffusion model may be a DDIM and the standard diffusion model process may include a DDIM forward process and a DDIM reverse process.

In some embodiments, in a case where the embedded vector of the target text and the image to be edited is obtained, noise data in a forward direction of the diffusion model may be determined based on the image to be edited, the noise data in the forward direction of the diffusion model may be used as first noise data, noise data in a reverse direction of the diffusion model using the embedded vector of the target text as a learning parameter may be determined, and noise data in the reverse direction of the diffusion model using the embedded vector of the target text as a learning parameter may be used as second noise data.

As an implementation manner, in the case of obtaining the embedded vector of the target text and the image to be edited, the noise adding process may be performed on the image to be edited in the forward direction of the diffusion model, so as to obtain the first noise data in the forward direction of the diffusion model, and the noise adding process may be performed on the image obtained in the forward direction of the diffusion model with the embedded vector of the target text as the learnable parameter in the reverse direction of the diffusion model, so as to obtain the second noise data in the reverse direction of the diffusion model.

In some embodiments, determining the first noise data in the forward direction of the diffusion model and determining the second noise data in the reverse direction of the diffusion model with the embedded vector of the target text as the learnable parameter based on the image to be edited may include: noise sampling and superposition are carried out on the image to be edited in the forward direction of the diffusion model, and first noise data are obtained; and guiding the reverse process of the diffusion model by taking the embedded vector of the target text as a learnable parameter to obtain second noise data output by the diffusion model.

As an implementation, the image reconstruction of the image to be edited may be processed by a standard diffusion model procedure. Specifically, in the forward direction of the diffusion model, noise collection and superposition can be performed on an image to be edited to obtain an image after noise superposition and first noise data, and then the reverse direction process of the diffusion model is guided by taking an embedded vector of a target text as a learnable parameter so as to denoise the image after noise superposition to obtain second noise data output by the diffusion model.

Step S332: a mean square error between the first noise data and the second noise data is calculated.

In some embodiments, in the case of obtaining the first noise data and the second noise data, then a mean square error between the first noise data and the second noise data may be calculated.

As an embodiment, a mean square error algorithm may be preset, and in the case of obtaining the first noise data and the second noise data, a mean square error between the first noise data and the second noise data may be calculated by the mean square error algorithm.

Step S333: and fine tuning the embedded vector of the target text based on a preset loss function and the mean square error to obtain the fine tuned target text.

In some embodiments, in the case of obtaining the mean square error between the first noise data and the second noise data, the embedded vector of the target text may be trimmed based on the preset loss function and the mean square error, so as to obtain the trimmed target text.

As an implementation manner, a preset loss function may be preset, and in the case of obtaining a mean square error between the first noise data and the second noise data, the mean square error may be calculated based on the preset loss function, so as to obtain a parameter to be optimized of an embedded vector of the target text, and fine tuning is performed on the embedded vector of the target text based on the parameter to be optimized, so as to obtain the target text after fine tuning.

As an example, in the case of obtaining the embedded vector of the image to be edited and the target text, noise sampling superposition may be performed on the image to be edited to generateThen the embedded vector is used as a learnable parameter under the condition of the target text, and the diffusion model is used for outputting +.>And standard data +.in diffusion model forward direction>The mean square error of the target text is used as a loss function to fine tune the embedded vector of the target text, the fine-tuned target text is obtained, the denoising process of the diffusion model under the guiding condition of the fine-tuned embedded vector of the target text outputs perfect reconstruction of the image to be edited, and therefore the characteristic relation between the target text and the image to be edited is constructed by means of the attention characteristic diagram.

Optionally, the preset loss function formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,for the reconstruction standard text code at the current time step t, the embedded vector with theta as the target text is also the parameter to be optimized of fine-tune, and psi _θ The current time layer code predicted by the diffusion model according to the time layer code at the time t+1, namely Z _t ，Γ _mse Is the mean square error.

Step S340: and obtaining an attention characteristic diagram corresponding to the image to be edited based on the fine-tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

The specific description of step S340 is referred to step S130, and will not be repeated here.

According to the image editing method provided by the embodiment of the application, the image to be edited and the target text are obtained, the embedded vector of the target text is determined, the image to be edited is reconstructed based on the embedded vector of the target text through the diffusion model, fine adjustment is carried out on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process, the fine-adjusted target text is obtained, the attention feature map corresponding to the image to be edited is obtained based on the fine-adjusted target text, and the image to be edited is edited based on the attention feature map, so that the target image is obtained. Compared with the image editing method shown in fig. 1, the embodiment further determines the embedded vector of the target text, and reconstructs an image to be edited based on the embedded vector of the target text through the diffusion model, so as to improve the subsequent fine adjustment effect on the target text, thereby improving the editing effect of the image.

Referring to fig. 11, fig. 11 is a flowchart illustrating an image editing method according to an embodiment of the application. The following will describe the flowchart shown in fig. 11 in detail, and the image editing method specifically may include the following steps:

Step S410: and acquiring an image to be edited and a target text.

Step S420: and carrying out image reconstruction on the image to be edited based on the target text, and carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text.

The specific description of step S410 to step S420 refer to step S110 to step S120, and are not described herein.

Step S430: and obtaining a self-attention characteristic diagram corresponding to the image to be edited based on the target text after fine adjustment, and obtaining a cross-attention characteristic diagram between the target text after fine adjustment and the image to be edited based on the target text after fine adjustment.

In this embodiment, when the fine-tuned target text is obtained, a self-attention feature map corresponding to the image to be edited may be obtained based on the fine-tuned target text, and a cross-attention feature map between the fine-tuned target text and the image to be edited may be obtained based on the fine-tuned target text. The self-attention feature map corresponding to the image to be edited more represents the layout features of the image to be edited, so that the whole layout of the image to be edited can be effectively reserved by editing the image to be edited based on the self-attention feature map corresponding to the image to be edited, and the editing of the image to be edited is more controllable. The cross attention feature diagram between the target text after fine adjustment and the image to be edited is more focused on representing the layout relationship between the target text after fine adjustment and the image to be edited, so that the target text feature can be better integrated in the target range to be edited when the image to be edited is edited based on the cross attention feature diagram between the target text after fine adjustment and the image to be edited, and the information of the image to be edited is greatly reserved when the image is edited.

In some embodiments, in the case of obtaining the image to be edited and the target text after fine adjustment, the image to be edited and the target text after fine adjustment may be input into a diffusion model to obtain a self-attention feature map corresponding to the image to be edited and a cross-attention feature map between the target text after fine adjustment and the image to be edited.

Step S440: and editing the image to be edited based on the self-attention characteristic diagram and the cross-attention characteristic diagram to obtain the target image.

In this embodiment, in the case of obtaining a self-attention feature map corresponding to an image to be edited and a cross-attention feature map between the target text after fine adjustment and the image to be edited, the image to be edited may be edited based on the self-attention feature map and the cross-attention feature map to obtain the target image.

In some embodiments, editing the image to be edited based on the self-attention profile and the cross-attention profile may include: adjusting part or all of the images to be edited based on the self-attention feature map and the cross-attention feature map to obtain a target image; the target image or the like is obtained by replacing part or all of the image to be edited based on the self-attention feature map and the cross-attention feature map, and is not limited herein.

According to the image editing method provided by the embodiment of the application, the image to be edited and the target text are obtained, the image to be edited is reconstructed based on the target text, fine adjustment is carried out on the target text based on image reconstruction loss in the image reconstruction process, the fine-adjusted target text is obtained, the self-attention feature map corresponding to the image to be edited is obtained based on the fine-adjusted target text, the cross-attention feature map between the fine-adjusted target text and the image to be edited is obtained based on the fine-adjusted target text, and the image to be edited is edited based on the self-attention feature map and the cross-attention feature map, so that the target image is obtained. Compared with the image editing method shown in fig. 1, the embodiment also determines the self-attention feature map corresponding to the image to be edited and the cross-attention feature map between the target text after fine adjustment and the image to be edited, so that the image editing can be realized while the original information of the image to be edited is greatly reserved and the controllability of the image editing is increased.

Referring to fig. 12, fig. 12 is a flowchart illustrating an image editing method according to an embodiment of the application. The following will describe the flowchart shown in fig. 12 in detail, and the image editing method specifically may include the following steps:

Step S510: and acquiring an image to be edited and a target text.

Step S520: and carrying out image reconstruction on the image to be edited based on the target text, and carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text.

The specific description of step S510 to step S520 refers to step S110 to step S120, and is not repeated here.

Step S530: and determining the embedded vector of the trimmed target text.

In this embodiment, when the trimmed target text is acquired, an embedded vector (conditional text embedding) of the trimmed target text may be determined.

In some embodiments, in the case of acquiring the trimmed target text, the feature vector may be extracted from the trimmed target text to obtain the embedded vector of the trimmed target text. Optionally, a feature vector extraction manner may be preset, and in the case of acquiring the trimmed target text, feature extraction may be performed on the trimmed target text based on the preset feature vector extraction manner, so as to obtain an embedded vector of the trimmed target text. Alternatively, a plurality of feature vector extraction manners may be preset, and when the trimmed target text is obtained, one feature vector extraction manner may be selected from the preset plurality of feature vector extraction manners, and feature extraction is performed on the trimmed target text by using the selected feature vector extraction manner, so as to obtain an embedded vector of the trimmed target text, where the feature vector extraction manner may be selected based on the trimmed target text, for example, may be based on text content of the trimmed target text, may be based on a target format of the trimmed target text, and the like, and is not limited herein.

Step S540: and obtaining an attention characteristic diagram corresponding to the image to be edited based on the embedded vector of the target text after fine adjustment, and editing the image to be edited based on the attention characteristic diagram to obtain the target image.

In this embodiment, in the case of obtaining the embedded vector of the target text after fine adjustment, an attention feature map corresponding to the image to be edited may be obtained based on the embedded vector of the target text after fine adjustment, and the image to be edited may be edited based on the attention feature map, so as to obtain the target image.

According to the image editing method provided by the embodiment of the application, the image to be edited and the target text are obtained, the image to be edited is reconstructed based on the target text, fine adjustment is carried out on the target text based on image reconstruction loss in the image reconstruction process, the fine-adjusted target text is obtained, the embedded vector of the fine-adjusted target text is determined, the attention feature map corresponding to the image to be edited is obtained based on the embedded vector of the fine-adjusted target text, and the image to be edited is edited based on the attention feature map, so that the target image is obtained. Compared with the image editing method shown in fig. 1, the embodiment further determines the embedding vector of the target text after fine adjustment to obtain the attention feature map corresponding to the image to be edited, so as to improve the accuracy of the determined attention feature map, thereby improving the image editing effect.

Referring to fig. 13, fig. 13 is a flowchart illustrating an image editing method according to an embodiment of the application. The following will describe the flowchart shown in fig. 13 in detail, and the image editing method specifically may include the following steps:

step S610: and acquiring an image to be edited and a target text.

Step S620: and carrying out image reconstruction on the image to be edited based on the target text, and carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text.

The specific description of step S610 to step S620 refer to step S110 to step S120, and are not repeated here.

Step S630: and determining the embedded vector of the trimmed target text.

The specific description of step S630 refers to step S530, and is not repeated here.

Step S640: and obtaining an attention characteristic diagram corresponding to the image to be edited based on the embedded vector of the target text after fine adjustment through a diffusion model, and editing the image to be edited based on the attention characteristic diagram to obtain the target image.

In this embodiment, in the case of obtaining the to-be-edited image and the embedded vector of the target text after fine adjustment, the attention feature map corresponding to the to-be-edited image may be obtained based on the embedded vector of the target text after fine adjustment by the diffusion model, and the to-be-edited image may be edited based on the attention feature map to obtain the target image. Alternatively, the diffusion model may comprise a DDIM model.

Under the condition that the embedded vector of the target text after fine adjustment is obtained, an attention feature map corresponding to an image to be edited can be obtained through a diffusion model based on the embedded vector of the target text after fine adjustment, then a diffusion model process (such as a DDIM process) under the condition of the target text is edited through the attention feature map, and the functions of specific object replacement, action modification, style conversion, background replacement and the like are realized on the basis of keeping the layout of the image to be edited.

As an implementation manner, under the condition that the embedded vector of the target text after fine adjustment is obtained, the embedded vector of the target text and the embedded vector of the target text after fine adjustment can be input into the diffusion model at the same time, and the attention feature map under the guiding condition of the target text after fine adjustment is replaced to the forward computing process under the guiding condition of the target text in the process (forward process and reverse process) of the diffusion model, so that the combination of the layout feature of the image to be edited and the semantic feature of the target text is realized, and further the controllable image layout editing is realized. It can be understood that, in this embodiment, no additional descriptive text is needed, no fine adjustment of the diffusion model is needed, no additional diffusion model is needed to increase the operation load, and controllable image editing can be realized only by a section of target text needed for editing. In addition, the embodiment has good effects in difficult image editing fields such as action editing and the like, and can provide efficient and controllable image editing capability in both scientific research and engineering application.

Referring to fig. 14, fig. 14 is a flowchart illustrating a step S640 of the image editing method shown in fig. 13 according to the present application. The following will describe the flow shown in fig. 14 in detail, and the method specifically may include the following steps:

step S641: and inputting the embedded vector of the target text and the embedded vector of the trimmed target text into the diffusion model.

In some embodiments, where the embedded vector of the trimmed target text is obtained, then the embedded vector of the target text and the embedded vector of the trimmed target text may be input into a diffusion model. Alternatively, in the case of obtaining the embedded vector of the trimmed target text, the embedded vector of the target text and the embedded vector of the trimmed target text may be packaged in the same batch and input into the diffusion model.

Step S642: and guiding the reverse process of the diffusion model by using the embedded vector of the fine-tuned target text, and obtaining the attention characteristic diagram in the embedded vector branch of the fine-tuned target text.

Wherein, in the case of inputting the embedded vector of the target text and the embedded vector of the trimmed target text into the diffusion model, then one of the branches of the inverse process (denoising process) of the diffusion model is performed under the guidance of the embedded vector of the target text, that is, the inverse process of the diffusion model is performed with the embedded vector of the target text as a learnable parameter; the other branch of the inverse process of the diffusion model is performed under the guidance of the embedded vector of the trimmed target text, i.e. the inverse process of the diffusion model is performed with the embedded vector of the trimmed target text as a learnable parameter.

In this embodiment, when the embedded vector of the target text and the embedded vector of the trimmed target text are input into the diffusion model, the inverse process of the diffusion model may be guided by the embedded vector of the trimmed target text, so as to obtain the attention feature map in the branch of the embedded vector of the trimmed target text.

Step S643: and replacing the attention characteristic diagram into a forward process of the diffusion model taking the embedded vector of the target text as a guide, and obtaining the target image.

In some embodiments, in the case of obtaining the attention profile, the attention profile may be replaced into a forward process of the diffusion model guided by the embedded vector of the target text to obtain the target image.

As one implementation, the attention profile includes a self-attention profile of the image to be edited and a cross-attention profile between the trimmed target text and the image to be edited. Branch 1 of the diffusion model represents the denoising process under the guidance of the embedded vector of the target text after fine tuning, and branch 2 represents the denoising process under the guidance of the embedded vector of the target text. The self-attention feature map more represents the layout features of the image to be edited, so that the self-attention feature map of the branch 1 is replaced to the branch 2 in the denoising process, the whole layout of the image to be edited can be effectively saved under the target text guiding condition, and the image editing is more controllable. The cross attention feature map is more focused on representing the layout relationship between the target text and the image to be edited, so that the cross attention feature map in the branch 1 is replaced to the branch 2, the target text feature can be better fused in the target range to be edited, and the image editing is realized while the information of the image to be edited is greatly reserved.

According to the image editing method provided by the embodiment of the application, the image to be edited and the target text are obtained, the image to be edited is reconstructed based on the target text, fine adjustment is carried out on the target text based on image reconstruction loss in the image reconstruction process, the fine-adjusted target text is obtained, the embedded vector of the fine-adjusted target text is determined, the attention feature map corresponding to the image to be edited is obtained based on the embedded vector of the fine-adjusted target text through the diffusion model, and the image to be edited is edited based on the attention feature map, so that the target image is obtained. Compared with the image editing method shown in fig. 1, the embodiment further determines the embedded vector of the target text after fine adjustment, and obtains the attention feature map corresponding to the image to be edited based on the embedded vector of the target text after fine adjustment through the diffusion model, so as to improve the accuracy of the determined attention feature map, thereby improving the image editing effect.

Referring to fig. 15, as shown in fig. 15, first, a target text and an image to be edited are given, then, an embedded vector of the target text is trimmed in a DDIM process, a trimmed target text is obtained when the number of steps of trimming reaches a predetermined value, then, the embedded vector of the target text and the embedded vector of the trimmed target text are input into the DDIM, and a attention feature map in an embedded vector branch of the trimmed target text is replaced to a forward direction process of the target text in a denoising process, so as to obtain the target image.

Referring to fig. 16, fig. 16 is a block diagram illustrating an image editing apparatus according to an embodiment of the application. The image editing apparatus 200 includes: an information acquisition module 210, a text fine tuning module 220, and an image editing module 230, wherein:

the information acquisition module 210 is configured to acquire an image to be edited and a target text.

And the text fine tuning module 220 is configured to perform image reconstruction on the image to be edited based on the target text, and fine tune the target text based on image reconstruction loss in the image reconstruction process, so as to obtain a fine-tuned target text.

Further, the text fine tuning module 220 includes: a first embedded vector determination sub-module and a text retuning sub-module, wherein:

and the first embedded vector determination submodule is used for determining the embedded vector of the target text.

And the text trimming sub-module is used for carrying out image reconstruction on the image to be edited based on the embedded vector of the target text, and carrying out trimming on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process to obtain the trimmed target text.

Further, the text fine tuning sub-module includes: text fine tuning unit, wherein:

And the text fine tuning unit is used for carrying out image reconstruction on the image to be edited based on the embedded vector of the target text through the diffusion model, and carrying out fine tuning on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-tuned target text.

Further, the text trimming unit includes: noise data obtaining subunit, mean square error calculating subunit and target text fine tuning subunit, wherein:

and the noise data obtaining subunit is used for determining first noise data in the forward process of the diffusion model and second noise data in the reverse process of the diffusion model, wherein the embedded vector of the target text is used as a learnable parameter, based on the image to be edited.

Further, the noise data obtaining subunit includes: a first noise data subunit and a second noise data subunit, wherein:

and the first noise data obtaining subunit is used for carrying out noise sampling superposition on the image to be edited in the forward direction process of the diffusion model to obtain the first noise data.

And the second noise data obtaining subunit is used for guiding the reverse process of the diffusion model by taking the embedded vector of the target text as a learnable parameter to obtain the second noise data output by the diffusion model.

And a mean square error calculating subunit for calculating a mean square error between the first noise data and the second noise data.

And the target text fine tuning subunit is used for carrying out fine tuning on the embedded vector of the target text based on a preset loss function and the mean square error to obtain the target text after fine tuning.

The image editing module 230 is configured to obtain an attention feature map corresponding to the image to be edited based on the fine-tuned target text, and edit the image to be edited based on the attention feature map, so as to obtain a target image.

Further, the image editing module 230 includes: the attention profile obtaining sub-module and the first image editing sub-module, wherein:

and the attention characteristic diagram obtaining sub-module is used for obtaining the self-attention characteristic diagram corresponding to the image to be edited based on the target text after fine adjustment, and obtaining the cross attention characteristic diagram between the target text after fine adjustment and the image to be edited based on the target text after fine adjustment.

And the first image editing sub-module is used for editing the image to be edited based on the self-attention characteristic diagram and the cross-attention characteristic diagram to obtain the target image.

Further, the image editing module 230 includes: a second embedded vector determination sub-module and a second image editing sub-module, wherein:

and the second embedding vector determining submodule is used for determining the embedding vector of the target text after fine tuning.

And the second image editing sub-module is used for obtaining an attention characteristic diagram corresponding to the image to be edited based on the embedded vector of the fine-tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain the target image.

Further, the second image editing submodule includes: an image editing unit, wherein:

and the image editing unit is used for obtaining an attention characteristic diagram corresponding to the image to be edited based on the embedded vector of the target text after fine adjustment through a diffusion model, and editing the image to be edited based on the attention characteristic diagram to obtain the target image.

Further, the image editing unit includes: an embedding vector input subunit, an attention profile acquisition subunit, and an image editing subunit, wherein:

and the embedded vector input subunit is used for inputting the embedded vector of the target text and the embedded vector of the trimmed target text into the diffusion model.

And the attention characteristic diagram obtaining subunit is used for guiding the reverse process of the diffusion model by using the embedded vector of the fine-tuned target text to obtain the attention characteristic diagram in the embedded vector branch of the fine-tuned target text.

And the image editing subunit is used for replacing the attention characteristic diagram into a forward process of the diffusion model guided by the embedded vector of the target text to obtain the target image.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided by the present application, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

Referring to fig. 17, a block diagram of an electronic device 100 according to an embodiment of the application is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or the like capable of running an application program. The electronic device 100 of the present application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, wherein the one or more application programs may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more program(s) configured to perform the method as described in the foregoing method embodiments.

Wherein the processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing the content to be displayed; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The Memory 120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a program storage area and a data storage area, wherein, the storage program area may store instructions for implementing an operating system, instructions for implementing functions (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, and the like. The storage data area may also store data created by the electronic device 100 in use (e.g., phonebook, audiovisual data, chat log data), and the like.

Referring to fig. 18, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 300 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 300 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 300 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer-readable storage medium 300 has a program for performing the above method memory space of program code 310 for any of the method steps of (a). The program code can be read from or written to one or more computer program products. Program code 310 may be compressed, for example, in a suitable form.

In summary, the image editing method, the device, the electronic equipment and the storage medium provided by the embodiment of the application acquire the image to be edited and the target text, reconstruct the image to be edited based on the target text, fine tune the target text based on image reconstruction loss in the image reconstruction process, acquire the fine-tuned target text, acquire the attention feature map corresponding to the image to be edited based on the fine-tuned target text, edit the image to be edited based on the attention feature map, acquire the target image, so as to acquire the target image by combining the fine-tuning of the target text and the attention feature map editing, thereby realizing controllable image editing according to the target text, ensuring simpler operation and meeting the user requirements in practical application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An image editing method, the method comprising:

acquiring an image to be edited and a target text;

performing image reconstruction on the image to be edited based on the target text, and performing fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text;

and obtaining an attention characteristic diagram corresponding to the image to be edited based on the fine-tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

2. The method according to claim 1, wherein the performing image reconstruction on the image to be edited based on the target text, and performing fine adjustment on the target text based on image reconstruction loss in the image reconstruction process, to obtain the fine-adjusted target text, includes:

determining a place the object is an embedded vector of text;

and carrying out image reconstruction on the image to be edited based on the embedded vector of the target text, and carrying out fine adjustment on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process to obtain the target text after fine adjustment.

3. The method according to claim 2, wherein the image reconstruction of the image to be edited based on the embedded vector of the target text, and fine tuning the embedded vector of the target text based on image reconstruction loss during the image reconstruction process, to obtain the fine-tuned target text, comprises:

And carrying out image reconstruction on the image to be edited based on the embedded vector of the target text through a diffusion model, and carrying out fine adjustment on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process to obtain the target text after fine adjustment.

4. A method according to claim 3, wherein the performing image reconstruction on the image to be edited based on the embedded vector of the target text through the diffusion model, and performing fine adjustment on the embedded vector of the target text based on image reconstruction loss in the image reconstruction process, to obtain the fine-adjusted target text, includes:

determining first noise data in a forward process of the diffusion model based on the image to be edited, and determining second noise data in a reverse process of the diffusion model with the embedded vector of the target text as a learnable parameter;

calculating a mean square error between the first noise data and the second noise data;

and fine tuning the embedded vector of the target text based on a preset loss function and the mean square error to obtain the fine tuned target text.

5. The method of claim 4, wherein determining first noise data in a forward direction of the diffusion model and determining second noise data in a reverse direction of the diffusion model with an embedded vector of the target text as a learnable parameter based on the image to be edited comprises:

Noise sampling and superposition are carried out on the image to be edited in the forward direction process of the diffusion model, and the first noise data are obtained;

and guiding the reverse process of the diffusion model by taking the embedded vector of the target text as a learnable parameter to obtain the second noise data output by the diffusion model.

6. The method according to claim 1, wherein the obtaining an attention profile corresponding to the image to be edited based on the fine-tuned target text and performing editing processing on the image to be edited based on the attention profile to obtain a target image includes:

acquiring a self-attention feature map corresponding to the image to be edited based on the target text after fine adjustment, and acquiring a cross-attention feature map between the target text after fine adjustment and the image to be edited based on the target text after fine adjustment;

and editing the image to be edited based on the self-attention characteristic diagram and the cross-attention characteristic diagram to obtain the target image.

7. The method according to claim 1, wherein the obtaining an attention profile corresponding to the image to be edited based on the fine-tuned target text and performing editing processing on the image to be edited based on the attention profile to obtain a target image includes:

Determining an embedded vector of the trimmed target text;

and obtaining an attention characteristic diagram corresponding to the image to be edited based on the embedded vector of the target text after fine adjustment, and editing the image to be edited based on the attention characteristic diagram to obtain the target image.

8. The method according to claim 7, wherein the obtaining an attention profile corresponding to the image to be edited based on the embedded vector of the fine-tuned target text and performing editing processing on the image to be edited based on the attention profile, obtaining the target image, includes:

and obtaining an attention characteristic diagram corresponding to the image to be edited based on the embedded vector of the target text after fine adjustment through a diffusion model, and editing the image to be edited based on the attention characteristic diagram to obtain the target image.

9. The method according to claim 8, wherein the obtaining, by a diffusion model, an attention profile corresponding to the image to be edited based on the embedded vector of the fine-tuned target text, and performing editing processing on the image to be edited based on the attention profile, obtaining the target image, includes:

Inputting the embedded vector of the target text and the embedded vector of the trimmed target text into the diffusion model;

guiding the reverse process of the diffusion model by using the embedded vector of the target text after fine tuning to obtain the attention feature map in the embedded vector branch of the target text after fine tuning;

and replacing the attention characteristic diagram into a forward process of the diffusion model taking the embedded vector of the target text as a guide, and obtaining the target image.

10. An image editing apparatus, the apparatus comprising:

the information acquisition module is used for acquiring an image to be edited and a target text;

the text fine adjustment module is used for carrying out image reconstruction on the image to be edited based on the target text, and carrying out fine adjustment on the target text based on image reconstruction loss in the image reconstruction process to obtain a fine-adjusted target text;

and the image editing module is used for obtaining an attention characteristic diagram corresponding to the image to be edited based on the finely tuned target text, and editing the image to be edited based on the attention characteristic diagram to obtain a target image.

11. An electronic device comprising a memory and a processor, the memory coupled to the processor, the memory storing instructions that when executed by the processor perform the method of any of claims 1-9.

12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for executing the method according to any one of claims 1-9.