CN117576258A

CN117576258A - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN117576258A
Application number: CN202311526888.8A
Authority: CN
Inventors: 王凡祎; 苏婧文
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2024-02-20

Abstract

The invention discloses an image processing method, an image processing device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a first image and a second image, adopting first text information as a prompt word, carrying out first image reconstruction processing on the first image to obtain a first reconstructed image, wherein the first text information is identification information corresponding to the first image, determining a target mapping relation in the first image reconstruction processing process according to the first text information and the first reconstructed image, carrying out second image reconstruction processing on the second image according to the target mapping relation and the second text information to obtain a second reconstructed image of which a main object is the first main object, adopting the second text information as identification information corresponding to the second image, and adopting the same image reconstruction algorithm for the first image reconstruction processing and the second image reconstruction processing. The method and the device can ensure that the main body of the image is perfectly embedded in the background on the premise of perfectly reconstructing the background of the image, and realize the generation of the appointed main body and the background of the image.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium.

Background

With the rapid development of image editing technology, matting and image synthesis technology are particularly favored by users. The matting refers to a technology of determining a foreground and a background from an image; the process of seamlessly attaching the scratched part to the target image is called image composition. The target image wanted by the user can be obtained by image editing through matting and image synthesis.

The matting and image synthesis methods adopted in the related art are generally performed by combining another image as a reference image. Meanwhile, the method provided in the related art can only realize the generation of the appointed subject, but can not simultaneously realize the generation of the appointed subject and the background, and can not meet the requirement of a user on image editing.

Disclosure of Invention

An object of an embodiment of the present invention is to provide an image processing method, an apparatus, an electronic device, and a storage medium, so as to solve a technical problem that a method provided in a related art cannot simultaneously implement generation of a specified subject and a background.

In a first aspect, an embodiment of the present invention provides an image processing method, including:

acquiring a first image and a second image;

performing first image reconstruction processing on the first image by using first text information as a prompt word to obtain a first reconstructed image, wherein the first text information is identification information corresponding to the first image, and the first reconstructed image is an image reconstructed by the first image;

Determining a target mapping relation in the first image reconstruction processing process according to the first text information and the first reconstructed image, wherein the target mapping relation is a mapping relation between the first text information and a first main object, and the first main object is a main object in the first reconstructed image;

and carrying out second image reconstruction processing on the second image according to the target mapping relation and the second text information to obtain a second reconstructed image of which the main object is the first main object, wherein the second text information is identification information corresponding to the second image, and the first image reconstruction processing and the second image reconstruction processing adopt the same image reconstruction algorithm.

In a second aspect, an embodiment of the present invention provides an image processing apparatus including:

the acquisition module is used for acquiring the first image and the second image;

the first processing module is used for carrying out first image reconstruction processing on the first image by taking first text information as a prompt word to obtain a first reconstructed image, wherein the first text information is identification information corresponding to the first image, and the first reconstructed image is an image reconstructed by the first image;

The determining module is used for determining a target mapping relation in the first image reconstruction processing process according to the first text information and the first reconstructed image, wherein the target mapping relation is a mapping relation between the first text information and a first main object, and the first main object is a main object in the first reconstructed image;

the second processing module is used for carrying out second image reconstruction processing on the second image according to the target mapping relation and second text information to obtain a second reconstructed image of which the main object is the first main object, the second text information is identification information corresponding to the second image, and the first image reconstruction processing and the second image reconstruction processing adopt the same image reconstruction algorithm.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a memory, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the image processing methods described above when the computer program is executed by the processor.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the image processing methods described above.

The embodiment of the invention provides an image processing method, an image processing device, electronic equipment and a storage medium.

Drawings

Fig. 1 is a schematic flow chart of an image processing method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another image processing method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a first image reconstruction process according to an embodiment of the present invention;

fig. 4 is a schematic view of an application scenario of an image processing method according to an embodiment of the present invention;

fig. 5 is a schematic structural view of an image processing apparatus according to an embodiment of the present invention;

Fig. 6 is a schematic diagram of another structure of an image processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 8 is a schematic diagram of another structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

In the related art, editing of an image is generally achieved by a method of matting and image synthesis, however, the method is generally performed by combining another image as a reference image. Meanwhile, the method provided in the related art can only realize the generation of the appointed subject, but can not simultaneously realize the generation of the appointed subject and the background, and can not meet the requirement of a user on image editing.

Therefore, how to simultaneously implement the generation of the subject and the background of the image specification is a technical problem to be solved.

In order to solve the technical problems in the related art, an embodiment of the present invention provides an image processing method. Referring to fig. 1, fig. 1 is a schematic flow chart of an image processing method according to an embodiment of the invention, and the method includes steps 101 to 104;

step 101, a first image and a second image are acquired.

In this embodiment, the first image provided in this embodiment is an image containing a subject to be used, that is, the first image is an image containing a subject specified by a user, and the subject in the first image is contained in a synthesized image obtained after the subsequent image is synthesized. The second image provided in this embodiment is an image containing the subject to be replaced and the background to be reconstructed, that is, the second image is an image containing the background specified by the user, and the synthesized image obtained after the subsequent image is synthesized contains the background in the second image.

It should be noted that, the subject object provided in this embodiment may be a foreground image of an image, and the background may be a background image of the image. The subject may be a whole subject such as a person, a building, or an animal, or may be a local subject such as a person, a building, or an animal, for example, a local subject of a face of a person, and is not particularly limited as long as it is a subject specified by a user.

And 102, carrying out first image reconstruction processing on the first image by adopting the first text information as a prompt word to obtain a first reconstructed image.

The first text information provided in this embodiment is identification information corresponding to a first image, and the first reconstructed image is an image reconstructed from the first image. In particular, the first text information may be specific and unique identification information for characterizing the first image alone. The first reconstructed image may be an image obtained by subjecting the first image to noise addition processing and noise removal processing, and the first reconstructed image is the same as the first image.

In this embodiment, the first image reconstruction processing provided in this embodiment may be performed by using an image reconstruction algorithm, or may be performed by using a model based on CNN (convolutional neural network), so long as perfect reconstruction of an image can be achieved, which is not specifically limited herein.

And step 103, determining a target mapping relation in the reconstruction processing process of the first image according to the first text information and the first reconstructed image.

After the first image reconstruction processing of the first image is completed under the guidance of the first text information, a first reconstructed image specified by the first text information can be obtained, and the first reconstructed image is set to be an image with the same image content as the first image. Therefore, after the first image reconstruction processing of the first image is completed under the guidance of the first text information and the first reconstructed image with the same image content as the first image is obtained, the target mapping relationship provided by the embodiment can be determined, specifically, the target mapping relationship provided by the embodiment is a mapping relationship between the first text information and the first subject object, and the first subject object is a subject object in the first reconstructed image.

In this embodiment, the target mapping relationship provided in this embodiment is equivalent to a binding relationship between the first text information and the first subject object, that is, the subsequently input prompting word is the first text information, and after the same first image reconstruction processing is adopted, the first subject object corresponding to the first text information can be obtained in the reconstructed image, so as to achieve the purpose of generating the specified subject object.

And 104, performing second image reconstruction processing on the second image according to the target mapping relation and the second text information to obtain a second reconstructed image with the main object being the first main object.

The second text information provided in this embodiment is identification information corresponding to the second image, and the first image reconstruction process and the second image reconstruction process adopt the same image reconstruction algorithm. In particular, the second text information may be specific and unique identification information for characterizing the second image alone.

By adopting the same image reconstruction algorithm for the first image reconstruction process and the second image reconstruction process, it is possible to use the same definition in the same image reconstruction algorithm but different mapping relationships of the objects specified in the mapping relationships (the definition provided in the present embodiment refers to the same type of the mapping object). Therefore, the mapping relation in the processing process can be modified in the second image reconstruction processing process, so that the finally reconstructed image contains the appointed subject object, and meanwhile, the appointed background in the second image is not changed, and therefore, the subject object appointed by a user can be perfectly embedded in the background on the premise of perfectly reconstructing the background of the image, and the generation of the subject object and the background appointed by the image is realized.

In this embodiment, since the above embodiment has determined the target mapping relationship between the first text information and the first subject object, the first text information is used as the prompt word, and the same processing procedure as the first image reconstruction processing is used, the first subject object corresponding to the first text information can be obtained in the reconstructed image, thereby achieving the purpose of generating the specified subject object. Therefore, in the process of using the second text information as the prompt word to guide the second image to carry out the second image reconstruction processing, the mapping relation used in the second image reconstruction processing process can be adjusted to be the target mapping relation, and the second reconstruction image simultaneously containing the appointed subject object and the background can be obtained, wherein the appointed subject object is the first subject object in the first image, and the appointed background is the background in the second image.

Therefore, by adopting the method provided by the embodiment of the invention, the main body of the image can be embedded in the background perfectly on the premise of perfectly reconstructing the background of the image, and the generation of the main body and the background appointed by the image can be realized, so that the use experience of a user is improved.

In some embodiments, the present embodiment may use a text description (SD) to perform image reconstruction processing, and specifically, referring to fig. 2, fig. 2 is another flow chart of an image processing method provided in the embodiment of the present invention, as shown in fig. 2, the image processing method provided in the embodiment includes steps 201 to 206;

In step 201, a first image and a second image are acquired.

In this embodiment, in order to increase the speed of learning the mapping relationship between the first image and the first text information by the subsequent text-to-image model, before step 201, this embodiment may further include: and carrying out matting processing on the initial image containing the main object to be used so as to extract the first main object in the initial image and obtain a first image only containing the first main object.

By setting the first image as the image only containing the first main object, the subsequent text-to-image model only learns the mapping relation between the first text information and the first main object, the training rate of the text-to-image model on the first image is improved, the main object of the image is only modified in the subsequent image reconstruction process, and the rate of the whole image reconstruction process is improved.

Step 202, adopting first text information as a prompt word, and calling a target text-generated graph model to carry out image reconstruction processing on a first image until the target text-generated graph model generates a first reconstructed image.

The first text information is identification information corresponding to a first image, and the first reconstructed image is an image reconstructed from the first image.

The target text-to-image model adopted in the embodiment is a pre-trained text-to-image model, and the given image can be edited according to the prompt word (prompt) and the given image to generate the image specified by the prompt word. Specifically, in this embodiment, the first text information is used as a prompt word, and the image reconstruction process of the target text-to-image model is guided until the target text-to-image model generates a first reconstructed image with the same image content as the first image, so that the target text-to-image model learns the target mapping relationship between the first text information and the first image, and when the first text information is input into the target text-to-image model again, the target text-to-image model generates the first image corresponding to the first text information, thereby achieving the purpose of generating the specified subject object.

Step 203, determining a first mapping relation adopted in the process of performing image reconstruction processing on the first image by the target text-to-image model according to the first text information and the first reconstructed image.

The target mapping relationship is a mapping relationship between the first text information and a first main object, and the first main object is a main object in the first reconstructed image. And determining a first mapping relation adopted by the target text-to-image model in the image reconstruction process of the first image through the first text information and the first reconstructed image.

And 204, determining a second mapping relation adopted in the process of reconstructing the second image by the target text-to-image model according to the second text information and the second image.

In this embodiment, the image reconstruction processing is performed on the second image in the same manner as the image reconstruction processing is performed on the first image, so as to determine the second mapping relationship used in the image reconstruction processing of the second image. That is, the second text information is used as a prompt word, and the target text-to-image model is called to carry out image reconstruction processing on the second image, so that a reconstructed image corresponding to the second image is obtained; and then determining a second mapping relation adopted in the process of carrying out image reconstruction processing on the second image by the target text-to-image model according to the second text information and the reconstructed image corresponding to the second image.

Step 205, adjusting the second mapping relation according to the first mapping relation, so as to adjust the second text information in the second mapping relation into the first text information, and obtain a third mapping relation.

In this embodiment, the second text information in the second mapping relationship may be directly modified into the first text information, so as to obtain the third mapping relationship. Specifically, the second text information in the second mapping relationship may be directly modified, or the vector of the second text information in the second mapping relationship may be modified, so long as the data defined in the second mapping relationship is the data corresponding to the second text information, and the types of the data before and after modification are the same, which is not limited specifically herein.

And 206, adopting the second text information as a prompt word, and calling a target text-to-image graph model to perform image reconstruction processing on the second image according to the third mapping relation to obtain a second reconstructed image with the main object being the first main object.

The second text information is identification information corresponding to the second image, and the first image reconstruction processing and the second image reconstruction processing adopt the same image reconstruction algorithm. It should be noted that, the second text information provided in the present embodiment and the first text information need to be in the same format, for example, the first text information is "a photo of a", the second text information should be in the same format, and be "a photo of B", where "a photo of __" is a specified format, and a and B may be different subject objects indicated in different images.

In this embodiment, since the target text-to-image model has already learned the target mapping relationship between the first text information and the first image, if the first subject object corresponding to the first text information is desired, the content of the subject object part may be modified to the first text information in the mapping relationship used by the target text-to-image model, that is, the first subject object corresponding to the first text information may be output through the target text-to-image model, so as to achieve the purpose of generating the specified subject object.

Specifically, in this embodiment, in the process of performing image reconstruction processing on the second image by using the second text information as the prompt word, the second mapping relationship used by the target text-to-image model is modified correspondingly, so that the second text information in the second mapping relationship used in the process of performing image reconstruction processing on the second image by the target text-to-image model is modified into the first text information in the first mapping relationship.

As an alternative embodiment, the first image reconstruction process provided in the present embodiment may include a noise adding process and a noise removing process, and in particular, referring to fig. 3, fig. 3 is a schematic flow diagram of a first image reconstruction process provided in the present embodiment, and as shown in the drawing, the first image reconstruction process provided in the present embodiment may include steps 301 to 304;

and step 301, inputting the first text information as a prompt word into a target guidance network for prediction processing to obtain a guidance embedded vector.

In this embodiment, the target guidance network provided in this embodiment may be a CLIP network, and is mainly used to extract an embedded vector of a hint word, that is, a guidance embedded vector, and send the guidance embedded vector into UNet of a target Diffusion model (SD) as a guidance condition in a subsequent cross-attention mechanism manner, so as to instruct the target Diffusion model to generate a specified image.

It should be noted that, the target diffusion model provided in this embodiment and the text-to-graph model provided in the above embodiment may be the same model, and the following embodiments will be abbreviated as SD model.

Step 302, inputting the first image into a target diffusion model for noise adding processing, and obtaining a noise added image of the first image.

In this embodiment, the noise adding process provided in this embodiment, that is, the diffusion process of the target diffusion model, specifically, the diffusion process of the target diffusion model mainly adds a certain gaussian noise to the input image, so as to obtain a noise image.

Step 303, determining a cross attention value of the guide embedded vector and the embedded vector of the noise added image according to the two.

In this embodiment, parameters Q (query), K (key) and V (value) required for calculating the cross-attention value are respectively: the query is an intermediate feature of UNet in the target diffusion model; key and value refer to the guide embedded vector. Thus, by calculating the parameters Q (query), K (key), and V (value) required for the cross-attention value, the cross-attention value can be determined.

And step 304, calling a target diffusion model, and denoising the denoised image according to the cross attention value to obtain a first reconstructed image.

In this embodiment, the target diffusion model is called to perform the denoising step number of the denoising process on the denoising image according to the cross attention value, which is the same as the denoising step number of the denoising process on the input image, that is, the amount of noise is added to the input image, and the amount of noise is correspondingly removed, so that perfect reconstruction of the input image is realized.

In this embodiment, the first mapping relationship provided in this embodiment may be a first cross-attention mapping relationship. Specifically, the step of determining, according to the first text information and the first reconstructed image, the target mapping relationship in the first image reconstruction processing procedure provided in this embodiment may be: and determining a first cross attention mapping relation adopted in the image reconstruction processing process of the target diffusion model on the first image and the first text information according to the first text information and the first reconstructed image.

The first cross-attention mapping relationship provided in this embodiment characterizes a correspondence between the first text information and the first subject object in the first reconstructed image.

The second mapping relationship provided in this embodiment may be a second cross-attention mapping relationship. Specifically, the step of performing the second image reconstruction processing on the second image according to the target mapping relationship and the second text information to obtain the second reconstructed image with the subject object being the first subject object may be: invoking a target diffusion model and a target guidance network, performing image reconstruction processing on the second image and the second text information to obtain an initial reconstructed image, and extracting a second cross attention mapping relation adopted in the image reconstruction processing process; according to the first cross attention mapping relation, the second cross attention mapping relation is adjusted so as to adjust the corresponding relation between the second text information in the second cross attention mapping relation and the second main body object into the corresponding relation between the first text information and the first main body object, and a target cross attention mapping relation is obtained; and calling a target diffusion model to adopt a target cross attention mapping relation, and carrying out image reconstruction processing on the second image and the second text information to obtain a second reconstructed image in which the second main object is replaced by the first main object.

Wherein the second subject object is a subject object in the initial reconstructed image.

In this way, the embodiment can replace the second subject object in the initial reconstructed image with the first subject object by adjusting the mapping relationship between the second text information and the cross attention mechanism of the second subject object in the initial reconstructed image, thereby achieving the purpose of replacing the subject object in the reconstructed image. In practical application, the target diffusion model already contains priori knowledge of the first main object, so that the main object can be replaced by modifying the part of data of the prompting word in the mapping relation adopted in the image reconstruction processing of the second image by adopting the target diffusion model. Specifically, the second text information "a photo of B" may be modified to "a photo of a" so that a second reconstructed image in which the second subject object is replaced with the first subject object can be generated.

As an optional embodiment, the present embodiment may train the SD model in two manners of Textual inversion or streambooth, so that the SD model learns the mapping relationship between the hint word and the subject object, thereby implementing the purpose of generating the specified subject object in the subsequent image reconstruction processing. The Textual inversion mainly adjusts the embedded vector of the prompting word so as to enable the prompting word to establish a mapping relation with a main object in an image output by the SD model, and the SD model is not required to be adjusted in the mode; the streambooth mainly adjusts the recognition type of the SD model, so that the SD model recognizes the input prompting word as an independent type, thereby establishing a mapping relationship between the prompting word and the main object in the image output by the SD model, and the method needs to adjust the SD model.

The SD model is trained in the two modes, so that the SD model can learn the mapping relation between the prompt word and the main object, and the purpose of generating the appointed main object is achieved in the subsequent image reconstruction processing process. It should be noted that, the present embodiment is not limited to training the SD model only in the two modes mentioned in the above embodiment, so that the SD model learns the mapping relationship between the hint word and the subject object, but may be another mode that can learn the mapping relationship between the hint word and the subject object, which is not specifically limited herein.

As another optional embodiment, the present embodiment may further use a Low-Rank Adaptation (Low-Rank Adaptation, loRA) method to implement controllable generation of an image. The low-rank self-adaptive method mainly comprises the steps of adding a bypass network in an original large model network, training the bypass network under the condition of freezing parameters of the original large model network, and finally fusing two network output results, namely, on a cross attention module, so that controllable generation of images is realized, and the generated images simultaneously contain a specified subject object and a specified background.

In this embodiment, the prompting words corresponding to the main object may be modified on the cross attention module in the image reconstruction process of the second image, so as to achieve the purpose of replacing the main object in the reconstructed image, and the specific modification process is the same as the above-mentioned method for modifying the mapping relationship, which is not described herein again.

As an alternative embodiment, the present embodiment may also employ a diffusion-directed approach, such as Classifier-free Guidance Diffusion, to achieve controlled generation of images. Specifically, referring to fig. 4, fig. 4 is a schematic view of an application scenario of an image processing method according to an embodiment of the present invention, as shown in fig. 4, a first image including a first subject object and a second image requiring replacement of the subject object are first acquired according to a user's requirement. Then, the first image and the second image are set, where the first image and the second image need to be set with the same format of the prompting words, as shown in fig. 4, the prompting words of the first image may be "a photo of a", and the prompting words of the second image may be "a photo of B".

The SD model may then be invoked according to Textual inversion or streambooth methods to train the first image and the hint word of the first image, such that the SD model contains a priori knowledge of the first subject object in the first image.

Then, setting the guide coefficient w in the Classifier-free Guidance Diffusion as 1, namely setting only the input prompter as a condition, setting the diffusion step length as a preset step number, such as 50 steps, by adopting a DDIM (digital display matrix Inversion) method, and carrying out noise adding processing on the second image and the prompter of the second image to obtain an intermediate latent variable of each noise adding step in the noise adding processing process of the second image. The DDIM conversion is mainly guided by conditions, so that controllable noise adding is realized, and noise in the noise-added image contains useful specific information.

The intermediate latent variable obtained in the DDIM conversion noise adding process is used as a Null-text conversion optimization benchmark, the guide coefficient w in the Classifier-free Guidance Diffusion is set to be a preset value larger than 1, for example, 7.5 (a specific value can be set according to actual requirements when the image reconstruction effect is good and is not limited in this case), the Null-text unconditional embedded vector of the unconditional guide part is optimized to obtain an optimized unconditional embedded vector, the noise added image is denoised by adopting the optimized unconditional embedded vector, and a reconstructed second image, namely, an initial reconstructed image shown in fig. 4, can be obtained. The Null-text Inversion is mainly used for optimizing the unconditional embedded vector by taking an intermediate latent variable obtained in the DDIM Inversion noise adding process as an optimization target, so that the noise adding image is subjected to noise removing processing sequentially by adopting the optimized unconditional embedded vector according to the steps of the noise adding processing process, such as 50 steps, and the original image can be reconstructed almost perfectly.

In order to realize the generation of the specified subject object and the background, that is, replace the second subject object with the first subject object in the initial reconstructed image shown in fig. 4 to obtain the second reconstructed image, the adjustment processing of the prompting word is needed in the process of performing image reconstruction on the second image by the SD model, that is, the mapping relationship between the prompting word and the subject object, for example, the mapping relationship of the cross-attention mechanism, is modified or replaced to realize the replacement of the second subject object in the initial reconstructed image with the first subject object. Specifically, "a photo of B" may be modified to "a photo of a" to effect replacement of a subject object in an image. Because the SD model contains priori knowledge of the first subject object, the replacement of the subject object in the generated reconstructed image can be realized by modifying the 'B' in the prompt word into the 'A'.

The adjustment processing of the prompt word in this embodiment mainly sets 3 super parameters, which are respectively: cross_replace_steps, self_replace_steps, and eq_parameters. Wherein cross_replace_steps and self_replace_steps represent the number of steps of Cross-Attention Map and Self-Attention Map replacement in the noise adding process, and eq_parameters represent the strength of replacement. The generated image is greatly affected by self_reproduction_steps, and 0.4 can be used as a default value in this embodiment. However, in practical applications, the values of the three super parameters may be set according to requirements, which is not limited herein.

In summary, an embodiment of the present invention provides an image processing method, where the method includes obtaining a first image and a second image, using first text information as a prompt word, performing first image reconstruction processing on the first image to obtain a first reconstructed image, where the first text information is identification information corresponding to the first image, determining a target mapping relationship in a first image reconstruction processing process according to the first text information and the first reconstructed image, performing second image reconstruction processing on the second image according to the target mapping relationship and second text information, and obtaining a second reconstructed image in which a main object is a first main object, where the second text information is identification information corresponding to the second image, and where the first image reconstruction processing and the second image reconstruction processing use the same image reconstruction algorithm. The method and the device can ensure that the main body of the image is perfectly embedded in the background on the premise of perfectly reconstructing the background of the image, and realize the generation of the appointed main body and the background of the image.

The method according to the above embodiment will be further described from the point of view of an image processing apparatus, which may be implemented as a separate entity or may be implemented as an integrated electronic device, such as a terminal, which may include a mobile phone, a tablet computer, etc.

In order to solve the same technical problems, the present embodiment also provides an image processing apparatus. Specifically, referring to fig. 5, fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention, and as shown in fig. 5, an image processing apparatus 500 according to an embodiment of the present invention includes: an acquisition module 501, a first processing module 502, a determination module 503, and a second processing module 504;

wherein, the acquiring module 501 is configured to acquire a first image and a second image.

The first processing module 502 is configured to perform a first image reconstruction process on a first image by using first text information as a prompt word, so as to obtain a first reconstructed image, where the first text information is identification information corresponding to the first image, and the first reconstructed image is an image reconstructed from the first image.

The determining module 503 is configured to determine, according to the first text information and the first reconstructed image, a target mapping relationship in a reconstruction process of the first image, where the target mapping relationship is a mapping relationship between the first text information and a first subject object, and the first subject object is a subject object in the first reconstructed image.

The second processing module 504 is configured to perform a second image reconstruction process on the second image according to the target mapping relationship and the second text information, so as to obtain a second reconstructed image of which the subject object is the first subject object, where the second text information is identification information corresponding to the second image, and the first image reconstruction process and the second image reconstruction process adopt the same image reconstruction algorithm.

In some embodiments, the first processing module 502 provided in this embodiment is specifically configured to: and taking the first text information as a prompt word, and calling the target text-generated graph model to carry out image reconstruction processing on the first image until the target text-generated graph model generates a first reconstructed image.

In some embodiments, the determining module 503 provided in this embodiment is specifically configured to: and determining a first mapping relation adopted in the process of carrying out image reconstruction processing on the first image by the target text-to-image model according to the first text information and the first reconstructed image.

The second processing module 504 provided in this embodiment is specifically configured to: determining a second mapping relation adopted in the process of reconstructing the second image by the target text-to-image model according to the second text information and the second image; adjusting the second mapping relation according to the first mapping relation to adjust the second text information in the second mapping relation into the first text information so as to obtain a third mapping relation; and taking the second text information as a prompt word, calling a target text-to-image graph model, and performing image reconstruction processing on the second image according to a third mapping relation to obtain a second reconstructed image with the main object as the first main object.

As an optional embodiment, the first image reconstruction processing provided in this embodiment includes a noise adding process and a noise removing process, and the first processing module 502 provided in this embodiment is specifically further configured to: inputting the first text information as a prompt word into a target guidance network for prediction processing to obtain a guidance embedded vector; inputting the first image into a target diffusion model for noise adding processing to obtain a noise added image of the first image; determining a cross attention value of the guide embedded vector and the embedded vector of the noise-added image according to the two; and calling a target diffusion model, and denoising the denoised image according to the cross attention value to obtain a first reconstructed image.

In this embodiment, the determining module 503 provided in this embodiment is specifically configured to: and determining a first cross attention mapping relation adopted in the process of carrying out image reconstruction processing on the first image and the first text information by the target diffusion model according to the first text information and the first reconstructed image, wherein the first cross attention mapping relation characterizes the corresponding relation between the first text information and a first main object in the first reconstructed image.

The second processing module 504 provided in this embodiment is specifically further configured to: invoking a target diffusion model and a target guidance network, performing image reconstruction processing on the second image and the second text information to obtain an initial reconstructed image, and extracting a second cross attention mapping relation adopted in the image reconstruction processing process; according to the first cross attention mapping relation, the second cross attention mapping relation is adjusted so as to adjust the corresponding relation between second text information in the second cross attention mapping relation and a second main object into the corresponding relation between the first text information and the first main object, and a target cross attention mapping relation is obtained, wherein the second main object is the main object in the initial reconstructed image; and calling a target diffusion model to adopt a target cross attention mapping relation, and carrying out image reconstruction processing on the second image and the second text information to obtain a second reconstructed image in which the second main object is replaced by the first main object.

As an alternative embodiment, in order to increase the training rate of the text-generated graph model on the first image and only make a pad for modifying the main object of the image in the subsequent image reconstruction process, the rate of the whole image reconstruction process is increased. Referring to fig. 6, fig. 6 is another schematic structural diagram of an image processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, an image processing apparatus 500 according to an embodiment of the present invention includes: a matting module 505;

the matting module 505 is configured to perform matting processing on an initial image containing a subject object to be used, so as to extract a first subject object in the initial image, and obtain a first image only containing the first subject object.

In the implementation, each module and/or unit may be implemented as an independent entity, or may be combined arbitrarily and implemented as the same entity or a plurality of entities, where the implementation of each module and/or unit may refer to the foregoing method embodiment, and the specific beneficial effects that may be achieved may refer to the beneficial effects in the foregoing method embodiment, which are not described herein again.

In addition, referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device may be a mobile terminal, such as a smart phone, a tablet computer, or the like. As shown in fig. 7, the electronic device 700 includes a processor 701, a memory 702. The processor 701 is electrically connected to the memory 702.

The processor 701 is a control center of the electronic device 700, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device 700 and processes data by running or loading application programs stored in the memory 702, and calling data stored in the memory 702, thereby performing overall monitoring of the electronic device 700.

In this embodiment, the processor 701 in the electronic device 700 loads instructions corresponding to the processes of one or more application programs into the memory 702 according to the following steps, and the processor 701 executes the application program stored in the memory 702, so as to implement any step of the image processing method provided in the foregoing embodiment.

The electronic device 700 may implement the steps in any embodiment of the image processing method provided by the embodiment of the present invention, so that the beneficial effects that any one of the image processing methods provided by the embodiment of the present invention can implement are described in detail in the previous embodiments, and are not described herein.

Referring to fig. 8, fig. 8 is another schematic structural diagram of an electronic device according to an embodiment of the present invention, and fig. 8 is a specific structural block diagram of the electronic device according to the embodiment of the present invention, where the electronic device may be used to implement the image processing method provided in the above embodiment. The electronic device 800 may be a mobile terminal such as a smart phone or a notebook computer.

The RF circuit 810 is configured to receive and transmit electromagnetic waves, and to perform mutual conversion between the electromagnetic waves and the electrical signals, thereby communicating with a communication network or other devices. RF circuitry 810 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. The RF circuitry 810 may communicate with various networks such as the internet, intranets, wireless networks, or other devices via wireless networks. The wireless network may include a cellular telephone network, a wireless local area network, or a metropolitan area network. The wireless network may use various communication standards, protocols, and technologies including, but not limited to, global system for mobile communications (Global System for Mobile Communication, GSM), enhanced mobile communications technology (Enhanced Data GSM Environment, EDGE), wideband code division multiple access technology (Wideband Code Division Multiple Access, WCDMA), code division multiple access technology (Code Division Access, CDMA), time division multiple access technology (Time Division Multiple Access, TDMA), wireless fidelity technology (Wireless Fidelity, wi-Fi) (e.g., institute of electrical and electronics engineers standards IEEE 802.11a,IEEE 802.11b,IEEE802.11g and/or IEEE802.11 n), internet telephony (Voice over Internet Protocol, voIP), worldwide interoperability for microwave access (Worldwide Interoperability for Microwave Access, wi-Max), other protocols for mail, instant messaging, and short messaging, as well as any other suitable communication protocols, even including those not currently developed.

The memory 820 may be used to store software programs and modules, such as program instructions/modules corresponding to the image processing methods in the above embodiments, and the processor 880 executes the software programs and modules stored in the memory 820 to thereby perform various functional applications and image processing methods.

Memory 820 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 820 may further include memory located remotely from processor 880, which may be connected to electronic device 800 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input unit 830 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, the input unit 830 may include a touch-sensitive surface 831 as well as other input devices 832. The touch-sensitive surface 831, also referred to as a touch screen or touch pad, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch-sensitive surface 831 or thereabout by using any suitable object or accessory such as a finger, stylus, etc.), and actuate the corresponding connection device according to a predetermined program. Alternatively, touch-sensitive surface 831 can include both a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 880 and can receive commands from the processor 880 and execute them. In addition, the touch-sensitive surface 831 can be implemented using a variety of types, such as resistive, capacitive, infrared, and surface acoustic waves. In addition to the touch-sensitive surface 831, the input unit 830 may also include other input devices 832. In particular, other input devices 832 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 840 may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device 800, which may be composed of graphics, text, icons, video, and any combination thereof. The display unit 840 may include a display panel 841, and optionally, the display panel 841 may be configured in the form of an LCD (Liquid Crystal Display ), an OLED (Organic Light-Emitting Diode), or the like. Further, touch-sensitive surface 831 can overlay display panel 841, and upon detection of a touch operation thereon or thereabout by touch-sensitive surface 831, is communicated to processor 880 for determining the type of touch event, whereupon processor 880 provides a corresponding visual output on display panel 841 based on the type of touch event. Although in the figures, touch-sensitive surface 831 and display panel 841 are implemented as two separate components, in some embodiments touch-sensitive surface 831 may be integrated with display panel 841 to implement input and output functions.

The electronic device 800 may also include at least one sensor 850, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 841 according to the brightness of ambient light, and a proximity sensor that may generate an interrupt when the folder is closed or closed. As one of the motion sensors, the gravity acceleration sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and the direction when the mobile phone is stationary, and can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the electronic device 800 are not described in detail herein.

Audio circuitry 860, speakers 861, and microphone 862 may provide an audio interface between the user and the electronic device 800. The audio circuit 860 may transmit the received electrical signal converted from audio data to the speaker 861, and the electrical signal is converted into a sound signal by the speaker 861 to be output; on the other hand, the microphone 862 converts the collected sound signals into electrical signals, which are received by the audio circuit 860 and converted into audio data, which are processed by the audio data output processor 880 and transmitted to, for example, another terminal via the RF circuit 810, or which are output to the memory 820 for further processing. Audio circuitry 860 may also include an ear bud jack to provide communication of peripheral headphones with electronic device 800.

The electronic device 800, via the transmission module 870 (e.g., wi-Fi module), may facilitate user reception of requests, transmission of information, etc., that provides wireless broadband internet access to the user. Although the transmission module 870 is shown in the figures, it is understood that it is not a necessary component of the electronic device 800 and may be omitted entirely as desired within the scope of not changing the essence of the invention.

The processor 880 is a control center of the electronic device 800, connects various parts of the entire cellular phone using various interfaces and lines, and performs various functions of the electronic device 800 and processes data by running or executing software programs and/or modules stored in the memory 820, and calling data stored in the memory 820, thereby performing overall monitoring of the electronic device. Optionally, processor 880 may include one or more processing cores; in some embodiments, processor 880 may integrate an application processor that primarily handles operating systems, user interfaces, applications, and the like, with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 880.

The electronic device 800 also includes a power supply 890 (e.g., a battery) that provides power to the various components, and in some embodiments, may be logically connected to the processor 880 via a power management system to perform functions such as managing charging, discharging, and power consumption via the power management system. Power supply 890 may also include one or more of any components of a dc or ac power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, etc.

Although not shown, the electronic device 800 further includes a camera (e.g., front camera, rear camera), a bluetooth module, etc., which are not described herein. In particular, in this embodiment, the display unit of the electronic device is a touch screen display, and the mobile terminal further includes a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors to implement any step of the image processing method provided in the foregoing embodiment.

In the implementation, each module may be implemented as an independent entity, or may be combined arbitrarily, and implemented as the same entity or several entities, and the implementation of each module may be referred to the foregoing method embodiment, which is not described herein again.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor. To this end, an embodiment of the present invention provides a storage medium in which a plurality of instructions capable of implementing any of the steps of the image processing method provided in the above embodiment when executed by a processor are stored.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps in any embodiment of the image processing method provided by the embodiment of the present invention can be executed by the instructions stored in the storage medium, so that the beneficial effects that can be achieved by any image processing method provided by the embodiment of the present invention can be achieved, and detailed descriptions of the previous embodiments are omitted herein.

The foregoing describes in detail an image processing method, apparatus, electronic device and storage medium provided in the embodiments of the present application, and specific examples are applied to illustrate principles and implementations of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above. Moreover, it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the principles of the present invention, and such modifications and variations are also considered to be within the scope of the invention.

Claims

1. An image processing method, comprising:

acquiring a first image and a second image;

2. The method according to claim 1, wherein the step of performing a first image reconstruction process on the first image using the first text information as a prompt word to obtain a first reconstructed image includes:

And taking the first text information as a prompt word, and calling a target text-to-image model to carry out image reconstruction processing on the first image until the target text-to-image model generates a first reconstructed image.

3. The method according to claim 2, wherein the step of determining a target mapping relation during the first image reconstruction process from the first text information and the first reconstructed image comprises:

determining a first mapping relation adopted in the process of carrying out image reconstruction processing on the first image by the target text-to-image model according to the first text information and the first reconstructed image;

and performing a second image reconstruction process on the second image according to the target mapping relationship and the second text information to obtain a second reconstructed image with the main object being the first main object, wherein the second reconstructed image comprises the following steps:

determining a second mapping relation adopted in the process of reconstructing the second image by the target text-to-image model according to the second text information and the second image;

adjusting the second mapping relation according to the first mapping relation to adjust the second text information in the second mapping relation into first text information, so as to obtain a third mapping relation;

And adopting second text information as a prompt word, and calling the target text-to-image graph model to perform image reconstruction processing on the second image according to the third mapping relation to obtain a second reconstructed image with the main object as the first main object.

4. The method of claim 1, wherein the first image reconstruction process comprises a denoising process and a denoising process;

the step of performing a first image reconstruction process on the first image by using the first text information as a prompt word to obtain a first reconstructed image comprises the following steps:

inputting the first text information as a prompt word into a target guidance network for prediction processing to obtain a guidance embedded vector;

inputting the first image into a target diffusion model for noise adding processing to obtain a noise added image of the first image;

determining a cross attention value of the guide embedded vector and the embedded vector of the noise-added image according to the two;

and calling the target diffusion model to perform denoising processing on the noise-added image according to the cross attention value to obtain a first reconstructed image.

5. The method of claim 4, wherein the step of determining a target mapping relationship during the first image reconstruction process from the first text information and the first reconstructed image comprises:

Determining a first cross attention mapping relation adopted in the process of carrying out image reconstruction processing on the first image and the first text information by the target diffusion model according to the first text information and the first reconstructed image, wherein the first cross attention mapping relation represents a corresponding relation between the first text information and a first main object in the first reconstructed image;

invoking the target diffusion model and the target guidance network, performing image reconstruction processing on the second image and the second text information to obtain an initial reconstructed image, and extracting a second cross attention mapping relation adopted in the image reconstruction processing process;

according to the first cross attention mapping relation, the second cross attention mapping relation is adjusted so as to adjust the corresponding relation between the second text information and a second main object in the second cross attention mapping relation to be the corresponding relation between the first text information and the first main object, and a target cross attention mapping relation is obtained, wherein the second main object is the main object in the initial reconstructed image;

And calling the target diffusion model to adopt the target cross-attention mapping relation, and carrying out image reconstruction processing on the second image and the second text information to obtain a second reconstructed image in which the second main object is replaced by the first main object.

6. The method of claim 1, wherein prior to the step of acquiring the first image and the second image, the method further comprises:

and carrying out matting processing on the initial image containing the subject object to be used so as to extract a first subject object in the initial image and obtain a first image only containing the first subject object.

7. An image processing apparatus, comprising:

8. The apparatus of claim 7, wherein the first processing module is further configured to: and taking the first text information as a prompt word, and calling a target text-to-image model to carry out image reconstruction processing on the first image until the target text-to-image model generates a first reconstructed image.

9. An electronic device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 6 when the computer program is executed.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps in the method according to any one of claims 1 to 6.