WO2023060434A1

WO2023060434A1 - Text-based image editing method, and electronic device

Info

Publication number: WO2023060434A1
Application number: PCT/CN2021/123272
Authority: WO
Inventors: 程俊; 吴福祥; 宋呈群
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2023-04-20

Abstract

A text-based image editing method, and an electronic device. The method comprises: acquiring a global image feature and a local image feature of a target source image, and a global sentence feature and a sentence word feature of target text; and according to the global image feature, the local image feature, the global sentence feature and the sentence word feature, editing the target source image on the basis of an image editing model, so as to obtain a target edited image, wherein the image editing model comprises a sampling and encoding module and at least one cascaded generation module. An intermediate editing result of an image editing process can be output by using an encoding module, and if the intermediate editing result does not meet user requirements, the image editing model can adjust the intermediate editing result, and inputs the adjusted intermediate result into the at least one cascaded generation module, thereby solving the problem of an image editing result that is output by a ManiGAN failing to meet the user requirements.

Description

A text-based image editing method and electronic device

technical field

The present application relates to the field of image processing, in particular to a text-based image editing method and electronic equipment.

Background technique

As we all know, text-based image editing is a technology to edit source images according to a given text. This technology is a research hotspot in the field of multimedia and has important application value. The existing attention-based Generative Adversarial Network (Manipulating Attention Generative Adversarial Network, ManiGAN) is used for image editing based on the text description content of the source image to be edited. However, when ManiGAN edits the source image according to the input text, the output image editing results often do not meet the user's requirements.

Therefore, how to make the image editing results output by ManiGAN meet user requirements is an urgent problem to be solved.

technical problem

One of the purposes of the embodiments of the present application is: a text-based image editing method and electronic equipment, aiming at solving the problem that the image editing results output by the existing ManiGAN do not meet user requirements.

technical solution

The technical scheme that the embodiment of the present application adopts is:

In the first aspect, a text-based image editing method is provided, including: acquiring the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text; according to the overall image features, the The image local feature, the sentence overall feature and the sentence word feature, edit the target source image based on the image editing model to obtain the target editing image; wherein, the image editing model includes: a sampling encoding module and at least one A cascaded generation module; the processing of the target source image by the image editing model includes: using the sampling encoding module to perform sampling encoding processing on the overall feature of the image, the overall feature of the sentence, and the local feature of the image , obtain a first edited image, and output the first edited image; in response to a user instruction, input the first edited image, the image local features and the sentence word features into the at least one cascaded generation module Perform high-dimensional visual feature extraction to obtain the target edited image, or input the target source image, the local features of the image and the sentence word features into the at least one cascaded generation module for high-dimensional vision feature extraction to obtain the target edited image.

The above method can be executed by a chip on an electronic device. Compared with the existing ManiGAN that edits the source image to be edited according to the text and directly outputs the editing results that may not meet the user's requirements, this application introduces a sampling encoding module into the existing ManiGAN to form an improved ManiGAN; The edited result (ie the first edited image) is output to facilitate the user to judge whether the intermediate edited result meets the requirements. If the intermediate edited result meets the requirements, the intermediate edited result will continue to be passed to at least one cascaded generation module; if it does not meet the requirements, it will not The intermediate result will continue to be delivered to at least one cascaded generation module, but the target source image will be used to replace the intermediate editing result and continue to be passed to at least one cascaded generation module. It can be seen that when the improved ManiGAN edits the target source image according to the target text, it can control the intermediate editing results, and promptly remove the intermediate editing results that do not meet the requirements, so as to prevent the output of the previous level from affecting the subsequent results. The accuracy of the first-level output results, so as to edit the target edited image more in line with the requirements for the user.

Optionally, the generating module includes: a first automatic decoder, a first self-attention module, a second upsampling module and a second automatic encoder, the first automatic decoder is used to restore the high-dimensional visual feature, to obtain the first high-dimensional feature image, the input information is the first edited image or the output information of the previous layer generation module; the first self-attention module is used for the first high-dimensional feature image and The sentence feature is fused and processed to obtain the sentence semantic information feature; the second up-sampling module is used to perform feature fusion and up-sampling processing on the sentence semantic information feature to obtain a second up-sampling result; the second up-sampling module Two automatic encoders are used to perform high-dimensional visual feature extraction on the second upsampling result to obtain output information. When the generation module is the last generation module in the at least one cascaded generation module, the The output information is the target edited image.

Optionally, the first self-attention module includes: a self-attention layer and a first noisy affine combination module, and the self-attention layer is used to analyze the first high-dimensional feature image and the sentence word Feature fusion of features; the first noisy affine combination module is used to perform feature fusion on the output result of the self-attention layer and the splicing result of the first high-dimensional feature image, as well as the local features of the image.

In the above-mentioned first self-attention module, the first noisy affine combination module is introduced. The first noisy affine combination module can enhance the reliability of the generation module to edit images by introducing Gaussian noise, thereby avoiding the generation module due to the The presence of random noise affects the reliability of the editing results.

Optionally, the sampling encoding module includes: a first up-sampling module and a first automatic encoder, the first up-sampling module is used to perform an overall feature of the image, the overall feature of the sentence, and the local feature of the image Perform upsampling processing to obtain a first upsampling result; the first automatic encoder is configured to generate a first edited image according to the first upsampling result.

Optionally, the first upsampling module includes: a plurality of identical upsampling layers, a second noisy affine combination module and a third noisy affine combination module, and the input of the first upsampling module is the sentence The overall feature, the overall feature of the image and the local feature of the image, among the two adjacent up-sampling layers in the plurality of identical up-sampling layers, the input of the latter up-sampling layer is the output of the previous up-sampling layer ; The second noisy affine combination module is located in the middle of any two upsampling layers in the plurality of identical upsampling layers, and is used to output the result of the previous upsampling layer in the any two upsampling layers Perform feature fusion with the local features of the image; the third noisy affine combination module is used to perform feature fusion on the output result of the last upsampling layer in the plurality of identical upsampling layers and the local features of the image .

Introducing the second noisy affine combination module and the third noisy affine combination module into the first upsampling module can further enhance the visual features of the output results of different upsampling layers in the first upsampling module.

Optionally, the image editing model further includes: a detail modification model, configured to modify the details of the target edited image; the detail modification model is used to process the image local features, the sentence word features, and the Editing the image of the target to obtain the corrected image of the target; the detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used to partially modify the image feature, the first random noise, and the sentence word feature are modified in detail to obtain the first detail feature; the second detail correction module is used to modify the image local feature corresponding to the target edited image, the second random noise, and the Sentence word features are modified in detail to obtain the second detailed features; the fusion module is used to perform feature fusion on the first detailed features and the second detailed features; the generator is used for outputting results according to the fusion module The target corrected image is generated.

Adding the detail correction model to the above image editing model can further modify and enhance the details of the target editing image output by the image editing model, so as to obtain a high-resolution target correction image.

Optionally, the first detail modification module includes a fourth noisy affine combination module, a fifth noisy affine combination module, a sixth noisy affine combination module, a second self-attention module, and a first residual network and the first linear network; the fourth noisy affine combination module is used to perform feature fusion on the first random noise and the local features of the image to obtain the first fusion feature; the second self-attention module It is used to perform feature fusion on the first fusion feature and the sentence word feature; the output result of the second self-attention module and the splicing of the first random noise by the fifth noisy affine combination module The results and the local features of the image are subjected to feature fusion; the first residual network is used to extract visual features from the output of the fifth noisy affine combination module; the first linear network is used to extract the visual features of the performing linear transformation on the local features of the image; the sixth noisy affine combination module is used to perform feature fusion on the output result of the first residual network and the output result of the first linear network.

Adding a plurality of noisy affine combination modules to the first detail correction module can enhance the reliability of the detail correction model.

Optionally, the second detail modification module includes a seventh noisy affine combination module, an eighth noisy affine combination module, a ninth noisy affine combination module, a third self-attention module, and a second residual network and a second linear network; the seventh noisy affine combination module is used to perform feature fusion on the second random noise and the image local features corresponding to the target edited image to obtain the first fusion feature; the second Three self-attention modules are used to carry out feature fusion to the first fusion feature and the sentence word feature;

The eighth noisy affine combination module performs feature fusion on the output result of the third self-attention module, the mosaic result of the second random noise, and the image local features corresponding to the target edited image; the The second residual network is used to perform visual feature extraction on the output result of the eighth noisy affine combination module; the second linear network is used to linearly transform the image local features corresponding to the target edited image; The ninth noisy affine combination module is used to perform feature fusion on the output result of the second residual network and the output result of the second linear network.

Adding a plurality of noisy affine combination modules to the second detail correction module can enhance the reliability of the detail correction model.

Optionally, the training method of the detail correction model includes: training the generator of the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function; according to the conditional discriminator loss function function and an unconditional discriminator loss function to train the discriminator of the minutiae revision model.

The above-mentioned generator that trains the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function can make the image editing result generated by the generator (ie, the target edited image) more conform to the target text description content and user requirements. Training the discriminator of the detail modification model according to the conditional discriminator loss function and the unconditional discriminator loss function can make the recognition result of the discriminator more accurate.

Optionally, the training method of the image editing model includes: using a preset loss function and a training set to train the initial model to obtain the image editing model; wherein, the preset loss function includes and N sub-networks Corresponding sub-functions and loss functions of N-1 automatic codecs, the initial model includes N sub-networks, and the N sub-networks are initial models corresponding to the sampling encoding module and at least one generation module respectively; training In the process, when the output image of the i-th sub-network does not meet the preset conditions, the sub-function corresponding to the i+1-th sub-network and the loss function of the i-th to i+1-th automatic codec are used, Train the initial model, 0≤i<N.

In the process of training the above image editing model, the training of the N sub-networks is to skip the previous sub-networks whose output results do not meet the requirements through the automatic codec and give priority to training the subsequent sub-networks, because the target source image is used when skipping. It is not an intermediate editing result (for example, the first edited image), therefore, it is possible to avoid the potential error results output by the previous sub-network from propagating to the subsequent sub-network. The priority training of the later sub-network can bring better update gradients to the front-level sub-network, so that the convergence effect of the front-level sub-network is better.

In a second aspect, an electronic device is provided, including a module for performing any one of the methods in the first aspect.

In a third aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the first aspect. Methods.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments or exemplary technical descriptions. Obviously, the accompanying drawings in the following descriptions are only for this application. For some embodiments, those skilled in the art can also obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a text-based image editing method in an embodiment of the present invention;

Fig. 2 is a schematic structural diagram of an image editing model in an embodiment of the present invention;

Fig. 3 is a schematic structural diagram of a detail correction model in an embodiment of the present invention;

Fig. 4 is a schematic diagram of the processing process of the target source image by the image editing model in the embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

Embodiment of this application

In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.

It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.

Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Accordingly, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc., in various places in this specification are not necessarily all References to the same embodiment mean "one or more but not all" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.

Text-based image editing is a research hotspot in the field of multimedia and has important application value. ManiGAN is used for image editing based on the text description content of the source image to be edited. However, the existing ManiGAN cannot process the intermediate results of image editing when editing the source image according to the text description content, which often leads to the image editing results output by ManiGAN not meeting user requirements. In order to solve the problem that the image editing results output by the existing ManiGAN are more in line with user requirements, this application introduces an automatic codec into the multi-level generation confrontation network of the existing ManiGAN. The automatic codec can output the intermediate editing results to the user for convenience. The user can directly control the edited result of the intermediate output of ManiGAN, so as to obtain the target edited image that is more in line with the user's requirements.

The present application will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

In order to solve the problem that the image editing results output by ManiGAN do not meet the user's requirements, this application proposes a text-based image editing method, as shown in Figure 1, the method is executed by electronic equipment, and the method includes:

S101. Acquire overall image features and local image features of a target source image, and overall sentence features and sentence word features of a target text.

Exemplarily, the electronic device acquires the overall image features and local image features of the target source image, and the overall sentence features and sentence word features of the target text. The target source image comes from the MS-COCO (Microsoft Common Objects in Context) dataset and the CUB200 dataset; the above target text refers to the text information that records the user editing the target source image, for example, the target source image is a bird, if the user If you want to dye the bird's feathers red and its head yellow, you can record these editing requirements in the target text in the form of words, that is, the specific content of the target text is to dye the bird's feathers Red and head dyed yellow.

The above-mentioned overall image features (i.e., global image features) refer to the features that can represent the entire image, and are used to describe the overall features such as image color and shape, such as color features, texture features, and shape features; the above-mentioned image local features (i.e., local image features) feature) is the local expression of image features, which reflects the local characteristics of the image.

Usually, the visual geometry group (Visual Geometry Group, VGG) network can be used to extract the image overall features and image local features of the target source image; a special recurrent neural network (Recurrent Neural Network, RNN), that is, long Short-term memory (Long short-term memory, LSTM) network to extract the overall sentence features and sentence word features of the target text.

S102, according to the image overall feature, image local feature, sentence overall feature and sentence word feature, edit the target source image based on the image editing model to obtain the target editing image; wherein, the image editing model includes: a sampling coding module and at least one cascade The generation module of the image editing model includes: using the sampling encoding module to sample and encode the overall features of the image, the overall features of the sentence and the local features of the image to obtain the first edited image and output the first edited image; In response to user instructions, input the first edited image, image local features and sentence word features into at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image, or input the target source image, image local features and sentence The word feature is input to at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image.

Exemplarily, as shown in FIG. 2 , the target source image I is a bird picture with a size of 128×128 and a channel number of 3. In the bird picture, the feathers on the belly of the bird are white, and the mouth is Gray, the feathers on the head and neck are gray and white; the VGG network performs feature extraction on the 128×128 target source image I, and obtains the image overall feature c _I and image local feature M I corresponding to the target source image _I , wherein, the overall feature c _I of the image is a 128×1 column vector; the size of the local feature M _I of the image is 128×128 and the number of channels is 128; for another example, the specific content of the target text T is the The feathers turn yellow, the mouth turns yellow, and the feathers on the head and neck turn gray and yellow; the LSTM network performs feature extraction on the target text T, and obtains the overall sentence feature c _T and the sentence corresponding to the target text T Word feature M _T , where the overall sentence feature c _T and sentence word feature M _T are both 128×1 column vectors.

As shown in FIG. 2 , the image editing model 201 includes: a sample encoding module G ₀₀ and at least one cascaded generation module (eg, G ₀₁ , G ₀₂ ), wherein the sample encoding module G ₀₀ includes a first upsampling module F ₀ and the first autoencoder G ₀ , the first upsampling module F ₀ performs upsampling processing on the image overall feature c _I and the sentence overall feature c _T to obtain the first upsampling result; the first autoencoder G ₀ The first upsampling result is encoded and a first edited image is generated

Optionally, as shown in FIG. 2, the first upsampling module _F0 includes: a plurality of identical upsampling layers, a second noisy affine combination module 2011a and a third noisy affine combination module 2011b, the first upsampling The input of the sampling module F ₀ is the overall sentence feature c _T , the overall image feature c _I and the image local feature M _I , among the two adjacent up-sampling layers in multiple identical up-sampling layers, the input of the latter up-sampling layer is the output of the previous up-sampling layer; the second noisy affine combination module 2011a is located in the middle of any two up-sampling layers in multiple identical up-sampling layers, and is used for the previous up-sampling layer in any two up-sampling layers The output result and the image local feature _MI are subjected to feature fusion; the third noisy affine combination module 2011b is used to perform feature fusion on the output result of the last upsampling layer in multiple identical upsampling layers and the image local feature _MI .

The above multiple identical upsampling layers may be three upsampling layers or four upsampling layers, which is not limited in this application, and the user may set the specific number of multiple upsampling layers according to actual needs. This application only takes 4 upsampling layers included in the first upsampling module F ₀ in _FIG . The overall feature c _I and the image local feature M _I perform the process of feature extraction and feature fusion.

The above-mentioned second noisy affine combination module 2011a can be located in the middle of any two upsampling layers among the 4 upsampling layers in the first upsampling module _F0 , for example, in FIG. 2, the four upsampling layers of the first upsampling module _F0 In the upsampling layer, the first three upsampling layers perform upsampling processing on the input 128×1 image overall feature c _I and the 128×1 sentence overall feature c _T , and the output size is 32×32 and the number of channels is 64. Internal features (i.e. internal features of 32×32×64); the second noisy affine combination module 2011a is located between the third upsampling layer and the fourth upsampling layer in the first upsampling module F ₀ ; the second band The noise affine combination module 2011a enhances the visual features of the upsampling results output by the third upsampling layer (that is, the internal features of 32×32×64 output by the first three upsampling layers) according to the local image feature M _I to obtain 32× The first enhanced upsampling result of 32×64; the first enhanced upsampling result is first subjected to upsampling processing of the fourth upsampling layer and outputs 64×64×32 visual features, and the 64×64×32 visual features are then The third noisy affine combination module 2011b further enhances the visual features and outputs the enhanced 64×64×32 visual features; the third noisy affine combination module 2011b performs the fourth _upper The upsampling result output by the sampling layer (that is, the output result of the last upsampling layer among multiple identical upsampling layers) is subjected to visual feature enhancement.

The above-mentioned first autoencoder _G0 performs feature extraction and encoding processing on the enhanced 64×64×32 visual features output by the third noisy affine combination module 2011b, and the output size is 64×64 and the number of channels is First edited image of 3

(i.e. the first edited image of 64×64×3

). Users can directly observe the first edited image of 64×64×3

To judge whether it meets the requirements; if the first edited image of 64×64×3

Meet the user's requirements, then the first edited image of 64×64×3

Input at least one cascaded generation module for subsequent processing, as described in (I) in Figure 2; if the first edited image of 64 × 64 × 3

If it does not meet the user's requirements, the first edited image of 64×64×3

Discard and replace the first edited image with the target source image I of 128×128×3

Input to the generation module G ₀₁ for subsequent editing processing, as described in (II) in Figure 2, where the first edited image generated by the sampling encoding module G ₀₀

was discarded, therefore, the network structure included in the sampling coding module G ₀₀ is represented by a dotted box to illustrate that due to the first edited image

Does not meet the user's requirements, so the sampling encoding module G ₀₀ generates the first edited image

Steps above are skipped, and the target source image I of 128×128×3 is directly input into the generation module G ₀₁ for image editing. It can be seen that, because the sampling encoding module G ₀₀ generates the first edited image

Directly replace the first edited image with the 128×128×3 target source image I that does not meet the requirements

Input to the generation module G ₀₁ for image editing, thus avoiding the first edited image that does not meet the user's requirements

(i.e. wrong first edited image

) continue to propagate backwards.

For example, if the user determines that the first edited image of 64×64×3

If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the first edited image of 64×64×3 according to the confirmation instruction received from the user.

Input at least one cascaded generation module to perform high-dimensional visual feature extraction to obtain the target edited image, as described in (I) in Figure 2; if the user judges the first edited image of 64×64×3

If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will convert the first edited image of 64×64×3 according to the rejection instruction received from the user.

Discard and input the 128×128×3 target source image I into at least one cascaded generation module (for example, generation module G ₀₁ ) for high-dimensional visual feature extraction to obtain the target edited image, as shown in (II) in Figure 2 stated.

The above-mentioned at least one cascaded generation module may be one generation module or two generation modules, which is not limited in this application, and the number of generation modules can be set by the user according to actual needs. This application only takes the example of the image editing model in Figure 2 including two generating modules to illustrate that the two generating modules cooperate with the sampling and encoding module to further image edit the intermediate editing result (for example, the first edited image) to generate the target edit. image process.

Exemplarily, as shown in FIG. 2 , the above generation module G ₀₁ includes: a first automatic decoder E ₀ , a first self-attention module 2012c, a second upsampling module F ₁ and a second automatic encoder G ₁ , the first An automatic decoder E ₀ is used to restore the first edited image

(that is, the high-dimensional visual features of the input information) to obtain the first high-dimensional feature image; the first self-attention module 2012c is used to fuse and splicing the first high-dimensional feature image and sentence word feature _MT to obtain sentence semantics Information features; the second upsampling module _F1 is used to perform feature fusion and upsampling processing on the semantic information features of the sentence to obtain the second upsampling result; the second autoencoder _G1 is used to perform high-dimensional processing on the second upsampling result Visual feature extraction to obtain output information.

In Figure 2, the generation module G ₀₁ for the first edited image

(e.g. the above 64×64×3 first edited image

) process is as follows: the first automatic decoder E ₀ for the above-mentioned first edited image of 64 × 64 × 3

Perform high-dimensional visual feature restoration to obtain the first high-dimensional feature image with a size of 64×64 and a channel number of 32 (ie, 64×64×32); after that, the first self-attention module 2012c performs 64×64×32 The first high-dimensional feature image, 128×1 sentence word feature M _T and 128×128×128 image local feature M _I perform feature fusion and splicing processing, and generate high-dimensional visual features with fine-grained sentence semantic information (i.e. the sentence semantic information feature); the size of the sentence semantic information feature is 64×64 and the number of channels is 32 (ie 64×64×32); after that, the second upsampling module F ₁ pairs 64×64×32 The semantic information features of the sentence are subjected to feature fusion and upsampling processing to generate a visual feature image with a size of 128×128 and a channel number of 32 (ie 128×128×32) (ie the second upsampling result); the second upsampling Sampling module F ₁ includes two residual networks and an upsampling layer, where the two residual networks are used to fuse the semantic information features of the sentence, and the upsampling layer is used to improve the spatial resolution of the image; finally, the second Autoencoder G ₁ performs high-dimensional visual feature extraction and encoding processing on the second upsampling result to generate a second edited image with a size of 128×128 and a channel number of 3 (ie 128×128×3)

Optionally, the second upsampling result may first pass through the noisy affine combination module 2012b, and the noisy affine combination module 2012b fuses the second upsampling result with the image local feature _MI of 128×128×128, and Generate a visual feature image with a size of 128×128 and a channel number of 32 (that is, 128×128×32); after that, the second automatic encoder G ₁ performs high-dimensional visual feature extraction on the visual feature image of 128×128×32 and encoding processing, the output size is 128×128 and the number of channels is 3 for the second edited image

(i.e. the second edited image of 128×128×3

).

The above-mentioned first self-attention module 2012c includes: a self-attention layer F _Atten and a first band noise affine combination module 2012a, and the self-attention layer F _Atten is used for feature fusion to the first high-dimensional feature image and sentence word features; The first noisy affine combination module 2012a is used to perform feature fusion on the output result of the self-attention layer F _Atten and the splicing result of the first high-dimensional feature image, as well as the image local feature M _I. For example, the self-attention layer F _Atten performs feature fusion on the first high-dimensional feature image of 64×64×32 and sentence word features of 128×1, and stitches the fusion result with the first high-dimensional feature image; the first The noisy affine combination module 2012a performs feature fusion on the mosaic result and the local feature M _I of the 128×128×128 image.

The second edited image of 128×128×3 output by the second autoencoder _G1

Users can directly observe the second edited image of 128×128×3

to determine whether it meets the requirements.

For example, if the user determines that the second edited image of 128×128×3

If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the second edited image of 128×128×3 according to the confirmation instruction received from the user.

Input to the generation module of the next cascade for high-dimensional visual feature extraction to obtain the target edited image; if the user judges the second edited image of 128×128×3

If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will generate a second edited image of 128×128×3 by the generation module G ₀₁ according to the rejection instruction sent by the user

Abandoned (i.e. do not consider generating module G ₀₀ to generate the first edited image

) and input the 128×128×3 target source image into the next cascaded generation module (for example, the generation module G ₀₂ ) for high-dimensional visual feature extraction to obtain the target edited image, thus avoiding the first 2 Edit the image

(i.e. wrong second edited image

) continue to propagate backwards.

Exemplarily, as shown in FIG. 2 , the next cascaded generation module G ₀₂ after the above generation module G ₀₁ includes: a second automatic decoder E ₁ , a second self-attention module 2013c, and a third upsampling module F ₂ and the generator G ₂ , the second automatic decoder E ₁ is used to recover the second edited image

(i.e. the output information of the previous layer generation module G ₀₁ ) high-dimensional visual features to obtain the second high-dimensional feature image; the second self-attention module 2013c is used to perform the second high-dimensional feature image and sentence word feature _MT Fusion and splicing processing, output second edited image

Corresponding sentence semantic information features; the third upsampling module F ₂ is used to edit the second image

The corresponding sentence semantic information features are subjected to feature fusion and upsampling processing to obtain a third upsampling result; the generator G ₂ is used to extract high-dimensional visual features from the third upsampling result.

The generation module G ₀₂ in Fig. 2 for the second edited image

(e.g. the second edited image of 128×128×3 above

) process is as follows: the second automatic decoder E ₁ for the above-mentioned 128 * 128 * 3 second editing image

Perform high-dimensional visual feature recovery to obtain the second high-dimensional feature image with a size of 128×128 and a channel number of 32 (i.e. 128×128×32); after that, the second self-attention module 2013c performs 128×128×32 The second high-dimensional feature image, 128×1 sentence word feature M _T and 128×128×128 image local feature M _I perform feature fusion and splicing processing, and generate high-dimensional visual features with fine-grained sentence semantic information (i.e. the second edited image

corresponding sentence semantic information feature); the second editing image

The size of the corresponding sentence semantic information feature is 128×128× and the number of channels is 32 (ie 128×128××32); after that, the third upsampling module F ₂ pairs the sentence semantic information of 128×128××32 Perform feature fusion and upsampling processing to generate a visual feature image with a size of 256×256 and a channel number of 32 (ie 256×256×32) (ie the third upsampling result); the third upsampling module F ₂ Contains two residual networks and an upsampling layer, where the two residual networks are used to fuse the semantic information features of the sentence, and the upsampling layer is used to improve the spatial resolution of the image; finally, the generator G ₂ pair Three upsampling results are subjected to high-dimensional visual feature extraction and encoding processing to generate a third edited image with a size of 256×256 and a channel number of 3 (ie 256×256×3)

Optionally, the third upsampling result of 256×256×32 may first pass through the noisy affine combination module 2013b, and the noisy affine combination module 2013b combines the third upsampling result of 256×256×32 with the 128×128 ×128 image local feature M _I to fuse, and generate a visual feature image with a size of ₂₅₆ ×256 and a channel number of 32 (that is, 256×256×32); The visual feature image is subjected to high-dimensional visual feature extraction and encoding processing to obtain a third edited image with a size of 256×256 and a channel number of 3

(i.e. the third edited image of 256×256×3

).

The above-mentioned second self-attention module 2013c includes: self-attention layer F _Atten and affine combination module 2012a with noise, and self-attention layer F _Atten is used to carry out feature fusion to the second high-dimensional feature image and sentence word features; The affine combination module 2012a is used to perform feature fusion on the output result of the self-attention layer F _Atten and the splicing result of the second high-dimensional feature image, as well as the image local feature M _I . For example, the self-attention layer F _Atten performs feature fusion on the second high-dimensional feature image of 128×128×32 and sentence word features of 128×1, and stitches the fusion result with the second high-dimensional feature image; with noise The affine combination module 2013a performs feature fusion on the splicing result and the 128×128×128 image local feature M _I .

The third edited image of 256×256×3 output by generator G ₂

Users can directly observe the third edited image of 256×256×3

to determine whether it meets the requirements. If the user judges the third edited image of 256×256×3

If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the third edited image of 256×256×3 according to the confirmation instruction received from the user.

Input to the next-level network to continue processing or directly output to the user (that is, since the generation module G ₀₂ is the last generation module in the two cascaded generation modules, the output of the generator G ₂ in the generation module G ₀₂ information for the target edited image (i.e. the third edited image

)); if the user judges the third edited image of 256×256×3

If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will convert the third edited image of 256×256×3 according to the rejection instruction received from the user.

Discard and re-use the image editing model to repeat the aforementioned process on the 128×128×3 target source image.

The core algorithm of the above-mentioned first noisy affine combination module 2012a is as follows:

F _NAC (h,M)＝h⊙W _rnd (M)+b _rnd (M), (1)

Wherein, F _NAC (h, M) represents the core algorithm function of the first noisy affine combination module 2012a, W _rnd (M)=W ₂ (W ₁ (M)+noise), b _rnd (M)=b ₂ (b ₁ (M)+noise), through W _rnd (M) and b _rnd (M) to calculate the weight and deviation of the regional feature M of the target source image (for example, the aforementioned image local feature M _I ), h represents the visual feature (For example, the splicing result of the output of the self-attention layer F _Atten and the first high-dimensional feature image), noise is Gaussian noise, and ⊙ is the Hadamard element product. The first noisy affine combination module 2012a can enhance the reliability of the sampling coding module and the generation module to edit images by introducing Gaussian noise, so that the sampling coding module and the generation module will not affect the reliability of the editing results due to the random noise in the image sex.

It is hereby explained that, except for the first noisy affine combination module 2012a, other noisy affine combination modules mentioned in this application (for example, the second noisy affine combination module, the third noisy affine combination module, etc. ) is the same as the core algorithm of the first noisy affine combination module 2012a, which will not be repeated in this application.

Exemplarily, the training method of the image editing model includes: using a preset loss function and a training set to train the initial model to obtain an image editing model; wherein, the preset loss function includes sub-functions corresponding to N sub-networks and N-1 automatic codec loss functions, the initial model includes N sub-networks, the N sub-networks are the initial models corresponding to the sampling encoding module and at least one generation module respectively; during the training process, when the output image of the i-th sub-network is not When the preset conditions are met, the initial model is trained using the sub-functions corresponding to the i+1-th to N-th sub-networks and the i-th to i+1-th automatic codec loss functions, 0≤i<N.

For example, as shown in FIG. 2 , the training method of the image editing model 201 includes: using MS-COCO and CUB200 datasets and a preset loss function to generate an adversarial form (for example, discriminator D ₀ , discriminator D ₁ and the discriminator D ₂ ) to train to obtain the image editing model 201; the above-mentioned initial model refers to the network model before the image editing model 201 is not trained; the initial model includes N=3 sub-networks and two (N-1=3- 1) Automatic codec, these three sub-networks are sampling encoding module G ₀₀ , generating module G ₀₁ and generating module G ₀₂ ; the two automatic codecs are autoencoder G ₀ -autodecoder E ₀ and automatic Encoder G ₁ -automatic decoder E ₁ ; the above-mentioned preset loss functions are loss functions of three self-networks and two loss functions of automatic codecs. In the training process, for example, N=3, when the output image of the first (i.e. i=1) sub-network (e.g., sampling encoding module G ₀₀ ) does not meet the preset conditions, the second (i.e. i+1= 1+1) to the sub-function corresponding to the third sub-network (that is, the loss function corresponding to the generation module G ₀₁ and the loss function corresponding to the generation module G ₀₂ ) and the first to the second automatic codec loss function (that is, the automatic Encoder G ₀ - the loss function of auto-decoder E ₀ and Auto-encoder G ₁ - the loss function of auto-decoder E ₁ ), trains the initial model.

When training the initial model, the following three preset loss functions are used for training. The three preset loss functions are as follows:

The first preset loss function:

The second preset loss function:

The third preset loss function:

In the above formula, b represents the number of sub-networks skipped during training, i=0,1,2; L _G,i represents the loss function of the corresponding level in the ManiGAN network, and L _G,0 is the location of the sampling encoding module G ₀₀ The loss function corresponding to the level, L _G,1 is the loss function corresponding to the level where the generation module G ₀₁ is located, and L _G,2 is the loss function corresponding to the level where the generation module G ₀₂ is located; the above

In the formula, I' represents the target edited image, and I'～ _PG,i(I,T) means that the target edited image I' is generated by the initial model given the training image I and the randomly selected target text T .

The loss function for the above automatic codec is

and

in,

is the loss function of autoencoder G ₀ -autodecoder E ₀ ,

is the loss function of autoencoder G ₁ -autodecoder E ₁ ; G _i consists of 3×3 convolutional layer and tanh activation function, E _i contains atanh function, 3×3 convolutional layer, Leaky ReLU layer and instance normalization Layer (Instance Normalization). The automatic codec composed of G _i and E _i feeds back the intermediate editing results to the user, and the user can directly control the intermediate results generated by different sub-networks to prevent wrong intermediate editing results from affecting the output of the entire image editing model 201 accuracy.

Exemplarily, taking the training of the initial model with the first preset loss function as an example, when the first preset loss function is training the initial model, no sub-network needs to be skipped. That is, input the image overall feature c _I and image local feature M _I of the training image, and the sentence overall feature c _T of the training text into the sample encoding module G ₀₀ , and the sample encoding module G ₀₀ outputs the first training edited image, if the user Judging that the first training edited image meets the user's requirements, the first training edited image is input into the generation module G ₀₁ ; the generation module G ₀₁ edits the first training edited image and the image local feature _MI of the training image and generates The second training editing image; if the _{user judges that the second training editing image meets the user's requirements, then the second training editing image is input into the generation module G 02} _; Feature _M1 performs editing processing and generates a third training edited image; if the user judges that the third training edited image meets the user's requirements, then the training of the initial model according to the first preset loss function is completed; if the user judges that the third training If the edited image does not meet the user's requirements, repeat the above training process. Since the autoencoder G ₀ -autodecoder E ₀ and autoencoder G ₁ -autodecoder E ₁ are distributed between the sampling encoding module G ₀₀ , the generating module G ₀₁ and the generating module G ₀₂ (see Figure 2), so , when training the initial model according to the first preset loss function, autoencoder G ₀ -autodecoder E ₀ and autoencoder G ₁ -autodecoder E ₁ are also trained.

Exemplarily, taking the second preset loss function to train the initial model as an example, when the second preset loss function trains the initial model, it is necessary to skip the sampling encoding module G ₀₀ to process the image overall feature c _I and image local features of the training image feature M _I , and the process of training the overall sentence feature c _T of the text. Instead, the training image is directly input into the generation module _G01 ; the generation module _G01 edits the training image and the image local feature _MI of the training image and generates a fourth training edited image; if the user judges that the fourth training edited image conforms to If the user requires, the fourth training edited image is input into the generation module _G02 ; the generation module _G02 edits the fourth training edited image and the image local feature _MI of the training image and generates the fifth training edited image; if the user If it is judged that the fifth training edited image meets the user's requirements, the training of the initial model according to the second preset loss function is completed; if the user judges that the fifth training edited image does not meet the user's requirements, the above training process is repeated. Since there is only autoencoder _G1 -autodecoder _E1 (see Figure 2) between the generation module _G01 and the generation module _G02 , when training the initial model according to the second preset loss function, it also trains Autoencoder G ₁ -Autodecoder E ₁ .

Exemplarily, taking the third preset loss function to train the initial model as an example, when the third preset loss function trains the initial model, it is necessary to skip the sampling encoding module G ₀₀ to process the image overall feature c _I and image local features of the training image feature M _I , and the process of the overall sentence feature c _T of the training text, and the process of editing the edited result output by the sample encoding module G ₀₀ and the image local feature M _I of the training image by the skip generation module G ₀₁ . Instead, the training image is directly input into the generation module _G02 ; the generation module _G02 edits the training image and the image local feature _MI of the training image and generates a sixth training edited image; if the user judges that the sixth training edited image If the user's requirements are met, the training of the initial model is completed according to the third preset loss function; if the user judges that the sixth training edited image does not meet the user's requirements, the above training process is repeated. Since the generating module G ₀₂ does not have an automatic codec, when training the initial model according to the third preset loss function, it is not necessary to train an automatic codec.

Due to the characteristics of the multi-layer confrontation network (that is, the above N sub-networks) that are difficult to train and converge, when training the initial model, the automatic codec is used to randomly skip the low-level confrontation network (for example, skip the sampling encoding module G ₀₀ ) instead of Prioritize the training of the high-level network (for example, the generation module G ₀₁ , or the generation module G ₀₂ ), for example, when b=1, skip the sample encoding module G ₀₀ , and give priority to training the generation module G ₀₁ and the generation module G ₀₂ ; when b = 2, the sampling encoding module G ₀₀ and the generating module G ₀₁ are skipped, and the above training method has the following advantages: a) Since the input of the next-level sub-network uses the original training image instead of the low-level sub-network when the low-level sub-network is skipped The editing results that do not meet the user's requirements (that is, wrong editing results) output by the network, therefore, can prevent the potential wrong editing results generated by the low-level sub-network from propagating to the high-level sub-network; b) Prioritizing the training of the high-level sub-network can bring more To better update the gradient, so that the convergence effect of the low-level sub-network can be better.

Exemplarily, the image editing model also includes: a detail correction model (Symmetrical Detail Correction Module, SCDM), which is used to modify the details of the target edited image; the detail correction model is used to process image local features, sentence word features and target edited images, Obtain the target correction image; the detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used for image local features, first random noise and sentence word features Perform detail modification to obtain the first detail feature; the second detail correction module is used to modify the image local features, the second random noise and sentence word features corresponding to the target edited image to obtain the second detail feature; the fusion module is used to The first detail feature and the second detail feature are subjected to feature fusion; the generator is used to generate the target correction image according to the output result of the fusion module.

As shown in FIG. 2, the image editing model 201 also includes a detail revision model 2014, which is generated by the generation module _G02 according to the image local feature M _I of the target source image I and the sentence word feature M _T of the text description T. The target edited image is modified and corrected for details, and the target corrected image is obtained

The detail correction model 2014 includes: a first detail correction module 301, a second detail correction module 302, a fusion module F _fuse , and a generator G _0S , wherein the first detail correction module 301 performs an image local feature M _I , the first random noise noise1 and sentence word feature _MT carry out detailed modification, obtain _the first detailed feature, and input the first detailed feature to the fusion module F _fuse ; Three edited images

) corresponding to the image local features

The second random noise noise2 and the sentence word feature _MT carry out detail modification, obtain the second detail feature, and input the second detail feature to the fusion module F _fuse ; the fusion module F _fuse performs feature on the first detail feature and the second detail feature Fusion, and input the fusion result into the generator G _0S ; the generator G _0S encodes the output result of the fusion module F _fuse and generates the target correction image

Exemplarily, as shown in FIG. 3 , the first detail correction module 301 includes a fourth noisy affine combination module 3011, a fifth noisy affine combination module 3013, a sixth noisy affine combination module 3015, a second self- Attention module 3012, the first residual network 3014 and the first linear network 3016; wherein, the fourth noisy affine combination module 3011 performs feature fusion on the first random noise noise1 and image local features _MI to obtain the first fusion feature , and input the first fusion feature to the second self-attention module 3012; the second self-attention module 3012 performs feature fusion on the first fusion feature and sentence word feature _MT , and performs fusion results with the first random noise noise1 Splicing to obtain the splicing result; the splicing result is input into the fifth band noise affine combination module 3013; the fifth band noise affine combination module 3013 performs feature fusion on the splicing result and the image local feature _MI , and fuses the fusion result Input to the first residual network 3014; the first residual network 3014 performs visual feature extraction on the fusion result output by the fifth band noise affine combination module 3013, and inputs the visual feature extraction result to the sixth band noise affine combination Module 3015; the first linear network 3016 performs linear transformation on the image local feature _MI , and the linear transformation result is input into the sixth band noise affine combination module 3015; the sixth band noise affine combination module 3015 performs a linear transformation on the first residual network 3014 The output visual feature extraction result and the linear transformation result output by the first linear network 3016 are subjected to feature fusion to obtain the first detail feature x _I .

Exemplarily, as shown in FIG. 3 , the second detail correction module 302 includes a seventh noisy affine combination module 3021, an eighth noisy affine combination module 3023, a ninth noisy affine combination module 3025, a third self Attention module 3022, the second residual network 3024 and the second linear network 3026; wherein, the seventh band noise affine combination module 3021 is used for the image local features corresponding to the second random noise noise2 and the target edited image (for example, the generation module G ₀₂ The third edited image output

Corresponding image local features

) to perform feature fusion to obtain the first fusion feature, and input the first fusion feature to the third self-attention module 3022; the third self-attention module 3022 performs feature fusion to the first fusion feature and sentence word feature M _T , And splicing the feature fusion result with the second random noise noise2, and finally input the splicing result to the eighth noisy affine combination module 3023; Image local features (for example, the third edited image output by the generation module G ₀₂

Corresponding image local features

) to perform feature fusion, and input the fusion result to the second residual network 3024; the second residual network 3024 performs visual feature extraction on the fusion result of the output of the eighth noisy affine combination module, and the visual feature extraction result Input to the ninth band noise affine combination module 3025; the second linear network 3026 to the image local features corresponding to the target edited image (for example, the third edited image output by the generation module G ₀₂

Corresponding image local features

) to carry out linear transformation, and input the linear transformation result into the ninth band noise affine combination module 3025; the ninth band noise affine combination module 3025 extracts the visual feature results of the second residual network 3024 output and the second linear network The linear transformation result output by 3026 is subjected to feature fusion to obtain the second detail feature

The local features of the above image

is to use the VGG network to edit the third image

obtained by feature extraction.

The above fusion module F _fuse for the first detailed feature x _I output by the sixth noisy affine combination module 3015 and the second detailed feature output by the ninth noisy affine combination module 3025

Carry out feature fusion, and input the fusion result into the generator G _0S ; the generator G _0S encodes the output fusion result of the fusion module F _fuse and generates the target correction image

Optionally, the input of the first detail correction module 301 and the second detail correction module 302 can be exchanged, that is, the input of the first detail correction module 301 can be the target edited image (for example, the third output of the generation module _G02 edit image

) corresponding to the image local features

The second random noise noise2 and the sentence word feature _MT ; the input of the second detail modification module 302 may be the image local feature M _I , the first random noise noise1 and the sentence word feature _MT . Correspondingly, the output of the first detail correction module 301 is the second detail feature

The output of the first detail modification module 301 is the first detail feature x _I .

In Figure 3, the core algorithm of the fusion module F _fuse is as follows:

Among them, F _residual is the residual network, β ₁ and β ₂ are the first detailed feature x _I and the second detailed feature of the input of the linear network layer

A pair of attention weights resulting from the computation. The fusion module F _fuse can adaptively choose whether to modify the image features based on the target source image or based on the target edited image generated by the image editing model, so as to enhance the detail features of the modified image. The above generator G _0S converts the output of the fusion module F _fuse into the final target corrected image

Exemplarily, the training method of the detail correction model 2014 includes: training the generator of the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function; according to the conditional discriminator loss function and the unconditional The discriminator loss function trains the discriminator of the detail correction model; where the training data sets are MS-COCO data set and CUB200 data set; the conditional generator loss function is as follows:

Among them, L _Gs,0 is the conditional generator loss function, L _Ds,0 is the conditional discriminator loss function; D _S is the discriminator of the detail correction model in the image editing model;

refers to the target corrected image

It is generated by the detail correction model (SDCM) given the training image I and the description text T _I of the training image I; I~P _data means that the training image I is sampled from real data; T is randomly selected the target text of

is the semantic contrast function, the purpose is to make the target correction image

The description text T _I that is closer to the training image I than the randomly selected target text T. the above

The function is defined as:

Among them, ρ _c is the contrast control threshold; _Lcorre is the correlation function in ControlGAN, which is used to describe the matching degree between the training text and the target corrected image.

The above unconditional generator loss function is as follows:

Among them, L _Gs,1 is the unconditional generator loss function, L _Ds,1 is the unconditional discriminator loss function; D _S is the discriminator of the detail correction module in the image editing model;

refers to the target corrected image

is generated by a detail correction model (SDCM) given a training image I and a randomly selected target text T. The total loss functions of the generator and the discriminator in the detail revision model 2014 are respectively:

Among them, L _Gs is the total loss function of the generator, L _Ds is the total loss function of the discriminator; L _ControlGAN is the multimodal loss function of text and image, which is used to improve the matching degree of image editing results and target text; the above

Image local features

It is to use the VGG network to correct the image of the target

The feature extraction is obtained. The image local feature M _I is obtained by using the VGG network to extract the feature of the target source image I. The function L _DAMSM is the text and image similarity function defined in ManiGAN; the L _reg function is the image editing model. The defined regularization term is used to strengthen the modification effect,

During the training of the detail correction model, the training image I and the target correction image are randomly used

replace each other to speed up the training process of the detail correction model.

For ease of understanding, the overall flow of the text-based image editing method provided by the present application will be described as an example below with reference to FIG. 4 .

As shown in Figure 4, the configurable editing part means that the user can control the intermediate editing results output by the sampling encoding module G ₀₀ , generating module G ₀₁ and generating module G ₀₂ in the image editing model, and by judging the intermediate editing results Instead, some modules are selectively skipped. The intermediate editing results that do not meet the user's requirements are replaced by the target source image, so as to realize the editing process of the target source image according to the target text. In the following, through the sub-flow charts of (a) to (h) in Fig. 4, we will illustrate how the user randomly controls the intermediate editing results output by the sample encoding module G ₀₀ , the generation module G ₀₁ and the generation module G ₀₂ in the image editing model process.

In Figure 4, (a) complete process: the user inputs the image overall features and image local features of the target source image into the image editing model, as well as the sentence overall features and sentence word features of the target text, and the sampling coding module G in the image editing model _00, the output of the first edited image meets the requirements of the user; the sample encoding module G ₀₀ inputs the first edited image into the generation module G ₀₁ , and the generation module G ₀₁ processes the first edited image and outputs the second edited image that also meets the user’s requirements. Requirements; input the second edited image to the generation module G ₀₂ ; the third edited image generated by the generation module G ₀₂ to process the second edited image also meets the requirements of the user, and the generation module G ₀₂ converts the third edited image (ie, the target edited image image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target correction image.

In Figure 4, (b) skip G ₀₀ : the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sampling code in the image editing model The first edited image output by module G ₀₀ does not meet the user's requirements; the first edited image output by the sample encoding module G ₀₀ is discarded, and the target source image is used instead of the first edited image to be input to the generation module G ₀₁ , and the generation module G ₀₁ is The target source image is processed and the second edited image output meets the user's requirements; the second edited image is input to the generation module G ₀₂ ; the generation module G ₀₂ processes the second edited image and generates a third edited image that also meets the user's requirements , the generating module G ₀₂ inputs the third edited image (ie, the target edited image) into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target corrected image.

In Fig. 4, (c) G ₀₀ and G ₀₁ are skipped: the user inputs the image overall features and image local features of the target source image into the image editing model, and the sentence overall features and sentence word features of the target text, and the image editing model The first edited image output by the sample encoding module G ₀₀ does not meet the user's requirements; the first edited image output by the sample encoding module G ₀₀ is discarded, and the target source image is used instead of the first edited image to be input to the generation module G ₀₁ , and the generation module G ₀₁ processes the target source image and outputs the second edited image that does not meet the user's requirements; the generation module G ₀₁ discards the second edited image, and uses the target source image instead of the second edited image to input to the generation module G ₀₂ ; generates Module G ₀₂ processes the target source image and generates a third edited image that meets the user's requirements. The generation module G ₀₂ inputs the third edited image (i.e., the target edited image) into the detail correction model (SCDM), and the detail correction model ( SCDM) to correct the details of the third edited image and obtain the target corrected image.

In Figure 4, (d) only use SCDM (i.e. skip G ₀₀ , G ₀₁ and G ₀₂ ): the user inputs the overall image features and local image features of the target source image, and the overall sentence features of the target text into the image editing model and sentence word features, the first edited image output by the sampling encoding module G ₀₀ in the image editing model does not meet the user's requirements; the first edited image output by the sampling encoding module G ₀₀ is discarded, and the target source image is used to replace the first edited image input In the generation module G ₀₁ , the second edited image output by the generation module G ₀₁ after processing the target source image also does not meet the user’s requirements; the generation module G ₀₁ discards the second edited image and replaces the second edited image with the target source image The image is input to the generation module G ₀₂ ; the generation module G ₀₂ processes the target source image and generates a third edited image that does not meet the user's requirements, and the generation module G ₀₂ discards the third edited image and replaces the third edited image with the target source image The edited image is input into the detail correction model (SCDM), which corrects the details of the target source image according to the image local features of the target source image and the sentence word features of the target text to obtain the target corrected image.

In Figure 4, (e) SCDM is skipped: the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sample encoding module in the image editing model The first edited image output by G00 meets user requirements; the sampling encoding module G00 inputs the first edited image into the generation module G01, and the generation module G01 processes the first edited image and outputs a second edited image that also meets user requirements; The second edited image is input to the generating module G02; the generating module G02 processes the second edited image and generates a third edited image that also meets user requirements, and the third edited image is the target edited image.

In Fig. 4, (f) repeats G ₀₁ : the user inputs the image overall features and image local features of the target source image into the image editing model, and the sentence overall features and sentence word features of the target text, and the sample encoding module in the image editing model The first edited image output by G ₀₀ meets the user's requirements; the sample encoding module G ₀₀ inputs the first edited image to the generation module G ₀₁ , and the generation module G ₀₁ processes the first edited image and outputs the second edited image that does not conform to User requirements; the generation module G ₀₁ can be used repeatedly until the second edited image that meets the user's requirements is output; for example, the generation module G ₀₁ reprocesses according to the first edited result and generates a new second edited image that meets the requirements, and The new second edited image is input to the generation module _G02 ; the generation module _G02 processes the new second edited image and generates a third edited image that also meets the user's requirements, and the generation module _G02 converts the third edited image (i.e. The target edited image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target corrected image.

In Fig. 4, (g) repeat G ₀₂ : the user inputs the image overall features and image local features of the target source image into the image editing model, as well as the sentence overall features and sentence word features of the target text, and the sample encoding module in the image editing model The first edited image output by G ₀₀ meets the requirements of the user; the sample encoding module G ₀₀ inputs the first edited image into the generation module G ₀₁ , and the generation module G ₀₁ processes the first edited image and outputs the second edited image that meets the user’s requirements. Requirements; the generation module _G02 processes the second edited image and generates a third edited image that does not meet the user's requirements, and the generation module _G02 can be used repeatedly until the third edited image that meets the user's requirements (ie, the target edited image) is output ; For example, the generation module G ₀₂ re-processes according to the second editing result and generates a new third edited image that meets the requirements, and inputs the new third edited image to the generation module G ₀₂ ; the generation module G ₀₂ converts the new third edited image The third edited image (ie, the target edited image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the new third edited image to obtain the target corrected image.

In Figure 4, (h) repeat SCDM: the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sampling coding module G in the image editing model The first edited image output by ₀₀ meets the user's requirements; the sample encoding module G ₀₀ inputs the first edited image into the generation module G ₀₁ , and the generation module G ₀₁ processes the first edited image and outputs the second edited image that meets the user's requirements ; The generation module G ₀₂ processes the second edited image and generates a third edited image that meets user requirements, and the generation module G ₀₂ inputs the third edited image (ie, the target edited image) into the detail correction model (SCDM), and the detail The correction model (SCDM) corrects the details of the third edited image and obtains the fourth edited image that does not meet the user's requirements. The detail correction model (SCDM) can be used repeatedly until the output of the fourth edited image that meets the user's requirements (ie, the target corrected image ); for example, the detail correction model (SCDM) reprocesses according to the third editing result and generates a new fourth edited image that meets the requirements, and the new fourth edited image is the target corrected image.

Compared with the prior art, the above-mentioned technical solution proposed by the application has the following beneficial effects:

Compared with the existing ManiGAN that edits the source image to be edited according to the text and directly outputs the editing results that may not meet the user's requirements, this application introduces a sampling encoding module into the existing ManiGAN to form an improved ManiGAN; The edited result (ie the first edited image) is output to facilitate the user to judge whether the intermediate edited result meets the requirements. If the intermediate edited result meets the requirements, the intermediate edited result will continue to be passed to at least one cascaded generation module; if it does not meet the requirements, it will not The intermediate result will continue to be delivered to at least one cascaded generation module, but the target source image will be used to replace the intermediate editing result and continue to be passed to at least one cascaded generation module. It can be seen that when the improved ManiGAN edits the target source image according to the target text, it can control the intermediate editing results, and promptly remove the intermediate editing results that do not meet the requirements, so as to prevent the output of the previous level from affecting the subsequent results. The accuracy of the first-level output results, so as to edit the target edited image more in line with the requirements for the user.

FIG. 5 shows a schematic structural diagram of an electronic device provided by the present application. The dotted line in Fig. 5 indicates that the unit or the module is optional. The electronic device 500 may be used to implement the methods described in the foregoing method embodiments. The electronic device 500 may be a terminal device or a server or a chip.

The electronic device 500 includes one or more processors 501, and the one or more processors 501 can support the electronic device 500 to implement the method in the method embodiment corresponding to FIG. 1 . Processor 501 may be a general purpose processor or a special purpose processor. For example, the processor 501 may be a central processing unit (central processing unit, CPU). The CPU can be used to control the electronic device 500, execute software programs, and process data of the software programs. The electronic device 500 may further include a communication unit 505, configured to implement input (reception) and output (send) of signals.

For example, the electronic device 500 may be a chip, and the communication unit 505 may be an input and/or output circuit of the chip, or the communication unit 505 may be a communication interface of the chip, and the chip may serve as a component of a terminal device.

For another example, the electronic device 500 may be a terminal device, and the communication unit 505 may be a transceiver of the terminal device, or the communication unit 505 may be a transceiver circuit of the terminal device.

The electronic device 500 may include one or more memories 502, on which a program 504 is stored, and the program 504 may be run by the processor 501 to generate instructions 503, so that the processor 501 executes the methods described in the above method embodiments according to the instructions 503. Optionally, data (such as the ID of the chip to be tested) may also be stored in the memory 502 . Optionally, the processor 501 may also read data stored in the memory 502, the data may be stored in the same storage address as the program 504, and the data may also be stored in a different storage address from the program 504.

The processor 501 and the memory 502 can be set separately, and can also be integrated together, for example, integrated on a system-on-chip (system on chip, SOC) of the terminal device.

For a specific manner in which the processor 501 executes the method for starting the burn-in test, reference may be made to relevant descriptions in the method embodiments.

It should be understood that the steps in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor 501 . The processor 501 may be a CPU, a digital signal processor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices or discrete hardware components .

The present application also provides a computer program product, which implements the method described in any method embodiment in the present application when the computer program product is executed by the processor 501 .

The computer program product can be stored in the memory 502 , such as a program 504 , and the program 504 is finally converted into an executable object file executable by the processor 501 through processes such as preprocessing, compiling, assembling and linking.

The present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in the present application is implemented. The computer program may be a high-level language program or an executable object program.

The computer readable storage medium is, for example, the memory 502 . The memory 502 may be a volatile memory or a nonvolatile memory, or, the memory 502 may include both a volatile memory and a nonvolatile memory. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. The volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM) ), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (Synchlink DRAM, SLDRAM) And direct memory bus random access memory (Direct Rambus RAM, DRRAM).

Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process and technical effects of the devices and equipment described above can refer to the corresponding processes and technical effects in the foregoing method embodiments, here No longer.

In the several embodiments provided in this application, the disclosed systems, devices and methods may be implemented in other ways. For example, some features of the method embodiments described above may be omitted, or not implemented. The device embodiments described above are only illustrative, and the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system. In addition, the coupling between the various units or the coupling between the various components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.

The above-mentioned embodiments are only used to illustrate the technical solution of the present application, but not to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: they can still modify the technical solutions described in the aforementioned embodiments, or perform equivalent replacements for some of the technical features, and these Any modification or replacement that does not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present application shall be included within the protection scope of the present application.

Claims

A text-based image editing method, characterized in that the method comprises:

Obtain the image overall features and image local features of the target source image, as well as the sentence overall features and sentence word features of the target text;

Editing the target source image based on an image editing model according to the overall feature of the image, the partial feature of the image, the overall feature of the sentence and the feature of the sentence word, to obtain a target edited image;

Wherein, the image editing model includes: a sampling encoding module and at least one cascaded generation module; the processing of the target source image by the image editing model includes: using the sampling encoding module to analyze the overall characteristics of the image, Sampling and encoding processing is performed on the overall feature of the sentence and the local feature of the image to obtain a first edited image, and output the first edited image; in response to a user instruction, the first edited image, the local feature of the image and the The sentence word features are input into the at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image, or the target source image, the image local features and the sentence word features are input Perform high-dimensional visual feature extraction in the at least one cascaded generation module to obtain the target edited image.
The method according to claim 1, wherein the generating module comprises: a first automatic decoder, a first self-attention module, a second upsampling module and a second automatic encoder,

The first automatic decoder is used to restore the high-dimensional visual features of the input information to obtain the first high-dimensional feature image, and the input information is the first edited image or the output information of the generation module of the previous layer;

The first self-attention module is used to fuse and concatenate the first high-dimensional feature image and the sentence word features to obtain sentence semantic information features;

The second upsampling module is used to perform feature fusion and upsampling processing on the semantic information features of the sentence to obtain a second upsampling result;

The second automatic encoder is used to perform high-dimensional visual feature extraction on the second upsampling result to obtain output information, when the generation module is the last generation module in the at least one cascaded generation module , the output information is the target edited image.
The method according to claim 2, wherein the first self-attention module comprises: a self-attention layer and a first noisy affine combination module,

The self-attention layer is used to carry out feature fusion to the first high-dimensional feature image and the sentence word features;

The first noisy affine combination module is used to perform feature fusion on the output result of the self-attention layer, the splicing result of the first high-dimensional feature image, and the local features of the image.
The method according to any one of claims 1 to 3, wherein the sampling encoding module comprises: a first upsampling module and a first automatic encoder,

The first upsampling module is used to perform upsampling processing on the overall feature of the image, the overall feature of the sentence, and the local feature of the image to obtain a first upsampling result;

The first automatic encoder is used to generate a first edited image according to the first upsampling result.
The method according to claim 4, wherein the first upsampling module comprises: a plurality of identical upsampling layers, the second band noise affine combination module and the third band noise affine combination module,

The input of the first upsampling module is the overall feature of the sentence, the overall feature of the image and the local features of the image. In the two adjacent upsampling layers in the same upsampling layer, the latter upsampling The input of the layer is the output of the previous upsampling layer;

The second noisy affine combination module is located in the middle of any two upsampling layers in the plurality of identical upsampling layers, and is used to output the result and performing feature fusion on the local features of the image;

The third noisy affine combination module is used to perform feature fusion on the output result of the last upsampling layer in the plurality of identical upsampling layers and the local feature of the image.
The method according to any one of claims 1 to 3, wherein the image editing model further comprises: a detail modification model, configured to modify the details of the target edited image;

The detail correction model is used to process the image local features, the sentence word features and the target edited image to obtain the target corrected image;

The detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used for local features of the image, the first random noise and the Sentence word features are modified in detail to obtain the first detailed feature;

The second detail correction module is used to modify the details of the image local features corresponding to the target edited image, the second random noise and the sentence word features to obtain the second detail features;

The fusion module is used to perform feature fusion on the first detailed features and the second detailed features;

The generator is used to generate the target correction image according to the output result of the fusion module.
The method according to claim 6, wherein the first detail correction module comprises a fourth noisy affine combination module, a fifth noisy affine combination module, a sixth noisy affine combination module, a second Self-attention module, first residual network and first linear network;

The fourth noisy affine combination module is used to perform feature fusion on the first random noise and the local features of the image to obtain a first fusion feature;

The second self-attention module is used to carry out feature fusion to the first fusion feature and the sentence word feature;

The fifth noisy affine combination module performs feature fusion on the output result of the second self-attention module, the mosaic result of the first random noise, and the local features of the image;

The first residual network is used to extract visual features from the output of the fifth noisy affine combination module;

The first linear network is used to linearly transform the local features of the image;

The sixth noisy affine combination module is used to perform feature fusion on the output result of the first residual network and the output result of the first linear network.
The method according to claim 6, wherein the second detail correction module comprises a seventh noisy affine combination module, an eighth noisy affine combination module, a ninth noisy affine combination module, a third Self-attention module, second residual network and second linear network;

The seventh noisy affine combination module is used to perform feature fusion on the second random noise and the image local features corresponding to the target edited image to obtain the first fusion feature;

The third self-attention module is used to carry out feature fusion to the first fusion feature and the sentence word feature;

The eighth noisy affine combination module performs feature fusion on the output result of the third self-attention module, the concatenation result of the second random noise, and the image local features corresponding to the target edited image;

The second residual network is used to extract visual features from the output of the eighth noisy affine combination module;

The second linear network is used to linearly transform the image local features corresponding to the target edited image;

The ninth noisy affine combination module is used to perform feature fusion on the output result of the second residual network and the output result of the second linear network.
The method according to claim 6, wherein the training method of the detail correction model comprises:

training a generator of said minutiae correction model based on a conditional generator loss function, an unconditional generator loss function, and a semantic contrast function;

The discriminator of the detail revision model is trained according to the conditional discriminator loss function and the unconditional discriminator loss function.
The method according to any one of claims 1 to 3, wherein the training method of the image editing model comprises:

Using a preset loss function and training set to train the initial model to obtain the image editing model;

Wherein, the preset loss function includes sub-functions corresponding to N sub-networks and loss functions of N-1 automatic codecs, the initial model includes N sub-networks, and the N sub-networks are the sampling Initial models respectively corresponding to the encoding module and at least one generating module;

During the training process, when the output image of the i-th sub-network does not meet the preset conditions, the sub-function corresponding to the i+1-th sub-network and the loss function of the i-th to i+1-th automatic codec are used , train the initial model, 0≤i<N.
An electronic device, characterized in that the device includes a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, so that the device performs A method as claimed in any one of claims 1 to 10.
A computer-readable storage medium, characterized in that, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the processor is made to execute any one of claims 1 to 10. described method.