WO2023060434A1 - Text-based image editing method, and electronic device - Google Patents

Text-based image editing method, and electronic device Download PDF

Info

Publication number
WO2023060434A1
WO2023060434A1 PCT/CN2021/123272 CN2021123272W WO2023060434A1 WO 2023060434 A1 WO2023060434 A1 WO 2023060434A1 CN 2021123272 W CN2021123272 W CN 2021123272W WO 2023060434 A1 WO2023060434 A1 WO 2023060434A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
module
feature
features
upsampling
Prior art date
Application number
PCT/CN2021/123272
Other languages
French (fr)
Chinese (zh)
Inventor
程俊
吴福祥
宋呈群
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2021/123272 priority Critical patent/WO2023060434A1/en
Publication of WO2023060434A1 publication Critical patent/WO2023060434A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application relates to the field of image processing, in particular to a text-based image editing method and electronic equipment.
  • text-based image editing is a technology to edit source images according to a given text.
  • This technology is a research hotspot in the field of multimedia and has important application value.
  • the existing attention-based Generative Adversarial Network Manipulating Attention Generative Adversarial Network, ManiGAN
  • ManiGAN Manipulating Attention Generative Adversarial Network
  • the output image editing results often do not meet the user's requirements.
  • One of the purposes of the embodiments of the present application is: a text-based image editing method and electronic equipment, aiming at solving the problem that the image editing results output by the existing ManiGAN do not meet user requirements.
  • a text-based image editing method including: acquiring the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text; according to the overall image features, the The image local feature, the sentence overall feature and the sentence word feature, edit the target source image based on the image editing model to obtain the target editing image;
  • the image editing model includes: a sampling encoding module and at least one A cascaded generation module;
  • the processing of the target source image by the image editing model includes: using the sampling encoding module to perform sampling encoding processing on the overall feature of the image, the overall feature of the sentence, and the local feature of the image , obtain a first edited image, and output the first edited image; in response to a user instruction, input the first edited image, the image local features and the sentence word features into the at least one cascaded generation module Perform high-dimensional visual feature extraction to obtain the target edited image, or input the target source image, the local features of the image and the sentence word features into the at least one cascade
  • the above method can be executed by a chip on an electronic device.
  • this application introduces a sampling encoding module into the existing ManiGAN to form an improved ManiGAN;
  • the edited result ie the first edited image
  • the intermediate edited result will continue to be passed to at least one cascaded generation module; if it does not meet the requirements, it will not The intermediate result will continue to be delivered to at least one cascaded generation module, but the target source image will be used to replace the intermediate editing result and continue to be passed to at least one cascaded generation module.
  • the improved ManiGAN edits the target source image according to the target text, it can control the intermediate editing results, and promptly remove the intermediate editing results that do not meet the requirements, so as to prevent the output of the previous level from affecting the subsequent results.
  • the accuracy of the first-level output results so as to edit the target edited image more in line with the requirements for the user.
  • the generating module includes: a first automatic decoder, a first self-attention module, a second upsampling module and a second automatic encoder, the first automatic decoder is used to restore the high-dimensional visual feature, to obtain the first high-dimensional feature image, the input information is the first edited image or the output information of the previous layer generation module; the first self-attention module is used for the first high-dimensional feature image and The sentence feature is fused and processed to obtain the sentence semantic information feature; the second up-sampling module is used to perform feature fusion and up-sampling processing on the sentence semantic information feature to obtain a second up-sampling result; the second up-sampling module Two automatic encoders are used to perform high-dimensional visual feature extraction on the second upsampling result to obtain output information.
  • the generation module is the last generation module in the at least one cascaded generation module, the The output information is the target edited image.
  • the first self-attention module includes: a self-attention layer and a first noisy affine combination module, and the self-attention layer is used to analyze the first high-dimensional feature image and the sentence word Feature fusion of features; the first noisy affine combination module is used to perform feature fusion on the output result of the self-attention layer and the splicing result of the first high-dimensional feature image, as well as the local features of the image.
  • the first noisy affine combination module is introduced.
  • the first noisy affine combination module can enhance the reliability of the generation module to edit images by introducing Gaussian noise, thereby avoiding the generation module due to the The presence of random noise affects the reliability of the editing results.
  • the sampling encoding module includes: a first up-sampling module and a first automatic encoder, the first up-sampling module is used to perform an overall feature of the image, the overall feature of the sentence, and the local feature of the image Perform upsampling processing to obtain a first upsampling result; the first automatic encoder is configured to generate a first edited image according to the first upsampling result.
  • the first upsampling module includes: a plurality of identical upsampling layers, a second noisy affine combination module and a third noisy affine combination module, and the input of the first upsampling module is the sentence
  • the overall feature, the overall feature of the image and the local feature of the image, among the two adjacent up-sampling layers in the plurality of identical up-sampling layers, the input of the latter up-sampling layer is the output of the previous up-sampling layer ;
  • the second noisy affine combination module is located in the middle of any two upsampling layers in the plurality of identical upsampling layers, and is used to output the result of the previous upsampling layer in the any two upsampling layers Perform feature fusion with the local features of the image;
  • the third noisy affine combination module is used to perform feature fusion on the output result of the last upsampling layer in the plurality of identical upsampling layers and the local features of the image .
  • Introducing the second noisy affine combination module and the third noisy affine combination module into the first upsampling module can further enhance the visual features of the output results of different upsampling layers in the first upsampling module.
  • the image editing model further includes: a detail modification model, configured to modify the details of the target edited image; the detail modification model is used to process the image local features, the sentence word features, and the Editing the image of the target to obtain the corrected image of the target; the detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used to partially modify the image feature, the first random noise, and the sentence word feature are modified in detail to obtain the first detail feature; the second detail correction module is used to modify the image local feature corresponding to the target edited image, the second random noise, and the Sentence word features are modified in detail to obtain the second detailed features; the fusion module is used to perform feature fusion on the first detailed features and the second detailed features; the generator is used for outputting results according to the fusion module The target corrected image is generated.
  • a detail correction model configured to modify the details of the target edited image
  • the detail modification model is used to process the image local features, the sentence word features, and the Edit
  • Adding the detail correction model to the above image editing model can further modify and enhance the details of the target editing image output by the image editing model, so as to obtain a high-resolution target correction image.
  • the first detail modification module includes a fourth noisy affine combination module, a fifth noisy affine combination module, a sixth noisy affine combination module, a second self-attention module, and a first residual network and the first linear network;
  • the fourth noisy affine combination module is used to perform feature fusion on the first random noise and the local features of the image to obtain the first fusion feature;
  • the second self-attention module It is used to perform feature fusion on the first fusion feature and the sentence word feature;
  • the output result of the second self-attention module and the splicing of the first random noise by the fifth noisy affine combination module The results and the local features of the image are subjected to feature fusion;
  • the first residual network is used to extract visual features from the output of the fifth noisy affine combination module;
  • the first linear network is used to extract the visual features of the performing linear transformation on the local features of the image;
  • the sixth noisy affine combination module is used to perform feature fusion on the output result of the first residual network and the output result of the
  • Adding a plurality of noisy affine combination modules to the first detail correction module can enhance the reliability of the detail correction model.
  • the second detail modification module includes a seventh noisy affine combination module, an eighth noisy affine combination module, a ninth noisy affine combination module, a third self-attention module, and a second residual network and a second linear network;
  • the seventh noisy affine combination module is used to perform feature fusion on the second random noise and the image local features corresponding to the target edited image to obtain the first fusion feature;
  • the second Three self-attention modules are used to carry out feature fusion to the first fusion feature and the sentence word feature;
  • the eighth noisy affine combination module performs feature fusion on the output result of the third self-attention module, the mosaic result of the second random noise, and the image local features corresponding to the target edited image; the The second residual network is used to perform visual feature extraction on the output result of the eighth noisy affine combination module; the second linear network is used to linearly transform the image local features corresponding to the target edited image; The ninth noisy affine combination module is used to perform feature fusion on the output result of the second residual network and the output result of the second linear network.
  • Adding a plurality of noisy affine combination modules to the second detail correction module can enhance the reliability of the detail correction model.
  • the training method of the detail correction model includes: training the generator of the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function; according to the conditional discriminator loss function function and an unconditional discriminator loss function to train the discriminator of the minutiae revision model.
  • the above-mentioned generator that trains the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function can make the image editing result generated by the generator (ie, the target edited image) more conform to the target text description content and user requirements.
  • Training the discriminator of the detail modification model according to the conditional discriminator loss function and the unconditional discriminator loss function can make the recognition result of the discriminator more accurate.
  • the training method of the image editing model includes: using a preset loss function and a training set to train the initial model to obtain the image editing model; wherein, the preset loss function includes and N sub-networks Corresponding sub-functions and loss functions of N-1 automatic codecs, the initial model includes N sub-networks, and the N sub-networks are initial models corresponding to the sampling encoding module and at least one generation module respectively; training In the process, when the output image of the i-th sub-network does not meet the preset conditions, the sub-function corresponding to the i+1-th sub-network and the loss function of the i-th to i+1-th automatic codec are used, Train the initial model, 0 ⁇ i ⁇ N.
  • the training of the N sub-networks is to skip the previous sub-networks whose output results do not meet the requirements through the automatic codec and give priority to training the subsequent sub-networks, because the target source image is used when skipping. It is not an intermediate editing result (for example, the first edited image), therefore, it is possible to avoid the potential error results output by the previous sub-network from propagating to the subsequent sub-network.
  • the priority training of the later sub-network can bring better update gradients to the front-level sub-network, so that the convergence effect of the front-level sub-network is better.
  • an electronic device including a module for performing any one of the methods in the first aspect.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the first aspect. Methods.
  • Fig. 1 is a schematic flow chart of a text-based image editing method in an embodiment of the present invention
  • Fig. 2 is a schematic structural diagram of an image editing model in an embodiment of the present invention.
  • Fig. 3 is a schematic structural diagram of a detail correction model in an embodiment of the present invention.
  • Fig. 4 is a schematic diagram of the processing process of the target source image by the image editing model in the embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.
  • references to "one embodiment” or “some embodiments” or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Accordingly, appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc., in various places in this specification are not necessarily all References to the same embodiment mean “one or more but not all” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • Text-based image editing is a research hotspot in the field of multimedia and has important application value.
  • ManiGAN is used for image editing based on the text description content of the source image to be edited.
  • the existing ManiGAN cannot process the intermediate results of image editing when editing the source image according to the text description content, which often leads to the image editing results output by ManiGAN not meeting user requirements.
  • this application introduces an automatic codec into the multi-level generation confrontation network of the existing ManiGAN.
  • the automatic codec can output the intermediate editing results to the user for convenience.
  • the user can directly control the edited result of the intermediate output of ManiGAN, so as to obtain the target edited image that is more in line with the user's requirements.
  • this application proposes a text-based image editing method, as shown in Figure 1, the method is executed by electronic equipment, and the method includes:
  • the electronic device acquires the overall image features and local image features of the target source image, and the overall sentence features and sentence word features of the target text.
  • the target source image comes from the MS-COCO (Microsoft Common Objects in Context) dataset and the CUB200 dataset; the above target text refers to the text information that records the user editing the target source image, for example, the target source image is a bird, if the user If you want to dye the bird's feathers red and its head yellow, you can record these editing requirements in the target text in the form of words, that is, the specific content of the target text is to dye the bird's feathers Red and head dyed yellow.
  • MS-COCO Microsoft Common Objects in Context
  • the above-mentioned overall image features refer to the features that can represent the entire image, and are used to describe the overall features such as image color and shape, such as color features, texture features, and shape features;
  • image local features i.e., local image features
  • image features is the local expression of image features, which reflects the local characteristics of the image.
  • the visual geometry group (Visual Geometry Group, VGG) network can be used to extract the image overall features and image local features of the target source image; a special recurrent neural network (Recurrent Neural Network, RNN), that is, long Short-term memory (Long short-term memory, LSTM) network to extract the overall sentence features and sentence word features of the target text.
  • RNN Recurrent Neural Network
  • the image editing model includes: a sampling coding module and at least one cascade
  • the generation module of the image editing model includes: using the sampling encoding module to sample and encode the overall features of the image, the overall features of the sentence and the local features of the image to obtain the first edited image and output the first edited image;
  • input the first edited image, image local features and sentence word features into at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image, or input the target source image, image local features and sentence
  • the word feature is input to at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image.
  • the target source image I is a bird picture with a size of 128 ⁇ 128 and a channel number of 3.
  • the feathers on the belly of the bird are white, and the mouth is Gray, the feathers on the head and neck are gray and white;
  • the VGG network performs feature extraction on the 128 ⁇ 128 target source image I, and obtains the image overall feature c I and image local feature M I corresponding to the target source image I , wherein, the overall feature c I of the image is a 128 ⁇ 1 column vector; the size of the local feature M I of the image is 128 ⁇ 128 and the number of channels is 128; for another example, the specific content of the target text T is the The feathers turn yellow, the mouth turns yellow, and the feathers on the head and neck turn gray and yellow; the LSTM network performs feature extraction on the target text T, and obtains the overall sentence feature c T and the sentence corresponding to the target text T Word feature M T , where the overall sentence feature c
  • the image editing model 201 includes: a sample encoding module G 00 and at least one cascaded generation module (eg, G 01 , G 02 ), wherein the sample encoding module G 00 includes a first upsampling module F 0 and the first autoencoder G 0 , the first upsampling module F 0 performs upsampling processing on the image overall feature c I and the sentence overall feature c T to obtain the first upsampling result; the first autoencoder G 0 The first upsampling result is encoded and a first edited image is generated
  • the first upsampling module F0 includes: a plurality of identical upsampling layers, a second noisy affine combination module 2011a and a third noisy affine combination module 2011b, the first upsampling
  • the input of the sampling module F 0 is the overall sentence feature c T , the overall image feature c I and the image local feature M I , among the two adjacent up-sampling layers in multiple identical up-sampling layers, the input of the latter up-sampling layer is the output of the previous up-sampling layer;
  • the second noisy affine combination module 2011a is located in the middle of any two up-sampling layers in multiple identical up-sampling layers, and is used for the previous up-sampling layer in any two up-sampling layers
  • the output result and the image local feature MI are subjected to feature fusion;
  • the third noisy affine combination module 2011b is used to perform feature fusion on the output result of the last upsampling layer in multiple identical upsampling layers and the image local
  • the above multiple identical upsampling layers may be three upsampling layers or four upsampling layers, which is not limited in this application, and the user may set the specific number of multiple upsampling layers according to actual needs.
  • This application only takes 4 upsampling layers included in the first upsampling module F 0 in FIG .
  • the overall feature c I and the image local feature M I perform the process of feature extraction and feature fusion.
  • the above-mentioned second noisy affine combination module 2011a can be located in the middle of any two upsampling layers among the 4 upsampling layers in the first upsampling module F0 , for example, in FIG. 2, the four upsampling layers of the first upsampling module F0 In the upsampling layer, the first three upsampling layers perform upsampling processing on the input 128 ⁇ 1 image overall feature c I and the 128 ⁇ 1 sentence overall feature c T , and the output size is 32 ⁇ 32 and the number of channels is 64.
  • Internal features i.e.
  • the second noisy affine combination module 2011a is located between the third upsampling layer and the fourth upsampling layer in the first upsampling module F 0 ; the second band The noise affine combination module 2011a enhances the visual features of the upsampling results output by the third upsampling layer (that is, the internal features of 32 ⁇ 32 ⁇ 64 output by the first three upsampling layers) according to the local image feature M I to obtain 32 ⁇
  • the first enhanced upsampling result of 32 ⁇ 64; the first enhanced upsampling result is first subjected to upsampling processing of the fourth upsampling layer and outputs 64 ⁇ 64 ⁇ 32 visual features, and the 64 ⁇ 64 ⁇ 32 visual features are then
  • the third noisy affine combination module 2011b further enhances the visual features and outputs the enhanced 64 ⁇ 64 ⁇ 32 visual features; the third noisy affine combination module 2011b performs the fourth upper The upsampling result output by the sampling layer (that is, the output result of the last upsampling layer among multiple identical up
  • the above-mentioned first autoencoder G0 performs feature extraction and encoding processing on the enhanced 64 ⁇ 64 ⁇ 32 visual features output by the third noisy affine combination module 2011b, and the output size is 64 ⁇ 64 and the number of channels is First edited image of 3 (i.e. the first edited image of 64 ⁇ 64 ⁇ 3 ).
  • the first edited image of 64 ⁇ 64 ⁇ 3 To judge whether it meets the requirements; if the first edited image of 64 ⁇ 64 ⁇ 3 Meet the user's requirements, then the first edited image of 64 ⁇ 64 ⁇ 3 Input at least one cascaded generation module for subsequent processing, as described in (I) in Figure 2; if the first edited image of 64 ⁇ 64 ⁇ 3 If it does not meet the user's requirements, the first edited image of 64 ⁇ 64 ⁇ 3 Discard and replace the first edited image with the target source image I of 128 ⁇ 128 ⁇ 3 Input to the generation module G 01 for subsequent editing processing, as described in (II) in Figure 2, where the first edited image generated by the sampling encoding module G 00 was discarded, therefore, the network structure included in the sampling coding module G 00 is represented by a dotted box to illustrate that due to the first edited image Does not meet the user's requirements, so the sampling encoding module G 00 generates the first edited image Steps above are skipped, and the target source image I of 128 ⁇ 128 ⁇ 3 is
  • sampling encoding module G 00 generates the first edited image Directly replace the first edited image with the 128 ⁇ 128 ⁇ 3 target source image I that does not meet the requirements Input to the generation module G 01 for image editing, thus avoiding the first edited image that does not meet the user's requirements (i.e. wrong first edited image ) continue to propagate backwards.
  • the user determines that the first edited image of 64 ⁇ 64 ⁇ 3 If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the first edited image of 64 ⁇ 64 ⁇ 3 according to the confirmation instruction received from the user.
  • Input at least one cascaded generation module to perform high-dimensional visual feature extraction to obtain the target edited image, as described in (I) in Figure 2; if the user judges the first edited image of 64 ⁇ 64 ⁇ 3 If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will convert the first edited image of 64 ⁇ 64 ⁇ 3 according to the rejection instruction received from the user.
  • the above-mentioned at least one cascaded generation module may be one generation module or two generation modules, which is not limited in this application, and the number of generation modules can be set by the user according to actual needs.
  • This application only takes the example of the image editing model in Figure 2 including two generating modules to illustrate that the two generating modules cooperate with the sampling and encoding module to further image edit the intermediate editing result (for example, the first edited image) to generate the target edit. image process.
  • the above generation module G 01 includes: a first automatic decoder E 0 , a first self-attention module 2012c, a second upsampling module F 1 and a second automatic encoder G 1 , the first An automatic decoder E 0 is used to restore the first edited image (that is, the high-dimensional visual features of the input information) to obtain the first high-dimensional feature image; the first self-attention module 2012c is used to fuse and splicing the first high-dimensional feature image and sentence word feature MT to obtain sentence semantics Information features; the second upsampling module F1 is used to perform feature fusion and upsampling processing on the semantic information features of the sentence to obtain the second upsampling result; the second autoencoder G1 is used to perform high-dimensional processing on the second upsampling result Visual feature extraction to obtain output information.
  • the first automatic decoder E 0 is used to restore the first edited image (that is, the high-dimensional visual features of the input information) to obtain the first high-dimensional feature image
  • the generation module G 01 for the first edited image (e.g. the above 64 ⁇ 64 ⁇ 3 first edited image ) process is as follows: the first automatic decoder E 0 for the above-mentioned first edited image of 64 ⁇ 64 ⁇ 3 Perform high-dimensional visual feature restoration to obtain the first high-dimensional feature image with a size of 64 ⁇ 64 and a channel number of 32 (ie, 64 ⁇ 64 ⁇ 32); after that, the first self-attention module 2012c performs 64 ⁇ 64 ⁇ 32
  • the first high-dimensional feature image, 128 ⁇ 1 sentence word feature M T and 128 ⁇ 128 ⁇ 128 image local feature M I perform feature fusion and splicing processing, and generate high-dimensional visual features with fine-grained sentence semantic information (i.e.
  • the second upsampling module F 1 pairs 64 ⁇ 64 ⁇ 32
  • the semantic information features of the sentence are subjected to feature fusion and upsampling processing to generate a visual feature image with a size of 128 ⁇ 128 and a channel number of 32 (ie 128 ⁇ 128 ⁇ 32) (ie the second upsampling result);
  • the second upsampling Sampling module F 1 includes two residual networks and an upsampling layer, where the two residual networks are used to fuse the semantic information features of the sentence, and the upsampling layer is used to improve the spatial resolution of the image;
  • the second Autoencoder G 1 performs high-dimensional visual feature extraction and encoding processing on the second upsampling result to generate a second edited image with a size of 128 ⁇ 128 and a channel number of 3 (ie 128 ⁇ 128 ⁇ 3)
  • the second upsampling result may first pass through the noisy affine combination module 2012b, and the noisy affine combination module 2012b fuses the second upsampling result with the image local feature MI of 128 ⁇ 128 ⁇ 128, and Generate a visual feature image with a size of 128 ⁇ 128 and a channel number of 32 (that is, 128 ⁇ 128 ⁇ 32); after that, the second automatic encoder G 1 performs high-dimensional visual feature extraction on the visual feature image of 128 ⁇ 128 ⁇ 32 and encoding processing, the output size is 128 ⁇ 128 and the number of channels is 3 for the second edited image (i.e. the second edited image of 128 ⁇ 128 ⁇ 3 ).
  • the above-mentioned first self-attention module 2012c includes: a self-attention layer F Atten and a first band noise affine combination module 2012a, and the self-attention layer F Atten is used for feature fusion to the first high-dimensional feature image and sentence word features;
  • the first noisy affine combination module 2012a is used to perform feature fusion on the output result of the self-attention layer F Atten and the splicing result of the first high-dimensional feature image, as well as the image local feature M I.
  • the self-attention layer F Atten performs feature fusion on the first high-dimensional feature image of 64 ⁇ 64 ⁇ 32 and sentence word features of 128 ⁇ 1, and stitches the fusion result with the first high-dimensional feature image; the first The noisy affine combination module 2012a performs feature fusion on the mosaic result and the local feature M I of the 128 ⁇ 128 ⁇ 128 image.
  • the second edited image of 128 ⁇ 128 ⁇ 3 output by the second autoencoder G1 Users can directly observe the second edited image of 128 ⁇ 128 ⁇ 3 to determine whether it meets the requirements.
  • the user determines that the second edited image of 128 ⁇ 128 ⁇ 3 If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the second edited image of 128 ⁇ 128 ⁇ 3 according to the confirmation instruction received from the user. Input to the generation module of the next cascade for high-dimensional visual feature extraction to obtain the target edited image; if the user judges the second edited image of 128 ⁇ 128 ⁇ 3 If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will generate a second edited image of 128 ⁇ 128 ⁇ 3 by the generation module G 01 according to the rejection instruction sent by the user Abandoned (i.e.
  • the next cascaded generation module G 02 after the above generation module G 01 includes: a second automatic decoder E 1 , a second self-attention module 2013c, and a third upsampling module F 2 and the generator G 2 , the second automatic decoder E 1 is used to recover the second edited image (i.e.
  • the output information of the previous layer generation module G 01 high-dimensional visual features to obtain the second high-dimensional feature image
  • the second self-attention module 2013c is used to perform the second high-dimensional feature image and sentence word feature MT Fusion and splicing processing, output second edited image Corresponding sentence semantic information features
  • the third upsampling module F 2 is used to edit the second image The corresponding sentence semantic information features are subjected to feature fusion and upsampling processing to obtain a third upsampling result
  • the generator G 2 is used to extract high-dimensional visual features from the third upsampling result.
  • the generation module G 02 in Fig. 2 for the second edited image (e.g. the second edited image of 128 ⁇ 128 ⁇ 3 above ) process is as follows: the second automatic decoder E 1 for the above-mentioned 128 * 128 * 3 second editing image Perform high-dimensional visual feature recovery to obtain the second high-dimensional feature image with a size of 128 ⁇ 128 and a channel number of 32 (i.e.
  • the second high-dimensional feature image, 128 ⁇ 1 sentence word feature M T and 128 ⁇ 128 ⁇ 128 image local feature M I perform feature fusion and splicing processing, and generate high-dimensional visual features with fine-grained sentence semantic information (i.e.
  • the second edited image corresponding sentence semantic information feature The size of the corresponding sentence semantic information feature is 128 ⁇ 128 ⁇ and the number of channels is 32 (ie 128 ⁇ 128 ⁇ 32); after that, the third upsampling module F 2 pairs the sentence semantic information of 128 ⁇ 128 ⁇ 32 Perform feature fusion and upsampling processing to generate a visual feature image with a size of 256 ⁇ 256 and a channel number of 32 (ie 256 ⁇ 256 ⁇ 32) (ie the third upsampling result); the third upsampling module F 2 Contains two residual networks and an upsampling layer, where the two residual networks are used to fuse the semantic information features of the sentence, and the upsampling layer is used to improve the spatial resolution of the image; finally, the generator G 2 pair Three upsampling results are subjected to high-dimensional visual feature extraction and encoding processing to generate a third edited image with a size of 256 ⁇ 256 and a channel number of 3 (ie 256 ⁇ 256 ⁇ 3)
  • the third upsampling result of 256 ⁇ 256 ⁇ 32 may first pass through the noisy affine combination module 2013b, and the noisy affine combination module 2013b combines the third upsampling result of 256 ⁇ 256 ⁇ 32 with the 128 ⁇ 128 ⁇ 128 image local feature M I to fuse, and generate a visual feature image with a size of 256 ⁇ 256 and a channel number of 32 (that is, 256 ⁇ 256 ⁇ 32);
  • the visual feature image is subjected to high-dimensional visual feature extraction and encoding processing to obtain a third edited image with a size of 256 ⁇ 256 and a channel number of 3 (i.e. the third edited image of 256 ⁇ 256 ⁇ 3 ).
  • the above-mentioned second self-attention module 2013c includes: self-attention layer F Atten and affine combination module 2012a with noise, and self-attention layer F Atten is used to carry out feature fusion to the second high-dimensional feature image and sentence word features;
  • the affine combination module 2012a is used to perform feature fusion on the output result of the self-attention layer F Atten and the splicing result of the second high-dimensional feature image, as well as the image local feature M I .
  • the self-attention layer F Atten performs feature fusion on the second high-dimensional feature image of 128 ⁇ 128 ⁇ 32 and sentence word features of 128 ⁇ 1, and stitches the fusion result with the second high-dimensional feature image; with noise
  • the affine combination module 2013a performs feature fusion on the splicing result and the 128 ⁇ 128 ⁇ 128 image local feature M I .
  • the third edited image of 256 ⁇ 256 ⁇ 3 output by generator G 2 Users can directly observe the third edited image of 256 ⁇ 256 ⁇ 3 to determine whether it meets the requirements. If the user judges the third edited image of 256 ⁇ 256 ⁇ 3 If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the third edited image of 256 ⁇ 256 ⁇ 3 according to the confirmation instruction received from the user. Input to the next-level network to continue processing or directly output to the user (that is, since the generation module G 02 is the last generation module in the two cascaded generation modules, the output of the generator G 2 in the generation module G 02 information for the target edited image (i.e.
  • the third edited image if the user judges the third edited image of 256 ⁇ 256 ⁇ 3 If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will convert the third edited image of 256 ⁇ 256 ⁇ 3 according to the rejection instruction received from the user. Discard and re-use the image editing model to repeat the aforementioned process on the 128 ⁇ 128 ⁇ 3 target source image.
  • the core algorithm of the above-mentioned first noisy affine combination module 2012a is as follows:
  • the first noisy affine combination module 2012a can enhance the reliability of the sampling coding module and the generation module to edit images by introducing Gaussian noise, so that the sampling coding module and the generation module will not affect the reliability of the editing results due to the random noise in the image sex.
  • noisy affine combination module 2012a other noisy affine combination modules mentioned in this application (for example, the second noisy affine combination module, the third noisy affine combination module, etc. ) is the same as the core algorithm of the first noisy affine combination module 2012a, which will not be repeated in this application.
  • the training method of the image editing model includes: using a preset loss function and a training set to train the initial model to obtain an image editing model; wherein, the preset loss function includes sub-functions corresponding to N sub-networks and N-1 automatic codec loss functions, the initial model includes N sub-networks, the N sub-networks are the initial models corresponding to the sampling encoding module and at least one generation module respectively; during the training process, when the output image of the i-th sub-network is not When the preset conditions are met, the initial model is trained using the sub-functions corresponding to the i+1-th to N-th sub-networks and the i-th to i+1-th automatic codec loss functions, 0 ⁇ i ⁇ N.
  • the training method of the image editing model 201 includes: using MS-COCO and CUB200 datasets and a preset loss function to generate an adversarial form (for example, discriminator D 0 , discriminator D 1 and the discriminator D 2 ) to train to obtain the image editing model 201;
  • the above-mentioned initial model refers to the network model before the image editing model 201 is not trained;
  • the two automatic codecs are autoencoder G 0 -autodecoder E 0 and automatic Encoder G 1 -automatic decoder E 1 ;
  • the above-mentioned preset loss functions are loss functions of three self-networks and two loss functions of automatic codecs.
  • N 3
  • the first to the second automatic codec loss function that is, the automatic Encoder G 0 - the loss function of auto-decoder E 0 and Auto-encoder G 1 - the loss function of auto-decoder E 1
  • the three preset loss functions are as follows:
  • the first preset loss function is the first preset loss function
  • L G,i represents the loss function of the corresponding level in the ManiGAN network, and L G,0 is the location of the sampling encoding module G 00
  • the loss function corresponding to the level L G,1 is the loss function corresponding to the level where the generation module G 01 is located, and L G,2 is the loss function corresponding to the level where the generation module G 02 is located;
  • I' represents the target edited image
  • I' ⁇ PG,i(I,T) means that the target edited image I' is generated by the initial model given the training image I and the randomly selected target text T .
  • the loss function for the above automatic codec is and in, is the loss function of autoencoder G 0 -autodecoder E 0 , is the loss function of autoencoder G 1 -autodecoder E 1 ;
  • G i consists of 3 ⁇ 3 convolutional layer and tanh activation function, E i contains atanh function, 3 ⁇ 3 convolutional layer, Leaky ReLU layer and instance normalization Layer (Instance Normalization).
  • the automatic codec composed of G i and E i feeds back the intermediate editing results to the user, and the user can directly control the intermediate results generated by different sub-networks to prevent wrong intermediate editing results from affecting the output of the entire image editing model 201 accuracy.
  • the first preset loss function when the first preset loss function is training the initial model, no sub-network needs to be skipped. That is, input the image overall feature c I and image local feature M I of the training image, and the sentence overall feature c T of the training text into the sample encoding module G 00 , and the sample encoding module G 00 outputs the first training edited image, if the user Judging that the first training edited image meets the user's requirements, the first training edited image is input into the generation module G 01 ; the generation module G 01 edits the first training edited image and the image local feature MI of the training image and generates The second training editing image; if the user judges that the second training editing image meets the user's requirements, then the second training editing image is input into the generation module G 02 ; Feature M1 performs editing processing and generates a third training edited image; if the user judges that the third training edited image meets the user's requirements, then the training of the initial model according to the first
  • autoencoder G 0 -autodecoder E 0 and autoencoder G 1 -autodecoder E 1 are distributed between the sampling encoding module G 00 , the generating module G 01 and the generating module G 02 (see Figure 2), so , when training the initial model according to the first preset loss function, autoencoder G 0 -autodecoder E 0 and autoencoder G 1 -autodecoder E 1 are also trained.
  • the second preset loss function trains the initial model
  • the training image is directly input into the generation module G01 ; the generation module G01 edits the training image and the image local feature MI of the training image and generates a fourth training edited image; if the user judges that the fourth training edited image conforms to If the user requires, the fourth training edited image is input into the generation module G02 ; the generation module G02 edits the fourth training edited image and the image local feature MI of the training image and generates the fifth training edited image; if the user If it is judged that the fifth training edited image meets the user's requirements, the training of the initial model according to the second preset loss function is completed; if the user judges that the fifth training edited image does not meet the user's requirements, the above training process is repeated.
  • the third preset loss function trains the initial model
  • the training image is directly input into the generation module G02 ; the generation module G02 edits the training image and the image local feature MI of the training image and generates a sixth training edited image; if the user judges that the sixth training edited image If the user's requirements are met, the training of the initial model is completed according to the third preset loss function; if the user judges that the sixth training edited image does not meet the user's requirements, the above training process is repeated. Since the generating module G 02 does not have an automatic codec, when training the initial model according to the third preset loss function, it is not necessary to train an automatic codec.
  • the image editing model also includes: a detail correction model (Symmetrical Detail Correction Module, SCDM), which is used to modify the details of the target edited image; the detail correction model is used to process image local features, sentence word features and target edited images, Obtain the target correction image; the detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used for image local features, first random noise and sentence word features Perform detail modification to obtain the first detail feature; the second detail correction module is used to modify the image local features, the second random noise and sentence word features corresponding to the target edited image to obtain the second detail feature; the fusion module is used to The first detail feature and the second detail feature are subjected to feature fusion; the generator is used to generate the target correction image according to the output result of the fusion module.
  • SCDM Symmetrical Detail Correction Module
  • the image editing model 201 also includes a detail revision model 2014, which is generated by the generation module G02 according to the image local feature M I of the target source image I and the sentence word feature M T of the text description T.
  • the target edited image is modified and corrected for details, and the target corrected image is obtained
  • the detail correction model 2014 includes: a first detail correction module 301, a second detail correction module 302, a fusion module F fuse , and a generator G 0S , wherein the first detail correction module 301 performs an image local feature M I , the first random noise noise1 and sentence word feature MT carry out detailed modification, obtain the first detailed feature, and input the first detailed feature to the fusion module F fuse ; Three edited images ) corresponding to the image local features The second random noise noise2 and the sentence word feature MT carry out detail modification, obtain the second detail feature, and input the second detail feature to the fusion module F fuse ; the fusion module F fuse performs feature on the first detail feature and the second detail feature Fusion, and input the fusion result into the
  • the first detail correction module 301 includes a fourth noisy affine combination module 3011, a fifth noisy affine combination module 3013, a sixth noisy affine combination module 3015, a second self- Attention module 3012, the first residual network 3014 and the first linear network 3016; wherein, the fourth noisy affine combination module 3011 performs feature fusion on the first random noise noise1 and image local features MI to obtain the first fusion feature , and input the first fusion feature to the second self-attention module 3012; the second self-attention module 3012 performs feature fusion on the first fusion feature and sentence word feature MT , and performs fusion results with the first random noise noise1 Splicing to obtain the splicing result; the splicing result is input into the fifth band noise affine combination module 3013; the fifth band noise affine combination module 3013 performs feature fusion on the splicing result and the image local feature MI , and fuses the fusion result Input to the first residual network 3014
  • the second detail correction module 302 includes a seventh noisy affine combination module 3021, an eighth noisy affine combination module 3023, a ninth noisy affine combination module 3025, a third self Attention module 3022, the second residual network 3024 and the second linear network 3026; wherein, the seventh band noise affine combination module 3021 is used for the image local features corresponding to the second random noise noise2 and the target edited image (for example, the generation module G 02
  • the third edited image output Corresponding image local features ) to perform feature fusion to obtain the first fusion feature, and input the first fusion feature to the third self-attention module 3022;
  • the third self-attention module 3022 performs feature fusion to the first fusion feature and sentence word feature M T , And splicing the feature fusion result with the second random noise noise2, and finally input the splicing result to the eighth noisy affine combination module 3023;
  • Image local features for example, the third edited image output by the generation module G 02 Corresponding image local features
  • the above fusion module F fuse for the first detailed feature x I output by the sixth noisy affine combination module 3015 and the second detailed feature output by the ninth noisy affine combination module 3025 Carry out feature fusion, and input the fusion result into the generator G 0S ; the generator G 0S encodes the output fusion result of the fusion module F fuse and generates the target correction image
  • the input of the first detail correction module 301 and the second detail correction module 302 can be exchanged, that is, the input of the first detail correction module 301 can be the target edited image (for example, the third output of the generation module G02 edit image ) corresponding to the image local features
  • the second random noise noise2 and the sentence word feature MT corresponding to the image local features
  • the second random noise noise2 and the sentence word feature MT corresponding to the image local features
  • the input of the second detail modification module 302 may be the image local feature M I , the first random noise noise1 and the sentence word feature MT .
  • the output of the first detail correction module 301 is the second detail feature
  • the output of the first detail modification module 301 is the first detail feature x I .
  • F residual is the residual network
  • ⁇ 1 and ⁇ 2 are the first detailed feature x I and the second detailed feature of the input of the linear network layer A pair of attention weights resulting from the computation.
  • the fusion module F fuse can adaptively choose whether to modify the image features based on the target source image or based on the target edited image generated by the image editing model, so as to enhance the detail features of the modified image.
  • the above generator G 0S converts the output of the fusion module F fuse into the final target corrected image
  • the training method of the detail correction model 2014 includes: training the generator of the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function; according to the conditional discriminator loss function and the unconditional The discriminator loss function trains the discriminator of the detail correction model; where the training data sets are MS-COCO data set and CUB200 data set; the conditional generator loss function is as follows:
  • L Gs,0 is the conditional generator loss function
  • L Ds,0 is the conditional discriminator loss function
  • D S is the discriminator of the detail correction model in the image editing model
  • I ⁇ P data means that the training image I is sampled from real data
  • T is randomly selected the target text of is the semantic contrast function, the purpose is to make the target correction image
  • the description text T I that is closer to the training image I than the randomly selected target text T.
  • ⁇ c is the contrast control threshold
  • Lcorre is the correlation function in ControlGAN, which is used to describe the matching degree between the training text and the target corrected image.
  • L Gs,1 is the unconditional generator loss function
  • L Ds,1 is the unconditional discriminator loss function
  • D S is the discriminator of the detail correction module in the image editing model
  • SDCM detail correction model
  • L Gs is the total loss function of the generator
  • L Ds is the total loss function of the discriminator
  • L ControlGAN is the multimodal loss function of text and image, which is used to improve the matching degree of image editing results and target text
  • the above Image local features It is to use the VGG network to correct the image of the target
  • the feature extraction is obtained.
  • the image local feature M I is obtained by using the VGG network to extract the feature of the target source image I.
  • the function L DAMSM is the text and image similarity function defined in ManiGAN; the L reg function is the image editing model.
  • the defined regularization term is used to strengthen the modification effect, During the training of the detail correction model, the training image I and the target correction image are randomly used replace each other to speed up the training process of the detail correction model.
  • the configurable editing part means that the user can control the intermediate editing results output by the sampling encoding module G 00 , generating module G 01 and generating module G 02 in the image editing model, and by judging the intermediate editing results Instead, some modules are selectively skipped.
  • the intermediate editing results that do not meet the user's requirements are replaced by the target source image, so as to realize the editing process of the target source image according to the target text.
  • the generation module G 02 converts the third edited image (ie, the target edited image image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target correction image.
  • SCDM detail correction model
  • (b) skip G 00 the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sampling code in the image editing model
  • the first edited image output by module G 00 does not meet the user's requirements; the first edited image output by the sample encoding module G 00 is discarded, and the target source image is used instead of the first edited image to be input to the generation module G 01 , and the generation module G 01 is
  • the target source image is processed and the second edited image output meets the user's requirements; the second edited image is input to the generation module G 02 ; the generation module G 02 processes the second edited image and generates a third edited image that also meets the user's requirements , the generating module G 02 inputs the third edited image (ie, the target edited image) into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target corrected image.
  • SCDM detail correction model
  • (c) G 00 and G 01 are skipped: the user inputs the image overall features and image local features of the target source image into the image editing model, and the sentence overall features and sentence word features of the target text, and the image editing model
  • the first edited image output by the sample encoding module G 00 does not meet the user's requirements; the first edited image output by the sample encoding module G 00 is discarded, and the target source image is used instead of the first edited image to be input to the generation module G 01 , and the generation module G 01 processes the target source image and outputs the second edited image that does not meet the user's requirements; the generation module G 01 discards the second edited image, and uses the target source image instead of the second edited image to input to the generation module G 02 ; generates Module G 02 processes the target source image and generates a third edited image that meets the user's requirements.
  • the generation module G 02 inputs the third edited image (i.e., the target edited image) into the detail correction model (SCDM), and the detail correction model ( SCDM)
  • (f) repeats G 01 : the user inputs the image overall features and image local features of the target source image into the image editing model, and the sentence overall features and sentence word features of the target text, and the sample encoding module in the image editing model
  • the first edited image output by G 00 meets the user's requirements
  • the sample encoding module G 00 inputs the first edited image to the generation module G 01
  • the generation module G 01 processes the first edited image and outputs the second edited image that does not conform to User requirements
  • the generation module G 01 can be used repeatedly until the second edited image that meets the user's requirements is output; for example, the generation module G 01 reprocesses according to the first edited result and generates a new second edited image that meets the requirements, and
  • the new second edited image is input to the generation module G02 ;
  • the generation module G02 processes the new second edited image and generates a third edited image that also meets the user's requirements, and the generation module G02 converts the third edited image (i.e. The target edited image) is
  • (g) repeat G 02 the user inputs the image overall features and image local features of the target source image into the image editing model, as well as the sentence overall features and sentence word features of the target text, and the sample encoding module in the image editing model
  • the first edited image output by G 00 meets the requirements of the user;
  • the sample encoding module G 00 inputs the first edited image into the generation module G 01 , and the generation module G 01 processes the first edited image and outputs the second edited image that meets the user’s requirements.
  • the generation module G02 processes the second edited image and generates a third edited image that does not meet the user's requirements, and the generation module G02 can be used repeatedly until the third edited image that meets the user's requirements (ie, the target edited image) is output ;
  • the generation module G 02 re-processes according to the second editing result and generates a new third edited image that meets the requirements, and inputs the new third edited image to the generation module G 02 ;
  • the generation module G 02 converts the new third edited image
  • the third edited image ie, the target edited image
  • SCDM detail correction model
  • SCDM detail correction model
  • the first edited image output by 00 meets the user's requirements
  • the sample encoding module G 00 inputs the first edited image into the generation module G 01
  • the generation module G 01 processes the first edited image and outputs the second edited image that meets the user's requirements
  • the generation module G 02 processes the second edited image and generates a third edited image that meets user requirements
  • the generation module G 02 inputs the third edited image (ie, the target edited image) into the detail correction model (SCDM), and the detail
  • the correction model (SCDM) corrects the details of the third edited image and obtains the fourth edited image that does not meet the user's requirements.
  • the detail correction model (SCDM) can be used repeatedly until the output of the fourth edited image that meets the user's requirements (ie, the target corrected image ); for example, the detail correction model (SCDM) reprocesses according to the third editing result and generates a new fourth edited image that meets the requirements, and the new fourth edited image is the target corrected image.
  • SCDM detail correction model
  • this application introduces a sampling encoding module into the existing ManiGAN to form an improved ManiGAN;
  • the edited result ie the first edited image
  • the edited result is output to facilitate the user to judge whether the intermediate edited result meets the requirements. If the intermediate edited result meets the requirements, the intermediate edited result will continue to be passed to at least one cascaded generation module; if it does not meet the requirements, it will not The intermediate result will continue to be delivered to at least one cascaded generation module, but the target source image will be used to replace the intermediate editing result and continue to be passed to at least one cascaded generation module.
  • the improved ManiGAN edits the target source image according to the target text, it can control the intermediate editing results, and promptly remove the intermediate editing results that do not meet the requirements, so as to prevent the output of the previous level from affecting the subsequent results.
  • the accuracy of the first-level output results so as to edit the target edited image more in line with the requirements for the user.
  • the first noisy affine combination module is introduced.
  • the first noisy affine combination module can enhance the reliability of the generation module to edit images by introducing Gaussian noise, thereby avoiding the generation module due to the The presence of random noise affects the reliability of the editing results.
  • Introducing the second noisy affine combination module and the third noisy affine combination module into the first upsampling module can further enhance the visual features of the output results of different upsampling layers in the first upsampling module.
  • Adding the detail correction model to the above image editing model can further modify and enhance the details of the target editing image output by the image editing model, so as to obtain a high-resolution target correction image.
  • Adding a plurality of noisy affine combination modules to the first detail correction module can enhance the reliability of the detail correction model.
  • Adding a plurality of noisy affine combination modules to the second detail correction module can enhance the reliability of the detail correction model.
  • the above-mentioned generator that trains the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function can make the image editing result generated by the generator (ie, the target edited image) more conform to the target text description content and user requirements.
  • Training the discriminator of the detail modification model according to the conditional discriminator loss function and the unconditional discriminator loss function can make the recognition result of the discriminator more accurate.
  • the training of the N sub-networks is to skip the previous sub-networks whose output results do not meet the requirements through the automatic codec and give priority to training the subsequent sub-networks, because the target source image is used when skipping. It is not an intermediate editing result (for example, the first edited image), therefore, it is possible to avoid the potential error results output by the previous sub-network from propagating to the subsequent sub-network.
  • the priority training of the later sub-network can bring better update gradients to the front-level sub-network, so that the convergence effect of the front-level sub-network is better.
  • FIG. 5 shows a schematic structural diagram of an electronic device provided by the present application.
  • the dotted line in Fig. 5 indicates that the unit or the module is optional.
  • the electronic device 500 may be used to implement the methods described in the foregoing method embodiments.
  • the electronic device 500 may be a terminal device or a server or a chip.
  • the electronic device 500 includes one or more processors 501, and the one or more processors 501 can support the electronic device 500 to implement the method in the method embodiment corresponding to FIG. 1 .
  • Processor 501 may be a general purpose processor or a special purpose processor.
  • the processor 501 may be a central processing unit (central processing unit, CPU).
  • the CPU can be used to control the electronic device 500, execute software programs, and process data of the software programs.
  • the electronic device 500 may further include a communication unit 505, configured to implement input (reception) and output (send) of signals.
  • the electronic device 500 may be a chip, and the communication unit 505 may be an input and/or output circuit of the chip, or the communication unit 505 may be a communication interface of the chip, and the chip may serve as a component of a terminal device.
  • the electronic device 500 may be a terminal device, and the communication unit 505 may be a transceiver of the terminal device, or the communication unit 505 may be a transceiver circuit of the terminal device.
  • the electronic device 500 may include one or more memories 502, on which a program 504 is stored, and the program 504 may be run by the processor 501 to generate instructions 503, so that the processor 501 executes the methods described in the above method embodiments according to the instructions 503.
  • data (such as the ID of the chip to be tested) may also be stored in the memory 502 .
  • the processor 501 may also read data stored in the memory 502, the data may be stored in the same storage address as the program 504, and the data may also be stored in a different storage address from the program 504.
  • the processor 501 and the memory 502 can be set separately, and can also be integrated together, for example, integrated on a system-on-chip (system on chip, SOC) of the terminal device.
  • SOC system on chip
  • the steps in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor 501 .
  • the processor 501 may be a CPU, a digital signal processor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices or discrete hardware components .
  • the present application also provides a computer program product, which implements the method described in any method embodiment in the present application when the computer program product is executed by the processor 501 .
  • the computer program product can be stored in the memory 502 , such as a program 504 , and the program 504 is finally converted into an executable object file executable by the processor 501 through processes such as preprocessing, compiling, assembling and linking.
  • the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in the present application is implemented.
  • the computer program may be a high-level language program or an executable object program.
  • the computer readable storage medium is, for example, the memory 502 .
  • the memory 502 may be a volatile memory or a nonvolatile memory, or, the memory 502 may include both a volatile memory and a nonvolatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • Double data rate synchronous dynamic random access memory Double Data Rate SDRAM, DDR SDRAM
  • enhanced SDRAM Enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory Synchlink DRAM, SLDRAM
  • Direct Rambus RAM Direct Rambus RAM
  • the disclosed systems, devices and methods may be implemented in other ways. For example, some features of the method embodiments described above may be omitted, or not implemented.
  • the device embodiments described above are only illustrative, and the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system.
  • the coupling between the various units or the coupling between the various components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.

Abstract

A text-based image editing method, and an electronic device. The method comprises: acquiring a global image feature and a local image feature of a target source image, and a global sentence feature and a sentence word feature of target text; and according to the global image feature, the local image feature, the global sentence feature and the sentence word feature, editing the target source image on the basis of an image editing model, so as to obtain a target edited image, wherein the image editing model comprises a sampling and encoding module and at least one cascaded generation module. An intermediate editing result of an image editing process can be output by using an encoding module, and if the intermediate editing result does not meet user requirements, the image editing model can adjust the intermediate editing result, and inputs the adjusted intermediate result into the at least one cascaded generation module, thereby solving the problem of an image editing result that is output by a ManiGAN failing to meet the user requirements.

Description

一种基于文本的图像编辑方法和电子设备A text-based image editing method and electronic device 技术领域technical field
本申请涉及图像处理领域,尤其涉及一种基于文本的图像编辑方法和电子设备。The present application relates to the field of image processing, in particular to a text-based image editing method and electronic equipment.
背景技术Background technique
众所周知,基于文本的图像编辑是一种根据给定文本编辑源图像的技术,该技术是多媒体领域的研究热点并具有重要的应用价值。现有基于注意的生成对抗网络(ManipulatingAttention Generative Adversarial Network,ManiGAN)用于根据文本描述内容对待编辑的源图像进行图像编辑。但是,ManiGAN在根据输入文本对源图像进行编辑时,输出的图像编辑结果往往不符合用户要求。As we all know, text-based image editing is a technology to edit source images according to a given text. This technology is a research hotspot in the field of multimedia and has important application value. The existing attention-based Generative Adversarial Network (Manipulating Attention Generative Adversarial Network, ManiGAN) is used for image editing based on the text description content of the source image to be edited. However, when ManiGAN edits the source image according to the input text, the output image editing results often do not meet the user's requirements.
因此,如何使得ManiGAN输出的图像编辑结果符合用户要求是当前急需解决的问题。Therefore, how to make the image editing results output by ManiGAN meet user requirements is an urgent problem to be solved.
技术问题technical problem
本申请实施例的目的之一在于:一种基于文本的图像编辑方法和电子设备,旨在解决现有ManiGAN输出的图像编辑结果不符合用户要求的问题。One of the purposes of the embodiments of the present application is: a text-based image editing method and electronic equipment, aiming at solving the problem that the image editing results output by the existing ManiGAN do not meet user requirements.
技术解决方案technical solution
本申请实施例采用的技术方案是:The technical scheme that the embodiment of the present application adopts is:
第一方面,提供了一种基于文本的图像编辑方法,包括:获取目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征;根据所述图像整体特征、所述图像局部特征、所述句子整体特征和所述句子词特征,基于图像编辑模型对所述目标源图像进行编辑,得到目标编辑图像;其中,所述图像编辑模型包括:采样编码模块和至少一个级联的生成模块;所述图像编辑模型对所述目标源图像的处理过程包括:利用所述采样编码模块对所述图像整体特征、所述句子整体特征和所述图像局部特征进行采样编码处理,得到第一编辑图像,并输出所述第一编辑图像;响应于用户指令,将所述第一编辑图像、所述图像局部特征和所述句子词特征输入所述至少一个级联的生成模块中进行高维视觉特征提取,得到所述目标编辑图像,或者将所述目标源图像、所述图像局部特征和所述句子词特征输入到所述至少一个级联的生成模块中进行高维视觉特征提取,得到所述目标编辑图像。In the first aspect, a text-based image editing method is provided, including: acquiring the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text; according to the overall image features, the The image local feature, the sentence overall feature and the sentence word feature, edit the target source image based on the image editing model to obtain the target editing image; wherein, the image editing model includes: a sampling encoding module and at least one A cascaded generation module; the processing of the target source image by the image editing model includes: using the sampling encoding module to perform sampling encoding processing on the overall feature of the image, the overall feature of the sentence, and the local feature of the image , obtain a first edited image, and output the first edited image; in response to a user instruction, input the first edited image, the image local features and the sentence word features into the at least one cascaded generation module Perform high-dimensional visual feature extraction to obtain the target edited image, or input the target source image, the local features of the image and the sentence word features into the at least one cascaded generation module for high-dimensional vision feature extraction to obtain the target edited image.
上述方法可以由电子设备上的芯片执行。相比现有ManiGAN根据文本对待编辑的源图像进行编辑并直接输出可能不符合用户要求的编辑结果,本申请在现有ManiGAN中引入采样编码模块,形成改进的ManiGAN;该采用编码模块会将中间编辑结果(即第一编辑图像)输出,以方便用户判断该中间编辑结果是否符合要求,若符合要求,则将中间编辑结果继续向至少一个级联的生成模块传递;若不符合要求,则不会将该中间结果继续向至少一个级联的生成模块传递,而是用目标源图像代替该中间编辑结果继续向至少一个级联 的生成模块传递。由此可见,改进的ManiGAN在根据目标文本对目标源图像编辑时,可以对中间编辑结果进行控制,并及时剔除不符合要求的中间编辑结果,以防止前一级输出不符合要求的结果影响后一级输出结果的准确性,从而为用户编辑出更加符合要求的目标编辑图像。The above method can be executed by a chip on an electronic device. Compared with the existing ManiGAN that edits the source image to be edited according to the text and directly outputs the editing results that may not meet the user's requirements, this application introduces a sampling encoding module into the existing ManiGAN to form an improved ManiGAN; The edited result (ie the first edited image) is output to facilitate the user to judge whether the intermediate edited result meets the requirements. If the intermediate edited result meets the requirements, the intermediate edited result will continue to be passed to at least one cascaded generation module; if it does not meet the requirements, it will not The intermediate result will continue to be delivered to at least one cascaded generation module, but the target source image will be used to replace the intermediate editing result and continue to be passed to at least one cascaded generation module. It can be seen that when the improved ManiGAN edits the target source image according to the target text, it can control the intermediate editing results, and promptly remove the intermediate editing results that do not meet the requirements, so as to prevent the output of the previous level from affecting the subsequent results. The accuracy of the first-level output results, so as to edit the target edited image more in line with the requirements for the user.
可选地,所述生成模块包括:第一自动解码器、第一自注意力模块、第二上采样模块和第二自动编码器,所述第一自动解码器用于恢复输入信息的高维视觉特征,得到第一高维特征图像,所述输入信息为第一编辑图像或者为前一层生成模块的输出信息;所述第一自注意力模块用于对所述第一高维特征图像和所述句子词特征进行融合以及处理,得到句子语义信息特征;所述第二上采样模块用于对所述句子语义信息特征进行特征融合和上采样处理,得到第二上采样结果;所述第二自动编码器用于对所述第二上采样结果进行高维视觉特征提取,得到输出信息,当所述生成模块为所述至少一个级联的生成模块中的最后一级生成模块时,所述输出信息为目标编辑图像。Optionally, the generating module includes: a first automatic decoder, a first self-attention module, a second upsampling module and a second automatic encoder, the first automatic decoder is used to restore the high-dimensional visual feature, to obtain the first high-dimensional feature image, the input information is the first edited image or the output information of the previous layer generation module; the first self-attention module is used for the first high-dimensional feature image and The sentence feature is fused and processed to obtain the sentence semantic information feature; the second up-sampling module is used to perform feature fusion and up-sampling processing on the sentence semantic information feature to obtain a second up-sampling result; the second up-sampling module Two automatic encoders are used to perform high-dimensional visual feature extraction on the second upsampling result to obtain output information. When the generation module is the last generation module in the at least one cascaded generation module, the The output information is the target edited image.
可选地,所述第一自注意力模块包括:自注意力层和第一带噪声仿射组合模块,所述自注意力层用于对所述第一高维特征图像和所述句子词特征进行特征融合;所述第一带噪声仿射组合模块用于对所述自注意力层的输出结果与所述第一高维特征图像的拼接结果,以及所述图像局部特征进行特征融合。Optionally, the first self-attention module includes: a self-attention layer and a first noisy affine combination module, and the self-attention layer is used to analyze the first high-dimensional feature image and the sentence word Feature fusion of features; the first noisy affine combination module is used to perform feature fusion on the output result of the self-attention layer and the splicing result of the first high-dimensional feature image, as well as the local features of the image.
在上述第一自注意力模块中引入第一带噪声仿射组合模块,该第一带噪声仿射组合模块通过引入高斯噪声能够增强生成模块编辑图像的可靠性,从而避免了生成模块因图像中存在随机噪声而影响编辑结果可靠性的情况出现。In the above-mentioned first self-attention module, the first noisy affine combination module is introduced. The first noisy affine combination module can enhance the reliability of the generation module to edit images by introducing Gaussian noise, thereby avoiding the generation module due to the The presence of random noise affects the reliability of the editing results.
可选地,所述采样编码模块包括:第一上采样模块和第一自动编码器,所述第一上采样模块用于对所述图像整体特征、所述句子整体特征和所述图像局部特征进行上采样处理,得到第一上采样结果;所述第一自动编码器用于根据所述第一上采样结果生成第一编辑图像。Optionally, the sampling encoding module includes: a first up-sampling module and a first automatic encoder, the first up-sampling module is used to perform an overall feature of the image, the overall feature of the sentence, and the local feature of the image Perform upsampling processing to obtain a first upsampling result; the first automatic encoder is configured to generate a first edited image according to the first upsampling result.
可选地,所述第一上采样模块包括:多个相同的上采样层、第二带噪声仿射组合模块和第三带噪声仿射组合模块,第一上采样模块的输入是所述句子整体特征、所述图像整体特征和所述图像局部特征,所述多个相同的上采样层中相邻的两个上采样层中,后一上采样层的输入是前一上采样层的输出;所述第二带噪声仿射组合模块位于所述多个相同的上采样层中任意两个上采样层中间,用于对所述任意两个上采样层中前一上采样层输出的结果和所述图像局部特征进行特征融合;所述第三带噪声仿射组合模块用于对所述多个相同的上采样层中最后一个上采样层的输出结果和所述图像局部特征进行特征融合。Optionally, the first upsampling module includes: a plurality of identical upsampling layers, a second noisy affine combination module and a third noisy affine combination module, and the input of the first upsampling module is the sentence The overall feature, the overall feature of the image and the local feature of the image, among the two adjacent up-sampling layers in the plurality of identical up-sampling layers, the input of the latter up-sampling layer is the output of the previous up-sampling layer ; The second noisy affine combination module is located in the middle of any two upsampling layers in the plurality of identical upsampling layers, and is used to output the result of the previous upsampling layer in the any two upsampling layers Perform feature fusion with the local features of the image; the third noisy affine combination module is used to perform feature fusion on the output result of the last upsampling layer in the plurality of identical upsampling layers and the local features of the image .
在第一上采样模块中引入第二带噪声仿射组合模块和第三带噪声仿射组合模块可以进一步对第一上采样模块中不同上采样层的输出结果进行视觉特征增强。Introducing the second noisy affine combination module and the third noisy affine combination module into the first upsampling module can further enhance the visual features of the output results of different upsampling layers in the first upsampling module.
可选地,所述图像编辑模型还包括:细节修正模型,用于对所述目标编辑图像进行细节修改;所述细节修正模型用于处理所述图像局部特征、所述句子词特征和所述目标编辑图像,得到目标修正图像;所述细节修正模型包括:第一细节修正模块、第二细节修正模块、融合模块和生成器,其中,所述第一细节修正模块用于对所述图像局部特征、第一随 机噪声和所述句子词特征进行细节修改,得到第一细节特征;所述第二细节修正模块用于对所述目标编辑图像对应的图像局部特征、第二随机噪声和所述句子词特征进行细节修改,得到第二细节特征;所述融合模块用于对所述第一细节特征和所述第二细节特征进行特征融合;所述生成器用于根据所述融合模块的输出结果生成所述目标修正图像。Optionally, the image editing model further includes: a detail modification model, configured to modify the details of the target edited image; the detail modification model is used to process the image local features, the sentence word features, and the Editing the image of the target to obtain the corrected image of the target; the detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used to partially modify the image feature, the first random noise, and the sentence word feature are modified in detail to obtain the first detail feature; the second detail correction module is used to modify the image local feature corresponding to the target edited image, the second random noise, and the Sentence word features are modified in detail to obtain the second detailed features; the fusion module is used to perform feature fusion on the first detailed features and the second detailed features; the generator is used for outputting results according to the fusion module The target corrected image is generated.
在上述图像编辑模型中增加细节修正模型能够对图像编辑模型输出的目标编辑图像进行进一步的细节修改和增强,从而得到高分辨率的目标修正图像。Adding the detail correction model to the above image editing model can further modify and enhance the details of the target editing image output by the image editing model, so as to obtain a high-resolution target correction image.
可选地,所述第一细节修正模块包括第四带噪声仿射组合模块、第五带噪声仿射组合模块、第六带噪声仿射组合模块、第二自注意力模块、第一残差网络和第一线性网络;所述第四带噪声仿射组合模块用于对所述第一随机噪声和所述图像局部特征进行特征融合,得到第一融合特征;所述第二自注意力模块用于对所述第一融合特征和所述句子词特征进行特征融合;所述第五带噪声仿射组合模块对所述第二自注意力模块的输出结果与所述第一随机噪声的拼接结果、以及所述图像局部特征进行特征融合;所述第一残差网络用于对所述第五带噪声仿射组合模块的输出结果进行视觉特征提取;所述第一线性网络用于对所述图像局部特征进行线性变换;所述第六带噪声仿射组合模块用于对所述第一残差网络的输出结果和所述第一线性网络的输出结果进行特征融合。Optionally, the first detail modification module includes a fourth noisy affine combination module, a fifth noisy affine combination module, a sixth noisy affine combination module, a second self-attention module, and a first residual network and the first linear network; the fourth noisy affine combination module is used to perform feature fusion on the first random noise and the local features of the image to obtain the first fusion feature; the second self-attention module It is used to perform feature fusion on the first fusion feature and the sentence word feature; the output result of the second self-attention module and the splicing of the first random noise by the fifth noisy affine combination module The results and the local features of the image are subjected to feature fusion; the first residual network is used to extract visual features from the output of the fifth noisy affine combination module; the first linear network is used to extract the visual features of the performing linear transformation on the local features of the image; the sixth noisy affine combination module is used to perform feature fusion on the output result of the first residual network and the output result of the first linear network.
在上述第一细节修正模块中增加多个带噪声仿射组合模块,可以增强细节修正模型的可靠性。Adding a plurality of noisy affine combination modules to the first detail correction module can enhance the reliability of the detail correction model.
可选地,所述第二细节修正模块包括第七带噪声仿射组合模块、第八带噪声仿射组合模块、第九带噪声仿射组合模块、第三自注意力模块、第二残差网络和第二线性网络;所述第七带噪声仿射组合模块用于对所述第二随机噪声和所述目标编辑图像对应的图像局部特征进行特征融合,得到第一融合特征;所述第三自注意力模块用于对所述第一融合特征和所述句子词特征进行特征融合;Optionally, the second detail modification module includes a seventh noisy affine combination module, an eighth noisy affine combination module, a ninth noisy affine combination module, a third self-attention module, and a second residual network and a second linear network; the seventh noisy affine combination module is used to perform feature fusion on the second random noise and the image local features corresponding to the target edited image to obtain the first fusion feature; the second Three self-attention modules are used to carry out feature fusion to the first fusion feature and the sentence word feature;
所述第八带噪声仿射组合模块对所述第三自注意力模块的输出结果与所述第二随机噪声的拼接结果、以及所述目标编辑图像对应的图像局部特征进行特征融合;所述第二残差网络用于对所述第八带噪声仿射组合模块的输出结果进行视觉特征提取;所述第二线性网络用于对所述目标编辑图像对应的图像局部特征进行线性变换;所述第九带噪声仿射组合模块用于对所述第二残差网络的输出结果和所述第二线性网络的输出结果进行特征融合。The eighth noisy affine combination module performs feature fusion on the output result of the third self-attention module, the mosaic result of the second random noise, and the image local features corresponding to the target edited image; the The second residual network is used to perform visual feature extraction on the output result of the eighth noisy affine combination module; the second linear network is used to linearly transform the image local features corresponding to the target edited image; The ninth noisy affine combination module is used to perform feature fusion on the output result of the second residual network and the output result of the second linear network.
在上述第二细节修正模块中增加多个带噪声仿射组合模块,可以增强细节修正模型的可靠性。Adding a plurality of noisy affine combination modules to the second detail correction module can enhance the reliability of the detail correction model.
可选地,所述细节修正模型的训练方式包括:根据有条件的生成器损失函数、无条件的生成器损失函数和语义对比函数训练所述细节修正模型的生成器;根据有条件的判别器损失函数和无条件的判别器损失函数训练所述细节修正模型的判别器。Optionally, the training method of the detail correction model includes: training the generator of the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function; according to the conditional discriminator loss function function and an unconditional discriminator loss function to train the discriminator of the minutiae revision model.
上述根据有条件的生成器损失函数、无条件的生成器损失函数和语义对比函数训练所述细节修正模型的生成器,能够使得生成器生成的图像编辑结果(即目标编辑图像)更加符合目标文本描述的内容和用户要求。上述根据有条件的判别器损失函数和无条件的判别器损失函数训练所述细节修正模型的判别器,能够使得判断器的识别结果更加准确。The above-mentioned generator that trains the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function can make the image editing result generated by the generator (ie, the target edited image) more conform to the target text description content and user requirements. Training the discriminator of the detail modification model according to the conditional discriminator loss function and the unconditional discriminator loss function can make the recognition result of the discriminator more accurate.
可选地,所述图像编辑模型的训练方法包括:利用预设的损失函数和训练集对初始模型进行训练,得到所述图像编辑模型;其中,所述预设的损失函数包括与N个子网络分别对应的子函数和N-1个自动编解码器的损失函数,所述初始模型包括N个子网络,所述N个子网络为所述采样编码模块和至少一个生成模块分别对应的初始模型;训练过程中,当第i个子网络的输出图像不满足预设条件时,采用第i+1至第N个子网络对应的子函数和第i个至第i+1个自动编解码器的损失函数,对初始模型进行训练,0≤i<N。Optionally, the training method of the image editing model includes: using a preset loss function and a training set to train the initial model to obtain the image editing model; wherein, the preset loss function includes and N sub-networks Corresponding sub-functions and loss functions of N-1 automatic codecs, the initial model includes N sub-networks, and the N sub-networks are initial models corresponding to the sampling encoding module and at least one generation module respectively; training In the process, when the output image of the i-th sub-network does not meet the preset conditions, the sub-function corresponding to the i+1-th sub-network and the loss function of the i-th to i+1-th automatic codec are used, Train the initial model, 0≤i<N.
在上述图像编辑模型训练的过程中,对N个子网络进行训练是通过自动编解码器跳过输出结果不符合要求的前级子网络而优先训练后级子网络,由于跳过时使用目标源图像而不是中间编辑结果(比如,第一编辑图像),因此,能够避免潜在的前级子网络输出的错误结果向后级子网络传播。后级子网络优先训练能够给前级子网络带来更好的更新梯度,从而使得前级子网络的收敛效果更好。In the process of training the above image editing model, the training of the N sub-networks is to skip the previous sub-networks whose output results do not meet the requirements through the automatic codec and give priority to training the subsequent sub-networks, because the target source image is used when skipping. It is not an intermediate editing result (for example, the first edited image), therefore, it is possible to avoid the potential error results output by the previous sub-network from propagating to the subsequent sub-network. The priority training of the later sub-network can bring better update gradients to the front-level sub-network, so that the convergence effect of the front-level sub-network is better.
第二方面,提供了一种电子设备,包括用于执行第一方面中任一种方法的模块。In a second aspect, an electronic device is provided, including a module for performing any one of the methods in the first aspect.
第三方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储了计算机程序,当所述计算机程序被处理器执行时,使得处理器执行第一方面中任一项所述的方法。In a third aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the first aspect. Methods.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments or exemplary technical descriptions. Obviously, the accompanying drawings in the following descriptions are only for this application. For some embodiments, those skilled in the art can also obtain other drawings based on these drawings without creative efforts.
图1为本发明实施例中一种基于文本的图像编辑方法流程示意图;Fig. 1 is a schematic flow chart of a text-based image editing method in an embodiment of the present invention;
图2为本发明实施例中图像编辑模型的结构示意图;Fig. 2 is a schematic structural diagram of an image editing model in an embodiment of the present invention;
图3为本发明实施例中细节修正模型的结构示意图;Fig. 3 is a schematic structural diagram of a detail correction model in an embodiment of the present invention;
图4为本发明实施例中图像编辑模型对目标源图像的处理过程示意图;Fig. 4 is a schematic diagram of the processing process of the target source image by the image editing model in the embodiment of the present invention;
图5为本发明实施例中一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.
本申请的实施方式Embodiment of this application
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。因此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Accordingly, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc., in various places in this specification are not necessarily all References to the same embodiment mean "one or more but not all" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.
基于文本的图像编辑是多媒体领域的研究热点并具有重要的应用价值。ManiGAN用于根据文本描述内容对待编辑的源图像进行图像编辑。但是,现有ManiGAN在根据文本描述内容对源图像进行编辑时无法对图像编辑的中间结果进行处理,往往导致ManiGAN输出的图像编辑结果不符合用户要求。本申请为了解决现有ManiGAN输出的图像编辑结果更加符合用户要求,在现有ManiGAN的多层级生成对抗网络中引入自动编解码器,该自动编解码器可以将中间编辑结果输出给用户,以方便用户对ManiGAN中间输出的编辑结果进行直接控制,从而得到更加符合用户要求的目标编辑图像。Text-based image editing is a research hotspot in the field of multimedia and has important application value. ManiGAN is used for image editing based on the text description content of the source image to be edited. However, the existing ManiGAN cannot process the intermediate results of image editing when editing the source image according to the text description content, which often leads to the image editing results output by ManiGAN not meeting user requirements. In order to solve the problem that the image editing results output by the existing ManiGAN are more in line with user requirements, this application introduces an automatic codec into the multi-level generation confrontation network of the existing ManiGAN. The automatic codec can output the intermediate editing results to the user for convenience. The user can directly control the edited result of the intermediate output of ManiGAN, so as to obtain the target edited image that is more in line with the user's requirements.
下面结合附图和具体实施例对本申请做进一步详细说明。The present application will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.
为了解决ManiGAN输出的图像编辑结果不符合用户要求的问题,本申请提出了一种基于文本的图像编辑方法,如图1所示,该方法由电子设备执行,该方法包括:In order to solve the problem that the image editing results output by ManiGAN do not meet the user's requirements, this application proposes a text-based image editing method, as shown in Figure 1, the method is executed by electronic equipment, and the method includes:
S101,获取目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征。S101. Acquire overall image features and local image features of a target source image, and overall sentence features and sentence word features of a target text.
示例性地,电子设备获取目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征。该目标源图像来自MS-COCO(即Microsoft Common Objects in Context)数据集和CUB200数据集;上述目标文本是指记录用户编辑目标源图像的文字信息,比如,目标源图像为一只鸟,若用户想要将这只鸟的羽毛染成红色和头部染成黄色,则可将这些编辑需求以文字的形式记录在目标文本中,即该目标文本的具体内容为将这只鸟的羽毛染成红色和头部染成黄色。Exemplarily, the electronic device acquires the overall image features and local image features of the target source image, and the overall sentence features and sentence word features of the target text. The target source image comes from the MS-COCO (Microsoft Common Objects in Context) dataset and the CUB200 dataset; the above target text refers to the text information that records the user editing the target source image, for example, the target source image is a bird, if the user If you want to dye the bird's feathers red and its head yellow, you can record these editing requirements in the target text in the form of words, that is, the specific content of the target text is to dye the bird's feathers Red and head dyed yellow.
上述图像整体特征(即全局图像特征)是指能表示整幅图像的特征,用于描述图像颜色和形状等整体特征,比如,颜色特征、纹理特征和形状特征;上述图像局部特征(即局部图像特征)是图像特征的局部表达,它反映了图像上具有的局部特性。The above-mentioned overall image features (i.e., global image features) refer to the features that can represent the entire image, and are used to describe the overall features such as image color and shape, such as color features, texture features, and shape features; the above-mentioned image local features (i.e., local image features) feature) is the local expression of image features, which reflects the local characteristics of the image.
通常情况下,可利用视觉几何组(Visual Geometry Group,VGG)网络提取上述目标源图像的图像整体特征和图像局部特征;可利用一种特殊的循环神经网络(Recurrent Neural Network,RNN),即长短期记忆(Long short-term memory,LSTM)网络,来提取目标文本的句子整体特征和句子词特征。Usually, the visual geometry group (Visual Geometry Group, VGG) network can be used to extract the image overall features and image local features of the target source image; a special recurrent neural network (Recurrent Neural Network, RNN), that is, long Short-term memory (Long short-term memory, LSTM) network to extract the overall sentence features and sentence word features of the target text.
S102,根据图像整体特征、图像局部特征、句子整体特征和句子词特征,基于图像编辑模型对目标源图像进行编辑,得到目标编辑图像;其中,图像编辑模型包括:采样编码模块和至少一个级联的生成模块;图像编辑模型对目标源图像的处理过程包括:利用采样 编码模块对图像整体特征、句子整体特征和图像局部特征进行采样编码处理,得到第一编辑图像,并输出第一编辑图像;响应于用户指令,将第一编辑图像、图像局部特征和句子词特征输入至少一个级联的生成模块中进行高维视觉特征提取,得到目标编辑图像,或者将目标源图像、图像局部特征和句子词特征输入到至少一个级联的生成模块中进行高维视觉特征提取,得到目标编辑图像。S102, according to the image overall feature, image local feature, sentence overall feature and sentence word feature, edit the target source image based on the image editing model to obtain the target editing image; wherein, the image editing model includes: a sampling coding module and at least one cascade The generation module of the image editing model includes: using the sampling encoding module to sample and encode the overall features of the image, the overall features of the sentence and the local features of the image to obtain the first edited image and output the first edited image; In response to user instructions, input the first edited image, image local features and sentence word features into at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image, or input the target source image, image local features and sentence The word feature is input to at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image.
示例性地,如图2所示,目标源图像I是一张尺寸大小为128×128,通道数为3的小鸟图片,在该小鸟图片中,小鸟腹部的羽毛为白色,嘴巴为灰色,头部和颈部的羽毛均为灰白相间色;VGG网络对该128×128的目标源图像I进行特征提取,得到该目标源图像I对应的图像整体特征c I和图像局部特征M I,其中,该图像整体特征c I是一个128×1的列向量;图像局部特征M I的尺寸大小为128×128且通道数为128;又比如,目标文本T的具体内容为将小鸟腹部羽毛变成黄色、嘴巴变为黄色,以及头部和颈部的羽毛均变为灰黄相间色;LSTM网络对该目标文本T进行特征提取,得到目标文本T对应的句子整体特征c T和句子词特征M T,其中,句子整体特征c T和句子词特征M T均为128×1的列向量。 Exemplarily, as shown in FIG. 2 , the target source image I is a bird picture with a size of 128×128 and a channel number of 3. In the bird picture, the feathers on the belly of the bird are white, and the mouth is Gray, the feathers on the head and neck are gray and white; the VGG network performs feature extraction on the 128×128 target source image I, and obtains the image overall feature c I and image local feature M I corresponding to the target source image I , wherein, the overall feature c I of the image is a 128×1 column vector; the size of the local feature M I of the image is 128×128 and the number of channels is 128; for another example, the specific content of the target text T is the The feathers turn yellow, the mouth turns yellow, and the feathers on the head and neck turn gray and yellow; the LSTM network performs feature extraction on the target text T, and obtains the overall sentence feature c T and the sentence corresponding to the target text T Word feature M T , where the overall sentence feature c T and sentence word feature M T are both 128×1 column vectors.
如图2所示,图像编辑模型201包括:采样编码模块G 00和至少一个级联的生成模块(比如,G 01、G 02),其中,采样编码模块G 00包括第一上采样模块F 0和第一自动编码器G 0,该第一上采样模块F 0对图像整体特征c I和句子整体特征c T进行上采样处理,得到第一上采样结果;第一自动编码器G 0对该第一上采样结果进行编码处理,并生第一编辑图像
Figure PCTCN2021123272-appb-000001
As shown in FIG. 2 , the image editing model 201 includes: a sample encoding module G 00 and at least one cascaded generation module (eg, G 01 , G 02 ), wherein the sample encoding module G 00 includes a first upsampling module F 0 and the first autoencoder G 0 , the first upsampling module F 0 performs upsampling processing on the image overall feature c I and the sentence overall feature c T to obtain the first upsampling result; the first autoencoder G 0 The first upsampling result is encoded and a first edited image is generated
Figure PCTCN2021123272-appb-000001
可选地,如图2所示,第一上采样模块F 0包括:多个相同的上采样层、第二带噪声仿射组合模块2011a和第三带噪声仿射组合模块2011b,第一上采样模块F 0的输入是句子整体特征c T、图像整体特征c I和图像局部特征M I,多个相同的上采样层中相邻的两个上采样层中,后一上采样层的输入是前一上采样层的输出;第二带噪声仿射组合模块2011a位于多个相同的上采样层中任意两个上采样层中间,用于对任意两个上采样层中前一上采样层输出的结果和图像局部特征M I进行特征融合;第三带噪声仿射组合模块2011b用于对多个相同的上采样层中最后一个上采样层的输出结果和图像局部特征M I进行特征融合。 Optionally, as shown in FIG. 2, the first upsampling module F0 includes: a plurality of identical upsampling layers, a second noisy affine combination module 2011a and a third noisy affine combination module 2011b, the first upsampling The input of the sampling module F 0 is the overall sentence feature c T , the overall image feature c I and the image local feature M I , among the two adjacent up-sampling layers in multiple identical up-sampling layers, the input of the latter up-sampling layer is the output of the previous up-sampling layer; the second noisy affine combination module 2011a is located in the middle of any two up-sampling layers in multiple identical up-sampling layers, and is used for the previous up-sampling layer in any two up-sampling layers The output result and the image local feature MI are subjected to feature fusion; the third noisy affine combination module 2011b is used to perform feature fusion on the output result of the last upsampling layer in multiple identical upsampling layers and the image local feature MI .
上述多个相同的上采样层可以是三个上采样层,也可以是四个上采样层,本申请对此不作任何限定,用户可以根据实际需求设定多个上采样层的具体数量。本申请仅以图2中第一上采样模块F 0中包含4个上采样层为例,来说明该4个上采样层结合第二带噪声仿射组合模块2011a对句子整体特征c T、图像整体特征c I和图像局部特征M I进行特征提取以及特征融合的处理过程。 The above multiple identical upsampling layers may be three upsampling layers or four upsampling layers, which is not limited in this application, and the user may set the specific number of multiple upsampling layers according to actual needs. This application only takes 4 upsampling layers included in the first upsampling module F 0 in FIG . The overall feature c I and the image local feature M I perform the process of feature extraction and feature fusion.
上述第二带噪声仿射组合模块2011a可以位于第一上采样模块F 0中4个上采样层中任意两个上采样层中间,比如,图2中,第一上采样模块F 0的4个上采样层中,前三个上采样层对输入128×1的图像整体特征c I和128×1的句子整体特征c T进行上采样处理,输出尺寸大小为32×32且通道数为64的内部特征(即32×32×64的内部特征);第二带噪声仿射组合模块2011a位于第一上采样模块F 0中第三上采样层和第四上采样层之间;该第二带噪声仿射组合模块2011a根据图像局部特征M I对第三上采样层输出的上采样结果(即前三个 上采样层输出的32×32×64的内部特征)进行视觉特征增强,得到32×32×64的第一增强上采样结果;该第一增强上采样结果先经过第四上采样层的上采样处理并输出64×64×32的视觉特征,该64×64×32的视觉特征再经过第三带噪声仿射组合模块2011b进行进一步地视觉特征增强并输出增强后的64×64×32的视觉特征;该第三带噪声仿射组合模块2011b根据图像局部特征M I对第四上采样层输出的上采样结果(即多个相同的上采样层中最后一个上采样层的输出结果)进行视觉特征增强。 The above-mentioned second noisy affine combination module 2011a can be located in the middle of any two upsampling layers among the 4 upsampling layers in the first upsampling module F0 , for example, in FIG. 2, the four upsampling layers of the first upsampling module F0 In the upsampling layer, the first three upsampling layers perform upsampling processing on the input 128×1 image overall feature c I and the 128×1 sentence overall feature c T , and the output size is 32×32 and the number of channels is 64. Internal features (i.e. internal features of 32×32×64); the second noisy affine combination module 2011a is located between the third upsampling layer and the fourth upsampling layer in the first upsampling module F 0 ; the second band The noise affine combination module 2011a enhances the visual features of the upsampling results output by the third upsampling layer (that is, the internal features of 32×32×64 output by the first three upsampling layers) according to the local image feature M I to obtain 32× The first enhanced upsampling result of 32×64; the first enhanced upsampling result is first subjected to upsampling processing of the fourth upsampling layer and outputs 64×64×32 visual features, and the 64×64×32 visual features are then The third noisy affine combination module 2011b further enhances the visual features and outputs the enhanced 64×64×32 visual features; the third noisy affine combination module 2011b performs the fourth upper The upsampling result output by the sampling layer (that is, the output result of the last upsampling layer among multiple identical upsampling layers) is subjected to visual feature enhancement.
上述第一自动编码器G 0对第三带噪声仿射组合模块2011b输出的增强后的64×64×32的视觉特征进行特征提取和编码处理,并输出尺寸大小为64×64且通道数为3的第一编辑图像
Figure PCTCN2021123272-appb-000002
(即64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000003
)。用户可以通过直接观察64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000004
来判断其是否符合要求;若64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000005
符合用户的要求,则将该64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000006
输入至少一个级联的生成模块中进行后续处理,如图2中(Ⅰ)所述;若64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000007
不符合用户的要求,则将该64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000008
舍弃并用128×128×3的目标源图像I代替第一编辑图像
Figure PCTCN2021123272-appb-000009
输入至生成模块G 01中进行后续编辑处理,如图2中(Ⅱ)所述,其中,由于采样编码模块G 00生成的第一编辑图像
Figure PCTCN2021123272-appb-000010
被舍弃了,因此,将采样编码模块G 00包含的网络结构用虚线框表示,以说明因第一编辑图像
Figure PCTCN2021123272-appb-000011
不符合用户要求,故将采样编码模块G 00生成第一编辑图像
Figure PCTCN2021123272-appb-000012
的步骤跳过,直接将128×128×3的目标源图像I输入至生成模块G 01中进行图像编辑。由此可见,因采样编码模块G 00生成第一编辑图像
Figure PCTCN2021123272-appb-000013
不符合要求而直接用128×128×3的目标源图像I代替第一编辑图像
Figure PCTCN2021123272-appb-000014
输入至生成模块G 01进行图像编辑,从而避免了不符合用户要求的第一编辑图像
Figure PCTCN2021123272-appb-000015
(即错误的第一编辑图像
Figure PCTCN2021123272-appb-000016
)继续向后传播的情况。
The above-mentioned first autoencoder G0 performs feature extraction and encoding processing on the enhanced 64×64×32 visual features output by the third noisy affine combination module 2011b, and the output size is 64×64 and the number of channels is First edited image of 3
Figure PCTCN2021123272-appb-000002
(i.e. the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000003
). Users can directly observe the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000004
To judge whether it meets the requirements; if the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000005
Meet the user's requirements, then the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000006
Input at least one cascaded generation module for subsequent processing, as described in (I) in Figure 2; if the first edited image of 64 × 64 × 3
Figure PCTCN2021123272-appb-000007
If it does not meet the user's requirements, the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000008
Discard and replace the first edited image with the target source image I of 128×128×3
Figure PCTCN2021123272-appb-000009
Input to the generation module G 01 for subsequent editing processing, as described in (II) in Figure 2, where the first edited image generated by the sampling encoding module G 00
Figure PCTCN2021123272-appb-000010
was discarded, therefore, the network structure included in the sampling coding module G 00 is represented by a dotted box to illustrate that due to the first edited image
Figure PCTCN2021123272-appb-000011
Does not meet the user's requirements, so the sampling encoding module G 00 generates the first edited image
Figure PCTCN2021123272-appb-000012
Steps above are skipped, and the target source image I of 128×128×3 is directly input into the generation module G 01 for image editing. It can be seen that, because the sampling encoding module G 00 generates the first edited image
Figure PCTCN2021123272-appb-000013
Directly replace the first edited image with the 128×128×3 target source image I that does not meet the requirements
Figure PCTCN2021123272-appb-000014
Input to the generation module G 01 for image editing, thus avoiding the first edited image that does not meet the user's requirements
Figure PCTCN2021123272-appb-000015
(i.e. wrong first edited image
Figure PCTCN2021123272-appb-000016
) continue to propagate backwards.
示例性地,若用户判断64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000017
符合要求,则该用户会向电子设备发送确认指令,电子设备根据接收到用户发送的确认指令将64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000018
输入至少一个级联的生成模块中进行高维视觉特征提取,得到目标编辑图像,如图2中(Ⅰ)所述;若用户判断64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000019
不符合要求,则该用户会向电子设备发送拒绝指令,电子设备根据接收到用户发送的拒绝指令将64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000020
舍弃并将128×128×3的目标源图像I输入至少一个级联的生成模块(比如,生成模块G 01)中进行高维视觉特征提取,得到目标编辑图像,如图2中(Ⅱ)所述。
For example, if the user determines that the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000017
If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the first edited image of 64×64×3 according to the confirmation instruction received from the user.
Figure PCTCN2021123272-appb-000018
Input at least one cascaded generation module to perform high-dimensional visual feature extraction to obtain the target edited image, as described in (I) in Figure 2; if the user judges the first edited image of 64×64×3
Figure PCTCN2021123272-appb-000019
If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will convert the first edited image of 64×64×3 according to the rejection instruction received from the user.
Figure PCTCN2021123272-appb-000020
Discard and input the 128×128×3 target source image I into at least one cascaded generation module (for example, generation module G 01 ) for high-dimensional visual feature extraction to obtain the target edited image, as shown in (II) in Figure 2 stated.
上述至少一个级联的生成模块可以是1个生成模块,也可以是两个生成模块,本申请对此不作任何限定,用户可以根据实际需求设定生成模块的数量。本申请仅以图2中图像编辑模型包含2个生成模块为例,来说明该两个生成模块配合采样编码模块对中间编辑结果(比如,第一编辑图像)进行进一步地图像编辑以生成目标编辑图像的过程。The above-mentioned at least one cascaded generation module may be one generation module or two generation modules, which is not limited in this application, and the number of generation modules can be set by the user according to actual needs. This application only takes the example of the image editing model in Figure 2 including two generating modules to illustrate that the two generating modules cooperate with the sampling and encoding module to further image edit the intermediate editing result (for example, the first edited image) to generate the target edit. image process.
示例性地,如图2所示,上述生成模块G 01包括:第一自动解码器E 0、第一自注意力模块2012c、第二上采样模块F 1和第二自动编码器G 1,第一自动解码器E 0用于恢复第一编辑图像
Figure PCTCN2021123272-appb-000021
(即输入信息)的高维视觉特征,得到第一高维特征图像;第一自注意力模块2012c用于对第一高维特征图像和句子词特征M T进行融合以及拼接处理,得到句子语义信息特征;第二上采样模块F 1用于对句子语义信息特征进行特征融合和上采样处理,得到第 二上采样结果;第二自动编码器G 1用于对第二上采样结果进行高维视觉特征提取,得到输出信息。
Exemplarily, as shown in FIG. 2 , the above generation module G 01 includes: a first automatic decoder E 0 , a first self-attention module 2012c, a second upsampling module F 1 and a second automatic encoder G 1 , the first An automatic decoder E 0 is used to restore the first edited image
Figure PCTCN2021123272-appb-000021
(that is, the high-dimensional visual features of the input information) to obtain the first high-dimensional feature image; the first self-attention module 2012c is used to fuse and splicing the first high-dimensional feature image and sentence word feature MT to obtain sentence semantics Information features; the second upsampling module F1 is used to perform feature fusion and upsampling processing on the semantic information features of the sentence to obtain the second upsampling result; the second autoencoder G1 is used to perform high-dimensional processing on the second upsampling result Visual feature extraction to obtain output information.
图2中生成模块G 01对第一编辑图像
Figure PCTCN2021123272-appb-000022
(比如,上述64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000023
)的处理过程如下:第一自动解码器E 0对上述64×64×3的第一编辑图像
Figure PCTCN2021123272-appb-000024
进行高维视觉特征恢复,得到尺寸大小为64×64且通道数为32(即64×64×32)的第一高维特征图像;之后,第一自注意力模块2012c对64×64×32的第一高维特征图像、128×1的句子词特征M T和128×128×128的图像局部特征M I进行特征融合以及拼接处理,并生成带有细粒度句子语义信息的高维视觉特征(即句子语义信息特征);该句子语义信息特征的尺寸大小为64×64且通道数为32(即64×64×32);再之后,第二上采样模块F 1对64×64×32的句子语义信息特征进行特征融合和上采样处理,生成尺寸大小为128×128且通道数为32(即128×128×32)的视觉特征图像(即第二上采样结果);该第二上采样模块F 1包含两个残差网络和一个上采样层,其中,两个残差网络用于对句子语义信息特征进行融合,该上采样层用于提升图像的空间分辨率;最后,第二自动编码器G 1对第二上采样结果进行高维视觉特征提取和编码处理,生成尺寸大小为128×128且通道数为3(即128×128×3)的第二编辑图像
Figure PCTCN2021123272-appb-000025
In Figure 2, the generation module G 01 for the first edited image
Figure PCTCN2021123272-appb-000022
(e.g. the above 64×64×3 first edited image
Figure PCTCN2021123272-appb-000023
) process is as follows: the first automatic decoder E 0 for the above-mentioned first edited image of 64 × 64 × 3
Figure PCTCN2021123272-appb-000024
Perform high-dimensional visual feature restoration to obtain the first high-dimensional feature image with a size of 64×64 and a channel number of 32 (ie, 64×64×32); after that, the first self-attention module 2012c performs 64×64×32 The first high-dimensional feature image, 128×1 sentence word feature M T and 128×128×128 image local feature M I perform feature fusion and splicing processing, and generate high-dimensional visual features with fine-grained sentence semantic information (i.e. the sentence semantic information feature); the size of the sentence semantic information feature is 64×64 and the number of channels is 32 (ie 64×64×32); after that, the second upsampling module F 1 pairs 64×64×32 The semantic information features of the sentence are subjected to feature fusion and upsampling processing to generate a visual feature image with a size of 128×128 and a channel number of 32 (ie 128×128×32) (ie the second upsampling result); the second upsampling Sampling module F 1 includes two residual networks and an upsampling layer, where the two residual networks are used to fuse the semantic information features of the sentence, and the upsampling layer is used to improve the spatial resolution of the image; finally, the second Autoencoder G 1 performs high-dimensional visual feature extraction and encoding processing on the second upsampling result to generate a second edited image with a size of 128×128 and a channel number of 3 (ie 128×128×3)
Figure PCTCN2021123272-appb-000025
可选地,第二上采样结果可以先经过带噪声仿射组合模块2012b,该带噪声仿射组合模块2012b将第二上采样结果和128×128×128的图像局部特征M I进行融合,并生成尺寸大小为128×128且通道数为32(即128×128×32)的视觉特征图像;之后,第二自动编码器G 1对128×128×32的视觉特征图像进行高维视觉特征提取和编码处理,输出尺寸大小为128×128且通道数为3的第二编辑图像
Figure PCTCN2021123272-appb-000026
(即128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000027
)。
Optionally, the second upsampling result may first pass through the noisy affine combination module 2012b, and the noisy affine combination module 2012b fuses the second upsampling result with the image local feature MI of 128×128×128, and Generate a visual feature image with a size of 128×128 and a channel number of 32 (that is, 128×128×32); after that, the second automatic encoder G 1 performs high-dimensional visual feature extraction on the visual feature image of 128×128×32 and encoding processing, the output size is 128×128 and the number of channels is 3 for the second edited image
Figure PCTCN2021123272-appb-000026
(i.e. the second edited image of 128×128×3
Figure PCTCN2021123272-appb-000027
).
上述第一自注意力模块2012c包括:自注意力层F Atten和第一带噪声仿射组合模块2012a,自注意力层F Atten用于对第一高维特征图像和句子词特征进行特征融合;第一带噪声仿射组合模块2012a用于对自注意力层F Atten的输出结果与第一高维特征图像的拼接结果,以及图像局部特征M I进行特征融合。比如,自注意力层F Atten对上述64×64×32的第一高维特征图像和128×1的句子词特征进行特征融合,并将融合结果与第一高维特征图像进行拼接;第一带噪声仿射组合模块2012a对拼接结果以及128×128×128的图像局部特征M I进行特征融合。 The above-mentioned first self-attention module 2012c includes: a self-attention layer F Atten and a first band noise affine combination module 2012a, and the self-attention layer F Atten is used for feature fusion to the first high-dimensional feature image and sentence word features; The first noisy affine combination module 2012a is used to perform feature fusion on the output result of the self-attention layer F Atten and the splicing result of the first high-dimensional feature image, as well as the image local feature M I. For example, the self-attention layer F Atten performs feature fusion on the first high-dimensional feature image of 64×64×32 and sentence word features of 128×1, and stitches the fusion result with the first high-dimensional feature image; the first The noisy affine combination module 2012a performs feature fusion on the mosaic result and the local feature M I of the 128×128×128 image.
第二自动编码器G 1输出的128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000028
用户可以通过直接观察该128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000029
来判断其是否符合要求。
The second edited image of 128×128×3 output by the second autoencoder G1
Figure PCTCN2021123272-appb-000028
Users can directly observe the second edited image of 128×128×3
Figure PCTCN2021123272-appb-000029
to determine whether it meets the requirements.
示例性地,若用户判断128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000030
符合要求,则该用户会向电子设备发送确认指令,电子设备根据接收到用户发送的确认指令将128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000031
输入下一个级联的生成模块中进行高维视觉特征提取,得到目标编辑图像;若用户判断128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000032
不符合要求,则该用户会向电子设备发送拒绝指令,电子设备根据接收到用户发送的拒绝指令将生成模块G 01生成128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000033
舍弃(即不考虑生成模块G 00生成第一编辑图像
Figure PCTCN2021123272-appb-000034
)并将128×128×3的目标源图像输入下一个级联的生成模块(比如,生成模块G 02)中进行高维视觉特征提取,得到目标编辑图 像,从而避免了不符合用户要求的第二编辑图像
Figure PCTCN2021123272-appb-000035
(即错误的第二编辑图像
Figure PCTCN2021123272-appb-000036
)继续向后传播的情况。
For example, if the user determines that the second edited image of 128×128×3
Figure PCTCN2021123272-appb-000030
If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the second edited image of 128×128×3 according to the confirmation instruction received from the user.
Figure PCTCN2021123272-appb-000031
Input to the generation module of the next cascade for high-dimensional visual feature extraction to obtain the target edited image; if the user judges the second edited image of 128×128×3
Figure PCTCN2021123272-appb-000032
If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will generate a second edited image of 128×128×3 by the generation module G 01 according to the rejection instruction sent by the user
Figure PCTCN2021123272-appb-000033
Abandoned (i.e. do not consider generating module G 00 to generate the first edited image
Figure PCTCN2021123272-appb-000034
) and input the 128×128×3 target source image into the next cascaded generation module (for example, the generation module G 02 ) for high-dimensional visual feature extraction to obtain the target edited image, thus avoiding the first 2 Edit the image
Figure PCTCN2021123272-appb-000035
(i.e. wrong second edited image
Figure PCTCN2021123272-appb-000036
) continue to propagate backwards.
示例性地,如图2所示,上述生成模块G 01之后的下一个级联的生成模块G 02包括:第二自动解码器E 1、第二自注意力模块2013c、第三上采样模块F 2和生成器G 2,第二自动解码器E 1用于恢复第二编辑图像
Figure PCTCN2021123272-appb-000037
(即前一层生成模块G 01的输出信息)的高维视觉特征,得到第二高维特征图像;第二自注意力模块2013c用于对第二高维特征图像和句子词特征M T进行融合以及拼接处理,输出第二编辑图像
Figure PCTCN2021123272-appb-000038
对应的句子语义信息特征;第三上采样模块F 2用于对第二编辑图像
Figure PCTCN2021123272-appb-000039
对应的句子语义信息特征进行特征融合和上采样处理,得到第三上采样结果;生成器G 2用于对第三上采样结果进行高维视觉特征提取。
Exemplarily, as shown in FIG. 2 , the next cascaded generation module G 02 after the above generation module G 01 includes: a second automatic decoder E 1 , a second self-attention module 2013c, and a third upsampling module F 2 and the generator G 2 , the second automatic decoder E 1 is used to recover the second edited image
Figure PCTCN2021123272-appb-000037
(i.e. the output information of the previous layer generation module G 01 ) high-dimensional visual features to obtain the second high-dimensional feature image; the second self-attention module 2013c is used to perform the second high-dimensional feature image and sentence word feature MT Fusion and splicing processing, output second edited image
Figure PCTCN2021123272-appb-000038
Corresponding sentence semantic information features; the third upsampling module F 2 is used to edit the second image
Figure PCTCN2021123272-appb-000039
The corresponding sentence semantic information features are subjected to feature fusion and upsampling processing to obtain a third upsampling result; the generator G 2 is used to extract high-dimensional visual features from the third upsampling result.
图2中生成模块G 02对第二编辑图像
Figure PCTCN2021123272-appb-000040
(比如,上述128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000041
)的处理过程如下:第二自动解码器E 1对上述128×128×3的第二编辑图像
Figure PCTCN2021123272-appb-000042
进行高维视觉特征恢复,得到尺寸大小为128×128且通道数为32(即128×128×32)的第二高维特征图像;之后,第二自注意力模块2013c对128×128×32的第二高维特征图像、128×1的句子词特征M T和128×128×128的图像局部特征M I进行特征融合以及拼接处理,并生成带有细粒度句子语义信息的高维视觉特征(即第二编辑图像
Figure PCTCN2021123272-appb-000043
对应的句子语义信息特征);该第二编辑图像
Figure PCTCN2021123272-appb-000044
对应的句子语义信息特征的尺寸大小为128×128×且通道数为32(即128×128××32);再之后,第三上采样模块F 2对128×128××32的句子语义信息特征进行特征融合和上采样处理,生成尺寸大小为256×256且通道数为32(即256×256×32)的视觉特征图像(即第三上采样结果);该第三上采样模块F 2包含两个残差网络和一个上采样层,其中,两个残差网络用于对句子语义信息特征进行融合,该上采样层用于提升图像的空间分辨率;最后,生成器G 2对第三上采样结果进行高维视觉特征提取和编码处理,生成尺寸大小为256×256且通道数为3(即256×256×3)的第三编辑图像
Figure PCTCN2021123272-appb-000045
The generation module G 02 in Fig. 2 for the second edited image
Figure PCTCN2021123272-appb-000040
(e.g. the second edited image of 128×128×3 above
Figure PCTCN2021123272-appb-000041
) process is as follows: the second automatic decoder E 1 for the above-mentioned 128 * 128 * 3 second editing image
Figure PCTCN2021123272-appb-000042
Perform high-dimensional visual feature recovery to obtain the second high-dimensional feature image with a size of 128×128 and a channel number of 32 (i.e. 128×128×32); after that, the second self-attention module 2013c performs 128×128×32 The second high-dimensional feature image, 128×1 sentence word feature M T and 128×128×128 image local feature M I perform feature fusion and splicing processing, and generate high-dimensional visual features with fine-grained sentence semantic information (i.e. the second edited image
Figure PCTCN2021123272-appb-000043
corresponding sentence semantic information feature); the second editing image
Figure PCTCN2021123272-appb-000044
The size of the corresponding sentence semantic information feature is 128×128× and the number of channels is 32 (ie 128×128××32); after that, the third upsampling module F 2 pairs the sentence semantic information of 128×128××32 Perform feature fusion and upsampling processing to generate a visual feature image with a size of 256×256 and a channel number of 32 (ie 256×256×32) (ie the third upsampling result); the third upsampling module F 2 Contains two residual networks and an upsampling layer, where the two residual networks are used to fuse the semantic information features of the sentence, and the upsampling layer is used to improve the spatial resolution of the image; finally, the generator G 2 pair Three upsampling results are subjected to high-dimensional visual feature extraction and encoding processing to generate a third edited image with a size of 256×256 and a channel number of 3 (ie 256×256×3)
Figure PCTCN2021123272-appb-000045
可选地,256×256×32的第三上采样结果可以先经过带噪声仿射组合模块2013b,该带噪声仿射组合模块2013b将256×256×32的第三上采样结果和128×128×128的图像局部特征M I进行融合,并生成尺寸大小为256×256且通道数为32(即256×256×32)的视觉特征图像;之后,生成器G 2对256×256×32的视觉特征图像进行高维视觉特征提取和编码处理,得到尺寸大小为256×256的且通道数为3的第三编辑图像
Figure PCTCN2021123272-appb-000046
(即256×256×3的第三编辑图像
Figure PCTCN2021123272-appb-000047
)。
Optionally, the third upsampling result of 256×256×32 may first pass through the noisy affine combination module 2013b, and the noisy affine combination module 2013b combines the third upsampling result of 256×256×32 with the 128×128 ×128 image local feature M I to fuse, and generate a visual feature image with a size of 256 ×256 and a channel number of 32 (that is, 256×256×32); The visual feature image is subjected to high-dimensional visual feature extraction and encoding processing to obtain a third edited image with a size of 256×256 and a channel number of 3
Figure PCTCN2021123272-appb-000046
(i.e. the third edited image of 256×256×3
Figure PCTCN2021123272-appb-000047
).
上述第二自注意力模块2013c包括:自注意力层F Atten和带噪声仿射组合模块2012a,自注意力层F Atten用于对第二高维特征图像和句子词特征进行特征融合;带噪声仿射组合模块2012a用于对自注意力层F Atten的输出结果与第二高维特征图像的拼接结果,以及图像局部特征M I进行特征融合。比如,自注意力层F Atten对上述128×128×32的第二高维特征图像和128×1的句子词特征进行特征融合,并将融合结果与第二高维特征图像进行拼接;带噪声仿射组合模块2013a对拼接结果以及128×128×128的图像局部特征M I进行特征融合。 The above-mentioned second self-attention module 2013c includes: self-attention layer F Atten and affine combination module 2012a with noise, and self-attention layer F Atten is used to carry out feature fusion to the second high-dimensional feature image and sentence word features; The affine combination module 2012a is used to perform feature fusion on the output result of the self-attention layer F Atten and the splicing result of the second high-dimensional feature image, as well as the image local feature M I . For example, the self-attention layer F Atten performs feature fusion on the second high-dimensional feature image of 128×128×32 and sentence word features of 128×1, and stitches the fusion result with the second high-dimensional feature image; with noise The affine combination module 2013a performs feature fusion on the splicing result and the 128×128×128 image local feature M I .
生成器G 2输出的256×256×3的第三编辑图像
Figure PCTCN2021123272-appb-000048
用户可以通过直接观察该256×256×3的第三编辑图像
Figure PCTCN2021123272-appb-000049
来判断其是否符合要求。若用户判断256×256×3的第三编 辑图像
Figure PCTCN2021123272-appb-000050
符合要求,则用户会向电子设备发送确认指令,电子设备根据接收到用户发送的确认指令将256×256×3的第三编辑图像
Figure PCTCN2021123272-appb-000051
输入下一级网络中继续处理或者直接输出给用户(即,由于生成模块G 02为两个级联的生成模块中的最后一级生成模块,因此,生成模块G 02中生成器G 2的输出信息为目标编辑图像(即第三编辑图像
Figure PCTCN2021123272-appb-000052
));若用户判断256×256×3的第三编辑图像
Figure PCTCN2021123272-appb-000053
不符合要求,则该用户会向电子设备发送拒绝指令,电子设备根据接收到用户发送的拒绝指令将256×256×3的第三编辑图像
Figure PCTCN2021123272-appb-000054
舍弃并重新利用图像编辑模型对128×128×3的目标源图像重复执行前述处理过程。
The third edited image of 256×256×3 output by generator G 2
Figure PCTCN2021123272-appb-000048
Users can directly observe the third edited image of 256×256×3
Figure PCTCN2021123272-appb-000049
to determine whether it meets the requirements. If the user judges the third edited image of 256×256×3
Figure PCTCN2021123272-appb-000050
If the requirements are met, the user will send a confirmation instruction to the electronic device, and the electronic device will convert the third edited image of 256×256×3 according to the confirmation instruction received from the user.
Figure PCTCN2021123272-appb-000051
Input to the next-level network to continue processing or directly output to the user (that is, since the generation module G 02 is the last generation module in the two cascaded generation modules, the output of the generator G 2 in the generation module G 02 information for the target edited image (i.e. the third edited image
Figure PCTCN2021123272-appb-000052
)); if the user judges the third edited image of 256×256×3
Figure PCTCN2021123272-appb-000053
If the requirements are not met, the user will send a rejection instruction to the electronic device, and the electronic device will convert the third edited image of 256×256×3 according to the rejection instruction received from the user.
Figure PCTCN2021123272-appb-000054
Discard and re-use the image editing model to repeat the aforementioned process on the 128×128×3 target source image.
上述第一带噪声仿射组合模块2012a的核心算法如下:The core algorithm of the above-mentioned first noisy affine combination module 2012a is as follows:
F NAC(h,M)=h⊙W rnd(M)+b rnd(M),      (1) F NAC (h,M)=h⊙W rnd (M)+b rnd (M), (1)
其中,F NAC(h,M)表示第一带噪声仿射组合模块2012a的核心算法函数,W rnd(M)=W 2(W 1(M)+noise),b rnd(M)=b 2(b 1(M)+noise),通过W rnd(M)和b rnd(M)计算目标源图像的区域特征M(比如,前述的图像局部特征M I)的权重和偏差,h表示视觉特征(比如,自注意力层F Atten的输出结果与第一高维特征图像的拼接结果),noise为高斯噪声,⊙是Hadamard元素积。第一带噪声仿射组合模块2012a通过引入高斯噪声能够增强采样编码模块以及生成模块编辑图像的可靠性,从而使得采样编码模块以及生成模块不会因为图像中存在的随机噪声而影响编辑结果的可靠性。 Wherein, F NAC (h, M) represents the core algorithm function of the first noisy affine combination module 2012a, W rnd (M)=W 2 (W 1 (M)+noise), b rnd (M)=b 2 (b 1 (M)+noise), through W rnd (M) and b rnd (M) to calculate the weight and deviation of the regional feature M of the target source image (for example, the aforementioned image local feature M I ), h represents the visual feature (For example, the splicing result of the output of the self-attention layer F Atten and the first high-dimensional feature image), noise is Gaussian noise, and ⊙ is the Hadamard element product. The first noisy affine combination module 2012a can enhance the reliability of the sampling coding module and the generation module to edit images by introducing Gaussian noise, so that the sampling coding module and the generation module will not affect the reliability of the editing results due to the random noise in the image sex.
特此说明,除第一带噪声仿射组合模块2012a以外,本申请中所提及的其他带噪声仿射组合模块(比如,第二带噪声仿射组合模块、第三带噪声仿射组合模块等)的核心算法与第一带噪声仿射组合模块2012a的核心算法相同,本申请对此不再赘述。It is hereby explained that, except for the first noisy affine combination module 2012a, other noisy affine combination modules mentioned in this application (for example, the second noisy affine combination module, the third noisy affine combination module, etc. ) is the same as the core algorithm of the first noisy affine combination module 2012a, which will not be repeated in this application.
示例性地,图像编辑模型的训练方法包括:利用预设的损失函数和训练集对初始模型进行训练,得到图像编辑模型;其中,预设的损失函数包括与N个子网络分别对应的子函数和N-1个自动编解码器损失函数,初始模型包括N个子网络,该N个子网络为采样编码模块和至少一个生成模块分别对应的初始模型;训练过程中,当第i个子网络的输出图像不满足预设条件时,采用第i+1至第N个子网络对应的子函数和第i个至第i+1个自动编解码器损失函数,对初始模型进行训练,0≤i<N。Exemplarily, the training method of the image editing model includes: using a preset loss function and a training set to train the initial model to obtain an image editing model; wherein, the preset loss function includes sub-functions corresponding to N sub-networks and N-1 automatic codec loss functions, the initial model includes N sub-networks, the N sub-networks are the initial models corresponding to the sampling encoding module and at least one generation module respectively; during the training process, when the output image of the i-th sub-network is not When the preset conditions are met, the initial model is trained using the sub-functions corresponding to the i+1-th to N-th sub-networks and the i-th to i+1-th automatic codec loss functions, 0≤i<N.
例如,如图2所示,图像编辑模型201的训练方法包括:利用MS-COCO和CUB200数据集和预设损失函数借助判别器以生成对抗的形式(比如,判别器D 0、判别器D 1和判别器D 2)进行训练,得到图像编辑模型201;上述初始模型是指图像编辑模型201未训练前的网络模型;该初始模型包括N=3个子网络和两个(N-1=3-1)自动编解码器,这3个子网络分别是采样编码模块G 00、生成模块G 01和生成模块G 02;两个自动编解码器分别是自动编码器G 0-自动解码器E 0和自动编码器G 1-自动解码器E 1;上述预设的损失函数为3个自网络的损失函数和两个自动编解码器的损失函数。在训练过程中,比如,N=3,当第1(即i=1)个子网络(比如,采样编码模块G 00)的输出图像不满足预设条件时,采用第2(即i+1=1+1)至第3个子网络对应的子函数(即生成模块G 01对应的损失函数和生成模块G 02对应的损失函数)和第1个至第2个自动编解码器损失函数(即自动编码器G 0-自动解码器E 0的损失函数和自动编码器G 1-自动解码器E 1的损失函数),对初始模型进行训练。 For example, as shown in FIG. 2 , the training method of the image editing model 201 includes: using MS-COCO and CUB200 datasets and a preset loss function to generate an adversarial form (for example, discriminator D 0 , discriminator D 1 and the discriminator D 2 ) to train to obtain the image editing model 201; the above-mentioned initial model refers to the network model before the image editing model 201 is not trained; the initial model includes N=3 sub-networks and two (N-1=3- 1) Automatic codec, these three sub-networks are sampling encoding module G 00 , generating module G 01 and generating module G 02 ; the two automatic codecs are autoencoder G 0 -autodecoder E 0 and automatic Encoder G 1 -automatic decoder E 1 ; the above-mentioned preset loss functions are loss functions of three self-networks and two loss functions of automatic codecs. In the training process, for example, N=3, when the output image of the first (i.e. i=1) sub-network (e.g., sampling encoding module G 00 ) does not meet the preset conditions, the second (i.e. i+1= 1+1) to the sub-function corresponding to the third sub-network (that is, the loss function corresponding to the generation module G 01 and the loss function corresponding to the generation module G 02 ) and the first to the second automatic codec loss function (that is, the automatic Encoder G 0 - the loss function of auto-decoder E 0 and Auto-encoder G 1 - the loss function of auto-decoder E 1 ), trains the initial model.
在对初始模型进行训练时,会采用如下三种预设的损失函数进行训练。这三种预设的损失函数具体如下:When training the initial model, the following three preset loss functions are used for training. The three preset loss functions are as follows:
第一预设的损失函数:
Figure PCTCN2021123272-appb-000055
The first preset loss function:
Figure PCTCN2021123272-appb-000055
第二预设的损失函数:
Figure PCTCN2021123272-appb-000056
The second preset loss function:
Figure PCTCN2021123272-appb-000056
第三预设的损失函数:
Figure PCTCN2021123272-appb-000057
The third preset loss function:
Figure PCTCN2021123272-appb-000057
上式中,b表示训练时跳过的子网络个数,i=0,1,2;L G,i表示ManiGAN网络中相应层级的损失函数,L G,0为采样编码模块G 00所处层级对应的损失函数,L G,1为生成模块G 01所处层级对应的损失函数,L G,2为生成模块G 02所处层级对应的损失函数;上述
Figure PCTCN2021123272-appb-000058
Figure PCTCN2021123272-appb-000059
式中,I′表示目标编辑图像,I′~P G,i(I,T)是指目标编辑图像I′是初始模型在给定训练图像I和随机选择的目标文本T的情况下生成的。
In the above formula, b represents the number of sub-networks skipped during training, i=0,1,2; L G,i represents the loss function of the corresponding level in the ManiGAN network, and L G,0 is the location of the sampling encoding module G 00 The loss function corresponding to the level, L G,1 is the loss function corresponding to the level where the generation module G 01 is located, and L G,2 is the loss function corresponding to the level where the generation module G 02 is located; the above
Figure PCTCN2021123272-appb-000058
Figure PCTCN2021123272-appb-000059
In the formula, I' represents the target edited image, and I'~ PG,i(I,T) means that the target edited image I' is generated by the initial model given the training image I and the randomly selected target text T .
上述自动编解码器的损失函数为
Figure PCTCN2021123272-appb-000060
Figure PCTCN2021123272-appb-000061
其中,
Figure PCTCN2021123272-appb-000062
为自动编码器G 0-自动解码器E 0的损失函数,
Figure PCTCN2021123272-appb-000063
为自动编码器G 1-自动解码器E 1的损失函数;G i由3×3卷积层和tanh激活函数组成,E i包含atanh函数、3×3卷积层、Leaky ReLU层和实例规范化层(Instance Normalization)。由G i和E i组成的自动编解码器将中间编辑结果反馈给用户,用户可以对不同子网络生成的中间结果进行直接控制,以防止错误的中间编辑结果影响整个图像编辑模型201输出结果的准确性。
The loss function for the above automatic codec is
Figure PCTCN2021123272-appb-000060
and
Figure PCTCN2021123272-appb-000061
in,
Figure PCTCN2021123272-appb-000062
is the loss function of autoencoder G 0 -autodecoder E 0 ,
Figure PCTCN2021123272-appb-000063
is the loss function of autoencoder G 1 -autodecoder E 1 ; G i consists of 3×3 convolutional layer and tanh activation function, E i contains atanh function, 3×3 convolutional layer, Leaky ReLU layer and instance normalization Layer (Instance Normalization). The automatic codec composed of G i and E i feeds back the intermediate editing results to the user, and the user can directly control the intermediate results generated by different sub-networks to prevent wrong intermediate editing results from affecting the output of the entire image editing model 201 accuracy.
示例性地,以第一预设的损失函数训练初始模型为例,该第一预设损失函数训练初始模型时,不需要跳过任何子网络。即,将训练图像的图像整体特征c I和图像局部特征M I,以及训练文本的句子整体特征c T输入至采样编码模块G 00中,采样编码模块G 00输出第一训练编辑图像,若用户判断该第一训练编辑图像符合用户要求,则将第一训练编辑图像输入至生成模块G 01中;生成模块G 01对第一训练编辑图像和训练图像的图像局部特征M I进行编辑处理并生成第二训练编辑图像;若用户判断该第二训练编辑图像符合用户要求,则将第二训练编辑图像输入至生成模块G 02中;生成模块G 02对第二训练编辑图像和训练图像的图像局部特征M I进行编辑处理并生成第三训练编辑图像;若用户判断该第三训练编辑图像符合用户要求,则根据第一预设的损失函数训练初始模型的情况完成;若用户判断该第三训练编辑图像不符合用户要求,则重复上述训练过程。由于自动编码器G 0-自动解码器E 0和自动编码器G 1-自动解码器E 1分布在采样编码模块G 00、生成模块G 01和生成模块G 02之间(见图2),因此,在根据第一预设的损失函数训练初始模型时,也训练了自动编码器G 0-自动解码器E 0和自动编码器G 1-自动解码器E 1Exemplarily, taking the training of the initial model with the first preset loss function as an example, when the first preset loss function is training the initial model, no sub-network needs to be skipped. That is, input the image overall feature c I and image local feature M I of the training image, and the sentence overall feature c T of the training text into the sample encoding module G 00 , and the sample encoding module G 00 outputs the first training edited image, if the user Judging that the first training edited image meets the user's requirements, the first training edited image is input into the generation module G 01 ; the generation module G 01 edits the first training edited image and the image local feature MI of the training image and generates The second training editing image; if the user judges that the second training editing image meets the user's requirements, then the second training editing image is input into the generation module G 02 ; Feature M1 performs editing processing and generates a third training edited image; if the user judges that the third training edited image meets the user's requirements, then the training of the initial model according to the first preset loss function is completed; if the user judges that the third training If the edited image does not meet the user's requirements, repeat the above training process. Since the autoencoder G 0 -autodecoder E 0 and autoencoder G 1 -autodecoder E 1 are distributed between the sampling encoding module G 00 , the generating module G 01 and the generating module G 02 (see Figure 2), so , when training the initial model according to the first preset loss function, autoencoder G 0 -autodecoder E 0 and autoencoder G 1 -autodecoder E 1 are also trained.
示例性地,以第二预设的损失函数训练初始模型为例,该第二预设损失函数训练初始模型时,需要跳过采样编码模块G 00处理训练图像的图像整体特征c I和图像局部特征M I, 以及训练文本的句子整体特征c T的过程。而是直接训练图像输入至生成模块G 01中;生成模块G 01对训练图像和训练图像的图像局部特征M I进行编辑处理并生成第四训练编辑图像;若用户判断该第四训练编辑图像符合用户要求,则将第四训练编辑图像输入至生成模块G 02中;生成模块G 02对第四训练编辑图像和训练图像的图像局部特征M I进行编辑处理并生成第五训练编辑图像;若用户判断该第五训练编辑图像符合用户要求,则根据第二预设的损失函数训练初始模型的情况完成;若用户判断该第五训练编辑图像不符合用户要求,则重复上述训练过程。由于生成模块G 01和生成模块G 02之间仅有自动编码器G 1-自动解码器E 1(见图2),因此,在根据第二预设的损失函数训练初始模型时,也训练了自动编码器G 1-自动解码器E 1Exemplarily, taking the second preset loss function to train the initial model as an example, when the second preset loss function trains the initial model, it is necessary to skip the sampling encoding module G 00 to process the image overall feature c I and image local features of the training image feature M I , and the process of training the overall sentence feature c T of the text. Instead, the training image is directly input into the generation module G01 ; the generation module G01 edits the training image and the image local feature MI of the training image and generates a fourth training edited image; if the user judges that the fourth training edited image conforms to If the user requires, the fourth training edited image is input into the generation module G02 ; the generation module G02 edits the fourth training edited image and the image local feature MI of the training image and generates the fifth training edited image; if the user If it is judged that the fifth training edited image meets the user's requirements, the training of the initial model according to the second preset loss function is completed; if the user judges that the fifth training edited image does not meet the user's requirements, the above training process is repeated. Since there is only autoencoder G1 -autodecoder E1 (see Figure 2) between the generation module G01 and the generation module G02 , when training the initial model according to the second preset loss function, it also trains Autoencoder G 1 -Autodecoder E 1 .
示例性地,以第三预设的损失函数训练初始模型为例,该第三预设损失函数训练初始模型时,需要跳过采样编码模块G 00处理训练图像的图像整体特征c I和图像局部特征M I,以及训练文本的句子整体特征c T的过程,以及跳过生成模块G 01对采样编码模块G 00输出的编辑结果和训练图像的图像局部特征M I进行编辑处理的过程。而是直接将训练图像输入至生成模块G 02中;生成模块G 02对训练图像和训练图像的图像局部特征M I进行编辑处理并生成第六训练编辑图像;若用户判断该第六训练编辑图像符合用户要求,则根据第三预设的损失函数训练初始模型的情况完成;若用户判断该第六训练编辑图像不符合用户要求,则重复上述训练过程。由于生成模块G 02没有自动编解码器,因此,在根据第三预设的损失函数训练初始模型时,不需要训练自动编解码器。 Exemplarily, taking the third preset loss function to train the initial model as an example, when the third preset loss function trains the initial model, it is necessary to skip the sampling encoding module G 00 to process the image overall feature c I and image local features of the training image feature M I , and the process of the overall sentence feature c T of the training text, and the process of editing the edited result output by the sample encoding module G 00 and the image local feature M I of the training image by the skip generation module G 01 . Instead, the training image is directly input into the generation module G02 ; the generation module G02 edits the training image and the image local feature MI of the training image and generates a sixth training edited image; if the user judges that the sixth training edited image If the user's requirements are met, the training of the initial model is completed according to the third preset loss function; if the user judges that the sixth training edited image does not meet the user's requirements, the above training process is repeated. Since the generating module G 02 does not have an automatic codec, when training the initial model according to the third preset loss function, it is not necessary to train an automatic codec.
由于多层对抗网络(即上述N个子网络)难以训练和收敛的特点,因此,在训练初始模型时借助自动编解码器来随机跳过低层对抗网络(比如,跳过采样编码模块G 00)而优先训练高层网络(比如,生成模块G 01,或者,生成模块G 02),比如,当b=1时,跳过采样编码模块G 00,优先训练生成模块G 01和生成模块G 02;当b=2时,跳过采样编码模块G 00和生成模块G 01,上述训练方法具有如下优点:a)由于在跳过低层子网络时下一级子网络的输入使用的是原始训练图像而不是低层子网络输出的不符合用户要求的编辑结果(即错误编辑结果),因此,能够避免潜在的低层子网络生成的错误编辑结果向高层子网络传播;b)优先训练高层子网络能够给低层子网络带来更好的更新梯度,从而能够使得低层子网络的收敛效果更好。 Due to the characteristics of the multi-layer confrontation network (that is, the above N sub-networks) that are difficult to train and converge, when training the initial model, the automatic codec is used to randomly skip the low-level confrontation network (for example, skip the sampling encoding module G 00 ) instead of Prioritize the training of the high-level network (for example, the generation module G 01 , or the generation module G 02 ), for example, when b=1, skip the sample encoding module G 00 , and give priority to training the generation module G 01 and the generation module G 02 ; when b = 2, the sampling encoding module G 00 and the generating module G 01 are skipped, and the above training method has the following advantages: a) Since the input of the next-level sub-network uses the original training image instead of the low-level sub-network when the low-level sub-network is skipped The editing results that do not meet the user's requirements (that is, wrong editing results) output by the network, therefore, can prevent the potential wrong editing results generated by the low-level sub-network from propagating to the high-level sub-network; b) Prioritizing the training of the high-level sub-network can bring more To better update the gradient, so that the convergence effect of the low-level sub-network can be better.
示例性地,图像编辑模型还包括:细节修正模型(Symmetrical Detail Correction Module,SCDM),用于对目标编辑图像进行细节修改;细节修正模型用于处理图像局部特征、句子词特征和目标编辑图像,得到目标修正图像;细节修正模型包括:第一细节修正模块、第二细节修正模块、融合模块和生成器,其中,第一细节修正模块用于对图像局部特征、第一随机噪声和句子词特征进行细节修改,得到第一细节特征;第二细节修正模块用于对目标编辑图像对应的图像局部特征、第二随机噪声和句子词特征进行细节修改,得到第二细节特征;融合模块用于对第一细节特征和第二细节特征进行特征融合;生成器用于根据融合模块的输出结果生成目标修正图像。Exemplarily, the image editing model also includes: a detail correction model (Symmetrical Detail Correction Module, SCDM), which is used to modify the details of the target edited image; the detail correction model is used to process image local features, sentence word features and target edited images, Obtain the target correction image; the detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used for image local features, first random noise and sentence word features Perform detail modification to obtain the first detail feature; the second detail correction module is used to modify the image local features, the second random noise and sentence word features corresponding to the target edited image to obtain the second detail feature; the fusion module is used to The first detail feature and the second detail feature are subjected to feature fusion; the generator is used to generate the target correction image according to the output result of the fusion module.
如图2所示,图像编辑模型201还包括细节修正模型2014,该细节修正模型2014根 据目标源图像I的图像局部特征M I和文本描述T的句子词特征M T对生成模块G 02生成的目标编辑图像进行细节修改和校正,得到目标修正图像
Figure PCTCN2021123272-appb-000064
该细节修正模型2014包括:第一细节修正模块301、第二细节修正模块302、融合模块F fuse和生成器G 0S,其中,第一细节修正模块301对图像局部特征M I、第一随机噪声noise1和句子词特征M T进行细节修改,得到第一细节特征,并将第一细节特征输入到融合模块F fuse;第二细节修正模块302对目标编辑图像(比如,生成模块G 02输出的第三编辑图像
Figure PCTCN2021123272-appb-000065
)对应的图像局部特征
Figure PCTCN2021123272-appb-000066
第二随机噪声noise2和句子词特征M T进行细节修改,得到第二细节特征,并将第二细节特征输入到融合模块F fuse;融合模块F fuse对第一细节特征和第二细节特征进行特征融合,并将融合结果输入到生成器G 0S中;生成器G 0S对融合模块F fuse的输出结果进行编码并生成目标修正图像
Figure PCTCN2021123272-appb-000067
As shown in FIG. 2, the image editing model 201 also includes a detail revision model 2014, which is generated by the generation module G02 according to the image local feature M I of the target source image I and the sentence word feature M T of the text description T. The target edited image is modified and corrected for details, and the target corrected image is obtained
Figure PCTCN2021123272-appb-000064
The detail correction model 2014 includes: a first detail correction module 301, a second detail correction module 302, a fusion module F fuse , and a generator G 0S , wherein the first detail correction module 301 performs an image local feature M I , the first random noise noise1 and sentence word feature MT carry out detailed modification, obtain the first detailed feature, and input the first detailed feature to the fusion module F fuse ; Three edited images
Figure PCTCN2021123272-appb-000065
) corresponding to the image local features
Figure PCTCN2021123272-appb-000066
The second random noise noise2 and the sentence word feature MT carry out detail modification, obtain the second detail feature, and input the second detail feature to the fusion module F fuse ; the fusion module F fuse performs feature on the first detail feature and the second detail feature Fusion, and input the fusion result into the generator G 0S ; the generator G 0S encodes the output result of the fusion module F fuse and generates the target correction image
Figure PCTCN2021123272-appb-000067
示例性地,如图3所示,第一细节修正模块301包括第四带噪声仿射组合模块3011、第五带噪声仿射组合模块3013、第六带噪声仿射组合模块3015、第二自注意力模块3012、第一残差网络3014和第一线性网络3016;其中,第四带噪声仿射组合模块3011对第一随机噪声noise1和图像局部特征M I进行特征融合,得到第一融合特征,并将第一融合特征输入到第二自注意力模块3012;第二自注意力模块3012对第一融合特征和句子词特征M T进行特征融合,并将融合结果与第一随机噪声noise1进行拼接,得到拼接结果;该拼接结果输入到第五带噪声仿射组合模块3013中;第五带噪声仿射组合模块3013对该拼接结果、以及图像局部特征M I进行特征融合,并将融合结果输入至第一残差网络3014中;第一残差网络3014对第五带噪声仿射组合模块3013输出的融合结果进行视觉特征提取,并将视觉特征提取结果输入到第六带噪声仿射组合模块3015;第一线性网络3016对图像局部特征M I进行线性变换,并将线性变换结果输入第六带噪声仿射组合模块3015;第六带噪声仿射组合模块3015对第一残差网络3014输出的视觉特征提取结果和第一线性网络3016输出的线性变换结果进行特征融合,得到第一细节特征x IExemplarily, as shown in FIG. 3 , the first detail correction module 301 includes a fourth noisy affine combination module 3011, a fifth noisy affine combination module 3013, a sixth noisy affine combination module 3015, a second self- Attention module 3012, the first residual network 3014 and the first linear network 3016; wherein, the fourth noisy affine combination module 3011 performs feature fusion on the first random noise noise1 and image local features MI to obtain the first fusion feature , and input the first fusion feature to the second self-attention module 3012; the second self-attention module 3012 performs feature fusion on the first fusion feature and sentence word feature MT , and performs fusion results with the first random noise noise1 Splicing to obtain the splicing result; the splicing result is input into the fifth band noise affine combination module 3013; the fifth band noise affine combination module 3013 performs feature fusion on the splicing result and the image local feature MI , and fuses the fusion result Input to the first residual network 3014; the first residual network 3014 performs visual feature extraction on the fusion result output by the fifth band noise affine combination module 3013, and inputs the visual feature extraction result to the sixth band noise affine combination Module 3015; the first linear network 3016 performs linear transformation on the image local feature MI , and the linear transformation result is input into the sixth band noise affine combination module 3015; the sixth band noise affine combination module 3015 performs a linear transformation on the first residual network 3014 The output visual feature extraction result and the linear transformation result output by the first linear network 3016 are subjected to feature fusion to obtain the first detail feature x I .
示例性地,如图3所示,第二细节修正模块302包括第七带噪声仿射组合模块3021、第八带噪声仿射组合模块3023、第九带噪声仿射组合模块3025、第三自注意力模块3022、第二残差网络3024和第二线性网络3026;其中,第七带噪声仿射组合模块3021对第二随机噪声noise2和目标编辑图像对应的图像局部特征(比如,生成模块G 02输出的第三编辑图像
Figure PCTCN2021123272-appb-000068
对应的图像局部特征
Figure PCTCN2021123272-appb-000069
)进行特征融合,得到第一融合特征,并将第一融合特征输入至第三自注意力模块3022中;第三自注意力模块3022对第一融合特征和句子词特征M T进行特征融合,并将特征融合结果与第二随机噪声noise2进行拼接,最后将拼接结果输入至第八带噪声仿射组合模块3023;第八带噪声仿射组合模块3023对该拼接结果、以及目标编辑图像对应的图像局部特征(比如,生成模块G 02输出的第三编辑图像
Figure PCTCN2021123272-appb-000070
对应的图像局部特征
Figure PCTCN2021123272-appb-000071
)进行特征融合,并将融合结果输入至第二残差网络3024;第二残差网络3024对第八带噪声仿射组合模块的输出的融合结果进行视觉特征提取,并将该视觉特征提取结果输入至第九带噪声仿射组合模块3025中;第二线性网络3026对目标编辑图像对应的图像局部特征(比如,生成模块G 02输出的第三编辑图像
Figure PCTCN2021123272-appb-000072
对应的图像局部特征
Figure PCTCN2021123272-appb-000073
)进行线 性变换,并将线性变换结果输入至第九带噪声仿射组合模块3025中;第九带噪声仿射组合模块3025对第二残差网络3024输出的视觉特征提取结果和第二线性网络3026输出的线性变换结果进行特征融合,得到第二细节特征
Figure PCTCN2021123272-appb-000074
上述图像局部特征
Figure PCTCN2021123272-appb-000075
是利用VGG网络对第三编辑图像
Figure PCTCN2021123272-appb-000076
进行特征提取得到的。
Exemplarily, as shown in FIG. 3 , the second detail correction module 302 includes a seventh noisy affine combination module 3021, an eighth noisy affine combination module 3023, a ninth noisy affine combination module 3025, a third self Attention module 3022, the second residual network 3024 and the second linear network 3026; wherein, the seventh band noise affine combination module 3021 is used for the image local features corresponding to the second random noise noise2 and the target edited image (for example, the generation module G 02 The third edited image output
Figure PCTCN2021123272-appb-000068
Corresponding image local features
Figure PCTCN2021123272-appb-000069
) to perform feature fusion to obtain the first fusion feature, and input the first fusion feature to the third self-attention module 3022; the third self-attention module 3022 performs feature fusion to the first fusion feature and sentence word feature M T , And splicing the feature fusion result with the second random noise noise2, and finally input the splicing result to the eighth noisy affine combination module 3023; Image local features (for example, the third edited image output by the generation module G 02
Figure PCTCN2021123272-appb-000070
Corresponding image local features
Figure PCTCN2021123272-appb-000071
) to perform feature fusion, and input the fusion result to the second residual network 3024; the second residual network 3024 performs visual feature extraction on the fusion result of the output of the eighth noisy affine combination module, and the visual feature extraction result Input to the ninth band noise affine combination module 3025; the second linear network 3026 to the image local features corresponding to the target edited image (for example, the third edited image output by the generation module G 02
Figure PCTCN2021123272-appb-000072
Corresponding image local features
Figure PCTCN2021123272-appb-000073
) to carry out linear transformation, and input the linear transformation result into the ninth band noise affine combination module 3025; the ninth band noise affine combination module 3025 extracts the visual feature results of the second residual network 3024 output and the second linear network The linear transformation result output by 3026 is subjected to feature fusion to obtain the second detail feature
Figure PCTCN2021123272-appb-000074
The local features of the above image
Figure PCTCN2021123272-appb-000075
is to use the VGG network to edit the third image
Figure PCTCN2021123272-appb-000076
obtained by feature extraction.
上述融合模块F fuse对第六带噪声仿射组合模块3015输出的第一细节特征x I和第九带噪声仿射组合模块3025输出的第二细节特征
Figure PCTCN2021123272-appb-000077
进行特征融合,并将融合结果输入到生成器G 0S中;生成器G 0S对融合模块F fuse的输出融合结果进行编码并生成目标修正图像
Figure PCTCN2021123272-appb-000078
The above fusion module F fuse for the first detailed feature x I output by the sixth noisy affine combination module 3015 and the second detailed feature output by the ninth noisy affine combination module 3025
Figure PCTCN2021123272-appb-000077
Carry out feature fusion, and input the fusion result into the generator G 0S ; the generator G 0S encodes the output fusion result of the fusion module F fuse and generates the target correction image
Figure PCTCN2021123272-appb-000078
可选地,第一细节修正模块301的输入和第二细节修正模块302的输入可以互相交换,即第一细节修正模块301的输入可以是目标编辑图像(比如,生成模块G 02输出的第三编辑图像
Figure PCTCN2021123272-appb-000079
)对应的图像局部特征
Figure PCTCN2021123272-appb-000080
第二随机噪声noise2和句子词特征M T;第二细节修正模块302的输入可以是图像局部特征M I、第一随机噪声noise1和句子词特征M T。相应地,第一细节修正模块301的输出为第二细节特征
Figure PCTCN2021123272-appb-000081
第一细节修正模块301的输出为第一细节特征x I
Optionally, the input of the first detail correction module 301 and the second detail correction module 302 can be exchanged, that is, the input of the first detail correction module 301 can be the target edited image (for example, the third output of the generation module G02 edit image
Figure PCTCN2021123272-appb-000079
) corresponding to the image local features
Figure PCTCN2021123272-appb-000080
The second random noise noise2 and the sentence word feature MT ; the input of the second detail modification module 302 may be the image local feature M I , the first random noise noise1 and the sentence word feature MT . Correspondingly, the output of the first detail correction module 301 is the second detail feature
Figure PCTCN2021123272-appb-000081
The output of the first detail modification module 301 is the first detail feature x I .
图3中,融合模块F fuse的核心算法如下: In Figure 3, the core algorithm of the fusion module F fuse is as follows:
Figure PCTCN2021123272-appb-000082
Figure PCTCN2021123272-appb-000082
其中,F residual是残差网络,β 1和β 2是线性网络层对输入的第一细节特征x I和第二细节特征
Figure PCTCN2021123272-appb-000083
进行计算而得到的一对注意力权重。该融合模块F fuse可以自适应地选择是基于目标源图像修改图像特征还是基于图像编辑模型生成的目标编辑图像来修改图像特征,从而增强修改图像的细节特征。上述生成器G 0S将融合模块F fuse的输出结果转化为最终的目标修正图像
Figure PCTCN2021123272-appb-000084
Among them, F residual is the residual network, β 1 and β 2 are the first detailed feature x I and the second detailed feature of the input of the linear network layer
Figure PCTCN2021123272-appb-000083
A pair of attention weights resulting from the computation. The fusion module F fuse can adaptively choose whether to modify the image features based on the target source image or based on the target edited image generated by the image editing model, so as to enhance the detail features of the modified image. The above generator G 0S converts the output of the fusion module F fuse into the final target corrected image
Figure PCTCN2021123272-appb-000084
示例性地,细节修正模型2014的训练方式包括:根据有条件的生成器损失函数、无条件的生成器损失函数和语义对比函数训练细节修正模型的生成器;根据有条件的判别器损失函数和无条件的判别器损失函数训练细节修正模型的判别器;其中,训练的数据集为MS-COCO数据集和CUB200数据集;有条件的生成器损失函数如下:Exemplarily, the training method of the detail correction model 2014 includes: training the generator of the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function; according to the conditional discriminator loss function and the unconditional The discriminator loss function trains the discriminator of the detail correction model; where the training data sets are MS-COCO data set and CUB200 data set; the conditional generator loss function is as follows:
Figure PCTCN2021123272-appb-000085
Figure PCTCN2021123272-appb-000085
其中,L Gs,0为有条件的生成器损失函数,L Ds,0为有条件的判别器损失函数;D S为图像编辑模型中细节修正模型的判别器;
Figure PCTCN2021123272-appb-000086
是指目标修正图像
Figure PCTCN2021123272-appb-000087
是细节修正模型(SDCM)在给定训练图像I和该训练图像I的描述文本T I的情况下生成的;I~P data指训练图像I是从真实数据中采样获得的;T为随机选择的目标文本,
Figure PCTCN2021123272-appb-000088
是语义对比函数,目的是使目标修正图像
Figure PCTCN2021123272-appb-000089
相对于随机选择的目标文本T更接近于训练图像I的描述文本T I。上述
Figure PCTCN2021123272-appb-000090
函数定义为:
Among them, L Gs,0 is the conditional generator loss function, L Ds,0 is the conditional discriminator loss function; D S is the discriminator of the detail correction model in the image editing model;
Figure PCTCN2021123272-appb-000086
refers to the target corrected image
Figure PCTCN2021123272-appb-000087
It is generated by the detail correction model (SDCM) given the training image I and the description text T I of the training image I; I~P data means that the training image I is sampled from real data; T is randomly selected the target text of
Figure PCTCN2021123272-appb-000088
is the semantic contrast function, the purpose is to make the target correction image
Figure PCTCN2021123272-appb-000089
The description text T I that is closer to the training image I than the randomly selected target text T. the above
Figure PCTCN2021123272-appb-000090
The function is defined as:
Figure PCTCN2021123272-appb-000091
Figure PCTCN2021123272-appb-000091
其中,ρ c是对比度控制阈值;L corre是ControlGAN中关联函数,用于描述训练文本和目标修正图像的匹配程度。 Among them, ρ c is the contrast control threshold; Lcorre is the correlation function in ControlGAN, which is used to describe the matching degree between the training text and the target corrected image.
上述无条件的生成器损失函数如下:The above unconditional generator loss function is as follows:
Figure PCTCN2021123272-appb-000092
Figure PCTCN2021123272-appb-000092
其中,L Gs,1为无条件的生成器损失函数,L Ds,1为无条件的判别器损失函数;D S为图像编辑模型中细节修正模块的判别器;
Figure PCTCN2021123272-appb-000093
是指目标修正图像
Figure PCTCN2021123272-appb-000094
是细节修正模型(SDCM)在给定训练图像I和随机选择的目标文本T的情况下生成的。细节修正模型2014中生成器和判别器的总损失函数分别为:
Among them, L Gs,1 is the unconditional generator loss function, L Ds,1 is the unconditional discriminator loss function; D S is the discriminator of the detail correction module in the image editing model;
Figure PCTCN2021123272-appb-000093
refers to the target corrected image
Figure PCTCN2021123272-appb-000094
is generated by a detail correction model (SDCM) given a training image I and a randomly selected target text T. The total loss functions of the generator and the discriminator in the detail revision model 2014 are respectively:
Figure PCTCN2021123272-appb-000095
Figure PCTCN2021123272-appb-000095
其中,L Gs为生成器的总损失函数,L Ds为判别器的总损失函数;L ControlGAN为文本和图像的多模态损失函数,用于提升图像编辑结果和目标文本的匹配程度;上述
Figure PCTCN2021123272-appb-000096
Figure PCTCN2021123272-appb-000097
图像局部特征
Figure PCTCN2021123272-appb-000098
是利用VGG网络对目标修正图像
Figure PCTCN2021123272-appb-000099
进行特征提取得到的,图像局部特征M I是利用VGG网络对目标源图像I进行特征提取得到的,函数L DAMSM为定义在ManiGAN中的文本与图像相似性函数;L reg函数为图像编辑模型中定义的正则化项,用于加强修改效果,
Figure PCTCN2021123272-appb-000100
在细节修正模型训练的过程中,随机地使用训练图像I和目标修正图像
Figure PCTCN2021123272-appb-000101
相互替换,以加速细节修正模型的训练过程。
Among them, L Gs is the total loss function of the generator, L Ds is the total loss function of the discriminator; L ControlGAN is the multimodal loss function of text and image, which is used to improve the matching degree of image editing results and target text; the above
Figure PCTCN2021123272-appb-000096
Figure PCTCN2021123272-appb-000097
Image local features
Figure PCTCN2021123272-appb-000098
It is to use the VGG network to correct the image of the target
Figure PCTCN2021123272-appb-000099
The feature extraction is obtained. The image local feature M I is obtained by using the VGG network to extract the feature of the target source image I. The function L DAMSM is the text and image similarity function defined in ManiGAN; the L reg function is the image editing model. The defined regularization term is used to strengthen the modification effect,
Figure PCTCN2021123272-appb-000100
During the training of the detail correction model, the training image I and the target correction image are randomly used
Figure PCTCN2021123272-appb-000101
replace each other to speed up the training process of the detail correction model.
为了便于理解,下面结合图4对本申请提供的基于文本的图像编辑方法的整体流程进行示例性说明。For ease of understanding, the overall flow of the text-based image editing method provided by the present application will be described as an example below with reference to FIG. 4 .
如图4所示,可配置编辑部分是指用户可以对图像编辑模型中的采样编码模块G 00、生成模块G 01和生成模块G 02输出的中间编辑结果进行控制,并且,通过判断中间编辑结果而选择性地跳过某些模块。将不符合用户要求的中间编辑结果用目标源图像代替,从而实现根据目标文本对目标源图像的编辑过程。下面通过图4中(a)~(h)的子流程图,来说明用户对图像编辑模型中的采样编码模块G 00、生成模块G 01和生成模块G 02输出的中间编辑结果进行随机控制的过程。 As shown in Figure 4, the configurable editing part means that the user can control the intermediate editing results output by the sampling encoding module G 00 , generating module G 01 and generating module G 02 in the image editing model, and by judging the intermediate editing results Instead, some modules are selectively skipped. The intermediate editing results that do not meet the user's requirements are replaced by the target source image, so as to realize the editing process of the target source image according to the target text. In the following, through the sub-flow charts of (a) to (h) in Fig. 4, we will illustrate how the user randomly controls the intermediate editing results output by the sample encoding module G 00 , the generation module G 01 and the generation module G 02 in the image editing model process.
图4中,(a)完整流程:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像符合用户要求;采样编码模块G 00将第一编辑图像输入至生成模块G 01中,生成模块G 01对第一编辑图像进行处理并输出的第二编辑图像也符合用户要求;将第二编辑图像输入至生成模块G 02;生成模块G 02对第二编辑图像进行处理并生成的第三编辑图像也符合用户要求,生成模块G 02将第三编辑图像(即目标编辑图像)输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)对第三编辑图像进行细节修正并得到目标修正图像。 In Figure 4, (a) complete process: the user inputs the image overall features and image local features of the target source image into the image editing model, as well as the sentence overall features and sentence word features of the target text, and the sampling coding module G in the image editing model 00, the output of the first edited image meets the requirements of the user; the sample encoding module G 00 inputs the first edited image into the generation module G 01 , and the generation module G 01 processes the first edited image and outputs the second edited image that also meets the user’s requirements. Requirements; input the second edited image to the generation module G 02 ; the third edited image generated by the generation module G 02 to process the second edited image also meets the requirements of the user, and the generation module G 02 converts the third edited image (ie, the target edited image image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target correction image.
图4中,(b)跳过G 00:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像不符合用户要求;该采样编码模块G 00输出的第一编辑图像舍弃, 并用目标源图像代替第一编辑图像输入至生成模块G 01中,生成模块G 01对目标源图像进行处理并输出的第二编辑图像符合用户要求;将第二编辑图像输入至生成模块G 02;生成模块G 02对第二编辑图像进行处理并生成的第三编辑图像也符合用户要求,生成模块G 02将第三编辑图像(即目标编辑图像)输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)对第三编辑图像进行细节修正并得到目标修正图像。 In Figure 4, (b) skip G 00 : the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sampling code in the image editing model The first edited image output by module G 00 does not meet the user's requirements; the first edited image output by the sample encoding module G 00 is discarded, and the target source image is used instead of the first edited image to be input to the generation module G 01 , and the generation module G 01 is The target source image is processed and the second edited image output meets the user's requirements; the second edited image is input to the generation module G 02 ; the generation module G 02 processes the second edited image and generates a third edited image that also meets the user's requirements , the generating module G 02 inputs the third edited image (ie, the target edited image) into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target corrected image.
图4中,(c)跳过G 00和G 01:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像不符合用户要求;该采样编码模块G 00输出的第一编辑图像舍弃,并用目标源图像代替第一编辑图像输入至生成模块G 01中,生成模块G 01对目标源图像进行处理并输出的第二编辑图像也不符合用户要求;该生成模块G 01将第二编辑图像舍弃,并用目标源图像代替第二编辑图像输入至生成模块G 02;生成模块G 02对目标源图像进行处理并生成的第三编辑图像符合用户要求,生成模块G 02将第三编辑图像(即目标编辑图像)输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)对第三编辑图像进行细节修正并得到目标修正图像。 In Fig. 4, (c) G 00 and G 01 are skipped: the user inputs the image overall features and image local features of the target source image into the image editing model, and the sentence overall features and sentence word features of the target text, and the image editing model The first edited image output by the sample encoding module G 00 does not meet the user's requirements; the first edited image output by the sample encoding module G 00 is discarded, and the target source image is used instead of the first edited image to be input to the generation module G 01 , and the generation module G 01 processes the target source image and outputs the second edited image that does not meet the user's requirements; the generation module G 01 discards the second edited image, and uses the target source image instead of the second edited image to input to the generation module G 02 ; generates Module G 02 processes the target source image and generates a third edited image that meets the user's requirements. The generation module G 02 inputs the third edited image (i.e., the target edited image) into the detail correction model (SCDM), and the detail correction model ( SCDM) to correct the details of the third edited image and obtain the target corrected image.
图4中,(d)仅用SCDM(即跳过G 00、G 01和G 02):用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像不符合用户要求;该采样编码模块G 00输出的第一编辑图像舍弃,并用目标源图像代替第一编辑图像输入至生成模块G 01中,生成模块G 01对目标源图像进行处理并输出的第二编辑图像也不符合用户要求;该生成模块G 01将第二编辑图像舍弃,并用目标源图像代替第二编辑图像输入至生成模块G 02;生成模块G 02对目标源图像进行处理并生成的第三编辑图像也不符合用户要求,该生成模块G 02将第三编辑图像舍弃,并用目标源图像代替第三编辑图像输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)根据目标源图像的图像局部特征和目标文本的句子词特征对目标源图像进行细节修正并得到目标修正图像。 In Figure 4, (d) only use SCDM (i.e. skip G 00 , G 01 and G 02 ): the user inputs the overall image features and local image features of the target source image, and the overall sentence features of the target text into the image editing model and sentence word features, the first edited image output by the sampling encoding module G 00 in the image editing model does not meet the user's requirements; the first edited image output by the sampling encoding module G 00 is discarded, and the target source image is used to replace the first edited image input In the generation module G 01 , the second edited image output by the generation module G 01 after processing the target source image also does not meet the user’s requirements; the generation module G 01 discards the second edited image and replaces the second edited image with the target source image The image is input to the generation module G 02 ; the generation module G 02 processes the target source image and generates a third edited image that does not meet the user's requirements, and the generation module G 02 discards the third edited image and replaces the third edited image with the target source image The edited image is input into the detail correction model (SCDM), which corrects the details of the target source image according to the image local features of the target source image and the sentence word features of the target text to obtain the target corrected image.
图4中,(e)跳过SCDM:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G00输出的第一编辑图像符合用户要求;采样编码模块G00将第一编辑图像输入至生成模块G01中,生成模块G01对第一编辑图像进行处理并输出的第二编辑图像也符合用户要求;将第二编辑图像输入至生成模块G02;生成模块G02对第二编辑图像进行处理并生成的第三编辑图像也符合用户要求,该第三编辑图像即为目标编辑图像。In Figure 4, (e) SCDM is skipped: the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sample encoding module in the image editing model The first edited image output by G00 meets user requirements; the sampling encoding module G00 inputs the first edited image into the generation module G01, and the generation module G01 processes the first edited image and outputs a second edited image that also meets user requirements; The second edited image is input to the generating module G02; the generating module G02 processes the second edited image and generates a third edited image that also meets user requirements, and the third edited image is the target edited image.
图4中,(f)重复G 01:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像符合用户要求;采样编码模块G 00将第一编辑图像输入至生成模块G 01中,生成模块G 01对第一编辑图像进行处理并输出的第二编辑图像不符合用户要求;该生成模块G 01可以反复利用,直到输出符合用户要求的第二编辑图像;比如,生成模块G 01重新根据第一编辑结果进行处理并生成新的第二编辑图像符合要求,并将该新的第二编 辑图像输入至生成模块G 02;生成模块G 02对新的第二编辑图像进行处理并生成的第三编辑图像也符合用户要求,生成模块G 02将第三编辑图像(即目标编辑图像)输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)对第三编辑图像进行细节修正并得到目标修正图像。 In Fig. 4, (f) repeats G 01 : the user inputs the image overall features and image local features of the target source image into the image editing model, and the sentence overall features and sentence word features of the target text, and the sample encoding module in the image editing model The first edited image output by G 00 meets the user's requirements; the sample encoding module G 00 inputs the first edited image to the generation module G 01 , and the generation module G 01 processes the first edited image and outputs the second edited image that does not conform to User requirements; the generation module G 01 can be used repeatedly until the second edited image that meets the user's requirements is output; for example, the generation module G 01 reprocesses according to the first edited result and generates a new second edited image that meets the requirements, and The new second edited image is input to the generation module G02 ; the generation module G02 processes the new second edited image and generates a third edited image that also meets the user's requirements, and the generation module G02 converts the third edited image (i.e. The target edited image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the third edited image to obtain the target corrected image.
图4中,(g)重复G 02:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像符合用户要求;采样编码模块G 00将第一编辑图像输入至生成模块G 01中,生成模块G 01对第一编辑图像进行处理并输出的第二编辑图像符合用户要求;生成模块G 02对第二编辑图像进行处理并生成的第三编辑图像不符合用户要求,该生成模块G 02可以反复利用,直到输出符合用户要求的第三编辑图像(即目标编辑图像);比如,生成模块G 02重新根据第二编辑结果进行处理并生成新的第三编辑图像符合要求,并将该新的第三编辑图像输入至生成模块G 02;生成模块G 02将新的第三编辑图像(即目标编辑图像)输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)对新的第三编辑图像进行细节修正并得到目标修正图像。 In Fig. 4, (g) repeat G 02 : the user inputs the image overall features and image local features of the target source image into the image editing model, as well as the sentence overall features and sentence word features of the target text, and the sample encoding module in the image editing model The first edited image output by G 00 meets the requirements of the user; the sample encoding module G 00 inputs the first edited image into the generation module G 01 , and the generation module G 01 processes the first edited image and outputs the second edited image that meets the user’s requirements. Requirements; the generation module G02 processes the second edited image and generates a third edited image that does not meet the user's requirements, and the generation module G02 can be used repeatedly until the third edited image that meets the user's requirements (ie, the target edited image) is output ; For example, the generation module G 02 re-processes according to the second editing result and generates a new third edited image that meets the requirements, and inputs the new third edited image to the generation module G 02 ; the generation module G 02 converts the new third edited image The third edited image (ie, the target edited image) is input into the detail correction model (SCDM), and the detail correction model (SCDM) performs detail correction on the new third edited image to obtain the target corrected image.
图4中,(h)重复SCDM:用户向图像编辑模型中输入目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征,图像编辑模型中的采样编码模块G 00输出的第一编辑图像符合用户要求;采样编码模块G 00将第一编辑图像输入至生成模块G 01中,生成模块G 01对第一编辑图像进行处理并输出的第二编辑图像符合用户要求;生成模块G 02对第二编辑图像进行处理并生成的第三编辑图像符合用户要求,生成模块G 02将第三编辑图像(即目标编辑图像)输入至细节修正模型(SCDM)中,该细节修正模型(SCDM)对第三编辑图像进行细节修正并得到第四编辑图像不符合用户要求,该细节修正模型(SCDM)可以反复利用,直到输出符合用户要求的第四编辑图像(即目标修正图像);比如,细节修正模型(SCDM)重新根据第三编辑结果进行处理并生成新的第四编辑图像符合要求,该新的第四编辑图像即为目标修正图像。 In Figure 4, (h) repeat SCDM: the user inputs the image overall features and image local features of the target source image, and the sentence overall features and sentence word features of the target text into the image editing model, and the sampling coding module G in the image editing model The first edited image output by 00 meets the user's requirements; the sample encoding module G 00 inputs the first edited image into the generation module G 01 , and the generation module G 01 processes the first edited image and outputs the second edited image that meets the user's requirements ; The generation module G 02 processes the second edited image and generates a third edited image that meets user requirements, and the generation module G 02 inputs the third edited image (ie, the target edited image) into the detail correction model (SCDM), and the detail The correction model (SCDM) corrects the details of the third edited image and obtains the fourth edited image that does not meet the user's requirements. The detail correction model (SCDM) can be used repeatedly until the output of the fourth edited image that meets the user's requirements (ie, the target corrected image ); for example, the detail correction model (SCDM) reprocesses according to the third editing result and generates a new fourth edited image that meets the requirements, and the new fourth edited image is the target corrected image.
本申请提出的上述技术方案与现有技术相比存在的有益效果如下:Compared with the prior art, the above-mentioned technical solution proposed by the application has the following beneficial effects:
相比现有ManiGAN根据文本对待编辑的源图像进行编辑并直接输出可能不符合用户要求的编辑结果,本申请在现有ManiGAN中引入采样编码模块,形成改进的ManiGAN;该采用编码模块会将中间编辑结果(即第一编辑图像)输出,以方便用户判断该中间编辑结果是否符合要求,若符合要求,则将中间编辑结果继续向至少一个级联的生成模块传递;若不符合要求,则不会将该中间结果继续向至少一个级联的生成模块传递,而是用目标源图像代替该中间编辑结果继续向至少一个级联的生成模块传递。由此可见,改进的ManiGAN在根据目标文本对目标源图像编辑时,可以对中间编辑结果进行控制,并及时剔除不符合要求的中间编辑结果,以防止前一级输出不符合要求的结果影响后一级输出结果的准确性,从而为用户编辑出更加符合要求的目标编辑图像。Compared with the existing ManiGAN that edits the source image to be edited according to the text and directly outputs the editing results that may not meet the user's requirements, this application introduces a sampling encoding module into the existing ManiGAN to form an improved ManiGAN; The edited result (ie the first edited image) is output to facilitate the user to judge whether the intermediate edited result meets the requirements. If the intermediate edited result meets the requirements, the intermediate edited result will continue to be passed to at least one cascaded generation module; if it does not meet the requirements, it will not The intermediate result will continue to be delivered to at least one cascaded generation module, but the target source image will be used to replace the intermediate editing result and continue to be passed to at least one cascaded generation module. It can be seen that when the improved ManiGAN edits the target source image according to the target text, it can control the intermediate editing results, and promptly remove the intermediate editing results that do not meet the requirements, so as to prevent the output of the previous level from affecting the subsequent results. The accuracy of the first-level output results, so as to edit the target edited image more in line with the requirements for the user.
在上述第一自注意力模块中引入第一带噪声仿射组合模块,该第一带噪声仿射组合模块通过引入高斯噪声能够增强生成模块编辑图像的可靠性,从而避免了生成模块因图像中 存在随机噪声而影响编辑结果可靠性的情况出现。In the above-mentioned first self-attention module, the first noisy affine combination module is introduced. The first noisy affine combination module can enhance the reliability of the generation module to edit images by introducing Gaussian noise, thereby avoiding the generation module due to the The presence of random noise affects the reliability of the editing results.
在第一上采样模块中引入第二带噪声仿射组合模块和第三带噪声仿射组合模块可以进一步对第一上采样模块中不同上采样层的输出结果进行视觉特征增强。Introducing the second noisy affine combination module and the third noisy affine combination module into the first upsampling module can further enhance the visual features of the output results of different upsampling layers in the first upsampling module.
在上述图像编辑模型中增加细节修正模型能够对图像编辑模型输出的目标编辑图像进行进一步的细节修改和增强,从而得到高分辨率的目标修正图像。Adding the detail correction model to the above image editing model can further modify and enhance the details of the target editing image output by the image editing model, so as to obtain a high-resolution target correction image.
在上述第一细节修正模块中增加多个带噪声仿射组合模块,可以增强细节修正模型的可靠性。Adding a plurality of noisy affine combination modules to the first detail correction module can enhance the reliability of the detail correction model.
在上述第二细节修正模块中增加多个带噪声仿射组合模块,可以增强细节修正模型的可靠性。Adding a plurality of noisy affine combination modules to the second detail correction module can enhance the reliability of the detail correction model.
上述根据有条件的生成器损失函数、无条件的生成器损失函数和语义对比函数训练所述细节修正模型的生成器,能够使得生成器生成的图像编辑结果(即目标编辑图像)更加符合目标文本描述的内容和用户要求。上述根据有条件的判别器损失函数和无条件的判别器损失函数训练所述细节修正模型的判别器,能够使得判断器的识别结果更加准确。The above-mentioned generator that trains the detail correction model according to the conditional generator loss function, the unconditional generator loss function and the semantic comparison function can make the image editing result generated by the generator (ie, the target edited image) more conform to the target text description content and user requirements. Training the discriminator of the detail modification model according to the conditional discriminator loss function and the unconditional discriminator loss function can make the recognition result of the discriminator more accurate.
在上述图像编辑模型训练的过程中,对N个子网络进行训练是通过自动编解码器跳过输出结果不符合要求的前级子网络而优先训练后级子网络,由于跳过时使用目标源图像而不是中间编辑结果(比如,第一编辑图像),因此,能够避免潜在的前级子网络输出的错误结果向后级子网络传播。后级子网络优先训练能够给前级子网络带来更好的更新梯度,从而使得前级子网络的收敛效果更好。In the process of training the above image editing model, the training of the N sub-networks is to skip the previous sub-networks whose output results do not meet the requirements through the automatic codec and give priority to training the subsequent sub-networks, because the target source image is used when skipping. It is not an intermediate editing result (for example, the first edited image), therefore, it is possible to avoid the potential error results output by the previous sub-network from propagating to the subsequent sub-network. The priority training of the later sub-network can bring better update gradients to the front-level sub-network, so that the convergence effect of the front-level sub-network is better.
图5示出了本申请提供了一种电子设备的结构示意图。图5中的虚线表示该单元或该模块为可选的。电子设备500可用于实现上述方法实施例中描述的方法。电子设备500可以是终端设备或服务器或芯片。FIG. 5 shows a schematic structural diagram of an electronic device provided by the present application. The dotted line in Fig. 5 indicates that the unit or the module is optional. The electronic device 500 may be used to implement the methods described in the foregoing method embodiments. The electronic device 500 may be a terminal device or a server or a chip.
电子设备500包括一个或多个处理器501,该一个或多个处理器501可支持电子设备500实现图1所对应方法实施例中的方法。处理器501可以是通用处理器或者专用处理器。例如,处理器501可以是中央处理器(central processing unit,CPU)。CPU可以用于对电子设备500进行控制,执行软件程序,处理软件程序的数据。电子设备500还可以包括通信单元505,用以实现信号的输入(接收)和输出(发送)。The electronic device 500 includes one or more processors 501, and the one or more processors 501 can support the electronic device 500 to implement the method in the method embodiment corresponding to FIG. 1 . Processor 501 may be a general purpose processor or a special purpose processor. For example, the processor 501 may be a central processing unit (central processing unit, CPU). The CPU can be used to control the electronic device 500, execute software programs, and process data of the software programs. The electronic device 500 may further include a communication unit 505, configured to implement input (reception) and output (send) of signals.
例如,电子设备500可以是芯片,通信单元505可以是该芯片的输入和/或输出电路,或者,通信单元505可以是该芯片的通信接口,该芯片可以作为终端设备的组成部分。For example, the electronic device 500 may be a chip, and the communication unit 505 may be an input and/or output circuit of the chip, or the communication unit 505 may be a communication interface of the chip, and the chip may serve as a component of a terminal device.
又例如,电子设备500可以是终端设备,通信单元505可以是该终端设备的收发器,或者,通信单元505可以是该终端设备的收发电路。For another example, the electronic device 500 may be a terminal device, and the communication unit 505 may be a transceiver of the terminal device, or the communication unit 505 may be a transceiver circuit of the terminal device.
电子设备500中可以包括一个或多个存储器502,其上存有程序504,程序504可被处理器501运行,生成指令503,使得处理器501根据指令503执行上述方法实施例中描述的方法。可选地,存储器502中还可以存储有数据(如待测芯片的ID)。可选地,处理器501还可以读取存储器502中存储的数据,该数据可以与程序504存储在相同的存储地址,该数据也可以与程序504存储在不同的存储地址。The electronic device 500 may include one or more memories 502, on which a program 504 is stored, and the program 504 may be run by the processor 501 to generate instructions 503, so that the processor 501 executes the methods described in the above method embodiments according to the instructions 503. Optionally, data (such as the ID of the chip to be tested) may also be stored in the memory 502 . Optionally, the processor 501 may also read data stored in the memory 502, the data may be stored in the same storage address as the program 504, and the data may also be stored in a different storage address from the program 504.
处理器501和存储器502可以单独设置,也可以集成在一起,例如,集成在终端设备 的系统级芯片(system on chip,SOC)上。The processor 501 and the memory 502 can be set separately, and can also be integrated together, for example, integrated on a system-on-chip (system on chip, SOC) of the terminal device.
处理器501执行启动老化测试的方法的具体方式可以参见方法实施例中的相关描述。For a specific manner in which the processor 501 executes the method for starting the burn-in test, reference may be made to relevant descriptions in the method embodiments.
应理解,上述方法实施例的各步骤可以通过处理器501中的硬件形式的逻辑电路或者软件形式的指令完成。处理器501可以是CPU、数字信号处理器(digital signalprocessor,DSP)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,例如,分立门、晶体管逻辑器件或分立硬件组件。It should be understood that the steps in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor 501 . The processor 501 may be a CPU, a digital signal processor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices or discrete hardware components .
本申请还提供了一种计算机程序产品,该计算机程序产品被处理器501执行时实现本申请中任一方法实施例所述的方法。The present application also provides a computer program product, which implements the method described in any method embodiment in the present application when the computer program product is executed by the processor 501 .
该计算机程序产品可以存储在存储器502中,例如是程序504,程序504经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器501执行的可执行目标文件。The computer program product can be stored in the memory 502 , such as a program 504 , and the program 504 is finally converted into an executable object file executable by the processor 501 through processes such as preprocessing, compiling, assembling and linking.
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例所述的方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。The present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in the present application is implemented. The computer program may be a high-level language program or an executable object program.
该计算机可读存储介质例如是存储器502。存储器502可以是易失性存储器或非易失性存储器,或者,存储器502可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(DynamicRAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。The computer readable storage medium is, for example, the memory 502 . The memory 502 may be a volatile memory or a nonvolatile memory, or, the memory 502 may include both a volatile memory and a nonvolatile memory. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. The volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM) ), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (Synchlink DRAM, SLDRAM) And direct memory bus random access memory (Direct Rambus RAM, DRRAM).
本领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和设备的具体工作过程以及产生的技术效果,可以参考前述方法实施例中对应的过程和技术效果,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process and technical effects of the devices and equipment described above can refer to the corresponding processes and technical effects in the foregoing method embodiments, here No longer.
在本申请所提供的几个实施例中,所揭露的系统、装置和方法,可以通过其它方式实现。例如,以上描述的方法实施例的一些特征可以忽略,或不执行。以上所描述的装置实施例仅仅是示意性的,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,多个单元或组件可以结合或者可以集成到另一个系统。另外,各单元之间的耦合或各个组件之间的耦合可以是直接耦合,也可以是间接耦合,上述耦合包括电的、机械的或其它形式的连接。In the several embodiments provided in this application, the disclosed systems, devices and methods may be implemented in other ways. For example, some features of the method embodiments described above may be omitted, or not implemented. The device embodiments described above are only illustrative, and the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system. In addition, the coupling between the various units or the coupling between the various components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制。尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施 例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换,而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solution of the present application, but not to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: they can still modify the technical solutions described in the aforementioned embodiments, or perform equivalent replacements for some of the technical features, and these Any modification or replacement that does not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present application shall be included within the protection scope of the present application.

Claims (12)

  1. 一种基于文本的图像编辑方法,其特征在于,所述方法包括:A text-based image editing method, characterized in that the method comprises:
    获取目标源图像的图像整体特征和图像局部特征,以及目标文本的句子整体特征和句子词特征;Obtain the image overall features and image local features of the target source image, as well as the sentence overall features and sentence word features of the target text;
    根据所述图像整体特征、所述图像局部特征、所述句子整体特征和所述句子词特征,基于图像编辑模型对所述目标源图像进行编辑,得到目标编辑图像;Editing the target source image based on an image editing model according to the overall feature of the image, the partial feature of the image, the overall feature of the sentence and the feature of the sentence word, to obtain a target edited image;
    其中,所述图像编辑模型包括:采样编码模块和至少一个级联的生成模块;所述图像编辑模型对所述目标源图像的处理过程包括:利用所述采样编码模块对所述图像整体特征、所述句子整体特征和所述图像局部特征进行采样编码处理,得到第一编辑图像,并输出所述第一编辑图像;响应于用户指令,将所述第一编辑图像、所述图像局部特征和所述句子词特征输入所述至少一个级联的生成模块中进行高维视觉特征提取,得到所述目标编辑图像,或者将所述目标源图像、所述图像局部特征和所述句子词特征输入到所述至少一个级联的生成模块中进行高维视觉特征提取,得到所述目标编辑图像。Wherein, the image editing model includes: a sampling encoding module and at least one cascaded generation module; the processing of the target source image by the image editing model includes: using the sampling encoding module to analyze the overall characteristics of the image, Sampling and encoding processing is performed on the overall feature of the sentence and the local feature of the image to obtain a first edited image, and output the first edited image; in response to a user instruction, the first edited image, the local feature of the image and the The sentence word features are input into the at least one cascaded generation module for high-dimensional visual feature extraction to obtain the target edited image, or the target source image, the image local features and the sentence word features are input Perform high-dimensional visual feature extraction in the at least one cascaded generation module to obtain the target edited image.
  2. 根据权利要求1所述的方法,其特征在于,所述生成模块包括:第一自动解码器、第一自注意力模块、第二上采样模块和第二自动编码器,The method according to claim 1, wherein the generating module comprises: a first automatic decoder, a first self-attention module, a second upsampling module and a second automatic encoder,
    所述第一自动解码器用于恢复输入信息的高维视觉特征,得到第一高维特征图像,所述输入信息为第一编辑图像或者为前一层生成模块的输出信息;The first automatic decoder is used to restore the high-dimensional visual features of the input information to obtain the first high-dimensional feature image, and the input information is the first edited image or the output information of the generation module of the previous layer;
    所述第一自注意力模块用于对所述第一高维特征图像和所述句子词特征进行融合以及拼接处理,得到句子语义信息特征;The first self-attention module is used to fuse and concatenate the first high-dimensional feature image and the sentence word features to obtain sentence semantic information features;
    所述第二上采样模块用于对所述句子语义信息特征进行特征融合和上采样处理,得到第二上采样结果;The second upsampling module is used to perform feature fusion and upsampling processing on the semantic information features of the sentence to obtain a second upsampling result;
    所述第二自动编码器用于对所述第二上采样结果进行高维视觉特征提取,得到输出信息,当所述生成模块为所述至少一个级联的生成模块中的最后一级生成模块时,所述输出信息为目标编辑图像。The second automatic encoder is used to perform high-dimensional visual feature extraction on the second upsampling result to obtain output information, when the generation module is the last generation module in the at least one cascaded generation module , the output information is the target edited image.
  3. 根据权利要求2所述的方法,其特征在于,所述第一自注意力模块包括:自注意力层和第一带噪声仿射组合模块,The method according to claim 2, wherein the first self-attention module comprises: a self-attention layer and a first noisy affine combination module,
    所述自注意力层用于对所述第一高维特征图像和所述句子词特征进行特征融合;The self-attention layer is used to carry out feature fusion to the first high-dimensional feature image and the sentence word features;
    所述第一带噪声仿射组合模块用于对所述自注意力层的输出结果与所述第一高维特征图像的拼接结果,以及所述图像局部特征进行特征融合。The first noisy affine combination module is used to perform feature fusion on the output result of the self-attention layer, the splicing result of the first high-dimensional feature image, and the local features of the image.
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述采样编码模块包括:第一上采样模块和第一自动编码器,The method according to any one of claims 1 to 3, wherein the sampling encoding module comprises: a first upsampling module and a first automatic encoder,
    所述第一上采样模块用于对所述图像整体特征、所述句子整体特征和所述图像局部特征进行上采样处理,得到第一上采样结果;The first upsampling module is used to perform upsampling processing on the overall feature of the image, the overall feature of the sentence, and the local feature of the image to obtain a first upsampling result;
    所述第一自动编码器用于根据所述第一上采样结果生成第一编辑图像。The first automatic encoder is used to generate a first edited image according to the first upsampling result.
  5. 根据权利要求4所述的方法,其特征在于,所述第一上采样模块包括:多个相同的 上采样层、第二带噪声仿射组合模块和第三带噪声仿射组合模块,The method according to claim 4, wherein the first upsampling module comprises: a plurality of identical upsampling layers, the second band noise affine combination module and the third band noise affine combination module,
    第一上采样模块的输入是所述句子整体特征、所述图像整体特征和所述图像局部特征,所述多个相同的上采样层中相邻的两个上采样层中,后一上采样层的输入是前一上采样层的输出;The input of the first upsampling module is the overall feature of the sentence, the overall feature of the image and the local features of the image. In the two adjacent upsampling layers in the same upsampling layer, the latter upsampling The input of the layer is the output of the previous upsampling layer;
    所述第二带噪声仿射组合模块位于所述多个相同的上采样层中任意两个上采样层中间,用于对所述任意两个上采样层中前一上采样层输出的结果和所述图像局部特征进行特征融合;The second noisy affine combination module is located in the middle of any two upsampling layers in the plurality of identical upsampling layers, and is used to output the result and performing feature fusion on the local features of the image;
    所述第三带噪声仿射组合模块用于对所述多个相同的上采样层中最后一个上采样层的输出结果和所述图像局部特征进行特征融合。The third noisy affine combination module is used to perform feature fusion on the output result of the last upsampling layer in the plurality of identical upsampling layers and the local feature of the image.
  6. 根据权利要求1至3中任一项所述的方法,其特征在于,所述图像编辑模型还包括:细节修正模型,用于对所述目标编辑图像进行细节修改;The method according to any one of claims 1 to 3, wherein the image editing model further comprises: a detail modification model, configured to modify the details of the target edited image;
    所述细节修正模型用于处理所述图像局部特征、所述句子词特征和所述目标编辑图像,得到目标修正图像;The detail correction model is used to process the image local features, the sentence word features and the target edited image to obtain the target corrected image;
    所述细节修正模型包括:第一细节修正模块、第二细节修正模块、融合模块和生成器,其中,所述第一细节修正模块用于对所述图像局部特征、第一随机噪声和所述句子词特征进行细节修改,得到第一细节特征;The detail correction model includes: a first detail correction module, a second detail correction module, a fusion module and a generator, wherein the first detail correction module is used for local features of the image, the first random noise and the Sentence word features are modified in detail to obtain the first detailed feature;
    所述第二细节修正模块用于对所述目标编辑图像对应的图像局部特征、第二随机噪声和所述句子词特征进行细节修改,得到第二细节特征;The second detail correction module is used to modify the details of the image local features corresponding to the target edited image, the second random noise and the sentence word features to obtain the second detail features;
    所述融合模块用于对所述第一细节特征和所述第二细节特征进行特征融合;The fusion module is used to perform feature fusion on the first detailed features and the second detailed features;
    所述生成器用于根据所述融合模块的输出结果生成所述目标修正图像。The generator is used to generate the target correction image according to the output result of the fusion module.
  7. 根据权利要求6所述的方法,其特征在于,所述第一细节修正模块包括第四带噪声仿射组合模块、第五带噪声仿射组合模块、第六带噪声仿射组合模块、第二自注意力模块、第一残差网络和第一线性网络;The method according to claim 6, wherein the first detail correction module comprises a fourth noisy affine combination module, a fifth noisy affine combination module, a sixth noisy affine combination module, a second Self-attention module, first residual network and first linear network;
    所述第四带噪声仿射组合模块用于对所述第一随机噪声和所述图像局部特征进行特征融合,得到第一融合特征;The fourth noisy affine combination module is used to perform feature fusion on the first random noise and the local features of the image to obtain a first fusion feature;
    所述第二自注意力模块用于对所述第一融合特征和所述句子词特征进行特征融合;The second self-attention module is used to carry out feature fusion to the first fusion feature and the sentence word feature;
    所述第五带噪声仿射组合模块对所述第二自注意力模块的输出结果与所述第一随机噪声的拼接结果、以及所述图像局部特征进行特征融合;The fifth noisy affine combination module performs feature fusion on the output result of the second self-attention module, the mosaic result of the first random noise, and the local features of the image;
    所述第一残差网络用于对所述第五带噪声仿射组合模块的输出结果进行视觉特征提取;The first residual network is used to extract visual features from the output of the fifth noisy affine combination module;
    所述第一线性网络用于对所述图像局部特征进行线性变换;The first linear network is used to linearly transform the local features of the image;
    所述第六带噪声仿射组合模块用于对所述第一残差网络的输出结果和所述第一线性网络的输出结果进行特征融合。The sixth noisy affine combination module is used to perform feature fusion on the output result of the first residual network and the output result of the first linear network.
  8. 根据权利要求6所述的方法,其特征在于,所述第二细节修正模块包括第七带噪声仿射组合模块、第八带噪声仿射组合模块、第九带噪声仿射组合模块、第三自注意力模块、第二残差网络和第二线性网络;The method according to claim 6, wherein the second detail correction module comprises a seventh noisy affine combination module, an eighth noisy affine combination module, a ninth noisy affine combination module, a third Self-attention module, second residual network and second linear network;
    所述第七带噪声仿射组合模块用于对所述第二随机噪声和所述目标编辑图像对应的图 像局部特征进行特征融合,得到第一融合特征;The seventh noisy affine combination module is used to perform feature fusion on the second random noise and the image local features corresponding to the target edited image to obtain the first fusion feature;
    所述第三自注意力模块用于对所述第一融合特征和所述句子词特征进行特征融合;The third self-attention module is used to carry out feature fusion to the first fusion feature and the sentence word feature;
    所述第八带噪声仿射组合模块对所述第三自注意力模块的输出结果与所述第二随机噪声的拼接结果、以及所述目标编辑图像对应的图像局部特征进行特征融合;The eighth noisy affine combination module performs feature fusion on the output result of the third self-attention module, the concatenation result of the second random noise, and the image local features corresponding to the target edited image;
    所述第二残差网络用于对所述第八带噪声仿射组合模块的输出结果进行视觉特征提取;The second residual network is used to extract visual features from the output of the eighth noisy affine combination module;
    所述第二线性网络用于对所述目标编辑图像对应的图像局部特征进行线性变换;The second linear network is used to linearly transform the image local features corresponding to the target edited image;
    所述第九带噪声仿射组合模块用于对所述第二残差网络的输出结果和所述第二线性网络的输出结果进行特征融合。The ninth noisy affine combination module is used to perform feature fusion on the output result of the second residual network and the output result of the second linear network.
  9. 根据权利要求6所述的方法,其特征在于,所述细节修正模型的训练方式包括:The method according to claim 6, wherein the training method of the detail correction model comprises:
    根据有条件的生成器损失函数、无条件的生成器损失函数和语义对比函数训练所述细节修正模型的生成器;training a generator of said minutiae correction model based on a conditional generator loss function, an unconditional generator loss function, and a semantic contrast function;
    根据有条件的判别器损失函数和无条件的判别器损失函数训练所述细节修正模型的判别器。The discriminator of the detail revision model is trained according to the conditional discriminator loss function and the unconditional discriminator loss function.
  10. 根据权利要求1至3中任一项所述的方法,其特征在于,所述图像编辑模型的训练方法包括:The method according to any one of claims 1 to 3, wherein the training method of the image editing model comprises:
    利用预设的损失函数和训练集对初始模型进行训练,得到所述图像编辑模型;Using a preset loss function and training set to train the initial model to obtain the image editing model;
    其中,所述预设的损失函数包括与N个子网络分别对应的子函数和N-1个自动编解码器的损失函数,所述初始模型包括N个子网络,所述N个子网络为所述采样编码模块和至少一个生成模块分别对应的初始模型;Wherein, the preset loss function includes sub-functions corresponding to N sub-networks and loss functions of N-1 automatic codecs, the initial model includes N sub-networks, and the N sub-networks are the sampling Initial models respectively corresponding to the encoding module and at least one generating module;
    训练过程中,当第i个子网络的输出图像不满足预设条件时,采用第i+1至第N个子网络对应的子函数和第i个至第i+1个自动编解码器的损失函数,对初始模型进行训练,0≤i<N。During the training process, when the output image of the i-th sub-network does not meet the preset conditions, the sub-function corresponding to the i+1-th sub-network and the loss function of the i-th to i+1-th automatic codec are used , train the initial model, 0≤i<N.
  11. 一种电子设备,其特征在于,所述设备包括处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于从所述存储器中调用并运行所述计算机程序,使得所述设备执行权利要求1至10中任一项所述的方法。An electronic device, characterized in that the device includes a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, so that the device performs A method as claimed in any one of claims 1 to 10.
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储了计算机程序,当所述计算机程序被处理器执行时,使得处理器执行权利要求1至10中任一项所述的方法。A computer-readable storage medium, characterized in that, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the processor is made to execute any one of claims 1 to 10. described method.
PCT/CN2021/123272 2021-10-12 2021-10-12 Text-based image editing method, and electronic device WO2023060434A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/123272 WO2023060434A1 (en) 2021-10-12 2021-10-12 Text-based image editing method, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/123272 WO2023060434A1 (en) 2021-10-12 2021-10-12 Text-based image editing method, and electronic device

Publications (1)

Publication Number Publication Date
WO2023060434A1 true WO2023060434A1 (en) 2023-04-20

Family

ID=85988130

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123272 WO2023060434A1 (en) 2021-10-12 2021-10-12 Text-based image editing method, and electronic device

Country Status (1)

Country Link
WO (1) WO2023060434A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503517A (en) * 2023-06-27 2023-07-28 江西农业大学 Method and system for generating image by long text
CN116681810A (en) * 2023-08-03 2023-09-01 腾讯科技(深圳)有限公司 Virtual object action generation method, device, computer equipment and storage medium
CN116681630A (en) * 2023-08-03 2023-09-01 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137681A1 (en) * 2016-11-17 2018-05-17 Adobe Systems Incorporated Methods and systems for generating virtual reality environments from electronic documents
CN110443863A (en) * 2019-07-23 2019-11-12 中国科学院深圳先进技术研究院 Method, electronic equipment and the storage medium of text generation image
CN111144414A (en) * 2019-01-25 2020-05-12 邹玉平 Image processing method, related device and system
CN113093960A (en) * 2021-04-16 2021-07-09 南京维沃软件技术有限公司 Image editing method, editing device, electronic device and readable storage medium
CN113158630A (en) * 2021-03-15 2021-07-23 苏州科技大学 Text editing image method, storage medium, electronic device and system
CN113448477A (en) * 2021-08-31 2021-09-28 南昌航空大学 Interactive image editing method and device, readable storage medium and electronic equipment
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137681A1 (en) * 2016-11-17 2018-05-17 Adobe Systems Incorporated Methods and systems for generating virtual reality environments from electronic documents
CN111144414A (en) * 2019-01-25 2020-05-12 邹玉平 Image processing method, related device and system
CN110443863A (en) * 2019-07-23 2019-11-12 中国科学院深圳先进技术研究院 Method, electronic equipment and the storage medium of text generation image
CN113158630A (en) * 2021-03-15 2021-07-23 苏州科技大学 Text editing image method, storage medium, electronic device and system
CN113093960A (en) * 2021-04-16 2021-07-09 南京维沃软件技术有限公司 Image editing method, editing device, electronic device and readable storage medium
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description
CN113448477A (en) * 2021-08-31 2021-09-28 南昌航空大学 Interactive image editing method and device, readable storage medium and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503517A (en) * 2023-06-27 2023-07-28 江西农业大学 Method and system for generating image by long text
CN116503517B (en) * 2023-06-27 2023-09-05 江西农业大学 Method and system for generating image by long text
CN116681810A (en) * 2023-08-03 2023-09-01 腾讯科技(深圳)有限公司 Virtual object action generation method, device, computer equipment and storage medium
CN116681630A (en) * 2023-08-03 2023-09-01 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium
CN116681810B (en) * 2023-08-03 2023-10-03 腾讯科技(深圳)有限公司 Virtual object action generation method, device, computer equipment and storage medium
CN116681630B (en) * 2023-08-03 2023-11-10 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2023060434A1 (en) Text-based image editing method, and electronic device
WO2022095682A1 (en) Text classification model training method, text classification method and apparatus, device, storage medium, and computer program product
US20230351748A1 (en) Image recognition method and system based on deep learning
WO2021127817A1 (en) Speech synthesis method, device, and apparatus for multilingual text, and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
WO2023072067A1 (en) Face attribute editing model training and face attribute editing methods
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
CN114092758A (en) Text-based image editing method and electronic equipment
WO2023130650A1 (en) Image restoration method and apparatus, electronic device, and storage medium
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN116128894A (en) Image segmentation method and device and electronic equipment
CN115761075A (en) Face image generation method, device, equipment, medium and product
CN116524299A (en) Image sample generation method, device, equipment and storage medium
KR20200145315A (en) Method of predicting lip position for synthesizing a person&#39;s speech video based on modified cnn
CN117197271A (en) Image generation method, device, electronic equipment and storage medium
CN116681810A (en) Virtual object action generation method, device, computer equipment and storage medium
CN112149651A (en) Facial expression recognition method, device and equipment based on deep learning
CN114155417B (en) Image target identification method and device, electronic equipment and computer storage medium
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN115810073A (en) Virtual image generation method and device
CN112948582B (en) Data processing method, device, equipment and readable medium
WO2022021304A1 (en) Method for identifying highlight clip in video on basis of bullet screen, and terminal and storage medium
CN113420869B (en) Translation method based on omnidirectional attention and related equipment thereof
US11508369B2 (en) Method and apparatus with speech processing
CN111325068B (en) Video description method and device based on convolutional neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21960183

Country of ref document: EP

Kind code of ref document: A1