CN113448477A

CN113448477A - Interactive image editing method and device, readable storage medium and electronic equipment

Info

Publication number: CN113448477A
Application number: CN202111008172.XA
Authority: CN
Inventors: 李波; 林枭; 刘彬; 刘奋成; 赵旭
Original assignee: Nanchang Hangkong University; Lenovo New Vision Nanchang Artificial Intelligence Industrial Research Institute Co Ltd
Current assignee: Nanchang Hangkong University; Lenovo New Vision Nanchang Artificial Intelligence Industrial Research Institute Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-09-28
Anticipated expiration: 2041-08-31
Also published as: CN113448477B

Abstract

An interactive image editing method, an interactive image editing device, a readable storage medium and an electronic device are provided, wherein the method comprises the following steps: extracting attribute features of the original image to obtain image attribute features; performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics; fusing the image attribute features and the text features to obtain fused features; extracting the integral structural features of the original image; performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region; performing structural feature completion of a non-edited region on the corrected structural feature of the edited region to obtain a corrected overall structural feature; inputting the corrected overall structural features into a generator so that the generator generates an image matched with the descriptive text based on the fusion feature guide.

Description

Interactive image editing method and device, readable storage medium and electronic equipment

Technical Field

The present invention relates to the field of image editing, and in particular, to an interactive image editing method, an interactive image editing apparatus, a readable storage medium, and an electronic device.

Background

Interactive image editing based on text descriptions aims at interactive editing of images through a text description. The text language is one of the most important and common communication modes for human beings, and the interactive editing of images by using the description of the text language is an important research direction of modern artificial intelligence in the field of image processing.

Although the existing method makes a certain progress on the image interactive editing problem based on the text description, and can preliminarily understand the editing intention in the text description, how to ensure the joint consistency of the spatial attention and the text attention of the editing and the decoupling of the non-editing area are still the main difficulties.

The existing text-based image editing method mainly encodes text information and image data into hidden variable semantic manifold space through an encoder, realizes text information-guided interactive editing in high-level semantic manifold space by using the combination and operation of text information encoding and image semantic attribute encoding, and finally generates an editing result through a decoder. The method mainly extends a task from a text to an image generation, lacks definition and constraint on an editing area and a non-editing area, and most of generation results are obviously changed in the non-editing area, so that the quality of an edited image is not high.

Disclosure of Invention

In view of the above, it is necessary to provide an interactive image editing method, an interactive image editing apparatus, a readable storage medium, and an electronic device, for solving the problem that the quality of an edited image is not high in the text-based image editing method in the prior art.

An interactive image editing method comprising:

extracting attribute features of the original image to obtain image attribute features;

performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics;

fusing the image attribute features and the text features to obtain fused features;

extracting the integral structural features of the original image;

performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region;

performing structural feature completion of a non-edited region on the corrected structural feature of the edited region to obtain a corrected overall structural feature;

inputting the corrected overall structural features into a generator so that the generator generates an image matched with the descriptive text based on the fusion feature guide.

Further, in the above interactive image editing method, the step of extracting the attribute features of the original image to obtain the image attribute features includes:

inputting an original image into an image attribute encoder, so that the image attribute encoder extracts the last layer of vectors and outputs the vectors to obtain global attribute characteristics by utilizing increment-v 3 encoding;

using global attribute features as the image attribute encoder input, using super-parameters

A set of multi-layer perceptrons is defined, and the dimensionality corresponding to the input image is estimated to be

And obtaining the image attribute characteristics through the Gaussian mixture distribution.

Further, in the above interactive image editing method, the step of performing context semantic word embedding and encoding on the descriptive text corresponding to the original image to obtain the text feature includes:

a descriptive text corresponding to the original image is mapped through a word list to obtain a group of word indexes, and word vectors with the length of the descriptive text are obtained through embedding;

and inputting the word vector with the descriptive text length into a text encoder, and acquiring the output vector of each time sequence node to obtain text characteristics.

Further, in the above interactive image editing method, the step of fusing the image attribute features and the text features to obtain fused features includes:

splicing the image attribute features and each word vector in the text features in a column direction to obtain spliced features;

inputting the splicing characteristics into a Bi-LSTM model, and acquiring output information of each time sequence node in the Bi-LSTM model to obtain fusion characteristics of corresponding word and image attribute distribution;

and taking the last node hidden layer output vector of the Bi-LSTM model as an image attribute-text fusion code, and decoupling a parameter vector group corresponding to the fused image attribute distribution by passing the image attribute-text fusion code through a group of multilayer perceptrons.

Further, in the above interactive image editing method, the step of inputting the modified overall structural feature into a generator to make the generator generate an image matching the descriptive text based on the fused feature guidance includes:

converting the parameter vector group into the current generated image as a variable parameter in a generator structure;

inputting the corrected overall structural features into the generator, and performing up-sampling and sampling for multiple times

And processing the convolution combination block, and outputting an image matched with the descriptive text.

Further, in the above interactive image editing method, before the step of extracting the attribute features of the original image, the method further includes:

constructing an interactive image editing model by utilizing an image attribute encoder, a text encoder, a content encoder, a fusion device and a generator;

and training the constructed interactive image editing model in a cross-loop mode.

and pre-training the image attribute encoder and the text encoder for mapping space alignment by adopting a DAMSM algorithm.

The invention also discloses an interactive image editing device, comprising:

the image attribute feature extraction module is used for extracting attribute features of the original image to obtain image attribute features;

the text feature coding module is used for embedding and coding words with context semantics into the descriptive text corresponding to the original image to obtain text features;

the fusion module is used for fusing the image attribute features and the text features to obtain fusion features;

the overall structure extraction module is used for extracting overall structure characteristics of the original image;

the fusion processing module is used for performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region;

the structure completion module is used for completing the structure characteristics of the non-edited region to the corrected structure characteristics of the edited region to obtain the corrected overall structure characteristics;

and the input module is used for inputting the corrected overall structural features into the generator so that the generator generates an image matched with the descriptive text based on the fusion feature guidance.

Further, the interactive image editing apparatus further includes:

the model building module is used for building an interactive image editing model by utilizing the image attribute encoder, the text encoder, the content encoder, the fusion device and the generator;

and the model training module is used for training the constructed whole interactive image editing model in a cross-loop mode.

Further, the interactive image editing apparatus further includes:

and the pre-training module is used for pre-training the alignment of the mapping space of the image attribute encoder and the text encoder by adopting a DAMSM algorithm.

The invention also discloses a readable storage medium on which a program is stored, which program, when executed by a processor, performs any of the methods described above.

The invention also discloses an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1 to 7 when executing the program.

The method can separate the content and the attribute characteristics of the image, realizes the real text-constrained image editing by integrating the text semantic characteristics into the image attribute characteristics, overcomes the complexity and the uncontrollable property of the traditional method for generating the image from the text again, can better reserve the region irrelevant to the text description and only modify the description object, and has higher speed on the editing of high-quality images.

Drawings

FIG. 1 is a schematic diagram of an interactive image editing model according to an embodiment of the present invention;

FIG. 2 is a flowchart of an interactive image editing method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a fusion cage according to an embodiment of the present invention;

FIG. 4 is a comparison experiment result of image editing effect quality according to an embodiment of the present invention;

FIG. 5 is a visualization of an ablation experiment performed on a cycle conservation consistency training mode in an embodiment of the present invention;

FIG. 6 is a result of visualization of the effect of an experiment for decoupling image attributes and content in an embodiment of the present invention;

fig. 7 is a block diagram of an interactive image editing apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

The method of the invention requires that the model be trained with a data set of objects of a certain type, including a single target image of an object of that type and a set of corresponding text describing the image. The input image has no special size requirement, the resolution ratio needs to be 256x256 under the optimal condition, and the edited object in the image is obvious; the input text should be an English character string without a specific description format. The interactive image editing method in the embodiment of the present invention may be implemented by using an interactive image editing model as shown in fig. 1, where the model includes an image attribute encoder EA, a text encoder ET, a content encoder EC, a Fuser and a generator G, where Lattr in the diagram is a KL loss in a distribution constraint corresponding to an attribute edited, and adain (adaptive Instance normalization) is adaptive Instance normalization.

Referring to fig. 2, an interactive image editing method according to an embodiment of the present invention includes steps S11 to S17.

And step S11, extracting the attribute features of the original image to obtain the image attribute features.

In order to extract the attribute features of an image better, an inclusion-v 3 network structure can be used as a core structure of an image attribute encoder, after the local and global features of the image are extracted, a plurality of different multilayer perceptrons are used for solving the distribution of the current features, and the specific implementation steps are as follows:

s111: the original image is used as the input of the encoder, and the local attribute characteristics of the image are obtained by encoding the original image by using increment-v 3

And global attribute features

Wherein, in the step (A),

，

wherein, in the step (A),

the space is a real space,

as the number of image channels,

is the number of channels that are characteristic of the device,

is the size of the image or images,

is the size of the local feature;

s112: will be provided with

As input, use the root-super parameter

A set of multi-layer perceptrons is defined to estimate the dimension corresponding to the input image as the maximum attribute category number of the processed image

The obtained image attribute features

，

Wherein

Is the number of attribute components. The image attribute features are expressed as a set of output parameter vectors.

And step S12, performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics.

When the method is implemented specifically, the text is firstly subjected to primary word embedding, then a cyclic neural network is used for processing a primary word embedding result, a word embedding result of the text based on context semantics and a sentence embedding vector (global coding) of the text are obtained, and the method comprises the following specific implementation steps:

s121: will have a length of

Descriptive text of

Obtaining a group of word indexes through word list mapping, and embedding to obtain word vectors of the descriptive text length

，

Which isIn (1),

representing a word vector dimension;

s122: adopting a Bi-Long-Short Memory model (Bi-LSTM) structure as a text encoder based on context, and encoding the text encoder based on the context

Inputting, obtaining output vector of each time sequence node as word embedding result of the node input word dependent context, obtaining final word embedding result of the text, namely text characteristic

Wherein

；

S123: taking hidden layer output of the last time sequence node of the Bi-LSTM as sentence coding of the text

Used as an auto-supervision variable in the DAMSM algorithm.

And step S13, fusing the image attribute features and the text features to obtain fused features.

In specific implementation, the image attribute features obtained in step S11 and the text features obtained in step S12 are spliced, a fusion operation with time sequence dependence is performed through a cyclic network, a result of each time sequence node and a hidden layer result of a tail node are output, and further the hidden layer result is subjected to a multi-layer perceptron to obtain fused image attribute distribution, wherein the specific implementation process is as follows:

s131: the core structure of the fusion device is shown in fig. 3, MLP is a group of multilayer perceptrons, LSTM is a long-time and short-time memory model, and the image attribute characteristics are represented

，

With text features

In which each word vector is spliced (operated) in column direction "

") to obtain

，

Namely:

，

，

is the ith word vector of the text;

s132: fusing image text features by Bi-LSTM, i.e.

As input, the hidden layer input of the time sequence starting node is initialized by random noise to enhance the variety of editing, and each time sequence node is taken as the output of the fusion feature of the corresponding word and the image attribute feature

，

；

S133: taking Bi-LSTM end node hidden layer output vector as image attribute-text fusion coding

The code is passed through

Decoupling parameter vector group corresponding to fused image attribute distribution by different multilayer perceptrons, and recording as

。

And step S14, adopting the CNN with the residual error structure as a content encoder, and extracting the overall structural features of the original image.

And step S15, performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region.

And step S16, performing structural feature completion of the non-edited region on the corrected structural feature of the edited region to obtain the corrected overall structural feature.

After the original image is encoded by the content encoder, the result is combined with the fusion feature obtained in step S13

Performing spatial attention processing to obtain a position corresponding to the fused feature on the content code, and recovering a position corresponding to the non-fused feature through Skip Connection, wherein the specific implementation process is as follows:

original image

As input of the encoder, the overall structural characteristics of the image are obtained through output

，

；

Will be provided with

Fusion features with image attribute-text

Performing space attention fusion processing to obtain the corrected structural characteristics of the edited region

The spatial attention fusion processing method is specifically as follows:

adopting Skip Connection structure, for

Completing the structural feature of the non-editing related area to obtain the corrected overall structural feature

I.e. by

。

And step S17, inputting the corrected overall structure characteristics into a generator so that the generator generates an image matched with the descriptive text based on the fusion characteristic guidance.

The generator reprocesses the original image content coding result based on the fusion characteristic guidance to generate an edited image, and the specific implementation steps are as follows:

s171: the Adaptive Instance Normalization (AdaIN) is adopted as the main Normalization method of the generator, so that the parameter vector group is migrated in a similar style

To the currently generated image, and thus

The generator dimension which can be received is adjusted through affine transformation processing and is used as a variable parameter in the generator structure;

s172: the generator input is after correctionOf the overall structure

Through multiple upsampling and

the convolution combination block is processed and output to obtain a text

Images of action

，

。

It will be appreciated that the model needs to be trained prior to interactive image editing.

Firstly, combining a pre-training image attribute encoder and a text encoder, then pre-training a model according to S11-S17 to realize initialization, and finally training the model by adopting a circular crossing method, wherein the specific implementation steps are as follows:

m1: adopting a Deep affected Multi-modal Similarity Model (DAMSM) to pre-train the mapping space alignment of an image Attribute feature Encoder (Attribute Encoder) and a Text Encoder (Text Encoder), wherein the DAMSM algorithm specifically comprises the following steps:

n1: and multiplying the text features and the image attribute features, and performing normalization processing by using softmax along the word embedding dimension direction, namely:

，

，

；

in particular, among others, the use of,

in order to be a transpose of the text feature W,

is composed of

The component at position (i, j) represents the similarity of the ith word to the jth region of the image, defining

To normalize the result for this similarity in the text space direction (sentence length).

N2: and calculating a content vector of the joint area, thereby dynamically solving the relevance of each local area and each word in the text:

；

the so-called region content vector is,

dynamically representing the relevance of the ith word to each area of the image;

is composed of

The jth row vector of (1) is the feature of the jth area of the image;

normalizing the result of the jth area and the ith word of the image along the image space direction;

determining the ratio of the local related sub-region features in the calculation of the region content vectorAnd (4) heavy.

N3: calculating the matching score of the image area and the text word by using the association obtained in N2:

；

；

wherein

For the i-th word vector of the text,

the hyper-parameter is used for expanding the influence degree of the text-image area with high correlation degree on the calculation of the correlation score.

N4: calculating the conditional probability distribution of whether the known images of all sample pairs in a batch are matched with the texts thereof by using a score calculation method in N3, and obtaining the conditional probability distribution of whether the known images are matched with the texts thereof by using the same method:

；

；

wherein the content of the first and second substances,

respectively an ith image and an ith text in the batch;

the super-parameter is used for smoothing the calculation result, and the effect is obtained through experiments.

N5: the loss is calculated by utilizing the distribution:

；

；

wherein the content of the first and second substances,

is prepared by mixing the above with

And respectively replacing the text word embedding and the image local characteristics in all the related formulas with the results obtained by the text sentence embedding and the image global characteristics.

The consistency of the image attribute encoder and the text encoder on the mapping coding space is trained through a DAMSM algorithm.

M2: and pre-training all modules by taking training sample data as input of the model and taking reconstructed original images as targets according to the steps S11-S17 to initialize model parameters, wherein the training sample data comprises a plurality of images used for training and corresponding texts.

M3: and training the whole model by adopting a cross-cycle reconstruction mode. The model input is one batch comprising n tuples at a time, one image within each tuple

And a corresponding text

(ii) a Taking the reverse text and the sequential image in each batch to form a new tuple, namely each image

Corresponding to a non-matching text

Inputting the model according to the steps S11-S17 to obtain the text

Matched edited images

(ii) a All in each batch

As a new input image and sequential text, a new tuple is composed, i.e. each image

Corresponding to a non-matching text

This text

For editing the pre-image

So that the restored image is assumed to be obtained after inputting the model according to the steps S11-S17

Thus, assume an image

Should approximate the original image as closely as possible

. And adopting reconstruction of a matched text, cross reconstruction of a non-matched text and the matched text and similarity of attribute distribution of the image before and after editing as main self-supervision information to construct a loss function, and realizing training optimization of the model.

The objective function in the cross-training process is:

；

wherein

In order for the image to be lost to the cyclic reconstruction,

in order to lose the reconstruction of the output image after editing the image with matching text,

the reconstruction loss after encoding and decoding for the image itself,

the KL distance between the edited image attribute distribution and the target attribute distribution,

in order to combat the loss of the generator,

for reconstruction loss of the property distribution after loop editing,

、

、

and

respectively, representing a hyper-parameter.

Order to

For generating the device, record

；

Then

；

；

；

；

；

Wherein the content of the first and second substances,

is a data "

"reverse order in the batch dimension;

used for calculating the KL distance between the two distributions;

，

，

respectively corresponding to the attribute distribution of the original image and the attribute of the edited imageDistributing and reconstructing the attribute distribution of the image after cyclic editing;

，

the discrimination results of the condition and unconditional condition of the discriminator are respectively,

representing the reverse order arrangement of the text T,

to calculate the expectation function, C is the channel, W and H are the width and height of the image, respectively, and CHW is the product of the channel, the width and the height of the image.

The corresponding discriminator objective function is:

。

a circular cross training mode is adopted, so that the problem that the model is unsupervised during training in the editing task is solved.

Further, the following correlation experiment was performed on the optimized model.

The model in the example of the invention and the quantitative comparison experiment of the existing open source work are shown in the following table:

the invention respectively carries out experimental comparison with two methods, namely ManiGAN and TAGAN, on a Caltech-UCSD copies 200(CUB) data set. The CUB data set included 8855 training images and 2933 test images. The quantitative indexes adopted by the invention comprise Inclusion Score (IS), text-image similarity (sim), L1-pixel difference (diff) and Management Precision (MP). Wherein IS IS used to measure the edited graphThe quality and the authenticity of the image, sim measures the similarity between the edited image and the input text, diff represents the pixel level difference between the edited image and the original input image, and MP measures the editing effect of the image. Specifically, MP is defined by sim and diff as:

. According to the average scores of the three methods on 2933 test images, the invention is superior to the existing method in four quantitative evaluation indexes. The highest MP value shows that the invention obtains the optimal effect on the consistency of text-image editing, and the IS value reflects that the editing result of the invention IS more real and natural.

In addition, in the experiment, the subjective test sharing of the editing results of the three models is designed for the learning of the user. The method includes the steps that 50 users with the ages ranging from 15 to 50 are invited to carry out subjective visual quality investigation, two editing results in the three methods are randomly and alternately displayed to the users, so that the users click a better editing effect, and experimental results show that more users tend to edit results of a text model.

And compared to the edit quality of existing open source work experiments, as shown in fig. 4. From the visual observation, the invention achieves better results on the target editing related to the text description semantics. In addition, since the algorithm of the invention does not generate images from zero but only modifies the text description content, the output result does not change the form of the edited image as a whole, and areas irrelevant to the text description, especially the background, can be better maintained.

To verify the effectiveness of the cyclic consensus training method of the present invention, an ablation experiment was performed on the cyclic training, as shown in fig. 5. The figure is an editing result output in the model training process, wherein n-ep represents the number of times of model iteration on the whole training set, w/cyc represents a training mode adopting loop consistency, and w/o cyc represents a mode not adopting loop training. According to the result, the model after the cycle consistency constraint is removed can not be effectively edited, and the output result oscillates between the rough editing effect and the original image reconstruction state due to the loss of supervision, so that the model is difficult to converge.

The decoupled visualization of the image content and the attributes is shown in fig. 6. The experimental result verifies that the invention can really separate the information such as the appearance, the background and the like of the image from the attributes such as the color and the like described by the text, effectively decomposes the content characteristics of the image and verifies the effectiveness of the decoupling model.

The method adopts an encoder to encode the text and image attribute information into an implicit variable popular space, and then utilizes a recurrent neural network to operate the distribution of the image attribute through text encoding to obtain the image attribute distribution based on the text; an extra encoder is adopted to encode the content of the image, and attention constraint of text-image attribute fusion characteristics is added to decouple the image structure of an editing region and a non-editing region; restoring the edited image in a manner similar to style migration by taking Adaptive instant Normalization as a main Normalization manner in a generator structure; through a cross cycle training mode, the similarity of the attribute distribution of the corresponding images before and after the cross is restrained so as to achieve the purpose of editing, and meanwhile, the reconstruction results of the images before and after the cycle are restrained so as to keep the quality of the output images. Due to the adoption of a one-stage direct generation mode and the separation of the image content and the attribute, the method has the advantages of high editing speed, remarkable editing effect and good decoupling of an editing target in a non-editing target area, and is suitable for editing the colorful image containing a single object by adopting the natural language text.

Referring to fig. 7, an interactive image editing apparatus according to an embodiment of the present invention includes:

the image attribute feature extraction module 10 is used for extracting attribute features of the original image to obtain image attribute features;

the text feature coding module 20 is configured to perform word embedding and coding of context semantics on the descriptive text corresponding to the original image to obtain text features;

the fusion module 30 is configured to fuse the image attribute features and the text features to obtain fusion features;

an overall structure extraction module 40, configured to extract an overall structure feature of the original image;

a fusion processing module 50, configured to perform spatial attention fusion processing on the overall structural feature and the fusion feature to obtain a modified structural feature of the edited region;

a structure completion module 60, configured to complete the structure characteristics of the non-edited region to the modified structure characteristics of the edited region, so as to obtain the modified overall structure characteristics;

and an input module 70, configured to input the modified overall structural features into a generator, so that the generator generates an image matching the descriptive text based on the fused feature guidance.

Further, the interactive image editing apparatus further includes:

The implementation principle and the generated technical effect of the interactive image editing apparatus provided by the embodiment of the present invention are the same as those of the foregoing method embodiment, and for brief description, no mention is made in the apparatus embodiment, and reference may be made to the corresponding contents in the foregoing method embodiment.

The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the interactive image editing method as described above.

The invention also discloses an electronic device, which comprises a memory, a processor and a program which is stored on the memory and can run on the processor, wherein the processor realizes the interactive image editing method when executing the program.

Those of skill in the art will understand that the logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be viewed as implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An interactive image editing method, comprising:

extracting the integral structural features of the original image;

2. The interactive image editing method of claim 1, wherein said step of extracting the attribute features of the original image to obtain the image attribute features comprises:

inputting an original image into an image attribute encoder, so that the image attribute encoder extracts the last layer of vectors by utilizing increment-v 3 encoding and outputs the vectors to obtain global attribute characteristics;

3. The interactive image editing method of claim 2, wherein said step of context-semantic word embedding and encoding the descriptive text corresponding to the original image to obtain text features comprises:

the descriptive text corresponding to the original image is mapped through a word list to obtain a group of word indexes, and word vectors with the length of the descriptive text are obtained through embedding;

4. The interactive image editing method of claim 3, wherein said step of fusing said image attribute features with text features to obtain fused features comprises:

5. The interactive image editing method of claim 4, wherein the step of inputting the revised overall structural features into a generator to cause the generator to generate an image matching the descriptive text based on a fused feature guide comprises:

6. The interactive image editing method of claim 2, wherein said step of performing attribute feature extraction on the original image further comprises:

7. The interactive image editing method of claim 3, wherein said step of performing attribute feature extraction on the original image further comprises:

8. An interactive image editing apparatus, comprising:

9. A readable storage medium on which a program is stored, which program, when executed by a processor, carries out the method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1-7 when executing the program.