CN115035213A

CN115035213A - Image editing method, device, medium and equipment

Info

Publication number: CN115035213A
Application number: CN202210561795.8A
Authority: CN
Inventors: 周作为
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-09-09

Abstract

The embodiment of the application discloses an image editing method, an image editing device, an image editing medium and image editing equipment. Wherein the method comprises the following steps: acquiring an original image to be edited and a target text corresponding to the original image; acquiring a pre-trained image editing model; the image editing model comprises a convolution layer, a text encoder, a residual block and a decoding unit; extracting global image features of the original image through the convolutional layer, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image obtained by editing the original image. According to the scheme, the problems of low image editing precision and incomplete matching between the output image and the input text description can be effectively solved, and the accuracy of image editing is improved.

Description

Image editing method, device, medium and equipment

Technical Field

The embodiment of the application relates to the field of image processing, in particular to an image editing method, an image editing device, an image editing medium and image editing equipment.

Background

With the successful application of deep learning in various fields such as image, video, text and voice processing. The application of deep learning to images in conjunction with text is still in a stage of development.

In the prior art, the image and text features are mapped to the same semantic space through a multi-modal representation learning mode, and a text encoder is obtained through training. And (3) generating a confrontation network by using the condition as a basic model, inputting the text features coded by the pre-training text coder and the preprocessed image, and learning the parameters of the network model through the confrontation training process of the generation confrontation network generator and the discriminator to finally obtain the model meeting the task requirement.

Image editing based on natural language text description is a research task with high attention in the research field of combining condition image synthesis of images and texts, and the overall objective of the task is to input an original image and a sentence of target text description and output an edited image, so that the output image integrally meets the text description, and meanwhile, details irrelevant to the text description in the original image are kept. However, the text encoder trained by adopting the multi-modal representation learning mode only provides global sentence characteristics for the countermeasure network model generated by the subsequent conditions, the image editing precision is low, and the output image and the input text description are not completely matched.

Disclosure of Invention

The embodiment of the application provides an image editing method, an image editing device, an image editing medium and image editing equipment.

In a first aspect, an embodiment of the present application provides an image editing method, where the method includes:

acquiring an original image to be edited and a target text corresponding to the original image; the target text is used for editing the original image;

acquiring a pre-trained image editing model; the image editing model comprises a convolution layer, a text encoder, a residual block and a decoding unit;

inputting the original image into a convolution layer in the image editing model, extracting global image features of the original image through the convolution layer, simultaneously inputting the target text into the text encoder, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image obtained by editing the original image.

In a second aspect, an embodiment of the present application provides an image editing apparatus, including:

the system comprises an original image and target text acquisition module, a target text acquisition module and a target editing module, wherein the original image and target text acquisition module is used for acquiring an original image to be edited and a target text corresponding to the original image; the target text is used for editing the original image;

the image editing model acquisition module is used for acquiring a pre-trained image editing model; the image editing model comprises a convolution layer, a text encoder, a residual block and a decoding unit;

the target image acquisition module is used for inputting the original image into a convolution layer in the image editing model, extracting global image features of the original image through the convolution layer, inputting the target text into the text encoder, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image.

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements an image editing method as described in the present application.

In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the image editing method according to the embodiment of the present application.

According to the technical scheme provided by the embodiment of the application, the decoding unit based on the attention mechanism is designed and added in the pre-trained image editing model, so that the problems of low image editing precision and incomplete matching between the output image and the input text description can be effectively solved, the edited image and the input target text can be completely matched, the original effect of the area irrelevant to the target text description in the edited image can be kept, and the accuracy of image editing is greatly improved.

Drawings

Fig. 1 is a flowchart of an image editing method according to an embodiment of the present application;

FIG. 2 is an overall architecture diagram of an image editing model according to an embodiment of the present invention;

FIG. 3 is a diagram of a decoding unit architecture provided by an embodiment of the present invention;

fig. 4 is a block diagram of an image editing apparatus according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures associated with the present application are shown in the drawings, not all of them.

Before discussing exemplary embodiments in greater detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of an image editing method provided in an embodiment of the present application, where the present embodiment is applicable to a scene in which an image is edited, and the method may be executed by an image editing apparatus provided in an embodiment of the present application, where the apparatus may be implemented by software and/or hardware, and may be integrated in an electronic device.

As shown in fig. 1, the image editing method includes:

s110, acquiring an original image to be edited and a target text corresponding to the original image; the target text is used for editing the original image;

the original image to be edited may be an original image captured by a camera on the electronic device, or may be a stored image selected from a local gallery of the electronic device according to a click operation of a user. The target text corresponding to the original image may be understood as text editing the original image so that the edited image satisfies its description as a whole. The target text language can be English or other languages; this is not a limitation of the present application. For example, the target text in this embodiment may be "a black bird with a read head", or "a black bird with red head".

S120, acquiring a pre-trained image editing model; the image editing model comprises a convolution layer, a text encoder, a residual block and a decoding unit.

The pre-trained image editing model may be a machine learning model for editing an original image according to a target text. The convolutional layer can be a network layer which utilizes convolutional cores to perform line-by-line or line-by-line scanning on an image in a convolutional neural network and extracts image features. The convolutional neural network may be a type of feed-forward neural network that contains convolutional calculations and has a depth structure. The convolution kernel may be a matrix parameter derived through self-learning of the neural network. A text encoder may be used to encode target text features to enable capture of important visual information. The text encoder in this embodiment may be obtained by training a target loss function training model formed by measuring similarity between an image and a text and calculating matching loss respectively through global text features and word features. The residual block may be a standard model that allows deeper neural networks to be trained by stacking many residual blocks together. The decoding unit may be an attention-based decoding unit for performing fine editing on the image. The attention mechanism may be a mechanism that selectively focuses on a portion of all information while ignoring other visible information.

S130, inputting the original image into a convolution layer in the image editing model, extracting global image features of the original image through the convolution layer, simultaneously inputting the target text into the text encoder, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image.

Fig. 2 is an overall architecture diagram of an image editing model according to an embodiment of the present invention. As shown in fig. 2, the overall architecture diagram 200 of an image editing model provided in this embodiment includes: a convolutional layer 210; a residual block 220, a decoding unit 230. The convolution layer scans the original image line by line or across lines through convolution core, and the extracted image features are sampled down to extract the global image features of the original image. The global image feature of the original image extracted by the convolution layer may be extracted by performing line-by-line or line-across scanning on the original image through convolution kernel, and the extracted image feature is downsampled to extract the global image feature of the original image. The down-sampling may be a new sequence obtained by sampling a sample sequence several samples apart. The global image features may be features on the entire image that describe the overall features such as color and shape of the image or object. And the text encoder measures the similarity between the original image and the target text by utilizing a target loss function training model, and extracts the global text characteristics and the word characteristics of the target text.

The fusion of the global image feature and the global text feature may be performed by performing cross-product operation on the global image feature matrix and the global text feature matrix to generate a new matrix. In the embodiment of the invention, the global image feature and the global text feature are fused through the residual block, and the fused feature is used as the hidden layer image feature. It should be noted that there may be one or more residual blocks in the image editing model, and the number of the residual blocks is not limited in the embodiment of the present invention. Specifically, when there are multiple residual blocks, the network level can be deepened by the multiple residual blocks to increase parameters of global image features and global text features, increase fusion possibility, and achieve better fusion of image global features and global sentence features. The hidden layer can be other layers except the input layer and the output layer, and the hidden layer does not directly receive and send signals to the outside. In this embodiment, the hidden layer image feature may be an image feature obtained by fusing a global image feature and a global text feature.

In the embodiment of the present invention, the decoding unit receives the word feature sent by the text editor and the hidden layer image feature sent by the residual block, and integrates the word feature and the hidden layer image feature to generate the target image, where the target image may be understood as an image generated after editing the original image based on the target text. In this embodiment, optionally, the decoding unit includes an attention model and a channel attention mechanism unit; integrating the hidden layer image features and the word features through the decoding unit to obtain a target image edited by the original image, wherein the target image comprises: the attention model carries out attention calculation on the hidden layer image characteristics based on the word characteristics to obtain a first image; and the channel attention mechanism unit determines a weight value corresponding to each channel of the first image based on the hidden layer image characteristics, performs weighting processing on the channel characteristics of the corresponding channel based on the weight values to obtain a second image, and takes the second image as a target image after the original image is edited. In this embodiment, optionally, the decoding unit further includes an upsampling unit and a convolutional layer; before the attention model performs attention calculation on the hidden layer image feature based on the word feature to obtain a first image, the method further includes: the up-sampling unit up-samples the hidden layer image characteristics; and performing channel compression on the image characteristics of the up-sampled hidden layer through the convolution layer in the decoding unit.

Fig. 3 is a diagram of a decoding unit architecture according to an embodiment of the present invention. As shown in fig. 3, the architecture diagram 300 of the decoding unit provided in this embodiment includes: an upsampling subunit 310, a convolutional layer 320, an attention model 330, and a channel attention mechanism subunit 340. The attention model may be a model in which the hidden layer image feature and the word feature are subjected to attention calculation to obtain a first image. The channel attention mechanism unit may be a module that performs weight learning on each channel of the first image and the hidden layer image, and performs weighting processing on corresponding channel characteristics by using the weight to obtain a second image. The channel attention mechanism unit automatically learns which channel should be weighted more or less according to a designed loss function. For example, the channel attention unit may determine whether the picture region is a text description region for weight learning. If the picture area is a text description area, giving a channel of the first image a larger weight; and if the picture area is not the text description area, giving a larger weight to the channel of the hidden layer image. And the channel attention mechanism unit determines a weight value corresponding to each channel of the first image based on the hidden layer image characteristics, and then performs weighting processing on the channel characteristics of the corresponding channel based on the weight values to obtain a target image. It can be understood that, through the channel attention mechanism unit, original features of a region irrelevant to the description of the target text in the original image can be retained in the target image generated by editing, and the edited image and the target text can be completely matched at the same time. In this embodiment, by adding a channel attention mechanism unit to the decoding unit, the selection capability of the image editing model for different features in generating an image can be enhanced.

The upsampling may be a process of transforming a smaller-sized matrix into a larger-sized matrix, and restoring the image features obtained through the convolutional layer to the size of the original image. Channel compression may be a process of reducing the channels of image features. For example, in this embodiment, if the number of channels of the upsampled hidden layer image feature is 64, the number of channels after the channel compression may be 16. In this embodiment, by upsampling the hidden layer image features, the resolution of the image can be improved.

In this embodiment, optionally, the image editing model further includes an unconditional discriminator and a conditional discriminator; after obtaining the edited target image of the original image, the method further comprises: judging the retention effect of the target image on the non-editing area of the original image through the unconditional discriminator; and judging the editing effect of the target image based on the global text characteristics through the condition judger.

As shown in fig. 2, the image editing model further includes a discriminator, wherein the discriminator includes a conditional discriminator and an unconditional discriminator. The unconditional discriminator may be a discriminator for performing discrimination only based on image data. The non-editing region may be a region unrelated to the target text description. The training unconditional discriminator uses the raw image data of the training set and the edited image. It is desirable for the discriminator to discriminate true if the input is a true image and false if the input is a false image. And then, the discriminant is used for judging the loss function obtained by the error calculation so as to further guide the training of the discriminant. The real image may be original image data in a training set of the unconditional discriminator, and the false image may be an image generated by editing the original image in the training set. The unconditional discriminator compares the target image with the original image, and judges the retention effect of the target image on the non-editing area of the original image according to the comparison result. Wherein the retention effect of the non-edited region of the target image and the original image can be represented by a number between 0 and 1 output by the unconditional discriminator. For example, the target image is input into the unconditional discriminator, and the higher the retention effect of the target image and the non-editing area of the original image is, the closer the value output by the unconditional discriminator is to 1, which indicates that the smaller the deviation between the target image and the non-editing area of the original image is, the higher the authenticity of the target image is; the lower the effect of holding the non-edited region of the target image and the original image is, the closer the value output by the unconditional discriminator is to 0, which means that the greater the deviation between the target image and the non-edited region of the original image is, the lower the authenticity of the target image is. In this embodiment, by adding an unconditional discriminator to the image editing model discriminator, the retention effect of the non-editing area of the original image can be discriminated, and the purpose of improving the authenticity of the target image is achieved. In addition, the image editing model also judges the editing effect of the target image based on the global text features through the condition judger, namely the condition judger judges whether the edited target image retains the global text features in the target text, namely whether the original image is subjected to image editing operation according to the global text features in the target text.

According to the technical scheme provided by the embodiment of the application, an original image to be edited and a target text corresponding to the original image are obtained; acquiring a pre-trained image editing model; the image editing model comprises a convolution layer, a text encoder, a residual block and a decoding unit; extracting global image features of the original image through the convolutional layer, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image. According to the technical scheme, the problems that the image editing precision is low, the output image is incompletely matched with the description of the input text can be effectively solved, the edited image is completely matched with the input target text, the original effect of an area irrelevant to the description of the target text in the edited image can be kept, and the accuracy of image editing is greatly improved.

Example two

Fig. 4 is a block diagram of an image editing apparatus according to a second embodiment of the present invention, which is capable of executing an image editing method according to any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 4, the apparatus may include:

an original image and target text obtaining module 410, configured to obtain an original image to be edited and a target text corresponding to the original image; the target text is used for editing the original image;

an image editing model obtaining module 420, configured to obtain a pre-trained image editing model; the image editing model comprises a convolution layer, a text encoder, a residual block and a decoding unit;

a target image obtaining module 430, configured to input the original image into a convolutional layer in the image editing model, extract a global image feature of the original image through the convolutional layer, input the target text into the text encoder, and extract a global text feature and a word feature of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image.

Optionally, the decoding unit includes an attention model and a channel attention mechanism unit, and the target image acquiring module includes:

the first image acquisition sub-module is used for carrying out attention calculation on the hidden layer image characteristics based on the word characteristics through the attention model to obtain a first image;

and the target image acquisition sub-module is used for determining a weight value corresponding to each channel of the first image based on the hidden layer image characteristics through the channel attention mechanism unit, performing weighting processing on the channel characteristics of the corresponding channels based on the weight values to obtain a second image, and taking the second image as a target image after the original image is edited.

Optionally, the decoding unit further includes an upsampling unit and a convolutional layer; the target image acquisition module further includes:

the upsampling sub-module is used for upsampling the hidden layer image features through the upsampling sub-unit before the attention model carries out attention calculation on the hidden layer image features based on the word features to obtain a first image;

and the channel compression submodule is used for carrying out channel compression on the up-sampled hidden layer image characteristics.

On the basis of the above technical solutions, optionally, the image editing model further includes an unconditional discriminator and a conditional discriminator; the device further comprises:

the hold effect judging module is used for judging the hold effect of the target image on the non-editing area of the original image through the unconditional discriminator after the target image edited on the original image is obtained;

and the editing effect judging module is used for judging the editing effect of the target image based on the global text characteristics through the condition discriminator.

The product can execute the image editing method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

A third embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the image editing method provided in all the embodiments of the present invention of the present application:

acquiring a pre-trained image editing model; the image editing model comprises a convolution layer, a text editor, a residual block and a decoding unit;

inputting the original image into a convolution layer in the image editing model, extracting global image features of the original image through the convolution layer, simultaneously inputting the target text into the text editor, and extracting global text features and word features of the target text through the text editor; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Example four

The fourth embodiment of the application provides electronic equipment. Fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in fig. 5, the present embodiment provides an electronic device 500, which includes: one or more processors 520; the storage 510 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 520, the one or more processors 520 implement the image editing method provided in the embodiment of the present application, the method includes:

Of course, those skilled in the art will understand that the processor 520 also implements the technical solution of the image editing method provided in any embodiment of the present application.

The electronic device 500 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the electronic device 500 includes a processor 520, a storage 510, an input 530, and an output 540; the number of the processors 520 in the electronic device may be one or more, and one processor 520 is taken as an example in fig. 5; the processor 520, the storage 510, the input device 530, and the output device 540 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 550 in fig. 5.

The storage device 510 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the image editing method in the embodiment of the present application.

The storage device 510 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 510 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 510 may further include memory located remotely from processor 520, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numerals, character information, or voice information, and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 540 may include a display screen, speakers, etc. of electronic equipment.

The electronic equipment provided by the embodiment of the application can be used for timely and effectively communicating each node through the arrangement of the agent nodes, and the safety of the block chain nodes can be ensured while the instantaneity and the accuracy of data information interaction are ensured.

The image editing apparatus, the medium, and the electronic device provided in the above embodiments may execute the image editing method provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to an image editing method provided in any embodiment of the present application.

It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. An image editing method, characterized in that the method comprises:

inputting the original image into a convolution layer in the image editing model, extracting global image features of the original image through the convolution layer, simultaneously inputting the target text into the text encoder, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image.

2. The method of claim 1, wherein the decoding unit comprises an attention model and channel attention mechanism unit;

integrating the hidden layer image features and the word features through the decoding unit to obtain a target image edited by the original image, wherein the target image comprises:

the attention model carries out attention calculation on the hidden layer image characteristics based on the word characteristics to obtain a first image;

the channel attention mechanism unit determines a weight value corresponding to each channel of the first image based on the hidden layer image characteristics, performs weighting processing on the channel characteristics of the corresponding channel based on the weight value to obtain a second image, and takes the second image as a target image after the original image is edited.

3. The method of claim 2, wherein the decoding unit further comprises an upsampling unit and a convolutional layer;

before the attention model performs attention calculation on the hidden layer image feature based on the word feature to obtain a first image, the method further includes:

the up-sampling unit up-samples the hidden layer image characteristics;

and performing channel compression on the image characteristics of the up-sampled hidden layer through the convolution layer in the decoding unit.

4. The method of claim 1, wherein the image editing model further comprises an unconditional discriminator and a conditional discriminator;

after obtaining the edited target image of the original image, the method further comprises the following steps:

judging the retention effect of the target image on the non-editing area of the original image through the unconditional discriminator;

and judging the editing effect of the target image based on the global text features through the condition discriminator.

5. An image editing apparatus, characterized in that the apparatus comprises:

the system comprises an original image and target text acquisition module, a target text editing module and a target text editing module, wherein the original image and target text acquisition module is used for acquiring an original image to be edited and a target text corresponding to the original image; the target text is used for editing the original image;

the target image acquisition module is used for inputting the original image into a convolutional layer in the image editing model, extracting global image features of the original image through the convolutional layer, simultaneously inputting the target text into the text encoder, and extracting global text features and word features of the target text through the text encoder; fusing the global image features and the global text features through the residual blocks to obtain hidden layer image features; and integrating the hidden layer image characteristics and the word characteristics through the decoding unit to obtain a target image edited by the original image.

6. The apparatus of claim 5, wherein the decoding unit comprises an attention model and channel attention mechanism unit; the target image acquisition module includes:

7. The apparatus of claim 6, wherein the decoding unit further comprises an upsampling unit and a convolutional layer; the target image acquisition module further includes:

the up-sampling sub-module is used for up-sampling the hidden layer image features through the up-sampling sub-module before the attention model carries out attention calculation on the hidden layer image features based on the word features to obtain a first image;

8. The apparatus of claim 5, wherein the image editing model further comprises an unconditional discriminator and a conditional discriminator; the device further comprises:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out an image editing method as claimed in any one of claims 1 to 4.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image editing method as claimed in any one of claims 1 to 4 when executing the computer program.