CN113689328A - Image harmony system based on self-attention transformation - Google Patents
Image harmony system based on self-attention transformation Download PDFInfo
- Publication number
- CN113689328A CN113689328A CN202111067167.6A CN202111067167A CN113689328A CN 113689328 A CN113689328 A CN 113689328A CN 202111067167 A CN202111067167 A CN 202111067167A CN 113689328 A CN113689328 A CN 113689328A
- Authority
- CN
- China
- Prior art keywords
- image
- self
- attention
- module
- transformation module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009466 transformation Effects 0.000 title claims description 167
- 238000005286 illumination Methods 0.000 claims abstract description 62
- 238000002310 reflectometry Methods 0.000 claims abstract description 52
- 239000013598 vector Substances 0.000 claims abstract description 18
- 238000000034 method Methods 0.000 claims description 40
- 239000002131 composite material Substances 0.000 claims description 33
- 230000008569 process Effects 0.000 claims description 17
- 230000015572 biosynthetic process Effects 0.000 claims description 9
- 238000003786 synthesis reaction Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 5
- 238000013527 convolutional neural network Methods 0.000 description 35
- 230000000007 visual effect Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 11
- 238000002474 experimental method Methods 0.000 description 10
- 230000002194 synthesizing effect Effects 0.000 description 9
- 238000013461 design Methods 0.000 description 8
- 230000001131 transforming effect Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000003384 imaging method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 239000007983 Tris buffer Substances 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000001678 irradiating effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000011158 quantitative evaluation Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention relates to the technical field of image processing, and particularly discloses two self-attention transform-based non-decoupling and decoupling image harmony systems, which utilize the strong remote context modeling capability of a self-attention transform network, adopt a non-decoupling image harmony module, and fully mine the relation between a foreground and a background in a feature space of a synthetic image by utilizing the self-attention transform network so as to guide the harmony of the synthetic image; or, a decoupling image harmonization module is adopted, the self-attention transform encoder and the decoder are used for decoupling the hidden vector encoding of the background image light, then the background light hidden vector encoding and the reflectivity characteristic image are fused through another self-attention transform decoder to generate an illumination intrinsic image, and finally the reflectivity intrinsic image and the illumination intrinsic image are multiplied to obtain a harmonized image, so that the purpose of adjusting the foreground illumination to be compatible with the background illumination while keeping the semantics and the structure of the synthetic image unchanged is achieved, and the problem of dissonance between the foreground and the background of the synthetic image is solved.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to an image harmony system based on self-attention transformation.
Background
Combining arbitrary regions of different images into a visually perceived composite image is a basic task of many application researches of computer vision and graphics, such as image synthesis, image stitching, image editing, scene synthesis and the like, and image synthesis is a common operation in human daily life. However, a composite image obtained by copying and pasting a partial region of one image (referred to as the foreground of the composite image) into another image (referred to as the background of the composite image) will inevitably have a problem that the foreground and background of the composite image are not harmonious due to different imaging environments (such as day and night, sunny and cloudy days, indoors and outdoors) of the foreground region and the background region (the other region except the foreground region in the composite image). Therefore, how to make the composite image look more realistic, i.e., harmonizing the image, by a simple and efficient means is an important and challenging task.
Traditional image harmonization methods focus on better matching techniques, ensuring appearance consistency between foreground and background by migrating statistical information such as color and texture. Recently, deep harmonic models and large-scale datasets have been developed to address this challenging task and achieve good results. The current deep learning model mainly adopts a Convolutional Neural Network (CNN) architecture of an encoder-decoder, which first tries to learn the color information of the background appearance near the foreground region by using the encoder, then captures the context of the synthesized image to adjust the appearance or illumination of the foreground region of the image to be consistent with the background, and finally reconstructs the harmonious image by using the decoder.
In fact, the commonly used encoder-decoder convolutional neural network architecture accomplishes the image harmonization task through a two-step process. The first stage is mainly based on color statistics of a background area of a composite image to adjust the color of a foreground area in a multilayer feature space to make the color of the foreground area compatible with the color of the background, and the second stage is mainly used for reconstructing original structure and semantic information and harmonious low-layer visual features of the image from a high-dimensional feature space. However, the generalized bias due to the local sensitivity of CNN itself determines that the convolutional neural network can only focus on locally limited information, so that the shallow CNN can only capture the background area context near the foreground, but lacks global background context. However, the consistency of the image overall harmony is a key element in evaluating the visual reality of the composite image. CNN may not be able to fully utilize background global information to adjust the foreground color and make it consistent with the overall background color.
In addition, the previous method adopts a U-Net multilayer CNN network structure with continuous coding, and although U-Net can increase the receptive field through a multilayer CNN stacking manner to capture the global context of the image, the original discordance information of the synthesized image may be introduced into the reconstructed image again due to the jump connection from the encoder to the decoder, and the performance of the image harmony model is reduced.
Disclosure of Invention
The invention provides an image harmony system based on self-attention transformation, which solves the technical problems that: in the image harmony process, the context of a background area near the foreground can be captured, the overall context of the image can also be captured, and the inharmonious information is not introduced, so that the inharmonious problem of the foreground and the background of the synthetic image is solved to the greatest extent.
In order to solve the technical problems, the invention provides an image harmonious system based on self-attention transformation, which comprises a non-decoupling image harmonious module or a decoupling image harmonious module;
the non-decoupling image harmony module is used for performing direct self-attention transformation on the input synthetic image and the mask image by using a self-attention transformation network to generate a corresponding harmony image;
the decoupling image and harmonization module comprises a reflectivity image generation module, a background light decoupling module, an illumination image generation module and a synthesis module;
the reflectivity image generation module is used for carrying out decoupling self-attention transformation on an input composite image and a mask image to generate a reflectivity intrinsic image of the composite image;
the background light decoupling module is used for decoupling background light from a background image of the synthetic image by using a self-attention transformation network so as to irradiate the background light on the reflectivity intrinsic image;
the illumination conversion module is used for further generating an illumination intrinsic image for the reflectivity intrinsic image irradiated with the background light by using a self-attention conversion network;
the synthesis module is used for carrying out point multiplication operation on the reflectivity intrinsic image and the illumination intrinsic image to generate a harmonious image of the synthetic image.
Specifically, the decoupled image and harmonization module includes a first encoder, a first serialized transformation module, a first attention transform module, a first serialized inverse transformation module, and a first decoder;
the first encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the first serialization transformation module;
the first serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the first self-attention transformation module;
the first self-attention transformation module is used for performing direct self-attention transformation on the input token generated by the first serialization transformation module to obtain an output token which is input into the first serialization inverse transformation module;
the first serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image;
the first decoder is configured to decode the harmonious feature image into a harmonious image corresponding to the composite image.
Specifically, the reflectivity image generation module includes a second encoder, a second serialization transformation module, a second attention transformation module, a second serialization inverse transformation module, and a second decoder;
the second encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the second serialization transformation module;
the second serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the second self-attention transformation module;
the second self-attention transformation module is used for performing decoupled self-attention transformation on the input token generated by the second serialization transformation module to obtain a reflectivity image output token and inputting the reflectivity image output token into the second serialization inverse transformation module and the illumination transformation module;
the second serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a reflectivity intrinsic characteristic image;
the second decoder is configured to decode the reflectance intrinsic feature image into a reflectance intrinsic image corresponding to the composite image.
Specifically, the backlight decoupling module includes a linear transformation module, a third self-attention transformation module, and a fourth self-attention transformation module;
the linear transformation module is used for dividing an input background image into an image block sequence, then flattening each image block to serve as a token and encoding the token into a feature space through linear mapping to generate an input token of the third self-attention transformation module;
the third self-attention transformation module is used for performing self-attention transformation coding on the input token of the third self-attention transformation module to generate the input token of the fourth self-attention transformation module;
the fourth self-attention transformation module is used for performing self-attention transformation decoding on the input token of the fourth self-attention transformation module, generating a background light hidden vector coding token and inputting the background light hidden vector coding token into the illumination transformation module.
Specifically, the illumination transformation module includes a fifth self-attention transformation module, a third sequence inverse transformation module, and a third decoder;
the fifth self-attention transformation module is used for performing self-attention transformation on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token;
the third serialization inverse transformation module is used for carrying out serialization inverse transformation on the illumination intrinsic image output token to generate an illumination intrinsic characteristic image corresponding to the synthesized image;
and the third decoder is used for decoding the illumination intrinsic characteristic image and outputting an illumination intrinsic image corresponding to the synthesized image.
Specifically, in the training process, a single non-decoupling image harmonization module and a single decoupling image harmonization module are adopted for both the non-decoupling image harmonization module and the decoupling image harmonization moduleA loss function is used to excite the harmonious image of the composite image to approximate its true image.
Specifically, the first encoder and the second encoder all employ encoders of a CNN network, and the first decoder, the second decoder and the third decoder all employ decoders of the CNN network.
Specifically, the first self-attention transform module, the second self-attention transform module, and the third self-attention transform module all employ a coder TRE of a self-attention transform network, and the fourth self-attention transform module and the fifth self-attention transform module all employ a decoder TRD of the self-attention transform network;
TRE consists of a stack of structurally identical layers, each of which contains a sublayer with a multi-headed self-attention mechanism and a feed-forward network sublayer, TRE being intended to output a self-attention map based on modeling the dependencies between input tokens (image patches);
the TRD is also made up of a stack of multiple identically structured layers, where each layer, in addition to two sublayers identical to TRE, has a third encoder-decoder cross-attention sublayer that performs a multi-head attention operation on the TRE output and the TRD itself; the TRD is directed to learning a mapping from a source domain to a target domain, generating a feature matrix associated with a task.
Specifically, the first self-attention transforming module, the second self-attention transforming module, the third self-attention transforming module, the fourth self-attention transforming module and the fifth self-attention transforming module all use 2 attention heads and 9 attention layers.
The image harmony system based on the self-attention transformation provided by the embodiment of the invention utilizes the strong remote context modeling capability of the self-attention transformation network, can directly generate harmony images in an image non-decoupling mode, and has a simple structure and a good harmony effect. In order to achieve a better harmonious effect, an image decoupling mode can be adopted, the reflectivity intrinsic image of the synthesized image is obtained through the reflectivity image generating module, the illumination intrinsic image of the synthesized image is obtained through the background light decoupling module and the illumination image generating module, and therefore the harmonious image is obtained through synthesizing the reflectivity intrinsic image and the reflectivity intrinsic image, the purpose that the foreground illumination is adjusted to be compatible with the background illumination while the semantics and the structure of the synthesized image are kept unchanged is achieved, and the problem that the foreground and the background of the synthesized image are not harmonious is solved. The experimental result proves that the system achieves the most advanced performance on the image harmony task.
Drawings
FIG. 1 is a block diagram of an input mode framework for performing image vision tasks using a Transformer according to an embodiment of the present invention;
FIG. 2 is a block diagram of a non-decoupled image harmonization module (HT model) according to an embodiment of the present invention;
FIG. 3 is a detailed block diagram of a non-decoupled image harmonization module (HT model) according to an embodiment of the present invention;
FIG. 4 is a block diagram of a decoupled image and harmonisation module (D-HT model) provided by an embodiment of the present invention;
FIG. 5 is a frame-refining diagram of the model shown in FIG. 4 provided by an embodiment of the present invention;
FIG. 6 is a detailed block diagram of the model shown in FIG. 5 provided by an embodiment of the present invention;
FIG. 7 is a graphical illustration of the visual effect of the various image harmonization methods provided by embodiments of the present invention on the four sub-datasets and the global dataset of iHarmony 4;
FIG. 8 is a graph showing the image and harmonious visual effects provided by an embodiment of the present invention compared using a normal mask (middle row) and a reverse mask (bottom row);
FIG. 9 is a diagram of image visual effect presentation with different outputs under different lighting conditions provided by an embodiment of the present invention;
FIG. 10 is a diagram illustrating the visual effect of an output image with different illumination by modifying the steganographic vector encoding (Lt) of a target image according to an embodiment of the present invention;
FIG. 11 is a diagram illustrating the visual effect of the images and the harmonizing method on the real synthesized image according to the embodiment of the present invention;
FIG. 12 is a diagram illustrating the visual effect of the image completion method on the Paris street View data set according to the embodiment of the present invention;
FIG. 13 is a diagram illustrating the visual effect of the image enhancement method on the MIT-Adobe-5K-UPE data set according to the embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are given solely for the purpose of illustration and are not to be construed as limitations of the invention, including the drawings which are incorporated herein by reference and for illustration only and are not to be construed as limitations of the invention, since many variations thereof are possible without departing from the spirit and scope of the invention.
Self-attention transformation networks (transformers) benefit from an elaborate self-attention mechanism design to capture remote context, and transformers are rapidly gaining wide attention as a new neural network structure in the scientific research and industrial fields. Transformer is applied to Natural Language Processing (NLP) tasks first instead of RNN and LSTM, and achieves a remarkable result in many tasks of NLP. Nowadays, with the benefit of the powerful feature representation capability of transformers, researchers are applying transformers to various computer vision tasks, such as object detection, image recognition, image processing, and the like.
Self-attention transformation networks (transformers) were first applied to sequential data processing tasks such as natural languages, such as machine translation, and do not rely on recursive forms but rather on a self-attention mechanism to describe global dependencies between inputs and outputs. Thus, if Transformer is used for computer vision tasks, it is necessary to represent 2D images as 1D sequence data and treat its elements or encoding as tokens (tokens, such as words in NLP) and take this serialized data as input to the Transformer. In fact, the image block can be used as a token to avoid the problem of a very long sequence with pixels as tokens. Thus, in this work, this example preliminarily analyzed the performance impact of different token numbers and different embedding types on the Transformer in terms of image harmony. For the number of tokens, it is considered to use different step sizes for adjustment when splitting the image into image blocks. For the encoding mode, two projection modes, linear (FC or CONV) and nonlinear (MLP or CNN network containing nonlinear activation function) are used. Through experiments, the Transformer is found to be more sensitive to the number of tokens and not sensitive to the coding type. The image input method is shown in fig. 1.
The self-attention transform network (Transformer) comprises an encoder TRE (-) for capturing token relations and a decoder TRD (-) for generating task outputs. TRE consists of a stack of structurally identical layers, where each layer contains one sublayer with a multi-headed self-attentiveness mechanism and one feed-forward network sublayer. The TRD is also made up of a stack of multiple identically structured layers, where each layer has, in addition to two sublayers identical to TRE, a third encoder-decoder cross-attention mechanism sublayer that performs a multi-headed attention operation on the output of TRE with the TRD itself. Thus, TRE exploits the self-attention mechanism to explore the self-relationships between its input vectors, while TRD performs cross-attention to find the correlation between its own input and TRE output. Thus, for visual tasks with images as input, TRE aims at generating a feature matrix related to the task based on modeling the dependencies between input tokens (image patches) and then outputting from the attention graph, while TRD aims at learning the mapping from the source domain (TRE input) to the target domain (TRD input/output). The method aims to explore the capability of TREs and TRDs on the image harmony task and the influence of different self-attention head numbers and layer numbers on the performance of a Transformer, and aims to solve the image harmony problem by utilizing the powerful remote context modeling capability of the Transformer so as to fully utilize background global context information to realize image harmony. In order to solve the problem of color appearance dissonance caused by different illumination conditions between the foreground and the background in the synthesized image, the present example provides an image harmonization system based on self-attention transformation, which firstly designs a simple non-decoupled self-attention transformation image harmonization framework (HT), i.e. a non-decoupled image harmonization module, and introduces a self-attention transformation network (Transformer) between a very basic Convolution (CNN) encoder and a decoder, for performing direct self-attention transformation on the input synthesized image and the mask image by using the self-attention transformation network to generate a corresponding harmonized image.
As shown in fig. 2 and 3, the decoupled image and harmonization module includes a first encoder E (an encoder using a CNN network), a first serialized transformation module R, a first self-attention transformation module TRE (an encoder using a self-attention transformation network), a first serialized inverse transformation module R', and a first decoder D (a decoder using a CNN network).
A first encoder E for encoding the input composite imageAnd coding the mask image M into a characteristic space to obtain a characteristic image and inputting the characteristic image into a first serialization transformation module R. The first serialization transformation module R performs serialization transformation on the input feature image to generate an input token of the first self-attention transformation module TRE. The first self-attention transformation module TRE is configured to perform direct self-attention transformation on the input token generated by the first serialization transformation module R to obtain an output token input into the first inverse serialization transformation module R'. The first serialization inverse transformation module R' is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image. The first decoder D is used for decoding the harmonious characteristic image into a synthetic imageCorresponding harmonised images
The CNN encoder E aims at encoding the composite image into a compact feature space with the pixels of the feature map as the transform input, while the CNN decoder D aims at reconstructing the transform output into a harmonious image corresponding to the input image. This design approach is actually embedding the codec of CNN into the transform under the basic encoder-decoder architecture, and is relatively fair compared to the current mainstream image harmonization methods. Furthermore, for a low-level visual task where much information is not changed (semantics, structure, etc.) between the input image and the output image, the cross attention module and the self attention module in the TRD can be regarded as having similar roles, and therefore this example uses TRE only in the HT framework.
For the image harmony task, a composite image is givenWith a corresponding foreground mask image M, the goal being to generate a harmonised image with compatible foreground and backgroundAs an output, i.e.Should be as close as possible to the real image H. In particular, the CNN encoder E (-) generates a lower resolution feature image F ∈ Rh×w×cWherein, useAnd c 256, H, W denote the height and width of the composite image, respectively. The pixels of the feature image F (corresponding to the image blocks of the input image) are then serialized into F' ∈ Rhw×cThis is used as the input token for TRE, and the input token code is the characteristic value of each channel in each pixel. In addition, similar to the use mode of the original Transformer in the NLP task, the present example obtains the position code E of each token from the actual coordinates of each pixel in the feature image F according to the sine and cosine fixed position coding mode, and uses the position code E as the token position input of TRE. Further, the sequence data outputted from TRE is converted into a feature image having the same size as F in the reverse direction according to the original position coordinates, and inputted to a CNN decoder D (-) to generate a feature image,finally generating a harmonious image
This example formulates this non-decoupled self-attention transformed image and harmonic model as:
where phi and phi' denote the transform and inverse transform operations, respectively.
Adjusting foreground illumination according to background illumination is a key to solving the problem of image dissonance. In addition, the diffuse reflection model based on the intrinsic image and Retinex theory assumes that the light intensity value of the image actually encodes all the features of the corresponding scene point, so this embodiment also uses the Transformer to capture the background light and place it on the reflectance intrinsic image to achieve the harmony of the illumination intrinsic image by decomposing the synthesized image into the reflectance intrinsic image and the illumination intrinsic image. The image harmonization system based on the self-attention transform according to the present embodiment, as shown in fig. 4, includes a reflectivity image generation module, a background light decoupling module, an illumination image generation module, and a synthesis module. Wherein:
the reflectivity image generation module is used for synthesizing the input imageSelf-injection decoupled from mask image MPerforming a semantic transformation to generate a composite imageReflectivity eigen image of
The background light decoupling module is used for synthesizing images by utilizing a self-attention transformation networkBackground image of(from the composite imageObtaining background image by removing foreground region) in the image processing systembgTo illuminate the reflectivity eigen imageThe above. The illumination conversion module is used for irradiating background light lbgReflectivity eigen image ofFurther generation of illumination intrinsic images using self-attention transform networksThe synthesis module is used for synthesizing the intrinsic image of the reflectivityAnd illuminating the intrinsic imagePerforming dot product operation to generate a composite imageTo harmonize the image, this process can be formulated as:
based on this, in particular, as shown in fig. 5 and 6, the reflectance image generation module includes a second encoder ER(encoder using CNN network), second serialization transformation module R1A second self-attention transformation module TRER(encoder employing self-attention transform network), second sequence inverse transform module R'1A second decoder DR(encoder using CNN network). Wherein:
first encoder ERFor synthesizing images to be inputCoding the mask image M to a feature space to obtain a feature image, and inputting the feature image into a second serialization transformation module R1This process can be formulated as:specifically, CNN encoder ERGenerating a lower resolution feature image F ∈ Rh×w×cWherein, in the step (A), and c 256 and H, W denote composite images, respectivelyHigh and wide.
Second serialized transformation module R1For feature images F e Rh×w×cThe process of carrying out serialization transformation to generate a plurality of tokens and carrying out position coding on the tokens to obtain an input token can be formulated as follows:specifically, the pixels of the feature image F (corresponding to the input image) are combinedImage block) serialization as F' e Rhw×cIt is used as the second self-attention transformation module TRERAnd the input token code is a characteristic value of each channel in each pixel. In addition, the position code E of each token is obtained according to the actual coordinates of each pixel in the feature image F in a sine and cosine fixed position coding moderThis is taken as TRERToken location input.
Second self-attention transform coder TRERThe process for performing self-attention decoupling transform coding on the input token to generate the corresponding reflectivity image output token can be formulated as follows:
second sequenced inverse transformation Module R'1Module R for performing second serialization transformation on reflectivity image output token1And reversely transforming to obtain the intrinsic reflectivity characteristic image with the same size as the characteristic image F, wherein the process can be formulated as:
second decoder DRUsed for decoding, outputting and synthesizing the reflectivity intrinsic characteristic imageEqual magnitude reflectivity eigen imagesThis process can be formulated as:
therefore, the whole process of the reflectivity image generation module can be formulated as:
specifically, as shown in fig. 5 and 6, the backlight decoupling module includes a linear transformation module LP and a third self-attention transformation module TREL(encoder using a self-attention transform network), fourth self-attention transform module TRDL(decoder employing a self-attention transform network). Wherein:
the linear transformation module LP is used to transform the background image(the number of channels C is 3, H, Y represents the height and width of the image, respectively, and the background image is as wide as and as high as the synthesized image) into a sequence of image blocks(number of image blocks)Size P of tiles 8), then each tile is flattened as a token and coded into C' -256 dimensional feature space by linear mapping LP (·), and fixed position coding Ep(the position coordinates of the image block in the original image are obtained by sine and cosine coding) is added into the token code to obtain a third self-attention transformation module TRELThe process can be formulated as:
third self-attention transform module TRELThe process of self-attention transformation coding the input token to generate the input token of the fourth self-attention transformation module can be formulated as follows:
fourth self-attention transform module TRDLFor performing self-attention transform decoding on its input token to generate corresponding backThe scene light hidden vector code token is input into the illumination transformation module, and the process can be formulated as follows: optically encoded token sequence here(dlSpherical harmonic coefficient of 27 dimensions) is TRDLInitial input of, ElIndicating a learnable light position code initial value, the initial value of the light code token being zero.
As shown in fig. 5 and 6, the illumination transformation module includes a fifth self-attention transformation module TRDI(decoder employing self-attention transform network), third sequential inverse transform module R'2A third decoder DI(decoder using CNN network).
Fifth self-attention transform module TRDIThe method is used for performing self-attention transformation decoding on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token, and the process can be formulated as TRDI(tl+El,tr+Er) Here, tl、ElRespectively representing the learned sequence of light-encoded tokens, the light position code (by the first self-attention transform decoder TRD)LOutput), while the sequence of reflectivity intrinsic image tokenstrIs coded as Er。
Third sequenced inverse transformation module R'2For outputting token TRD to illumination intrinsic imageI(tl+El,tr+Er) Carry out and second serializing transformation module R1And reversely transforming to obtain an illumination intrinsic characteristic image with the same size as the characteristic image, wherein the process can be formulated as: phi' (TRD)I(tl+El,tr+Er))。
Third decoder DIUsed for decoding, outputting and synthesizing the illumination intrinsic characteristic imageIllumination intrinsic images with same sizeThis process can be formulated as:
it should also be noted that during the training of the system, the same applies to the single oneNorm loss function to excite composite imageOf (2) harmonious imagesApproximating its real image H:
in general, the decoupled image harmonization module shown in this example uses two transform encoders and two transform decoders, where the encoder TRERUsing the CNN-encoded tokens of the image block as input and generating a reflectivity eigen-image, encoder TRELTaking FC coded token of image block as input and combining with TRD (decoder TRD)LGenerating implicit vector coding of the background light, and a decoder TRDIAnd finally, performing dot multiplication on the reflectivity and the illumination intrinsic image to generate a harmonious image.
The image harmony system based on the self-attention transformation provided by the embodiment of the invention utilizes the strong remote context modeling capability of the self-attention transformation network, can directly generate harmony images by adopting an image non-decoupling mode (figures 2 and 3), and has a simple structure and a good harmony effect. In order to achieve a better harmonious effect, an image decoupling mode can be adopted, the reflectivity intrinsic image of the synthesized image is obtained through the reflectivity image generating module, the illumination intrinsic image of the synthesized image is obtained through the background light decoupling module and the illumination image generating module, and therefore the harmonious image is obtained through synthesizing the reflectivity intrinsic image and the reflectivity intrinsic image, the fact that foreground illumination is adjusted to be compatible with background illumination while the semantics and the structure of the synthesized image are kept unchanged is achieved, and the problem that the foreground and the background of the synthesized image are not harmonious is solved.
The effect of the system provided in this example is verified experimentally.
The experiment selects a synthetic iHarmony4 dataset and a real synthetic image dataset. Where experiments were performed on the common synthetic iHarmony4 dataset to analyze and evaluate the self-attentive transformed image and harmonic model performance. The iHarmony4 dataset contains a total of 4 sub-datasets, HCOCO, HAdobe5k, HFlickr and Hday2night, one for each synthetic image, this example following the same experimental setup as DoveNet. This example evaluates the performance of the system on 99 real synthetic image datasets, as does the DoveNet evaluation.
Reflectivity and illumination only usingThe loss function is used as reconstruction constraint and Adam optimizer (parameter is beta)1=0.5、β20.999), the total number of iterative training is 60 times, and the model initial learning rate is set to e-4And decreases to e after 40 iterations-5. Reflectivity decoder D in decoupled self-attention transformed image and harmonic modelRAnd an illumination decoder DIThe last layer of (d) uses the tanh activation function. The input image is resized to 256 x 256 for training and testing, and the model also generates harmonious images of the same size.
In particular, in restoring harmonised imagesPreviously, the reflectance and illumination images needed to be normalized to [0, 1%]An interval.
For experimental comparison, the example first constructs two classical network models for implementing the task of image-to-image translation as references, namely an Encoder-Decoder structured U-Net (E-D U-Net) and an Encoder-Decoder residual convolutional neural network model (E-D CNN, structured as Encoder-blocks-Decoder). Table 1 shows the results of quantitative evaluation of the four datasets and the global dataset at iHarmony4, including HT, D-HC, D-HT and baseline models E-D U-Net and E-D CNN and the best current method: DIH, S2The results of the AM and DoveNet comparisons over the 4 data sets, with the arrow pointing up indicating that the higher the data the better, and the arrow pointing down indicating that the lower the data the better. HT is the non-decoupled self-attention transformed image harmony model (TRE with 2 heads of attention and 9 layers of attention) shown in FIG. 3, and D-HC and D-HT represent the decoupled model using CNN and the decoupled model using Transformer, respectively (TRE and TRD with 2 heads and 9 layers of attention shown in FIG. 6). Constructing a D-HC model by replacing a self-attention transform coder TRE in the D-HT with ResblockRReplacement of TRE with Encoder and MLPLAnd TRDLDecoupling background scene light, replacing TRD with AdaINIAnd re-rendering to obtain the illumination intrinsic image. In addition, table 1 also provides the evaluation results of the synthetic image and the real image as references (Composite column).
TABLE 1
From the experimental evaluation results of Table 1, it can be seen that E-D (CNN) performed better on HCOCO and HAdobe5k datasets and worse on HFlickr and Hday2night datasets than E-D (U-Net) probably because U-Net has a global receptive field that captures the global context, but its jumpers may introduce dissonance factors for the reconstructed image, and CNN has a limited receptive field due to its generalized bias. In summary, E-D (CNN) is lower than the fMSE of E-D (U-Net) over the entire iHarmony4 dataset, but the non-decoupled self-attention transformed image harmony model (HT) is superior not only to the two reference models E-D (U-Net) and E-D (CNN), but also to other image harmony methods, indicating that the remote context capability of the transform is very efficient on the image harmony task.
The quantitative comparison results in table 1 show that the D-HC model achieves competitive or superior results compared to the current state-of-the-art method, while demonstrating that reflectance and illumination intrinsic image separation and harmonization do contribute to image harmonization. Likewise, the D-HT model has a very low fMSE score (320.78, while S2AM and DoveNet are 537.23 and 541.53, respectively), and the accuracy and effectiveness of the design mode of the D-HT model are proved. In addition, D-HC performed better than HT on the Hday2night dataset, probably due to better decoupling capability of D-HC, while HT was due to lack of bias on the Hday2night training dataset (311 training images only).
Fig. 7 shows the visual effect of the harmonization method for each image (the frame part in the synthesized image represents the discordant foreground region, one example for each data set, HCOCO, HAdobe5k, HFlickr and Hday2night from top to bottom), and the harmonization image obtained by the D-HT model is closest to the real image by comparing the visual effects.
To investigate the impact of analyzing the number of input tokens (tokens) and the type of encoding on the performance of the Transformer, this example uses a step S to adjust the number of tokens T, using a 1-header 3-layer encoder for TRE, then using CNN reconstruction, and using a step S to adjust the number of tokens. The data in Table 2 show that both linear and non-linear encoding modes are dependent on the number of tokensThe performance of the transducer is increased continuously. Furthermore, for a fixed number of tokens (e.g. 4N), the performance of the transform is similar regardless of which encoding scheme (linear FC or CONV or non-linear MLP or CNN) is chosen. Thus, can pushThe performance of the Transformer is sensitive to the number of tokens, and is insensitive to the way in which the tokens are encoded in the image harmony. Therefore, this example provides a long sequence with more tokens, and even there may be redundancy between tokens, so the Transformer can mine richer contexts, and different encoding methods can provide effective information for the image block.
TABLE 2
Experiments were further designed to verify the effect of the HT architecture based self-attention transform network encoder (E) and decoder (D) layer number on image harmonization. The results of fmes ↓ quantitative data comparison in table 3 show that the performance of the Transformer on image and harmonization tasks is similar when the number of layers of the encoder is equal to the total number of layers of the encoder and decoder, even if the decoder has additional cross attention layers. Thus, this example employs only the transform encoder TRE on the HT model design (on the resulting reflectance eigen image).
TABLE 3
To analyze the performance impact of using the Transformer with different number of attention heads and layers on the harmony of the HT model images, this example further designed a set of experiments. The quantitative data comparison results in table 4 show that more attention layers and numbers of heads contribute to the improvement of the performance of the Transformer, but when the number of attention layers exceeds 9, the performance improvement space of the Transformer is limited.
TABLE 4
An ablation experimental study was performed on the D-HT model using the Transformer part, and the transformers of the reflectivity path and the illumination path were replaced with the CNN structures used in the D-HC model, respectively, to obtain the quantitative comparison results of Table 5, which demonstrate the superiority of the transformers in the image harmonization task.
TABLE 5
In addition, the present example performed another experiment by the foreground mask inversion operation, that is, exchanging the foreground and background regions of the synthesized image to generate an inverted mask, so as to adjust the background according to the foreground of the synthesized image on the D-HT model of the present example to harmonize it. FIG. 8 shows the image harmonization results compared using a normal mask (middle row) and an inverted mask (bottom row), indicating that D-HT can produce meaningful image harmonization results from any foreground mask.
This example further investigated the implicit vector space of light to explore whether the transform can learn the light representation of the image. Given an image, this example uses a decoupled self-attention transformed image harmony (D-HT) model to obtain a latent vector encoding of its light and arbitrarily alter that encoding to produce a different image over a subsequent network. Fig. 9 shows images with different outputs under different lighting conditions, indicating that the background scene light learned with the transform encoder and decoder is accurate for this example.
Further, this example also designs a set of combined experiments for verifying scene light learning and migration. As shown in fig. 10, in this example, two images (Source1 and Source2) are used as scene light reference images, one image is used as a Target image (Target) for light transition, and first, scene light implicit vector codes L corresponding to the two reference images are learneds1And Ls2And then using the formula Lt=αLs1+(1-α)Ls2Obtaining different target scene light implicit vector codes L by adjusting variable alphatFinally, encoding the target scene light implicit vector LtAnd rendering the images to the reflectivity characteristic image of the target image through the illumination migration model to generate images with different illumination. The qualitative experiment result shows that the scene light of the exampleLearning and migration design is effective and can also be applied to the relevant task of generating images of different modalities.
Compared with the current latest technology, the B-T score is used for evaluating the image harmonious capability of the D-HT model on the real synthetic image. The statistics in table 6 and the visual effects in fig. 11 show that the best B-T score and best visual effects are obtained with the method of this example.
TABLE 6
The non-decoupled image harmonization HT model is applied to the image completion task of the random missing area on the Paris StreetView data set in the embodiment, so that the practicability and the expansibility of the HT model designed in the embodiment are verified. The purpose of image completion is to fill in missing regions of an image by synthesizing visually realistic and semantically reasonable pixels that are consistent with the pixels of the known regions. Table 7 and FIG. 12 show the quantitative and visual results of the HT model and the current latest RFR-Net, and the HT model proves the superior performance of the HT model in the image completion task by fully utilizing the advantages of the Transformer in long-term modeling and knowing through the quantitative and visual results.
TABLE 7
The example also applies the decoupled self-attention transformed image and harmonised D-HT model to the image enhancement task on the MIT-Adobe-5K-UPE dataset, in contrast to the latest method DeepLPF. Poor lighting conditions during imaging can lead to reduced image quality, especially underexposed images. Therefore, this example uses a D-HT model to decompose the low-light image into reflectance and light images by a reconstruction loss function, and treats the reflectance image as an enhanced image.
The quantitative comparison results in Table 8 show that D-HT is superior to DeepLPF method in PSNR, SSIM and LPIPS evaluation criteria. FIG. 13 further verifies that the D-HT model of the present example can recover the contrast, natural color, and sharp details of the image through a decoupled self-attention transform network.
TABLE 8
In conclusion of the experiment, the example proposes a new image harmonization method using a self-attention transformation network, aiming at eliminating the discordance factor of the synthetic image by utilizing the modeling capability of the remote context of the Transformer. This example not only established two non-decoupled and decoupled self-attention transformed image harmony frameworks (HT and D-HT), but also designed comprehensive experiments to explore and analyze the usage patterns and potentials of the Transformer on image harmony. In addition, the method further applies the non-decoupled and decoupled self-attention transformation image harmony model to two computer vision classical tasks of image restoration and image enhancement, and further illustrates the effectiveness and superiority of the design method (D-HT model) of the method.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (9)
1. An image harmonization system based on self-attention transform, comprising: the image harmonization module comprises a non-decoupling image harmonization module or a decoupling image harmonization module;
the non-decoupling image harmony module is used for performing direct self-attention transformation on the input synthetic image and the mask image by using a self-attention transformation network to generate a corresponding harmony image;
the decoupling image and harmonization module comprises a reflectivity image generation module, a background light decoupling module, an illumination image generation module and a synthesis module;
the reflectivity image generation module is used for carrying out decoupling self-attention transformation on an input composite image and a mask image to generate a reflectivity intrinsic image of the composite image;
the background light decoupling module is used for decoupling background light from a background image of the synthetic image by using a self-attention transformation network so as to irradiate the background light on the reflectivity intrinsic image;
the illumination conversion module is used for further generating an illumination intrinsic image for the reflectivity intrinsic image irradiated with the background light by using a self-attention conversion network;
the synthesis module is used for carrying out point multiplication operation on the reflectivity intrinsic image and the illumination intrinsic image to generate a harmonious image of the synthetic image.
2. The self-attention transform-based image harmony system of claim 1, wherein: the decoupling image and harmonization module comprises a first encoder, a first serialization transformation module, a first self-attention transformation module, a first serialization inverse transformation module and a first decoder;
the first encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the first serialization transformation module;
the first serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the first self-attention transformation module;
the first self-attention transformation module is used for performing direct self-attention transformation on the input token generated by the first serialization transformation module to obtain an output token which is input into the first serialization inverse transformation module;
the first serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image;
the first decoder is configured to decode the harmonious feature image into a harmonious image corresponding to the composite image.
3. The self-attention transform-based image harmony system of claim 2, wherein: the reflectivity image generation module comprises a second encoder, a second serialization transformation module, a second attention transformation module, a second serialization inverse transformation module and a second decoder;
the second encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the second serialization transformation module;
the second serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the second self-attention transformation module;
the second self-attention transformation module is used for performing decoupled self-attention transformation on the input token generated by the second serialization transformation module to obtain a reflectivity image output token and inputting the reflectivity image output token into the second serialization inverse transformation module and the illumination transformation module;
the second serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a reflectivity intrinsic characteristic image;
the second decoder is configured to decode the reflectance intrinsic feature image into a reflectance intrinsic image corresponding to the composite image.
4. The self-attention transform-based image harmony system of claim 3, wherein: the background light decoupling module comprises a linear transformation module, a third self-attention transformation module and a fourth self-attention transformation module;
the linear transformation module is used for dividing an input background image into an image block sequence, then flattening each image block to serve as a token and encoding the token into a feature space through linear mapping to generate an input token of the third self-attention transformation module;
the third self-attention transformation module is used for performing self-attention transformation coding on the input token of the third self-attention transformation module to generate the input token of the fourth self-attention transformation module;
the fourth self-attention transformation module is used for performing self-attention transformation decoding on the input token of the fourth self-attention transformation module, generating a background light hidden vector coding token and inputting the background light hidden vector coding token into the illumination transformation module.
5. The self-attention transform-based image harmony system of claim 4, wherein: the illumination transformation module comprises a fifth self-attention transformation module, a third sequence inverse transformation module and a third decoder;
the fifth self-attention transformation module is used for performing self-attention transformation on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token;
the third serialization inverse transformation module is used for carrying out serialization inverse transformation on the illumination intrinsic image output token to generate an illumination intrinsic characteristic image corresponding to the synthesized image;
and the third decoder is used for decoding the illumination intrinsic characteristic image and outputting an illumination intrinsic image corresponding to the synthesized image.
6. The self-attention transform-based image harmony system of claim 5, wherein: in the training process, a single image harmonization module is adopted for the non-decoupling image harmonization module and the decoupling image harmonization moduleA loss function is used to excite the harmonious image of the composite image to approximate its true image.
7. The self-attention transform-based image harmony system of claim 6, wherein:
the first encoder and the second encoder are both encoders of a CNN network, and the first decoder, the second decoder and the third decoder are all decoders of the CNN network.
8. The self-attention transform-based image harmony system of claim 7, wherein: the first self-attention transformation module, the second self-attention transformation module and the third self-attention transformation module all adopt encoders TREs of self-attention transformation networks, and the fourth self-attention transformation module and the fifth self-attention transformation module all adopt decoders TRDs of self-attention transformation networks;
TRE consists of a stack of structurally identical layers, each of which contains a sublayer with a multi-headed self-attention mechanism and a feed-forward network sublayer, TRE being intended to output a self-attention map based on modeling the dependencies between input tokens (image patches);
the TRD is also made up of a stack of multiple identically structured layers, where each layer, in addition to two sublayers identical to TRE, has a third encoder-decoder cross-attention sublayer that performs a multi-head attention operation on the TRE output and the TRD itself; the TRD is directed to learning a mapping from a source domain to a target domain, generating a feature matrix associated with a task.
9. The self-attention transform-based image harmony system of claim 8, wherein: the first self-attention transformation module, the second self-attention transformation module, the third self-attention transformation module, the fourth self-attention transformation module and the fifth self-attention transformation module all adopt 2 attention heads and 9 attention layers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111067167.6A CN113689328B (en) | 2021-09-13 | 2021-09-13 | Image harmony system based on self-attention transformation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111067167.6A CN113689328B (en) | 2021-09-13 | 2021-09-13 | Image harmony system based on self-attention transformation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113689328A true CN113689328A (en) | 2021-11-23 |
CN113689328B CN113689328B (en) | 2024-06-04 |
Family
ID=78586147
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111067167.6A Active CN113689328B (en) | 2021-09-13 | 2021-09-13 | Image harmony system based on self-attention transformation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113689328B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115713680A (en) * | 2022-11-18 | 2023-02-24 | 山东省人工智能研究院 | Semantic guidance-based face image identity synthesis method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111523534A (en) * | 2020-03-31 | 2020-08-11 | 华东师范大学 | Image description method |
CN113076809A (en) * | 2021-03-10 | 2021-07-06 | 青岛海纳云科技控股有限公司 | High-altitude falling object detection method based on visual Transformer |
CN113192055A (en) * | 2021-05-20 | 2021-07-30 | 中国海洋大学 | Harmonious method and model for synthesizing image |
CN113269792A (en) * | 2021-05-07 | 2021-08-17 | 上海交通大学 | Image post-harmony processing method, system and terminal |
CN113344807A (en) * | 2021-05-26 | 2021-09-03 | 商汤集团有限公司 | Image restoration method and device, electronic equipment and storage medium |
-
2021
- 2021-09-13 CN CN202111067167.6A patent/CN113689328B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111523534A (en) * | 2020-03-31 | 2020-08-11 | 华东师范大学 | Image description method |
CN113076809A (en) * | 2021-03-10 | 2021-07-06 | 青岛海纳云科技控股有限公司 | High-altitude falling object detection method based on visual Transformer |
CN113269792A (en) * | 2021-05-07 | 2021-08-17 | 上海交通大学 | Image post-harmony processing method, system and terminal |
CN113192055A (en) * | 2021-05-20 | 2021-07-30 | 中国海洋大学 | Harmonious method and model for synthesizing image |
CN113344807A (en) * | 2021-05-26 | 2021-09-03 | 商汤集团有限公司 | Image restoration method and device, electronic equipment and storage medium |
Non-Patent Citations (5)
Title |
---|
HANTING CHEN等: "Pre-Trained Image Processing Transformer", 《ARXIV PREPRINT》, pages 1 - 15 * |
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08 * |
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12) * |
王俊豪;罗轶凤;: "通过细粒度的语义特征与Transformer丰富图像描述", 华东师范大学学报(自然科学版), no. 05 * |
王俊豪;罗轶凤;: "通过细粒度的语义特征与Transformer丰富图像描述", 华东师范大学学报(自然科学版), no. 05, 25 September 2020 (2020-09-25) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115713680A (en) * | 2022-11-18 | 2023-02-24 | 山东省人工智能研究院 | Semantic guidance-based face image identity synthesis method |
Also Published As
Publication number | Publication date |
---|---|
CN113689328B (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111539887B (en) | Channel attention mechanism and layered learning neural network image defogging method based on mixed convolution | |
Chen et al. | DARGS: Image inpainting algorithm via deep attention residuals group and semantics | |
Ning et al. | Accurate and lightweight image super-resolution with model-guided deep unfolding network | |
CN109376830A (en) | Two-dimensional code generation method and device | |
CN113934890B (en) | Method and system for automatically generating scene video by characters | |
CN111915693A (en) | Sketch-based face image generation method and system | |
Xin et al. | Residual attribute attention network for face image super-resolution | |
CN113192055B (en) | Harmonious method and model for synthesizing image | |
CN111986105B (en) | Video time sequence consistency enhancing method based on time domain denoising mask | |
CN117597703A (en) | Multi-scale converter for image analysis | |
KR20240065281A (en) | Vector-quantized image modeling | |
CN115457043A (en) | Image segmentation network based on overlapped self-attention deformer framework U-shaped network | |
CN113689328B (en) | Image harmony system based on self-attention transformation | |
Esmaeilzehi et al. | SRNHARB: A deep light-weight image super resolution network using hybrid activation residual blocks | |
CN113537246A (en) | Gray level image simultaneous coloring and hyper-parting method based on counterstudy | |
CN117474800A (en) | Image defogging method of full convolution decoder based on channel converter | |
CN112686830A (en) | Super-resolution method of single depth map based on image decomposition | |
CN113781376B (en) | High-definition face attribute editing method based on divide-and-congress | |
Yang et al. | Deep 3d modeling of human bodies from freehand sketching | |
CN115660979A (en) | Attention mechanism-based double-discriminator image restoration method | |
CN113780209A (en) | Human face attribute editing method based on attention mechanism | |
Ni et al. | Natural Image Reconstruction from fMRI Based on Self-supervised Representation Learning and Latent Diffusion Model | |
Wen et al. | Mrft: Multiscale recurrent fusion transformer based prior knowledge for bit-depth enhancement | |
Chang et al. | 3D hand reconstruction with both shape and appearance from an RGB image | |
Peng | Efficient Neural Light Fields (ENeLF) for Mobile Devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |