CN113689328A - Image harmony system based on self-attention transformation - Google Patents

Image harmony system based on self-attention transformation Download PDF

Info

Publication number
CN113689328A
CN113689328A CN202111067167.6A CN202111067167A CN113689328A CN 113689328 A CN113689328 A CN 113689328A CN 202111067167 A CN202111067167 A CN 202111067167A CN 113689328 A CN113689328 A CN 113689328A
Authority
CN
China
Prior art keywords
image
self
attention
module
transformation module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111067167.6A
Other languages
Chinese (zh)
Other versions
CN113689328B (en
Inventor
郭宗辉
郑海永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202111067167.6A priority Critical patent/CN113689328B/en
Publication of CN113689328A publication Critical patent/CN113689328A/en
Application granted granted Critical
Publication of CN113689328B publication Critical patent/CN113689328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention relates to the technical field of image processing, and particularly discloses two self-attention transform-based non-decoupling and decoupling image harmony systems, which utilize the strong remote context modeling capability of a self-attention transform network, adopt a non-decoupling image harmony module, and fully mine the relation between a foreground and a background in a feature space of a synthetic image by utilizing the self-attention transform network so as to guide the harmony of the synthetic image; or, a decoupling image harmonization module is adopted, the self-attention transform encoder and the decoder are used for decoupling the hidden vector encoding of the background image light, then the background light hidden vector encoding and the reflectivity characteristic image are fused through another self-attention transform decoder to generate an illumination intrinsic image, and finally the reflectivity intrinsic image and the illumination intrinsic image are multiplied to obtain a harmonized image, so that the purpose of adjusting the foreground illumination to be compatible with the background illumination while keeping the semantics and the structure of the synthetic image unchanged is achieved, and the problem of dissonance between the foreground and the background of the synthetic image is solved.

Description

Image harmony system based on self-attention transformation
Technical Field
The invention relates to the technical field of image processing, in particular to an image harmony system based on self-attention transformation.
Background
Combining arbitrary regions of different images into a visually perceived composite image is a basic task of many application researches of computer vision and graphics, such as image synthesis, image stitching, image editing, scene synthesis and the like, and image synthesis is a common operation in human daily life. However, a composite image obtained by copying and pasting a partial region of one image (referred to as the foreground of the composite image) into another image (referred to as the background of the composite image) will inevitably have a problem that the foreground and background of the composite image are not harmonious due to different imaging environments (such as day and night, sunny and cloudy days, indoors and outdoors) of the foreground region and the background region (the other region except the foreground region in the composite image). Therefore, how to make the composite image look more realistic, i.e., harmonizing the image, by a simple and efficient means is an important and challenging task.
Traditional image harmonization methods focus on better matching techniques, ensuring appearance consistency between foreground and background by migrating statistical information such as color and texture. Recently, deep harmonic models and large-scale datasets have been developed to address this challenging task and achieve good results. The current deep learning model mainly adopts a Convolutional Neural Network (CNN) architecture of an encoder-decoder, which first tries to learn the color information of the background appearance near the foreground region by using the encoder, then captures the context of the synthesized image to adjust the appearance or illumination of the foreground region of the image to be consistent with the background, and finally reconstructs the harmonious image by using the decoder.
In fact, the commonly used encoder-decoder convolutional neural network architecture accomplishes the image harmonization task through a two-step process. The first stage is mainly based on color statistics of a background area of a composite image to adjust the color of a foreground area in a multilayer feature space to make the color of the foreground area compatible with the color of the background, and the second stage is mainly used for reconstructing original structure and semantic information and harmonious low-layer visual features of the image from a high-dimensional feature space. However, the generalized bias due to the local sensitivity of CNN itself determines that the convolutional neural network can only focus on locally limited information, so that the shallow CNN can only capture the background area context near the foreground, but lacks global background context. However, the consistency of the image overall harmony is a key element in evaluating the visual reality of the composite image. CNN may not be able to fully utilize background global information to adjust the foreground color and make it consistent with the overall background color.
In addition, the previous method adopts a U-Net multilayer CNN network structure with continuous coding, and although U-Net can increase the receptive field through a multilayer CNN stacking manner to capture the global context of the image, the original discordance information of the synthesized image may be introduced into the reconstructed image again due to the jump connection from the encoder to the decoder, and the performance of the image harmony model is reduced.
Disclosure of Invention
The invention provides an image harmony system based on self-attention transformation, which solves the technical problems that: in the image harmony process, the context of a background area near the foreground can be captured, the overall context of the image can also be captured, and the inharmonious information is not introduced, so that the inharmonious problem of the foreground and the background of the synthetic image is solved to the greatest extent.
In order to solve the technical problems, the invention provides an image harmonious system based on self-attention transformation, which comprises a non-decoupling image harmonious module or a decoupling image harmonious module;
the non-decoupling image harmony module is used for performing direct self-attention transformation on the input synthetic image and the mask image by using a self-attention transformation network to generate a corresponding harmony image;
the decoupling image and harmonization module comprises a reflectivity image generation module, a background light decoupling module, an illumination image generation module and a synthesis module;
the reflectivity image generation module is used for carrying out decoupling self-attention transformation on an input composite image and a mask image to generate a reflectivity intrinsic image of the composite image;
the background light decoupling module is used for decoupling background light from a background image of the synthetic image by using a self-attention transformation network so as to irradiate the background light on the reflectivity intrinsic image;
the illumination conversion module is used for further generating an illumination intrinsic image for the reflectivity intrinsic image irradiated with the background light by using a self-attention conversion network;
the synthesis module is used for carrying out point multiplication operation on the reflectivity intrinsic image and the illumination intrinsic image to generate a harmonious image of the synthetic image.
Specifically, the decoupled image and harmonization module includes a first encoder, a first serialized transformation module, a first attention transform module, a first serialized inverse transformation module, and a first decoder;
the first encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the first serialization transformation module;
the first serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the first self-attention transformation module;
the first self-attention transformation module is used for performing direct self-attention transformation on the input token generated by the first serialization transformation module to obtain an output token which is input into the first serialization inverse transformation module;
the first serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image;
the first decoder is configured to decode the harmonious feature image into a harmonious image corresponding to the composite image.
Specifically, the reflectivity image generation module includes a second encoder, a second serialization transformation module, a second attention transformation module, a second serialization inverse transformation module, and a second decoder;
the second encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the second serialization transformation module;
the second serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the second self-attention transformation module;
the second self-attention transformation module is used for performing decoupled self-attention transformation on the input token generated by the second serialization transformation module to obtain a reflectivity image output token and inputting the reflectivity image output token into the second serialization inverse transformation module and the illumination transformation module;
the second serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a reflectivity intrinsic characteristic image;
the second decoder is configured to decode the reflectance intrinsic feature image into a reflectance intrinsic image corresponding to the composite image.
Specifically, the backlight decoupling module includes a linear transformation module, a third self-attention transformation module, and a fourth self-attention transformation module;
the linear transformation module is used for dividing an input background image into an image block sequence, then flattening each image block to serve as a token and encoding the token into a feature space through linear mapping to generate an input token of the third self-attention transformation module;
the third self-attention transformation module is used for performing self-attention transformation coding on the input token of the third self-attention transformation module to generate the input token of the fourth self-attention transformation module;
the fourth self-attention transformation module is used for performing self-attention transformation decoding on the input token of the fourth self-attention transformation module, generating a background light hidden vector coding token and inputting the background light hidden vector coding token into the illumination transformation module.
Specifically, the illumination transformation module includes a fifth self-attention transformation module, a third sequence inverse transformation module, and a third decoder;
the fifth self-attention transformation module is used for performing self-attention transformation on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token;
the third serialization inverse transformation module is used for carrying out serialization inverse transformation on the illumination intrinsic image output token to generate an illumination intrinsic characteristic image corresponding to the synthesized image;
and the third decoder is used for decoding the illumination intrinsic characteristic image and outputting an illumination intrinsic image corresponding to the synthesized image.
Specifically, in the training process, a single non-decoupling image harmonization module and a single decoupling image harmonization module are adopted for both the non-decoupling image harmonization module and the decoupling image harmonization module
Figure BDA0003258832480000041
A loss function is used to excite the harmonious image of the composite image to approximate its true image.
Specifically, the first encoder and the second encoder all employ encoders of a CNN network, and the first decoder, the second decoder and the third decoder all employ decoders of the CNN network.
Specifically, the first self-attention transform module, the second self-attention transform module, and the third self-attention transform module all employ a coder TRE of a self-attention transform network, and the fourth self-attention transform module and the fifth self-attention transform module all employ a decoder TRD of the self-attention transform network;
TRE consists of a stack of structurally identical layers, each of which contains a sublayer with a multi-headed self-attention mechanism and a feed-forward network sublayer, TRE being intended to output a self-attention map based on modeling the dependencies between input tokens (image patches);
the TRD is also made up of a stack of multiple identically structured layers, where each layer, in addition to two sublayers identical to TRE, has a third encoder-decoder cross-attention sublayer that performs a multi-head attention operation on the TRE output and the TRD itself; the TRD is directed to learning a mapping from a source domain to a target domain, generating a feature matrix associated with a task.
Specifically, the first self-attention transforming module, the second self-attention transforming module, the third self-attention transforming module, the fourth self-attention transforming module and the fifth self-attention transforming module all use 2 attention heads and 9 attention layers.
The image harmony system based on the self-attention transformation provided by the embodiment of the invention utilizes the strong remote context modeling capability of the self-attention transformation network, can directly generate harmony images in an image non-decoupling mode, and has a simple structure and a good harmony effect. In order to achieve a better harmonious effect, an image decoupling mode can be adopted, the reflectivity intrinsic image of the synthesized image is obtained through the reflectivity image generating module, the illumination intrinsic image of the synthesized image is obtained through the background light decoupling module and the illumination image generating module, and therefore the harmonious image is obtained through synthesizing the reflectivity intrinsic image and the reflectivity intrinsic image, the purpose that the foreground illumination is adjusted to be compatible with the background illumination while the semantics and the structure of the synthesized image are kept unchanged is achieved, and the problem that the foreground and the background of the synthesized image are not harmonious is solved. The experimental result proves that the system achieves the most advanced performance on the image harmony task.
Drawings
FIG. 1 is a block diagram of an input mode framework for performing image vision tasks using a Transformer according to an embodiment of the present invention;
FIG. 2 is a block diagram of a non-decoupled image harmonization module (HT model) according to an embodiment of the present invention;
FIG. 3 is a detailed block diagram of a non-decoupled image harmonization module (HT model) according to an embodiment of the present invention;
FIG. 4 is a block diagram of a decoupled image and harmonisation module (D-HT model) provided by an embodiment of the present invention;
FIG. 5 is a frame-refining diagram of the model shown in FIG. 4 provided by an embodiment of the present invention;
FIG. 6 is a detailed block diagram of the model shown in FIG. 5 provided by an embodiment of the present invention;
FIG. 7 is a graphical illustration of the visual effect of the various image harmonization methods provided by embodiments of the present invention on the four sub-datasets and the global dataset of iHarmony 4;
FIG. 8 is a graph showing the image and harmonious visual effects provided by an embodiment of the present invention compared using a normal mask (middle row) and a reverse mask (bottom row);
FIG. 9 is a diagram of image visual effect presentation with different outputs under different lighting conditions provided by an embodiment of the present invention;
FIG. 10 is a diagram illustrating the visual effect of an output image with different illumination by modifying the steganographic vector encoding (Lt) of a target image according to an embodiment of the present invention;
FIG. 11 is a diagram illustrating the visual effect of the images and the harmonizing method on the real synthesized image according to the embodiment of the present invention;
FIG. 12 is a diagram illustrating the visual effect of the image completion method on the Paris street View data set according to the embodiment of the present invention;
FIG. 13 is a diagram illustrating the visual effect of the image enhancement method on the MIT-Adobe-5K-UPE data set according to the embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are given solely for the purpose of illustration and are not to be construed as limitations of the invention, including the drawings which are incorporated herein by reference and for illustration only and are not to be construed as limitations of the invention, since many variations thereof are possible without departing from the spirit and scope of the invention.
Self-attention transformation networks (transformers) benefit from an elaborate self-attention mechanism design to capture remote context, and transformers are rapidly gaining wide attention as a new neural network structure in the scientific research and industrial fields. Transformer is applied to Natural Language Processing (NLP) tasks first instead of RNN and LSTM, and achieves a remarkable result in many tasks of NLP. Nowadays, with the benefit of the powerful feature representation capability of transformers, researchers are applying transformers to various computer vision tasks, such as object detection, image recognition, image processing, and the like.
Self-attention transformation networks (transformers) were first applied to sequential data processing tasks such as natural languages, such as machine translation, and do not rely on recursive forms but rather on a self-attention mechanism to describe global dependencies between inputs and outputs. Thus, if Transformer is used for computer vision tasks, it is necessary to represent 2D images as 1D sequence data and treat its elements or encoding as tokens (tokens, such as words in NLP) and take this serialized data as input to the Transformer. In fact, the image block can be used as a token to avoid the problem of a very long sequence with pixels as tokens. Thus, in this work, this example preliminarily analyzed the performance impact of different token numbers and different embedding types on the Transformer in terms of image harmony. For the number of tokens, it is considered to use different step sizes for adjustment when splitting the image into image blocks. For the encoding mode, two projection modes, linear (FC or CONV) and nonlinear (MLP or CNN network containing nonlinear activation function) are used. Through experiments, the Transformer is found to be more sensitive to the number of tokens and not sensitive to the coding type. The image input method is shown in fig. 1.
The self-attention transform network (Transformer) comprises an encoder TRE (-) for capturing token relations and a decoder TRD (-) for generating task outputs. TRE consists of a stack of structurally identical layers, where each layer contains one sublayer with a multi-headed self-attentiveness mechanism and one feed-forward network sublayer. The TRD is also made up of a stack of multiple identically structured layers, where each layer has, in addition to two sublayers identical to TRE, a third encoder-decoder cross-attention mechanism sublayer that performs a multi-headed attention operation on the output of TRE with the TRD itself. Thus, TRE exploits the self-attention mechanism to explore the self-relationships between its input vectors, while TRD performs cross-attention to find the correlation between its own input and TRE output. Thus, for visual tasks with images as input, TRE aims at generating a feature matrix related to the task based on modeling the dependencies between input tokens (image patches) and then outputting from the attention graph, while TRD aims at learning the mapping from the source domain (TRE input) to the target domain (TRD input/output). The method aims to explore the capability of TREs and TRDs on the image harmony task and the influence of different self-attention head numbers and layer numbers on the performance of a Transformer, and aims to solve the image harmony problem by utilizing the powerful remote context modeling capability of the Transformer so as to fully utilize background global context information to realize image harmony. In order to solve the problem of color appearance dissonance caused by different illumination conditions between the foreground and the background in the synthesized image, the present example provides an image harmonization system based on self-attention transformation, which firstly designs a simple non-decoupled self-attention transformation image harmonization framework (HT), i.e. a non-decoupled image harmonization module, and introduces a self-attention transformation network (Transformer) between a very basic Convolution (CNN) encoder and a decoder, for performing direct self-attention transformation on the input synthesized image and the mask image by using the self-attention transformation network to generate a corresponding harmonized image.
As shown in fig. 2 and 3, the decoupled image and harmonization module includes a first encoder E (an encoder using a CNN network), a first serialized transformation module R, a first self-attention transformation module TRE (an encoder using a self-attention transformation network), a first serialized inverse transformation module R', and a first decoder D (a decoder using a CNN network).
A first encoder E for encoding the input composite image
Figure BDA0003258832480000086
And coding the mask image M into a characteristic space to obtain a characteristic image and inputting the characteristic image into a first serialization transformation module R. The first serialization transformation module R performs serialization transformation on the input feature image to generate an input token of the first self-attention transformation module TRE. The first self-attention transformation module TRE is configured to perform direct self-attention transformation on the input token generated by the first serialization transformation module R to obtain an output token input into the first inverse serialization transformation module R'. The first serialization inverse transformation module R' is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image. The first decoder D is used for decoding the harmonious characteristic image into a synthetic image
Figure BDA0003258832480000081
Corresponding harmonised images
Figure BDA0003258832480000082
The CNN encoder E aims at encoding the composite image into a compact feature space with the pixels of the feature map as the transform input, while the CNN decoder D aims at reconstructing the transform output into a harmonious image corresponding to the input image. This design approach is actually embedding the codec of CNN into the transform under the basic encoder-decoder architecture, and is relatively fair compared to the current mainstream image harmonization methods. Furthermore, for a low-level visual task where much information is not changed (semantics, structure, etc.) between the input image and the output image, the cross attention module and the self attention module in the TRD can be regarded as having similar roles, and therefore this example uses TRE only in the HT framework.
For the image harmony task, a composite image is given
Figure BDA0003258832480000083
With a corresponding foreground mask image M, the goal being to generate a harmonised image with compatible foreground and background
Figure BDA0003258832480000084
As an output, i.e.
Figure BDA0003258832480000085
Should be as close as possible to the real image H. In particular, the CNN encoder E (-) generates a lower resolution feature image F ∈ Rh×w×cWherein, use
Figure BDA0003258832480000091
And c 256, H, W denote the height and width of the composite image, respectively. The pixels of the feature image F (corresponding to the image blocks of the input image) are then serialized into F' ∈ Rhw×cThis is used as the input token for TRE, and the input token code is the characteristic value of each channel in each pixel. In addition, similar to the use mode of the original Transformer in the NLP task, the present example obtains the position code E of each token from the actual coordinates of each pixel in the feature image F according to the sine and cosine fixed position coding mode, and uses the position code E as the token position input of TRE. Further, the sequence data outputted from TRE is converted into a feature image having the same size as F in the reverse direction according to the original position coordinates, and inputted to a CNN decoder D (-) to generate a feature image,finally generating a harmonious image
Figure BDA0003258832480000092
This example formulates this non-decoupled self-attention transformed image and harmonic model as:
Figure BDA0003258832480000093
where phi and phi' denote the transform and inverse transform operations, respectively.
Importantly, this example uses only a single
Figure BDA0003258832480000094
Excited by a loss function
Figure BDA0003258832480000095
Figure BDA0003258832480000096
Adjusting foreground illumination according to background illumination is a key to solving the problem of image dissonance. In addition, the diffuse reflection model based on the intrinsic image and Retinex theory assumes that the light intensity value of the image actually encodes all the features of the corresponding scene point, so this embodiment also uses the Transformer to capture the background light and place it on the reflectance intrinsic image to achieve the harmony of the illumination intrinsic image by decomposing the synthesized image into the reflectance intrinsic image and the illumination intrinsic image. The image harmonization system based on the self-attention transform according to the present embodiment, as shown in fig. 4, includes a reflectivity image generation module, a background light decoupling module, an illumination image generation module, and a synthesis module. Wherein:
the reflectivity image generation module is used for synthesizing the input image
Figure BDA0003258832480000097
Self-injection decoupled from mask image MPerforming a semantic transformation to generate a composite image
Figure BDA0003258832480000098
Reflectivity eigen image of
Figure BDA0003258832480000099
The background light decoupling module is used for synthesizing images by utilizing a self-attention transformation network
Figure BDA00032588324800000910
Background image of
Figure BDA00032588324800000911
(from the composite image
Figure BDA00032588324800000912
Obtaining background image by removing foreground region) in the image processing systembgTo illuminate the reflectivity eigen image
Figure BDA00032588324800000913
The above. The illumination conversion module is used for irradiating background light lbgReflectivity eigen image of
Figure BDA00032588324800000914
Further generation of illumination intrinsic images using self-attention transform networks
Figure BDA00032588324800000915
The synthesis module is used for synthesizing the intrinsic image of the reflectivity
Figure BDA00032588324800000916
And illuminating the intrinsic image
Figure BDA0003258832480000101
Performing dot product operation to generate a composite image
Figure BDA0003258832480000102
To harmonize the image, this process can be formulated as:
Figure BDA0003258832480000103
based on this, in particular, as shown in fig. 5 and 6, the reflectance image generation module includes a second encoder ER(encoder using CNN network), second serialization transformation module R1A second self-attention transformation module TRER(encoder employing self-attention transform network), second sequence inverse transform module R'1A second decoder DR(encoder using CNN network). Wherein:
first encoder ERFor synthesizing images to be input
Figure BDA0003258832480000104
Coding the mask image M to a feature space to obtain a feature image, and inputting the feature image into a second serialization transformation module R1This process can be formulated as:
Figure BDA0003258832480000105
specifically, CNN encoder ERGenerating a lower resolution feature image F ∈ Rh×w×cWherein, in the step (A),
Figure BDA0003258832480000106
Figure BDA0003258832480000107
and c 256 and H, W denote composite images, respectively
Figure BDA0003258832480000108
High and wide.
Second serialized transformation module R1For feature images F e Rh×w×cThe process of carrying out serialization transformation to generate a plurality of tokens and carrying out position coding on the tokens to obtain an input token can be formulated as follows:
Figure BDA0003258832480000109
specifically, the pixels of the feature image F (corresponding to the input image) are combinedImage block) serialization as F' e Rhw×cIt is used as the second self-attention transformation module TRERAnd the input token code is a characteristic value of each channel in each pixel. In addition, the position code E of each token is obtained according to the actual coordinates of each pixel in the feature image F in a sine and cosine fixed position coding moderThis is taken as TRERToken location input.
Second self-attention transform coder TRERThe process for performing self-attention decoupling transform coding on the input token to generate the corresponding reflectivity image output token can be formulated as follows:
Figure BDA00032588324800001010
second sequenced inverse transformation Module R'1Module R for performing second serialization transformation on reflectivity image output token1And reversely transforming to obtain the intrinsic reflectivity characteristic image with the same size as the characteristic image F, wherein the process can be formulated as:
Figure BDA00032588324800001011
second decoder DRUsed for decoding, outputting and synthesizing the reflectivity intrinsic characteristic image
Figure BDA00032588324800001012
Equal magnitude reflectivity eigen images
Figure BDA00032588324800001013
This process can be formulated as:
Figure BDA00032588324800001014
Figure BDA00032588324800001015
therefore, the whole process of the reflectivity image generation module can be formulated as:
Figure BDA0003258832480000111
specifically, as shown in fig. 5 and 6, the backlight decoupling module includes a linear transformation module LP and a third self-attention transformation module TREL(encoder using a self-attention transform network), fourth self-attention transform module TRDL(decoder employing a self-attention transform network). Wherein:
the linear transformation module LP is used to transform the background image
Figure BDA0003258832480000112
(the number of channels C is 3, H, Y represents the height and width of the image, respectively, and the background image is as wide as and as high as the synthesized image) into a sequence of image blocks
Figure BDA0003258832480000113
(number of image blocks)
Figure BDA0003258832480000114
Size P of tiles 8), then each tile is flattened as a token and coded into C' -256 dimensional feature space by linear mapping LP (·), and fixed position coding Ep(the position coordinates of the image block in the original image are obtained by sine and cosine coding) is added into the token code to obtain a third self-attention transformation module TRELThe process can be formulated as:
Figure BDA0003258832480000115
third self-attention transform module TRELThe process of self-attention transformation coding the input token to generate the input token of the fourth self-attention transformation module can be formulated as follows:
Figure BDA0003258832480000116
fourth self-attention transform module TRDLFor performing self-attention transform decoding on its input token to generate corresponding backThe scene light hidden vector code token is input into the illumination transformation module, and the process can be formulated as follows:
Figure BDA0003258832480000117
Figure BDA0003258832480000118
optically encoded token sequence here
Figure BDA0003258832480000119
(dlSpherical harmonic coefficient of 27 dimensions) is TRDLInitial input of, ElIndicating a learnable light position code initial value, the initial value of the light code token being zero.
As shown in fig. 5 and 6, the illumination transformation module includes a fifth self-attention transformation module TRDI(decoder employing self-attention transform network), third sequential inverse transform module R'2A third decoder DI(decoder using CNN network).
Fifth self-attention transform module TRDIThe method is used for performing self-attention transformation decoding on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token, and the process can be formulated as TRDI(tl+El,tr+Er) Here, tl、ElRespectively representing the learned sequence of light-encoded tokens, the light position code (by the first self-attention transform decoder TRD)LOutput), while the sequence of reflectivity intrinsic image tokens
Figure BDA00032588324800001110
trIs coded as Er
Third sequenced inverse transformation module R'2For outputting token TRD to illumination intrinsic imageI(tl+El,tr+Er) Carry out and second serializing transformation module R1And reversely transforming to obtain an illumination intrinsic characteristic image with the same size as the characteristic image, wherein the process can be formulated as: phi' (TRD)I(tl+El,tr+Er))。
Third decoder DIUsed for decoding, outputting and synthesizing the illumination intrinsic characteristic image
Figure BDA0003258832480000121
Illumination intrinsic images with same size
Figure BDA0003258832480000122
This process can be formulated as:
Figure BDA0003258832480000123
it should also be noted that during the training of the system, the same applies to the single one
Figure BDA0003258832480000124
Norm loss function to excite composite image
Figure BDA0003258832480000125
Of (2) harmonious images
Figure BDA0003258832480000126
Approximating its real image H:
Figure BDA0003258832480000127
in general, the decoupled image harmonization module shown in this example uses two transform encoders and two transform decoders, where the encoder TRERUsing the CNN-encoded tokens of the image block as input and generating a reflectivity eigen-image, encoder TRELTaking FC coded token of image block as input and combining with TRD (decoder TRD)LGenerating implicit vector coding of the background light, and a decoder TRDIAnd finally, performing dot multiplication on the reflectivity and the illumination intrinsic image to generate a harmonious image.
The image harmony system based on the self-attention transformation provided by the embodiment of the invention utilizes the strong remote context modeling capability of the self-attention transformation network, can directly generate harmony images by adopting an image non-decoupling mode (figures 2 and 3), and has a simple structure and a good harmony effect. In order to achieve a better harmonious effect, an image decoupling mode can be adopted, the reflectivity intrinsic image of the synthesized image is obtained through the reflectivity image generating module, the illumination intrinsic image of the synthesized image is obtained through the background light decoupling module and the illumination image generating module, and therefore the harmonious image is obtained through synthesizing the reflectivity intrinsic image and the reflectivity intrinsic image, the fact that foreground illumination is adjusted to be compatible with background illumination while the semantics and the structure of the synthesized image are kept unchanged is achieved, and the problem that the foreground and the background of the synthesized image are not harmonious is solved.
The effect of the system provided in this example is verified experimentally.
The experiment selects a synthetic iHarmony4 dataset and a real synthetic image dataset. Where experiments were performed on the common synthetic iHarmony4 dataset to analyze and evaluate the self-attentive transformed image and harmonic model performance. The iHarmony4 dataset contains a total of 4 sub-datasets, HCOCO, HAdobe5k, HFlickr and Hday2night, one for each synthetic image, this example following the same experimental setup as DoveNet. This example evaluates the performance of the system on 99 real synthetic image datasets, as does the DoveNet evaluation.
Reflectivity and illumination only using
Figure BDA0003258832480000131
The loss function is used as reconstruction constraint and Adam optimizer (parameter is beta)1=0.5、β20.999), the total number of iterative training is 60 times, and the model initial learning rate is set to e-4And decreases to e after 40 iterations-5. Reflectivity decoder D in decoupled self-attention transformed image and harmonic modelRAnd an illumination decoder DIThe last layer of (d) uses the tanh activation function. The input image is resized to 256 x 256 for training and testing, and the model also generates harmonious images of the same size.
In particular, in restoring harmonised images
Figure BDA0003258832480000132
Previously, the reflectance and illumination images needed to be normalized to [0, 1%]An interval.
For experimental comparison, the example first constructs two classical network models for implementing the task of image-to-image translation as references, namely an Encoder-Decoder structured U-Net (E-D U-Net) and an Encoder-Decoder residual convolutional neural network model (E-D CNN, structured as Encoder-blocks-Decoder). Table 1 shows the results of quantitative evaluation of the four datasets and the global dataset at iHarmony4, including HT, D-HC, D-HT and baseline models E-D U-Net and E-D CNN and the best current method: DIH, S2The results of the AM and DoveNet comparisons over the 4 data sets, with the arrow pointing up indicating that the higher the data the better, and the arrow pointing down indicating that the lower the data the better. HT is the non-decoupled self-attention transformed image harmony model (TRE with 2 heads of attention and 9 layers of attention) shown in FIG. 3, and D-HC and D-HT represent the decoupled model using CNN and the decoupled model using Transformer, respectively (TRE and TRD with 2 heads and 9 layers of attention shown in FIG. 6). Constructing a D-HC model by replacing a self-attention transform coder TRE in the D-HT with ResblockRReplacement of TRE with Encoder and MLPLAnd TRDLDecoupling background scene light, replacing TRD with AdaINIAnd re-rendering to obtain the illumination intrinsic image. In addition, table 1 also provides the evaluation results of the synthetic image and the real image as references (Composite column).
TABLE 1
Figure BDA0003258832480000141
From the experimental evaluation results of Table 1, it can be seen that E-D (CNN) performed better on HCOCO and HAdobe5k datasets and worse on HFlickr and Hday2night datasets than E-D (U-Net) probably because U-Net has a global receptive field that captures the global context, but its jumpers may introduce dissonance factors for the reconstructed image, and CNN has a limited receptive field due to its generalized bias. In summary, E-D (CNN) is lower than the fMSE of E-D (U-Net) over the entire iHarmony4 dataset, but the non-decoupled self-attention transformed image harmony model (HT) is superior not only to the two reference models E-D (U-Net) and E-D (CNN), but also to other image harmony methods, indicating that the remote context capability of the transform is very efficient on the image harmony task.
The quantitative comparison results in table 1 show that the D-HC model achieves competitive or superior results compared to the current state-of-the-art method, while demonstrating that reflectance and illumination intrinsic image separation and harmonization do contribute to image harmonization. Likewise, the D-HT model has a very low fMSE score (320.78, while S2AM and DoveNet are 537.23 and 541.53, respectively), and the accuracy and effectiveness of the design mode of the D-HT model are proved. In addition, D-HC performed better than HT on the Hday2night dataset, probably due to better decoupling capability of D-HC, while HT was due to lack of bias on the Hday2night training dataset (311 training images only).
Fig. 7 shows the visual effect of the harmonization method for each image (the frame part in the synthesized image represents the discordant foreground region, one example for each data set, HCOCO, HAdobe5k, HFlickr and Hday2night from top to bottom), and the harmonization image obtained by the D-HT model is closest to the real image by comparing the visual effects.
To investigate the impact of analyzing the number of input tokens (tokens) and the type of encoding on the performance of the Transformer, this example uses a step S to adjust the number of tokens T, using a 1-header 3-layer encoder for TRE, then using CNN reconstruction, and using a step S to adjust the number of tokens. The data in Table 2 show that both linear and non-linear encoding modes are dependent on the number of tokens
Figure BDA0003258832480000151
The performance of the transducer is increased continuously. Furthermore, for a fixed number of tokens (e.g. 4N), the performance of the transform is similar regardless of which encoding scheme (linear FC or CONV or non-linear MLP or CNN) is chosen. Thus, can pushThe performance of the Transformer is sensitive to the number of tokens, and is insensitive to the way in which the tokens are encoded in the image harmony. Therefore, this example provides a long sequence with more tokens, and even there may be redundancy between tokens, so the Transformer can mine richer contexts, and different encoding methods can provide effective information for the image block.
TABLE 2
Figure BDA0003258832480000152
Experiments were further designed to verify the effect of the HT architecture based self-attention transform network encoder (E) and decoder (D) layer number on image harmonization. The results of fmes ↓ quantitative data comparison in table 3 show that the performance of the Transformer on image and harmonization tasks is similar when the number of layers of the encoder is equal to the total number of layers of the encoder and decoder, even if the decoder has additional cross attention layers. Thus, this example employs only the transform encoder TRE on the HT model design (on the resulting reflectance eigen image).
TABLE 3
Figure BDA0003258832480000153
To analyze the performance impact of using the Transformer with different number of attention heads and layers on the harmony of the HT model images, this example further designed a set of experiments. The quantitative data comparison results in table 4 show that more attention layers and numbers of heads contribute to the improvement of the performance of the Transformer, but when the number of attention layers exceeds 9, the performance improvement space of the Transformer is limited.
TABLE 4
Figure BDA0003258832480000161
An ablation experimental study was performed on the D-HT model using the Transformer part, and the transformers of the reflectivity path and the illumination path were replaced with the CNN structures used in the D-HC model, respectively, to obtain the quantitative comparison results of Table 5, which demonstrate the superiority of the transformers in the image harmonization task.
TABLE 5
Figure BDA0003258832480000162
In addition, the present example performed another experiment by the foreground mask inversion operation, that is, exchanging the foreground and background regions of the synthesized image to generate an inverted mask, so as to adjust the background according to the foreground of the synthesized image on the D-HT model of the present example to harmonize it. FIG. 8 shows the image harmonization results compared using a normal mask (middle row) and an inverted mask (bottom row), indicating that D-HT can produce meaningful image harmonization results from any foreground mask.
This example further investigated the implicit vector space of light to explore whether the transform can learn the light representation of the image. Given an image, this example uses a decoupled self-attention transformed image harmony (D-HT) model to obtain a latent vector encoding of its light and arbitrarily alter that encoding to produce a different image over a subsequent network. Fig. 9 shows images with different outputs under different lighting conditions, indicating that the background scene light learned with the transform encoder and decoder is accurate for this example.
Further, this example also designs a set of combined experiments for verifying scene light learning and migration. As shown in fig. 10, in this example, two images (Source1 and Source2) are used as scene light reference images, one image is used as a Target image (Target) for light transition, and first, scene light implicit vector codes L corresponding to the two reference images are learneds1And Ls2And then using the formula Lt=αLs1+(1-α)Ls2Obtaining different target scene light implicit vector codes L by adjusting variable alphatFinally, encoding the target scene light implicit vector LtAnd rendering the images to the reflectivity characteristic image of the target image through the illumination migration model to generate images with different illumination. The qualitative experiment result shows that the scene light of the exampleLearning and migration design is effective and can also be applied to the relevant task of generating images of different modalities.
Compared with the current latest technology, the B-T score is used for evaluating the image harmonious capability of the D-HT model on the real synthetic image. The statistics in table 6 and the visual effects in fig. 11 show that the best B-T score and best visual effects are obtained with the method of this example.
TABLE 6
Figure BDA0003258832480000171
The non-decoupled image harmonization HT model is applied to the image completion task of the random missing area on the Paris StreetView data set in the embodiment, so that the practicability and the expansibility of the HT model designed in the embodiment are verified. The purpose of image completion is to fill in missing regions of an image by synthesizing visually realistic and semantically reasonable pixels that are consistent with the pixels of the known regions. Table 7 and FIG. 12 show the quantitative and visual results of the HT model and the current latest RFR-Net, and the HT model proves the superior performance of the HT model in the image completion task by fully utilizing the advantages of the Transformer in long-term modeling and knowing through the quantitative and visual results.
TABLE 7
Figure BDA0003258832480000172
The example also applies the decoupled self-attention transformed image and harmonised D-HT model to the image enhancement task on the MIT-Adobe-5K-UPE dataset, in contrast to the latest method DeepLPF. Poor lighting conditions during imaging can lead to reduced image quality, especially underexposed images. Therefore, this example uses a D-HT model to decompose the low-light image into reflectance and light images by a reconstruction loss function, and treats the reflectance image as an enhanced image.
The quantitative comparison results in Table 8 show that D-HT is superior to DeepLPF method in PSNR, SSIM and LPIPS evaluation criteria. FIG. 13 further verifies that the D-HT model of the present example can recover the contrast, natural color, and sharp details of the image through a decoupled self-attention transform network.
TABLE 8
Figure BDA0003258832480000181
In conclusion of the experiment, the example proposes a new image harmonization method using a self-attention transformation network, aiming at eliminating the discordance factor of the synthetic image by utilizing the modeling capability of the remote context of the Transformer. This example not only established two non-decoupled and decoupled self-attention transformed image harmony frameworks (HT and D-HT), but also designed comprehensive experiments to explore and analyze the usage patterns and potentials of the Transformer on image harmony. In addition, the method further applies the non-decoupled and decoupled self-attention transformation image harmony model to two computer vision classical tasks of image restoration and image enhancement, and further illustrates the effectiveness and superiority of the design method (D-HT model) of the method.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (9)

1. An image harmonization system based on self-attention transform, comprising: the image harmonization module comprises a non-decoupling image harmonization module or a decoupling image harmonization module;
the non-decoupling image harmony module is used for performing direct self-attention transformation on the input synthetic image and the mask image by using a self-attention transformation network to generate a corresponding harmony image;
the decoupling image and harmonization module comprises a reflectivity image generation module, a background light decoupling module, an illumination image generation module and a synthesis module;
the reflectivity image generation module is used for carrying out decoupling self-attention transformation on an input composite image and a mask image to generate a reflectivity intrinsic image of the composite image;
the background light decoupling module is used for decoupling background light from a background image of the synthetic image by using a self-attention transformation network so as to irradiate the background light on the reflectivity intrinsic image;
the illumination conversion module is used for further generating an illumination intrinsic image for the reflectivity intrinsic image irradiated with the background light by using a self-attention conversion network;
the synthesis module is used for carrying out point multiplication operation on the reflectivity intrinsic image and the illumination intrinsic image to generate a harmonious image of the synthetic image.
2. The self-attention transform-based image harmony system of claim 1, wherein: the decoupling image and harmonization module comprises a first encoder, a first serialization transformation module, a first self-attention transformation module, a first serialization inverse transformation module and a first decoder;
the first encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the first serialization transformation module;
the first serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the first self-attention transformation module;
the first self-attention transformation module is used for performing direct self-attention transformation on the input token generated by the first serialization transformation module to obtain an output token which is input into the first serialization inverse transformation module;
the first serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image;
the first decoder is configured to decode the harmonious feature image into a harmonious image corresponding to the composite image.
3. The self-attention transform-based image harmony system of claim 2, wherein: the reflectivity image generation module comprises a second encoder, a second serialization transformation module, a second attention transformation module, a second serialization inverse transformation module and a second decoder;
the second encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the second serialization transformation module;
the second serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the second self-attention transformation module;
the second self-attention transformation module is used for performing decoupled self-attention transformation on the input token generated by the second serialization transformation module to obtain a reflectivity image output token and inputting the reflectivity image output token into the second serialization inverse transformation module and the illumination transformation module;
the second serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a reflectivity intrinsic characteristic image;
the second decoder is configured to decode the reflectance intrinsic feature image into a reflectance intrinsic image corresponding to the composite image.
4. The self-attention transform-based image harmony system of claim 3, wherein: the background light decoupling module comprises a linear transformation module, a third self-attention transformation module and a fourth self-attention transformation module;
the linear transformation module is used for dividing an input background image into an image block sequence, then flattening each image block to serve as a token and encoding the token into a feature space through linear mapping to generate an input token of the third self-attention transformation module;
the third self-attention transformation module is used for performing self-attention transformation coding on the input token of the third self-attention transformation module to generate the input token of the fourth self-attention transformation module;
the fourth self-attention transformation module is used for performing self-attention transformation decoding on the input token of the fourth self-attention transformation module, generating a background light hidden vector coding token and inputting the background light hidden vector coding token into the illumination transformation module.
5. The self-attention transform-based image harmony system of claim 4, wherein: the illumination transformation module comprises a fifth self-attention transformation module, a third sequence inverse transformation module and a third decoder;
the fifth self-attention transformation module is used for performing self-attention transformation on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token;
the third serialization inverse transformation module is used for carrying out serialization inverse transformation on the illumination intrinsic image output token to generate an illumination intrinsic characteristic image corresponding to the synthesized image;
and the third decoder is used for decoding the illumination intrinsic characteristic image and outputting an illumination intrinsic image corresponding to the synthesized image.
6. The self-attention transform-based image harmony system of claim 5, wherein: in the training process, a single image harmonization module is adopted for the non-decoupling image harmonization module and the decoupling image harmonization module
Figure FDA0003258832470000031
A loss function is used to excite the harmonious image of the composite image to approximate its true image.
7. The self-attention transform-based image harmony system of claim 6, wherein:
the first encoder and the second encoder are both encoders of a CNN network, and the first decoder, the second decoder and the third decoder are all decoders of the CNN network.
8. The self-attention transform-based image harmony system of claim 7, wherein: the first self-attention transformation module, the second self-attention transformation module and the third self-attention transformation module all adopt encoders TREs of self-attention transformation networks, and the fourth self-attention transformation module and the fifth self-attention transformation module all adopt decoders TRDs of self-attention transformation networks;
TRE consists of a stack of structurally identical layers, each of which contains a sublayer with a multi-headed self-attention mechanism and a feed-forward network sublayer, TRE being intended to output a self-attention map based on modeling the dependencies between input tokens (image patches);
the TRD is also made up of a stack of multiple identically structured layers, where each layer, in addition to two sublayers identical to TRE, has a third encoder-decoder cross-attention sublayer that performs a multi-head attention operation on the TRE output and the TRD itself; the TRD is directed to learning a mapping from a source domain to a target domain, generating a feature matrix associated with a task.
9. The self-attention transform-based image harmony system of claim 8, wherein: the first self-attention transformation module, the second self-attention transformation module, the third self-attention transformation module, the fourth self-attention transformation module and the fifth self-attention transformation module all adopt 2 attention heads and 9 attention layers.
CN202111067167.6A 2021-09-13 2021-09-13 Image harmony system based on self-attention transformation Active CN113689328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111067167.6A CN113689328B (en) 2021-09-13 2021-09-13 Image harmony system based on self-attention transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111067167.6A CN113689328B (en) 2021-09-13 2021-09-13 Image harmony system based on self-attention transformation

Publications (2)

Publication Number Publication Date
CN113689328A true CN113689328A (en) 2021-11-23
CN113689328B CN113689328B (en) 2024-06-04

Family

ID=78586147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111067167.6A Active CN113689328B (en) 2021-09-13 2021-09-13 Image harmony system based on self-attention transformation

Country Status (1)

Country Link
CN (1) CN113689328B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713680A (en) * 2022-11-18 2023-02-24 山东省人工智能研究院 Semantic guidance-based face image identity synthesis method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523534A (en) * 2020-03-31 2020-08-11 华东师范大学 Image description method
CN113076809A (en) * 2021-03-10 2021-07-06 青岛海纳云科技控股有限公司 High-altitude falling object detection method based on visual Transformer
CN113192055A (en) * 2021-05-20 2021-07-30 中国海洋大学 Harmonious method and model for synthesizing image
CN113269792A (en) * 2021-05-07 2021-08-17 上海交通大学 Image post-harmony processing method, system and terminal
CN113344807A (en) * 2021-05-26 2021-09-03 商汤集团有限公司 Image restoration method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523534A (en) * 2020-03-31 2020-08-11 华东师范大学 Image description method
CN113076809A (en) * 2021-03-10 2021-07-06 青岛海纳云科技控股有限公司 High-altitude falling object detection method based on visual Transformer
CN113269792A (en) * 2021-05-07 2021-08-17 上海交通大学 Image post-harmony processing method, system and terminal
CN113192055A (en) * 2021-05-20 2021-07-30 中国海洋大学 Harmonious method and model for synthesizing image
CN113344807A (en) * 2021-05-26 2021-09-03 商汤集团有限公司 Image restoration method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HANTING CHEN等: "Pre-Trained Image Processing Transformer", 《ARXIV PREPRINT》, pages 1 - 15 *
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08 *
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12) *
王俊豪;罗轶凤;: "通过细粒度的语义特征与Transformer丰富图像描述", 华东师范大学学报(自然科学版), no. 05 *
王俊豪;罗轶凤;: "通过细粒度的语义特征与Transformer丰富图像描述", 华东师范大学学报(自然科学版), no. 05, 25 September 2020 (2020-09-25) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713680A (en) * 2022-11-18 2023-02-24 山东省人工智能研究院 Semantic guidance-based face image identity synthesis method

Also Published As

Publication number Publication date
CN113689328B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
CN111539887B (en) Channel attention mechanism and layered learning neural network image defogging method based on mixed convolution
Chen et al. DARGS: Image inpainting algorithm via deep attention residuals group and semantics
Ning et al. Accurate and lightweight image super-resolution with model-guided deep unfolding network
CN109376830A (en) Two-dimensional code generation method and device
CN113934890B (en) Method and system for automatically generating scene video by characters
CN111915693A (en) Sketch-based face image generation method and system
Xin et al. Residual attribute attention network for face image super-resolution
CN113192055B (en) Harmonious method and model for synthesizing image
CN111986105B (en) Video time sequence consistency enhancing method based on time domain denoising mask
CN117597703A (en) Multi-scale converter for image analysis
KR20240065281A (en) Vector-quantized image modeling
CN115457043A (en) Image segmentation network based on overlapped self-attention deformer framework U-shaped network
CN113689328B (en) Image harmony system based on self-attention transformation
Esmaeilzehi et al. SRNHARB: A deep light-weight image super resolution network using hybrid activation residual blocks
CN113537246A (en) Gray level image simultaneous coloring and hyper-parting method based on counterstudy
CN117474800A (en) Image defogging method of full convolution decoder based on channel converter
CN112686830A (en) Super-resolution method of single depth map based on image decomposition
CN113781376B (en) High-definition face attribute editing method based on divide-and-congress
Yang et al. Deep 3d modeling of human bodies from freehand sketching
CN115660979A (en) Attention mechanism-based double-discriminator image restoration method
CN113780209A (en) Human face attribute editing method based on attention mechanism
Ni et al. Natural Image Reconstruction from fMRI Based on Self-supervised Representation Learning and Latent Diffusion Model
Wen et al. Mrft: Multiscale recurrent fusion transformer based prior knowledge for bit-depth enhancement
Chang et al. 3D hand reconstruction with both shape and appearance from an RGB image
Peng Efficient Neural Light Fields (ENeLF) for Mobile Devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant