CN113689328A

CN113689328A - Image harmony system based on self-attention transformation

Info

Publication number: CN113689328A
Application number: CN202111067167.6A
Authority: CN
Inventors: 郭宗辉; 郑海永
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2021-11-23
Anticipated expiration: 2041-09-13
Also published as: CN113689328B

Abstract

The invention relates to the technical field of image processing, and particularly discloses two self-attention transform-based non-decoupling and decoupling image harmony systems, which utilize the strong remote context modeling capability of a self-attention transform network, adopt a non-decoupling image harmony module, and fully mine the relation between a foreground and a background in a feature space of a synthetic image by utilizing the self-attention transform network so as to guide the harmony of the synthetic image; or, a decoupling image harmonization module is adopted, the self-attention transform encoder and the decoder are used for decoupling the hidden vector encoding of the background image light, then the background light hidden vector encoding and the reflectivity characteristic image are fused through another self-attention transform decoder to generate an illumination intrinsic image, and finally the reflectivity intrinsic image and the illumination intrinsic image are multiplied to obtain a harmonized image, so that the purpose of adjusting the foreground illumination to be compatible with the background illumination while keeping the semantics and the structure of the synthetic image unchanged is achieved, and the problem of dissonance between the foreground and the background of the synthetic image is solved.

Description

Image harmony system based on self-attention transformation

Technical Field

The invention relates to the technical field of image processing, in particular to an image harmony system based on self-attention transformation.

Background

Combining arbitrary regions of different images into a visually perceived composite image is a basic task of many application researches of computer vision and graphics, such as image synthesis, image stitching, image editing, scene synthesis and the like, and image synthesis is a common operation in human daily life. However, a composite image obtained by copying and pasting a partial region of one image (referred to as the foreground of the composite image) into another image (referred to as the background of the composite image) will inevitably have a problem that the foreground and background of the composite image are not harmonious due to different imaging environments (such as day and night, sunny and cloudy days, indoors and outdoors) of the foreground region and the background region (the other region except the foreground region in the composite image). Therefore, how to make the composite image look more realistic, i.e., harmonizing the image, by a simple and efficient means is an important and challenging task.

Traditional image harmonization methods focus on better matching techniques, ensuring appearance consistency between foreground and background by migrating statistical information such as color and texture. Recently, deep harmonic models and large-scale datasets have been developed to address this challenging task and achieve good results. The current deep learning model mainly adopts a Convolutional Neural Network (CNN) architecture of an encoder-decoder, which first tries to learn the color information of the background appearance near the foreground region by using the encoder, then captures the context of the synthesized image to adjust the appearance or illumination of the foreground region of the image to be consistent with the background, and finally reconstructs the harmonious image by using the decoder.

In fact, the commonly used encoder-decoder convolutional neural network architecture accomplishes the image harmonization task through a two-step process. The first stage is mainly based on color statistics of a background area of a composite image to adjust the color of a foreground area in a multilayer feature space to make the color of the foreground area compatible with the color of the background, and the second stage is mainly used for reconstructing original structure and semantic information and harmonious low-layer visual features of the image from a high-dimensional feature space. However, the generalized bias due to the local sensitivity of CNN itself determines that the convolutional neural network can only focus on locally limited information, so that the shallow CNN can only capture the background area context near the foreground, but lacks global background context. However, the consistency of the image overall harmony is a key element in evaluating the visual reality of the composite image. CNN may not be able to fully utilize background global information to adjust the foreground color and make it consistent with the overall background color.

In addition, the previous method adopts a U-Net multilayer CNN network structure with continuous coding, and although U-Net can increase the receptive field through a multilayer CNN stacking manner to capture the global context of the image, the original discordance information of the synthesized image may be introduced into the reconstructed image again due to the jump connection from the encoder to the decoder, and the performance of the image harmony model is reduced.

Disclosure of Invention

The invention provides an image harmony system based on self-attention transformation, which solves the technical problems that: in the image harmony process, the context of a background area near the foreground can be captured, the overall context of the image can also be captured, and the inharmonious information is not introduced, so that the inharmonious problem of the foreground and the background of the synthetic image is solved to the greatest extent.

In order to solve the technical problems, the invention provides an image harmonious system based on self-attention transformation, which comprises a non-decoupling image harmonious module or a decoupling image harmonious module;

the non-decoupling image harmony module is used for performing direct self-attention transformation on the input synthetic image and the mask image by using a self-attention transformation network to generate a corresponding harmony image;

the decoupling image and harmonization module comprises a reflectivity image generation module, a background light decoupling module, an illumination image generation module and a synthesis module;

the reflectivity image generation module is used for carrying out decoupling self-attention transformation on an input composite image and a mask image to generate a reflectivity intrinsic image of the composite image;

the background light decoupling module is used for decoupling background light from a background image of the synthetic image by using a self-attention transformation network so as to irradiate the background light on the reflectivity intrinsic image;

the illumination conversion module is used for further generating an illumination intrinsic image for the reflectivity intrinsic image irradiated with the background light by using a self-attention conversion network;

the synthesis module is used for carrying out point multiplication operation on the reflectivity intrinsic image and the illumination intrinsic image to generate a harmonious image of the synthetic image.

Specifically, the decoupled image and harmonization module includes a first encoder, a first serialized transformation module, a first attention transform module, a first serialized inverse transformation module, and a first decoder;

the first encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the first serialization transformation module;

the first serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the first self-attention transformation module;

the first self-attention transformation module is used for performing direct self-attention transformation on the input token generated by the first serialization transformation module to obtain an output token which is input into the first serialization inverse transformation module;

the first serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image;

the first decoder is configured to decode the harmonious feature image into a harmonious image corresponding to the composite image.

Specifically, the reflectivity image generation module includes a second encoder, a second serialization transformation module, a second attention transformation module, a second serialization inverse transformation module, and a second decoder;

the second encoder is used for encoding the input composite image and the mask image into a feature space to obtain a feature image and inputting the feature image into the second serialization transformation module;

the second serialization transformation module carries out serialization transformation on the input characteristic image to generate an input token of the second self-attention transformation module;

the second self-attention transformation module is used for performing decoupled self-attention transformation on the input token generated by the second serialization transformation module to obtain a reflectivity image output token and inputting the reflectivity image output token into the second serialization inverse transformation module and the illumination transformation module;

the second serialization inverse transformation module is used for carrying out serialization inverse transformation on the input output token to generate a reflectivity intrinsic characteristic image;

the second decoder is configured to decode the reflectance intrinsic feature image into a reflectance intrinsic image corresponding to the composite image.

Specifically, the backlight decoupling module includes a linear transformation module, a third self-attention transformation module, and a fourth self-attention transformation module;

the linear transformation module is used for dividing an input background image into an image block sequence, then flattening each image block to serve as a token and encoding the token into a feature space through linear mapping to generate an input token of the third self-attention transformation module;

the third self-attention transformation module is used for performing self-attention transformation coding on the input token of the third self-attention transformation module to generate the input token of the fourth self-attention transformation module;

the fourth self-attention transformation module is used for performing self-attention transformation decoding on the input token of the fourth self-attention transformation module, generating a background light hidden vector coding token and inputting the background light hidden vector coding token into the illumination transformation module.

Specifically, the illumination transformation module includes a fifth self-attention transformation module, a third sequence inverse transformation module, and a third decoder;

the fifth self-attention transformation module is used for performing self-attention transformation on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token;

the third serialization inverse transformation module is used for carrying out serialization inverse transformation on the illumination intrinsic image output token to generate an illumination intrinsic characteristic image corresponding to the synthesized image;

and the third decoder is used for decoding the illumination intrinsic characteristic image and outputting an illumination intrinsic image corresponding to the synthesized image.

Specifically, in the training process, a single non-decoupling image harmonization module and a single decoupling image harmonization module are adopted for both the non-decoupling image harmonization module and the decoupling image harmonization module

A loss function is used to excite the harmonious image of the composite image to approximate its true image.

Specifically, the first encoder and the second encoder all employ encoders of a CNN network, and the first decoder, the second decoder and the third decoder all employ decoders of the CNN network.

Specifically, the first self-attention transform module, the second self-attention transform module, and the third self-attention transform module all employ a coder TRE of a self-attention transform network, and the fourth self-attention transform module and the fifth self-attention transform module all employ a decoder TRD of the self-attention transform network;

TRE consists of a stack of structurally identical layers, each of which contains a sublayer with a multi-headed self-attention mechanism and a feed-forward network sublayer, TRE being intended to output a self-attention map based on modeling the dependencies between input tokens (image patches);

the TRD is also made up of a stack of multiple identically structured layers, where each layer, in addition to two sublayers identical to TRE, has a third encoder-decoder cross-attention sublayer that performs a multi-head attention operation on the TRE output and the TRD itself; the TRD is directed to learning a mapping from a source domain to a target domain, generating a feature matrix associated with a task.

Specifically, the first self-attention transforming module, the second self-attention transforming module, the third self-attention transforming module, the fourth self-attention transforming module and the fifth self-attention transforming module all use 2 attention heads and 9 attention layers.

The image harmony system based on the self-attention transformation provided by the embodiment of the invention utilizes the strong remote context modeling capability of the self-attention transformation network, can directly generate harmony images in an image non-decoupling mode, and has a simple structure and a good harmony effect. In order to achieve a better harmonious effect, an image decoupling mode can be adopted, the reflectivity intrinsic image of the synthesized image is obtained through the reflectivity image generating module, the illumination intrinsic image of the synthesized image is obtained through the background light decoupling module and the illumination image generating module, and therefore the harmonious image is obtained through synthesizing the reflectivity intrinsic image and the reflectivity intrinsic image, the purpose that the foreground illumination is adjusted to be compatible with the background illumination while the semantics and the structure of the synthesized image are kept unchanged is achieved, and the problem that the foreground and the background of the synthesized image are not harmonious is solved. The experimental result proves that the system achieves the most advanced performance on the image harmony task.

Drawings

FIG. 1 is a block diagram of an input mode framework for performing image vision tasks using a Transformer according to an embodiment of the present invention;

FIG. 2 is a block diagram of a non-decoupled image harmonization module (HT model) according to an embodiment of the present invention;

FIG. 3 is a detailed block diagram of a non-decoupled image harmonization module (HT model) according to an embodiment of the present invention;

FIG. 4 is a block diagram of a decoupled image and harmonisation module (D-HT model) provided by an embodiment of the present invention;

FIG. 5 is a frame-refining diagram of the model shown in FIG. 4 provided by an embodiment of the present invention;

FIG. 6 is a detailed block diagram of the model shown in FIG. 5 provided by an embodiment of the present invention;

FIG. 7 is a graphical illustration of the visual effect of the various image harmonization methods provided by embodiments of the present invention on the four sub-datasets and the global dataset of iHarmony 4;

FIG. 8 is a graph showing the image and harmonious visual effects provided by an embodiment of the present invention compared using a normal mask (middle row) and a reverse mask (bottom row);

FIG. 9 is a diagram of image visual effect presentation with different outputs under different lighting conditions provided by an embodiment of the present invention;

FIG. 10 is a diagram illustrating the visual effect of an output image with different illumination by modifying the steganographic vector encoding (Lt) of a target image according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating the visual effect of the images and the harmonizing method on the real synthesized image according to the embodiment of the present invention;

FIG. 12 is a diagram illustrating the visual effect of the image completion method on the Paris street View data set according to the embodiment of the present invention;

FIG. 13 is a diagram illustrating the visual effect of the image enhancement method on the MIT-Adobe-5K-UPE data set according to the embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are given solely for the purpose of illustration and are not to be construed as limitations of the invention, including the drawings which are incorporated herein by reference and for illustration only and are not to be construed as limitations of the invention, since many variations thereof are possible without departing from the spirit and scope of the invention.

Self-attention transformation networks (transformers) benefit from an elaborate self-attention mechanism design to capture remote context, and transformers are rapidly gaining wide attention as a new neural network structure in the scientific research and industrial fields. Transformer is applied to Natural Language Processing (NLP) tasks first instead of RNN and LSTM, and achieves a remarkable result in many tasks of NLP. Nowadays, with the benefit of the powerful feature representation capability of transformers, researchers are applying transformers to various computer vision tasks, such as object detection, image recognition, image processing, and the like.

Self-attention transformation networks (transformers) were first applied to sequential data processing tasks such as natural languages, such as machine translation, and do not rely on recursive forms but rather on a self-attention mechanism to describe global dependencies between inputs and outputs. Thus, if Transformer is used for computer vision tasks, it is necessary to represent 2D images as 1D sequence data and treat its elements or encoding as tokens (tokens, such as words in NLP) and take this serialized data as input to the Transformer. In fact, the image block can be used as a token to avoid the problem of a very long sequence with pixels as tokens. Thus, in this work, this example preliminarily analyzed the performance impact of different token numbers and different embedding types on the Transformer in terms of image harmony. For the number of tokens, it is considered to use different step sizes for adjustment when splitting the image into image blocks. For the encoding mode, two projection modes, linear (FC or CONV) and nonlinear (MLP or CNN network containing nonlinear activation function) are used. Through experiments, the Transformer is found to be more sensitive to the number of tokens and not sensitive to the coding type. The image input method is shown in fig. 1.

The self-attention transform network (Transformer) comprises an encoder TRE (-) for capturing token relations and a decoder TRD (-) for generating task outputs. TRE consists of a stack of structurally identical layers, where each layer contains one sublayer with a multi-headed self-attentiveness mechanism and one feed-forward network sublayer. The TRD is also made up of a stack of multiple identically structured layers, where each layer has, in addition to two sublayers identical to TRE, a third encoder-decoder cross-attention mechanism sublayer that performs a multi-headed attention operation on the output of TRE with the TRD itself. Thus, TRE exploits the self-attention mechanism to explore the self-relationships between its input vectors, while TRD performs cross-attention to find the correlation between its own input and TRE output. Thus, for visual tasks with images as input, TRE aims at generating a feature matrix related to the task based on modeling the dependencies between input tokens (image patches) and then outputting from the attention graph, while TRD aims at learning the mapping from the source domain (TRE input) to the target domain (TRD input/output). The method aims to explore the capability of TREs and TRDs on the image harmony task and the influence of different self-attention head numbers and layer numbers on the performance of a Transformer, and aims to solve the image harmony problem by utilizing the powerful remote context modeling capability of the Transformer so as to fully utilize background global context information to realize image harmony. In order to solve the problem of color appearance dissonance caused by different illumination conditions between the foreground and the background in the synthesized image, the present example provides an image harmonization system based on self-attention transformation, which firstly designs a simple non-decoupled self-attention transformation image harmonization framework (HT), i.e. a non-decoupled image harmonization module, and introduces a self-attention transformation network (Transformer) between a very basic Convolution (CNN) encoder and a decoder, for performing direct self-attention transformation on the input synthesized image and the mask image by using the self-attention transformation network to generate a corresponding harmonized image.

As shown in fig. 2 and 3, the decoupled image and harmonization module includes a first encoder E (an encoder using a CNN network), a first serialized transformation module R, a first self-attention transformation module TRE (an encoder using a self-attention transformation network), a first serialized inverse transformation module R', and a first decoder D (a decoder using a CNN network).

A first encoder E for encoding the input composite image

And coding the mask image M into a characteristic space to obtain a characteristic image and inputting the characteristic image into a first serialization transformation module R. The first serialization transformation module R performs serialization transformation on the input feature image to generate an input token of the first self-attention transformation module TRE. The first self-attention transformation module TRE is configured to perform direct self-attention transformation on the input token generated by the first serialization transformation module R to obtain an output token input into the first inverse serialization transformation module R'. The first serialization inverse transformation module R' is used for carrying out serialization inverse transformation on the input output token to generate a harmony characteristic image. The first decoder D is used for decoding the harmonious characteristic image into a synthetic image

Corresponding harmonised images

The CNN encoder E aims at encoding the composite image into a compact feature space with the pixels of the feature map as the transform input, while the CNN decoder D aims at reconstructing the transform output into a harmonious image corresponding to the input image. This design approach is actually embedding the codec of CNN into the transform under the basic encoder-decoder architecture, and is relatively fair compared to the current mainstream image harmonization methods. Furthermore, for a low-level visual task where much information is not changed (semantics, structure, etc.) between the input image and the output image, the cross attention module and the self attention module in the TRD can be regarded as having similar roles, and therefore this example uses TRE only in the HT framework.

For the image harmony task, a composite image is given

With a corresponding foreground mask image M, the goal being to generate a harmonised image with compatible foreground and background

As an output, i.e.

Should be as close as possible to the real image H. In particular, the CNN encoder E (-) generates a lower resolution feature image F ∈ R^h×w×cWherein, use

And c 256, H, W denote the height and width of the composite image, respectively. The pixels of the feature image F (corresponding to the image blocks of the input image) are then serialized into F' ∈ R^hw×cThis is used as the input token for TRE, and the input token code is the characteristic value of each channel in each pixel. In addition, similar to the use mode of the original Transformer in the NLP task, the present example obtains the position code E of each token from the actual coordinates of each pixel in the feature image F according to the sine and cosine fixed position coding mode, and uses the position code E as the token position input of TRE. Further, the sequence data outputted from TRE is converted into a feature image having the same size as F in the reverse direction according to the original position coordinates, and inputted to a CNN decoder D (-) to generate a feature image,finally generating a harmonious image

This example formulates this non-decoupled self-attention transformed image and harmonic model as:

where phi and phi' denote the transform and inverse transform operations, respectively.

Importantly, this example uses only a single

Excited by a loss function

Adjusting foreground illumination according to background illumination is a key to solving the problem of image dissonance. In addition, the diffuse reflection model based on the intrinsic image and Retinex theory assumes that the light intensity value of the image actually encodes all the features of the corresponding scene point, so this embodiment also uses the Transformer to capture the background light and place it on the reflectance intrinsic image to achieve the harmony of the illumination intrinsic image by decomposing the synthesized image into the reflectance intrinsic image and the illumination intrinsic image. The image harmonization system based on the self-attention transform according to the present embodiment, as shown in fig. 4, includes a reflectivity image generation module, a background light decoupling module, an illumination image generation module, and a synthesis module. Wherein:

the reflectivity image generation module is used for synthesizing the input image

Self-injection decoupled from mask image MPerforming a semantic transformation to generate a composite image

Reflectivity eigen image of

The background light decoupling module is used for synthesizing images by utilizing a self-attention transformation network

Background image of

(from the composite image

Obtaining background image by removing foreground region) in the image processing system^bgTo illuminate the reflectivity eigen image

The above. The illumination conversion module is used for irradiating background light l^bgReflectivity eigen image of

Further generation of illumination intrinsic images using self-attention transform networks

The synthesis module is used for synthesizing the intrinsic image of the reflectivity

And illuminating the intrinsic image

Performing dot product operation to generate a composite image

To harmonize the image, this process can be formulated as:

based on this, in particular, as shown in fig. 5 and 6, the reflectance image generation module includes a second encoder E_R(encoder using CNN network), second serialization transformation module R₁A second self-attention transformation module TRE_R(encoder employing self-attention transform network), second sequence inverse transform module R'₁A second decoder D_R(encoder using CNN network). Wherein:

first encoder E_RFor synthesizing images to be input

Coding the mask image M to a feature space to obtain a feature image, and inputting the feature image into a second serialization transformation module R₁This process can be formulated as:

specifically, CNN encoder E_RGenerating a lower resolution feature image F ∈ R^h×w×cWherein, in the step (A),

and c 256 and H, W denote composite images, respectively

High and wide.

Second serialized transformation module R₁For feature images F e R^h×w×cThe process of carrying out serialization transformation to generate a plurality of tokens and carrying out position coding on the tokens to obtain an input token can be formulated as follows:

specifically, the pixels of the feature image F (corresponding to the input image) are combinedImage block) serialization as F' e R^hw×cIt is used as the second self-attention transformation module TRE_RAnd the input token code is a characteristic value of each channel in each pixel. In addition, the position code E of each token is obtained according to the actual coordinates of each pixel in the feature image F in a sine and cosine fixed position coding mode_rThis is taken as TRE_RToken location input.

Second self-attention transform coder TRE_RThe process for performing self-attention decoupling transform coding on the input token to generate the corresponding reflectivity image output token can be formulated as follows:

second sequenced inverse transformation Module R'₁Module R for performing second serialization transformation on reflectivity image output token₁And reversely transforming to obtain the intrinsic reflectivity characteristic image with the same size as the characteristic image F, wherein the process can be formulated as:

second decoder D_RUsed for decoding, outputting and synthesizing the reflectivity intrinsic characteristic image

Equal magnitude reflectivity eigen images

This process can be formulated as:

therefore, the whole process of the reflectivity image generation module can be formulated as:

specifically, as shown in fig. 5 and 6, the backlight decoupling module includes a linear transformation module LP and a third self-attention transformation module TRE_L(encoder using a self-attention transform network), fourth self-attention transform module TRD_L(decoder employing a self-attention transform network). Wherein:

the linear transformation module LP is used to transform the background image

(the number of channels C is 3, H, Y represents the height and width of the image, respectively, and the background image is as wide as and as high as the synthesized image) into a sequence of image blocks

(number of image blocks)

Size P of tiles 8), then each tile is flattened as a token and coded into C' -256 dimensional feature space by linear mapping LP (·), and fixed position coding E_p(the position coordinates of the image block in the original image are obtained by sine and cosine coding) is added into the token code to obtain a third self-attention transformation module TRE_LThe process can be formulated as:

third self-attention transform module TRE_LThe process of self-attention transformation coding the input token to generate the input token of the fourth self-attention transformation module can be formulated as follows:

fourth self-attention transform module TRD_LFor performing self-attention transform decoding on its input token to generate corresponding backThe scene light hidden vector code token is input into the illumination transformation module, and the process can be formulated as follows:

optically encoded token sequence here

(d_lSpherical harmonic coefficient of 27 dimensions) is TRD_LInitial input of, E_lIndicating a learnable light position code initial value, the initial value of the light code token being zero.

As shown in fig. 5 and 6, the illumination transformation module includes a fifth self-attention transformation module TRD_I(decoder employing self-attention transform network), third sequential inverse transform module R'₂A third decoder D_I(decoder using CNN network).

Fifth self-attention transform module TRD_IThe method is used for performing self-attention transformation decoding on the background light hidden vector encoding token and the reflectivity image output token to generate a corresponding illumination intrinsic image output token, and the process can be formulated as TRD_I(t_l+E_l,t_r+E_r) Here, t_l、E_lRespectively representing the learned sequence of light-encoded tokens, the light position code (by the first self-attention transform decoder TRD)_LOutput), while the sequence of reflectivity intrinsic image tokens

t_rIs coded as E_r。

Third sequenced inverse transformation module R'₂For outputting token TRD to illumination intrinsic image_I(t_l+E_l,t_r+E_r) Carry out and second serializing transformation module R₁And reversely transforming to obtain an illumination intrinsic characteristic image with the same size as the characteristic image, wherein the process can be formulated as: phi' (TRD)_I(t_l+E_l,t_r+E_r))。

Third decoder D_IUsed for decoding, outputting and synthesizing the illumination intrinsic characteristic image

Illumination intrinsic images with same size

This process can be formulated as:

it should also be noted that during the training of the system, the same applies to the single one

Norm loss function to excite composite image

Of (2) harmonious images

Approximating its real image H:

in general, the decoupled image harmonization module shown in this example uses two transform encoders and two transform decoders, where the encoder TRE_RUsing the CNN-encoded tokens of the image block as input and generating a reflectivity eigen-image, encoder TRE_LTaking FC coded token of image block as input and combining with TRD (decoder TRD)_LGenerating implicit vector coding of the background light, and a decoder TRD_IAnd finally, performing dot multiplication on the reflectivity and the illumination intrinsic image to generate a harmonious image.

The image harmony system based on the self-attention transformation provided by the embodiment of the invention utilizes the strong remote context modeling capability of the self-attention transformation network, can directly generate harmony images by adopting an image non-decoupling mode (figures 2 and 3), and has a simple structure and a good harmony effect. In order to achieve a better harmonious effect, an image decoupling mode can be adopted, the reflectivity intrinsic image of the synthesized image is obtained through the reflectivity image generating module, the illumination intrinsic image of the synthesized image is obtained through the background light decoupling module and the illumination image generating module, and therefore the harmonious image is obtained through synthesizing the reflectivity intrinsic image and the reflectivity intrinsic image, the fact that foreground illumination is adjusted to be compatible with background illumination while the semantics and the structure of the synthesized image are kept unchanged is achieved, and the problem that the foreground and the background of the synthesized image are not harmonious is solved.

The effect of the system provided in this example is verified experimentally.

The experiment selects a synthetic iHarmony4 dataset and a real synthetic image dataset. Where experiments were performed on the common synthetic iHarmony4 dataset to analyze and evaluate the self-attentive transformed image and harmonic model performance. The iHarmony4 dataset contains a total of 4 sub-datasets, HCOCO, HAdobe5k, HFlickr and Hday2night, one for each synthetic image, this example following the same experimental setup as DoveNet. This example evaluates the performance of the system on 99 real synthetic image datasets, as does the DoveNet evaluation.

Reflectivity and illumination only using

The loss function is used as reconstruction constraint and Adam optimizer (parameter is beta)₁＝0.5、β₂0.999), the total number of iterative training is 60 times, and the model initial learning rate is set to e^-4And decreases to e after 40 iterations^-5. Reflectivity decoder D in decoupled self-attention transformed image and harmonic model_RAnd an illumination decoder D_IThe last layer of (d) uses the tanh activation function. The input image is resized to 256 x 256 for training and testing, and the model also generates harmonious images of the same size.

In particular, in restoring harmonised images

Previously, the reflectance and illumination images needed to be normalized to [0, 1%]An interval.

For experimental comparison, the example first constructs two classical network models for implementing the task of image-to-image translation as references, namely an Encoder-Decoder structured U-Net (E-D U-Net) and an Encoder-Decoder residual convolutional neural network model (E-D CNN, structured as Encoder-blocks-Decoder). Table 1 shows the results of quantitative evaluation of the four datasets and the global dataset at iHarmony4, including HT, D-HC, D-HT and baseline models E-D U-Net and E-D CNN and the best current method: DIH, S²The results of the AM and DoveNet comparisons over the 4 data sets, with the arrow pointing up indicating that the higher the data the better, and the arrow pointing down indicating that the lower the data the better. HT is the non-decoupled self-attention transformed image harmony model (TRE with 2 heads of attention and 9 layers of attention) shown in FIG. 3, and D-HC and D-HT represent the decoupled model using CNN and the decoupled model using Transformer, respectively (TRE and TRD with 2 heads and 9 layers of attention shown in FIG. 6). Constructing a D-HC model by replacing a self-attention transform coder TRE in the D-HT with Resblock_RReplacement of TRE with Encoder and MLP_LAnd TRD_LDecoupling background scene light, replacing TRD with AdaIN_IAnd re-rendering to obtain the illumination intrinsic image. In addition, table 1 also provides the evaluation results of the synthetic image and the real image as references (Composite column).

TABLE 1

From the experimental evaluation results of Table 1, it can be seen that E-D (CNN) performed better on HCOCO and HAdobe5k datasets and worse on HFlickr and Hday2night datasets than E-D (U-Net) probably because U-Net has a global receptive field that captures the global context, but its jumpers may introduce dissonance factors for the reconstructed image, and CNN has a limited receptive field due to its generalized bias. In summary, E-D (CNN) is lower than the fMSE of E-D (U-Net) over the entire iHarmony4 dataset, but the non-decoupled self-attention transformed image harmony model (HT) is superior not only to the two reference models E-D (U-Net) and E-D (CNN), but also to other image harmony methods, indicating that the remote context capability of the transform is very efficient on the image harmony task.

The quantitative comparison results in table 1 show that the D-HC model achieves competitive or superior results compared to the current state-of-the-art method, while demonstrating that reflectance and illumination intrinsic image separation and harmonization do contribute to image harmonization. Likewise, the D-HT model has a very low fMSE score (320.78, while S²AM and DoveNet are 537.23 and 541.53, respectively), and the accuracy and effectiveness of the design mode of the D-HT model are proved. In addition, D-HC performed better than HT on the Hday2night dataset, probably due to better decoupling capability of D-HC, while HT was due to lack of bias on the Hday2night training dataset (311 training images only).

Fig. 7 shows the visual effect of the harmonization method for each image (the frame part in the synthesized image represents the discordant foreground region, one example for each data set, HCOCO, HAdobe5k, HFlickr and Hday2night from top to bottom), and the harmonization image obtained by the D-HT model is closest to the real image by comparing the visual effects.

To investigate the impact of analyzing the number of input tokens (tokens) and the type of encoding on the performance of the Transformer, this example uses a step S to adjust the number of tokens T, using a 1-header 3-layer encoder for TRE, then using CNN reconstruction, and using a step S to adjust the number of tokens. The data in Table 2 show that both linear and non-linear encoding modes are dependent on the number of tokens

The performance of the transducer is increased continuously. Furthermore, for a fixed number of tokens (e.g. 4N), the performance of the transform is similar regardless of which encoding scheme (linear FC or CONV or non-linear MLP or CNN) is chosen. Thus, can pushThe performance of the Transformer is sensitive to the number of tokens, and is insensitive to the way in which the tokens are encoded in the image harmony. Therefore, this example provides a long sequence with more tokens, and even there may be redundancy between tokens, so the Transformer can mine richer contexts, and different encoding methods can provide effective information for the image block.

TABLE 2

Experiments were further designed to verify the effect of the HT architecture based self-attention transform network encoder (E) and decoder (D) layer number on image harmonization. The results of fmes ↓ quantitative data comparison in table 3 show that the performance of the Transformer on image and harmonization tasks is similar when the number of layers of the encoder is equal to the total number of layers of the encoder and decoder, even if the decoder has additional cross attention layers. Thus, this example employs only the transform encoder TRE on the HT model design (on the resulting reflectance eigen image).

TABLE 3

To analyze the performance impact of using the Transformer with different number of attention heads and layers on the harmony of the HT model images, this example further designed a set of experiments. The quantitative data comparison results in table 4 show that more attention layers and numbers of heads contribute to the improvement of the performance of the Transformer, but when the number of attention layers exceeds 9, the performance improvement space of the Transformer is limited.

TABLE 4

An ablation experimental study was performed on the D-HT model using the Transformer part, and the transformers of the reflectivity path and the illumination path were replaced with the CNN structures used in the D-HC model, respectively, to obtain the quantitative comparison results of Table 5, which demonstrate the superiority of the transformers in the image harmonization task.

TABLE 5

In addition, the present example performed another experiment by the foreground mask inversion operation, that is, exchanging the foreground and background regions of the synthesized image to generate an inverted mask, so as to adjust the background according to the foreground of the synthesized image on the D-HT model of the present example to harmonize it. FIG. 8 shows the image harmonization results compared using a normal mask (middle row) and an inverted mask (bottom row), indicating that D-HT can produce meaningful image harmonization results from any foreground mask.

This example further investigated the implicit vector space of light to explore whether the transform can learn the light representation of the image. Given an image, this example uses a decoupled self-attention transformed image harmony (D-HT) model to obtain a latent vector encoding of its light and arbitrarily alter that encoding to produce a different image over a subsequent network. Fig. 9 shows images with different outputs under different lighting conditions, indicating that the background scene light learned with the transform encoder and decoder is accurate for this example.

Further, this example also designs a set of combined experiments for verifying scene light learning and migration. As shown in fig. 10, in this example, two images (Source1 and Source2) are used as scene light reference images, one image is used as a Target image (Target) for light transition, and first, scene light implicit vector codes L corresponding to the two reference images are learned_s1And L_s2And then using the formula L_t＝αL_s1+(1-α)L_s2Obtaining different target scene light implicit vector codes L by adjusting variable alpha_tFinally, encoding the target scene light implicit vector L_tAnd rendering the images to the reflectivity characteristic image of the target image through the illumination migration model to generate images with different illumination. The qualitative experiment result shows that the scene light of the exampleLearning and migration design is effective and can also be applied to the relevant task of generating images of different modalities.

Compared with the current latest technology, the B-T score is used for evaluating the image harmonious capability of the D-HT model on the real synthetic image. The statistics in table 6 and the visual effects in fig. 11 show that the best B-T score and best visual effects are obtained with the method of this example.

TABLE 6

The non-decoupled image harmonization HT model is applied to the image completion task of the random missing area on the Paris StreetView data set in the embodiment, so that the practicability and the expansibility of the HT model designed in the embodiment are verified. The purpose of image completion is to fill in missing regions of an image by synthesizing visually realistic and semantically reasonable pixels that are consistent with the pixels of the known regions. Table 7 and FIG. 12 show the quantitative and visual results of the HT model and the current latest RFR-Net, and the HT model proves the superior performance of the HT model in the image completion task by fully utilizing the advantages of the Transformer in long-term modeling and knowing through the quantitative and visual results.

TABLE 7

The example also applies the decoupled self-attention transformed image and harmonised D-HT model to the image enhancement task on the MIT-Adobe-5K-UPE dataset, in contrast to the latest method DeepLPF. Poor lighting conditions during imaging can lead to reduced image quality, especially underexposed images. Therefore, this example uses a D-HT model to decompose the low-light image into reflectance and light images by a reconstruction loss function, and treats the reflectance image as an enhanced image.

The quantitative comparison results in Table 8 show that D-HT is superior to DeepLPF method in PSNR, SSIM and LPIPS evaluation criteria. FIG. 13 further verifies that the D-HT model of the present example can recover the contrast, natural color, and sharp details of the image through a decoupled self-attention transform network.

TABLE 8

In conclusion of the experiment, the example proposes a new image harmonization method using a self-attention transformation network, aiming at eliminating the discordance factor of the synthetic image by utilizing the modeling capability of the remote context of the Transformer. This example not only established two non-decoupled and decoupled self-attention transformed image harmony frameworks (HT and D-HT), but also designed comprehensive experiments to explore and analyze the usage patterns and potentials of the Transformer on image harmony. In addition, the method further applies the non-decoupled and decoupled self-attention transformation image harmony model to two computer vision classical tasks of image restoration and image enhancement, and further illustrates the effectiveness and superiority of the design method (D-HT model) of the method.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An image harmonization system based on self-attention transform, comprising: the image harmonization module comprises a non-decoupling image harmonization module or a decoupling image harmonization module;

2. The self-attention transform-based image harmony system of claim 1, wherein: the decoupling image and harmonization module comprises a first encoder, a first serialization transformation module, a first self-attention transformation module, a first serialization inverse transformation module and a first decoder;

3. The self-attention transform-based image harmony system of claim 2, wherein: the reflectivity image generation module comprises a second encoder, a second serialization transformation module, a second attention transformation module, a second serialization inverse transformation module and a second decoder;

4. The self-attention transform-based image harmony system of claim 3, wherein: the background light decoupling module comprises a linear transformation module, a third self-attention transformation module and a fourth self-attention transformation module;

5. The self-attention transform-based image harmony system of claim 4, wherein: the illumination transformation module comprises a fifth self-attention transformation module, a third sequence inverse transformation module and a third decoder;

6. The self-attention transform-based image harmony system of claim 5, wherein: in the training process, a single image harmonization module is adopted for the non-decoupling image harmonization module and the decoupling image harmonization module

7. The self-attention transform-based image harmony system of claim 6, wherein:

the first encoder and the second encoder are both encoders of a CNN network, and the first decoder, the second decoder and the third decoder are all decoders of the CNN network.

8. The self-attention transform-based image harmony system of claim 7, wherein: the first self-attention transformation module, the second self-attention transformation module and the third self-attention transformation module all adopt encoders TREs of self-attention transformation networks, and the fourth self-attention transformation module and the fifth self-attention transformation module all adopt decoders TRDs of self-attention transformation networks;

9. The self-attention transform-based image harmony system of claim 8, wherein: the first self-attention transformation module, the second self-attention transformation module, the third self-attention transformation module, the fourth self-attention transformation module and the fifth self-attention transformation module all adopt 2 attention heads and 9 attention layers.