CN113160101A

CN113160101A - Method for synthesizing high-simulation image

Info

Publication number: CN113160101A
Application number: CN202110401470.9A
Authority: CN
Inventors: 金枝; 张欢荣; 齐银鹤; 庞雨贤
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-07-23
Anticipated expiration: 2041-04-14
Also published as: CN113160101B

Abstract

The invention provides a method for synthesizing a high-simulation image, which comprises the following steps: constructing an original image data set and a target data set, and training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model; acquiring an image to be processed in an original image data set, and generating a target image with a target data set style through an unpaired unsupervised style conversion network model; wherein, the model training step comprises: acquiring a first original image in an original image data set, and converting the first original image into a first intermediate image under the format of a target data set; the first intermediate image is subjected to information growth recovery or information reverse modification to obtain a second original image; the method saves a large amount of time cost and labor cost, enables the image data to have higher usability, and can be widely applied to the technical field of image processing.

Description

Method for synthesizing high-simulation image

Technical Field

The invention relates to the technical field of image processing, in particular to a method for synthesizing a high-simulation image.

Background

In the prior art, a supervised learning based deep Neural Network (CNN) shows remarkable modeling capabilities in various advanced visual tasks, such as object detection, object segmentation, and the like. Most of the images in the existing image data sets are clear, clean and bright images, which may cause the networks trained on these data sets to be poor in poor visual conditions, such as low scene brightness and rain and fog, because some poor visual conditions may impair the visibility of the images and distort the structure, texture and color of objects in the images. In order to improve the robustness of networks for target detection, recognition, etc., people begin to train networks with images under poor visual conditions, and it is desirable to improve the network performance through learning such scenes. In addition, people also try to enhance the image before training the network, for example, the image is subjected to rain and fog removal, so that the visibility of the image to be detected is improved. Training a network on images under poor visual conditions requires a labeled image dataset under poor visual conditions. Although image collection under poor vision conditions is not difficult, labeling on an image is difficult and unreliable due to poor visibility of the image.

For this related art, a way of manufacturing a data set is proposed: the clear image is converted into the image under the bad vision condition, so that the original label of the clear image can be transferred to the generated image under the bad vision condition while the image under the bad vision condition is obtained. However, in this learning process of image-to-image conversion, it is very important to obtain a sufficient number of image data sets of different visual conditions in a pair in the same scene, and it takes a great deal of time and effort to collect a sufficient number of pairs of images, such as a high-brightness image and a low-brightness image in the same scene, a rain-fog-free image and a rain-fog-present image. In addition, since outdoor scenes frequently change over time, it is difficult to ensure that the content of the collected photographs in different situations is completely consistent, such as day and night, with no and no rain, so it is very difficult and impractical to acquire paired images.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems, an embodiment of the present invention provides a method for synthesizing a high-simulation image based on an unsupervised approach.

The technical scheme of the application provides a method for synthesizing a high-simulation image, which comprises the following steps: constructing an original image data set and a target data set, and training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model;

acquiring an image to be processed in an original image data set, and generating a target image with the style of the target data set through the unpaired unsupervised style conversion network model;

the step of training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model includes:

acquiring a first original image in the original image data set, and converting the first original image into a first intermediate image under the format of a target data set;

and obtaining a second original image by the first intermediate image through information growth recovery or information reverse modification.

In a possible embodiment of the present disclosure, the first intermediate image in the target data set format is a first intermediate image in a poor visual condition, and the step of converting the first original drawing into the first intermediate image in the target data set format includes removing object texture information of the first original drawing through a UNet network and introducing visual disturbance information to generate the first intermediate image in the poor visual condition.

In a possible embodiment of the present disclosure, the step of obtaining the second original image by performing information growth restoration or information reverse modification on the first intermediate image includes extracting image mask information from the first original image, and obtaining the second original image by performing AttUNet network restoration on the basis of the image mask information and the first intermediate image.

In one possible embodiment of the solution of the present application, the UNet network comprises an encoder and a decoder;

the encoder is used for extracting the multi-scale features of the first original image and down-sampling the multi-scale features to obtain a feature map;

the decoder is used for performing up-sampling on the feature map and performing feature fusion on the up-sampled feature map to obtain the first intermediate image under the poor visual condition.

In one possible embodiment of the solution of the present application, the AttUNet network comprises an attention block and a residual block;

the attention block is used for generating an edge map according to the image mask information, generating a scale map and a displacement map through parallel inference according to the edge map, and carrying out scaling displacement according to the scale map and the displacement map to obtain a first feature map;

and the residual block is used for carrying out deconvolution according to the first feature mapping to obtain the second original image.

In a possible embodiment of the solution of the present application, the first intermediate image under the target dataset format comprises at least one of: a first intermediate image in poor vision conditions and a first intermediate image in non-poor vision conditions; the step of training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model comprises at least one of the following steps:

determining, by a discriminator, a probability that the first intermediate image is a first intermediate image under poor visual conditions;

or

Determining, by a discriminator, a probability that the first intermediate image is a first intermediate image in non-poor visual conditions.

In a possible embodiment of the scheme of the application, the countermeasure loss, the cyclic pixel loss and the cyclic perception loss of the unpaired unsupervised style conversion network model are obtained;

and updating the unpaired unsupervised style conversion network model according to the sum of the countermeasure loss, the cyclic pixel loss and the cyclic perception loss.

In a possible embodiment of the solution of the present application, said antagonistic losses are obtained by:

determining the countermeasure loss based on a difference between the first intermediate image and the first artwork

In a possible embodiment of the solution of the present application, the cyclic pixel loss is obtained by:

and determining the cyclic pixel loss according to the difference of the first original image and the second original image in the spatial domain.

In a possible embodiment of the solution of the present application, the cyclic perceptual loss is obtained by:

and obtaining the similarity of the characteristic domains of the first original image and the second original image by pre-training a VGG network model to obtain the cycle perception loss.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

according to the technical scheme, the image data sets are collected, the images under a part of specified conditions are collected as the target data sets, the two data sets are formed by being separated, the unpaired style conversion network model is trained on the two data sets, the trained model can be synthesized on any input image to obtain the image with a specific style, an image acquisition link and a manual labeling link which are required by a severe environment are avoided, a large amount of time cost and labor cost are saved, and the usability of the image data is higher.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a method for synthesizing a high-emulation image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an unpaired style conversion network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a UNet network architecture according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an AttUNet network architecture in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a discriminator according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

First, the terms referred to in the technical section of the present application will be explained:

UNet Networks are neural Networks improved on the basis of full Convolutional Networks (full volumetric Networks).

The AttUNet network is a UNet network after an Attention (Attention) mechanism is introduced.

The VGG network is a deep convolutional neural network developed by the computer vision group at oxford university and by the researcher instruments of Google deep mind corporation.

The technical scheme of the application provides a method for synthesizing a high-simulation data set based on an unsupervised means, and target effects can be synthesized on an image, such as synthesizing rain and synthesizing fog in a scene, and switching from day to night.

In a first aspect, as shown in fig. 1, the technical solution of the present application provides an embodiment of a method for synthesizing a high simulation image, wherein the method includes steps S100-S200:

s100, constructing an original image data set and a target data set, and training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model;

illustratively, the original image dataset collects a sorted clear image (clearimages) in advance as an image dataset, and the target dataset collects a part of images (targetimages) under a specified condition, which may be a relatively poor visual condition such as rainy, foggy, or nighttime, as a target dataset while collecting the image dataset. In particular, the collected image dataset and the image in the target dataset do not require a one-to-one correspondence, i.e. the two datasets may be collected separately.

S200, acquiring an image to be processed in an original image data set, and generating a target image with a target data set style through an unpaired unsupervised style conversion network model;

specifically, the unpaired style conversion network is trained on the two data sets collected in this step S100, and the trained network can synthesize a specified effect on an arbitrary input image. Taking the example of the conversion of the images in the day and at night, firstly, some images shot in the day of any scene are collected, and then some images shot at night are collected to train the network. The tagged daytime image dataset is input into the trained network, and the corresponding night image in the same scene can be obtained, so that a paired 'day-night' image dataset is obtained.

Referring to fig. 2, the unpaired conversion network is composed of two cyclic networks of image-to-image conversion: cyclic network Cycle_C-T-CAnd a Cycle network_T-C-TBoth of the two circulating networks comprise a UNet network and an AttUNet network, with the difference that UNet precedes in one circulating network and UNet follows in the other circulating network.

The first intermediate image under the target data set format in an embodiment may be the first intermediate image under poor visual conditions, and step S200 may further comprise the step of refining: s210, acquiring a first original image in the image data set, and converting the first original image into a first intermediate image under a poor visual condition; and obtaining a second original image by the first intermediate image through information growth recovery or information reverse modification.

Referring to fig. 2, the first original image is original image c in the image data set, and the first intermediate image is a high-emulation composite image obtained by passing the original image through UNet network

And the second original is

Graph restored by AttUNet

Illustratively, in the Cycle_C-T-CFirstly, the original image c is converted into an image under poor visual conditions through a UNet network

Then the image is processed

And the mask is restored into the original image through the AttUNet network

The original object texture information can be lost and visual interference information is introduced when the simulation effect image is synthesized on an original image, for example, rainy days, foggy days, night and the like, so that the image visual condition is worse; this is a loss of information process; and the restoration of the image is an information growing process. And Cycle_C-T-CSimilarly, Cycle_T-C-TFirstly, t is converted by an AttUNet network to obtain a synthesized source domain map

In comparison to the value of t,

the visual condition is better, and the target domain graph is obtained through UNet recovery

The difference is that Cycle_T-C-TThe used mask is obtained from the image under the bad visual condition, and the image needs to be preprocessed to obtain the mask. Taking the synthetic night effect as an example, it is necessary to extract the edge map after increasing the brightness of the image with low brightness. Finally, a map of the target domain is extracted

The style of the image (c) includes, but is not limited to, the image features of the image such as brightness, saturation, white balance, and color temperature of the screen, and the style is determined from the original image

The extracted image features are adjusted or synthesized to obtain a synthetic image with a corresponding style; for example, the original image is an image of day, and is converted to an image of night.

In some alternative embodiments, the UNet network in the cyclic network of image-to-image conversion is composed of an encoder and a decoder;

in particular, the cyclic network framework in an embodiment comprises two generators (G)_C-TAnd G_T-C). Wherein G is_C-TAnd the method is responsible for converting a clear and bright image into an image under poor visual conditions by using UNet. UNet is 2015 "U-net: the network architecture proposed in the document "public networks for biological image segmentation" has a strong capability in semantic segmentation, image restoration and image enhancement applications, and the structure of the UNet is shown in fig. 3, and the UNet network structure is symmetrical and is called UNet because it looks like the english letter U. Wherein the box part represents a feature map; the dotted pattern arrows represent the 3x3 convolution and activation function ReLU for feature extraction; the slash pattern arrow represents the upsampling process for recovering the dimensionality; the grid arrows represent a 1 × 1 convolution for the output result.

More specifically, the UNet network in the embodiment is an Encoder-Decoder (Encoder-Decoder) structure; as shown in fig. 3, the left half of the illustrated network architecture of fig. 3 is an Encoder (Encoder) for extracting multi-scale features, which is composed of two 3 × 3 convolutional layers (ReLU) and 2 × 2 maximum pooling layers (stride 2) repeated four times, and the number of channels is doubled each time downsampling is performed; in the right half of the illustrated network architecture of fig. 3 is a Decoder (Decoder) for image synthesis, which is composed of a 2x2 upsampled convolutional layer (ReLU), a connection layer for cropping the output feature map of the corresponding Encoder layer and then adding the upsampled result of the Decoder layer, and two 3x3 convolutional layers (ReLU) repeated four times; the structure fuses the local features and the overall features, enlarges the influence of the overall style on the local, and adds the influence of global scenes, lighting conditions, texture information and the like into the local features. Then, an UNet expansion path is executed on the fused feature maps, that is, upsampling is continuously performed, the upsampled feature maps obtained in each level are fused with the feature maps of the corresponding compression path levels, and finally, a poor visual condition image with the same size as the original image is obtained through a 1 × 1 convolutional layer. It will be appreciated that the UNet network architecture in the two image-to-image conversion loop networks is the same, and the functions and steps performed by the UNet network architecture are the same.

In some alternative embodiments, the AttUNet network in the cyclic network of image-to-image conversion is composed of several attention blocks and residual blocks;

specifically, two generators (G) in the embodiment_C-TAnd G_T-C) Another G_T-CThe atstunt is used to convert the poor vision condition image and the associated mask into a clear and bright image as input. The AttUNet network architecture is modified from the UNet network architecture, an attention mechanism is introduced on the basis of the UNet network, the specific structure of the AttUNet network architecture is shown in fig. 4, in the whole architecture of the AttUNet of the embodiment, the input is an image t under severe visual conditions and an extracted mask, and the output is a fresh and bright image c. AttUNet consists of a series of attention blocks (Att Block) and residual blocks (Rec Block), which are used to process features on different scales. Take the nth attention block and the nth residual block as examples, take f_nIs expressed as the nth noteInput of an intention block or an nth residual block, mask_nShown as an additional input to the nth attention block, i.e. the edge map. Attention Block from mask by four parallel inferences_nGenerating two scale graphs

And two shift maps

Each inference consists of a 1 × 1 convolutional layer, an activation function (ReLU), and a 1 × 1 convolutional layer. Use of

And

nth attention block scaling and shifting input f_nTo obtain a new set of feature maps

Then, the 3 × 3 convolutional layer, the batch normalization layer and the activation function (ReLU) are used as the slave

Generating an output feature map representing a new style

Similarly, the next network layer uses

And

to pair

Performing scaling and shifting operations to form

To output of (c). By mixing

Passed to the 3x3 convolutional layer, batch normalization layer, and activation function (ReLU), resulting in the final output of the nth layer). A 4 x 4 convolutional layer is connected after each attention block, with a step size of 2, for halving the size of the feature map. For each residual block, it simply consists of two cells, each consisting of a 3 × 3 convolutional layer, a batch normalization layer, and an activation function (ReLU).

In some possible embodiments, the first intermediate image under the target data set format includes: at least one of the first intermediate image under poor visual condition and the first intermediate image under non-poor visual condition, so in step S200, the process of obtaining the unpaired style conversion network model through training the image data set and the target data set includes:

s220, determining the probability that the first intermediate image is the first intermediate image under the poor visual condition through a discriminator; or

Determining, by a discriminator, a probability that the first intermediate image is a first intermediate image under non-poor visual conditions;

illustratively, the recurrent network framework in an embodiment further comprises two discriminators (D)_CAnd D_T) Wherein D is_CThe probability that the sample is a sharp bright image, D, is estimated_TThe probability that the sample is an image of poor visual conditions is estimated and the structure of the two discriminators is shown in fig. 5, where,

denotes the convolution layer k × k with a step size of s, N_iRepresenting a normalization layer and a an activation function ReLU. The arbiter uses a CNN network that performs the generation of the picture type decision through 5 hierarchies.

In some possible embodiments, the process of training the image dataset and the target dataset to obtain the unpaired style conversion network model in step S200 may further include steps S230-S240:

s230, acquiring the countermeasure loss, the cyclic pixel loss and the cyclic perception loss of the unpaired style conversion network model;

and S240, optimizing the unpaired style conversion network model according to the countermeasure loss, the cyclic pixel loss and the cyclic perception loss.

Specifically, to stably learn the proposed network, embodiments employ multiple losses in the training process, including countermeasures, cyclic pixel losses, and cyclic perception losses, with the left network architecture Cycle in fig. 1_C-T-CBy way of example, Cycle_T-C-TThe same is true. Total loss of generator and discriminator taking into account the generated simulation image

Authenticity, input sharp bright image c and reconstructed sharp bright image

The difference between them in the spatial and feature domains, the formula is as follows:

wherein G represents a generator, and D represents a discriminator; l is_totalRepresents the total loss, L_advRepresenting a loss of antagonism, L_{cycle_pix}Representing a loss of cyclic pixels, L_{cycle_per}Represents a loss of cyclic perception; α, β, γ, and δ are the corresponding balance coefficients of the loss components.

In some possible embodiments, the antagonistic loss may be derived from a difference between the first intermediate image and the first artwork;

specifically, the discriminator D_TIt is necessary to learn whether the sample is a simulated image, and therefore, the D-based method is adopted_TTo reduce the generated simulation image

The difference from the true poor visual image t. Generator G_C-TAnd discriminator D_TThe resistance loss of (c) is as follows:

where E is the desired calculation.

In some possible embodiments, the cyclic pixel loss may be determined based on a difference in the spatial domain of the first artwork and the second artwork;

in particular, in the Cycle_C-T-CMiddle, reconstructed bright image

The original fresh bright image c should be input as close as possible. Therefore, the L1 loss function is applied in the embodiment to constrain the similarity of the two images in the spatial domain. H, W and C are given as C,

Height, width and number of channels, the cyclic pixel loss is as follows:

wherein, c_h，w，cRepresenting the pixel intensities at the corresponding rows, columns and channels in image c,

the same is true.

In some possible embodiments, the cyclic perception loss is determined by determining the similarity of the first original image and the second original image in the feature domain through a pre-trained VGG network model;

in particular, perceptual loss, also known as feature loss, is used to constrain the similarity of features of two images. In the network architecture provided by the embodiment, the network architecture is applied to Cycle_C-T-CI.e. images c and

unlike periodic pixel loss, cyclic perceptual loss can constrain image c and image

Similarity of feature domains obtained by a pre-trained VGG network, the cyclic perceptual loss function is defined as follows:

where φ (c) represents a feature map extracted from c by a pre-trained VGG-19 network without batch normalization, φ (c) represents a feature map extracted from c by a pre-trained VGG-19 network without batch normalization^i，j(c) Representing the feature map obtained from the ith convolutional layer before the jth largest pooling layer.

In a second aspect, the present disclosure may further provide a system for synthesizing a high simulation image, including at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to execute a method of synthesizing a high simulation image as in the first aspect.

An embodiment of the present invention further provides a storage medium storing a program, where the program is executed by a processor to implement the method in the first aspect.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

1) according to the technical scheme, a high-simulation data set is synthesized through learning of a small part of data sets, and in the synthesis process, labels carried on the real data set serving as a source domain can be transferred to a synthesized high-simulation severe image data set for use, so that the link of collecting large-scale multi-scene image pairs and the link of manually labeling the large-scale image data sets are saved. The two links are saved, and the method has the characteristics of saving time, manpower and material resources;

2) the technical scheme of this application when converting the bad picture into with clear picture, can also convert the bad picture into clear picture, can be used for improving image quality.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for synthesizing a high simulation image is characterized by comprising the following steps:

constructing an original image data set and a target data set, and training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model;

2. A method for synthesizing an emulated image of claim 1, wherein said first intermediate image in the format of said target data set is a first intermediate image in poor visual conditions, and said step of converting said first artwork into said first intermediate image in the format of said target data set comprises:

and removing the object texture information of the first original image through a UNet network and introducing visual interference information to generate a first intermediate image under the poor visual condition.

3. The method of claim 1, wherein said step of modifying said first intermediate image into said second original image by information growth restoration or information reverse modification comprises:

and extracting image mask information from the first original image, and recovering the first intermediate image through an AttUNet network according to the image mask information and the first intermediate image to obtain the second original image.

4. A method of synthesizing an hyperartificial image according to claim 2, wherein the UNet network comprises an encoder and a decoder;

5. A method of synthesizing a hyperartificial image according to claim 3, wherein the AttUNet network comprises an attention block and a residual block;

6. A method of composing a hyperartificial image according to any of claims 1-5, wherein the first intermediate image under the target data set format comprises at least one of: a first intermediate image in poor vision conditions and a first intermediate image in non-poor vision conditions;

the step of training the original image data set and the target data set to obtain an unpaired unsupervised style conversion network model comprises at least one of the following steps:

or

7. The method of claim 6, wherein the step of training the original image data set and the target data set to obtain an unpaired unsupervised style transformation network model further comprises:

acquiring the countermeasure loss, the cyclic pixel loss and the cyclic perception loss of the unpaired unsupervised style conversion network model; and updating the unpaired unsupervised style conversion network model according to the sum of the countermeasure loss, the cyclic pixel loss and the cyclic perception loss.

8. The method for synthesizing high simulation image according to claim 7, wherein the countermeasure loss is obtained by the following steps:

determining the countermeasure loss based on a difference between the first intermediate image and the first artwork.

9. The method of synthesizing a high simulation image according to claim 7, wherein the cyclic pixel loss is obtained by:

10. The method of claim 7, wherein the cyclic perceptual loss is obtained by:

and obtaining the similarity of the first original image and the second original image in a feature domain by pre-training a VGG network model to obtain the cycle perception loss.