CN114529622A

CN114529622A - Method and device for generating confrontation network to generate high-quality image by introducing self-supervision compound task training

Info

Publication number: CN114529622A
Application number: CN202210033454.3A
Authority: CN
Inventors: 魏莹; 张见威
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-05-24

Abstract

The invention discloses a method and a device for generating a confrontation network to generate a high-quality image by introducing self-supervision compound task training, which comprises the following steps: (1) preparing an original image data set and a spliced image data set; (2) designing a composite task for implementing self-supervised learning for generating a countermeasure network; (3) building a model, and constructing an antagonistic training branch and an automatic supervision compound task branch; (4) training a model and storing parameters of a network; (5) and generating the image by using the trained generator network. The method for generating the high-quality image by the confrontation network by introducing the self-supervision compound task training simultaneously utilizes the information in the image and the information between the images, constructs a compound task comprising three subtasks, guides the network to learn more stable and more universal characteristics in the image, and simultaneously constructs the local discriminator to improve the capability of the network for extracting the local information of the image, thereby obviously improving the training effect of the network and improving the quality of the finally generated image.

Description

Method and device for generating confrontation network to generate high-quality image by introducing self-supervision compound task training

Technical Field

The invention belongs to the technical field of computer image generation, and particularly relates to a method and a device for generating a confrontation network to generate a high-quality image by introducing self-supervision compound task training.

Background

The real image data set is an indispensable tool for training the network in the field of computer vision, and a large number of real images can help the network to well learn useful feature representations, so that the network can play a good role in subsequent tasks. However, in the process of manufacturing a real image data set, due to differences of image acquisition devices, the acquired original image also needs to be subjected to alignment operations such as size adjustment and uniform resolution, which usually requires huge labor cost, which is an important reason that the network training cost is high. The generated image is an image which is generated by a well-trained generated model and is similar to a real image in a training set, the generated image can be directly mapped from random noise by the generated model, the existing data set is expanded by using a large amount of generated images generated by the generated model, one of the methods for solving the problems is that the generated model generates vivid and diversified generated images by learning the real images in the existing data set, and the data set manufacturing cost caused by manually collecting and processing data can be greatly reduced.

The generation of countermeasure networks is a generation model for intense heat in recent years, which is proposed by Goodfellow et al in the article "genetic adaptive networks" (neuroips, 2014). In generating the countermeasure network, the generator is responsible for receiving random noise and generating an image, and the discriminator is responsible for receiving a real image sample and a sample generated by the generator and judging whether the received sample is a real image. In the training process of the network, the generator and the discriminator are continuously optimized in mutual confrontation. However, the generation of the countermeasure network has the disadvantages of "catastrophic forgetting" of the discriminator and unstable training process, and even causes the problem of mode collapse. One current solution is to perform self-supervised learning for generating an antagonistic network by introducing additional auxiliary tasks, so that a discriminator learns more general and stable characteristics to improve the stability of a training process. However, the existing auxiliary task is usually a single task, which easily causes the web-learned features to have a relatively obvious task bias, for example, the rotation task proposed by Gidaris et al in the article "unused representation learning by prediction image rotation" (ICLR,2018) performs a random rotation operation on the input image and requires the network to determine the rotation angle corresponding to the received image, which can effectively help the web-learned structural features of the image, but the color, texture, and other features of the image are easily ignored by the network because the help of the task of determining the rotation angle is not great. The existing self-supervision auxiliary tasks proposed aiming at generating the countermeasure network have the problems of single task and incomplete covered characteristics, and have stronger bias for guiding the process of network learning characteristics, which is not favorable for network learning general and stable characteristics and can also influence the quality of subsequently generated images.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a method and a device for generating a high-quality image of an confrontation network by introducing self-supervision compound task training, and guides more stable and universal characteristics in a network learning image by designing and introducing a reasonable compound task, so that the training effect of the network is improved, and the quality of the finally generated image is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a method for generating a confrontation network to generate a high-quality image by introducing self-supervision compound task training, which comprises the following steps:

preparing a training data set, wherein the training data set comprises original image data and spliced image data; the original image data is used for a training process of a confrontation training branch, and the spliced image data is used for a training process of an automatic supervision compound task branch;

designing three subtasks to form a composite task, wherein the composite task is used for constructing a self-supervision composite task branch and providing supervision information for model training, the three subtasks are respectively a rotation prediction task, a position prediction task and a common feature extraction task, the rotation prediction task is used for correctly judging tags corresponding to image blocks contained in each spliced image, the position prediction task is used for correctly judging tags corresponding to the image blocks contained in each spliced image, the common feature extraction task is used for firstly correctly judging which original image each image block belongs to and then extracting common features among homologous image blocks;

building a model, and respectively building an antagonistic training branch and an automatic supervision compound task branch, wherein the antagonistic training branch comprises a local discriminator and a generator, and the automatic supervision compound task branch comprises a classifier with three output heads;

training the built model to obtain a trained generator network; the training specifically comprises the following steps: the method comprises the steps that an original data set is used as input of a confrontation training branch, spliced image data is used as input of an automatic supervision composite task branch, networks in the two branches are trained, and the automatic supervision composite task branch is responsible for providing supervision information for a local discriminator in the confrontation training branch in the training process;

and inputting the image to be processed into a trained generator network for image generation.

As a preferred technical solution, the preparation of the stitched image data is specifically as follows:

for original images in a batch, cutting out 4 image blocks with a certain overlap ratio omega from the upper left area, the upper right area, the lower left area and the lower right area of each image, wherein the overlap ratio omega is equal to the ratio of the side length of the overlapped part between the adjacent image blocks to the side length of the original image;

randomly disordering the obtained image blocks of one batch, and performing one-time rotation transformation, wherein the rotation angle is randomly selected from a set R (0 degrees, 90 degrees, 180 degrees and 270 degrees);

adjusting the size of the obtained image blocks of one batch by using a bilinear interpolation method to enable the side length of the image blocks to be equal to half of the side length of the original image;

and splicing the obtained image blocks by taking 4 image blocks as a group to obtain a batch of spliced images with the same size as the original image, thereby finishing the production of spliced image data.

As a preferred technical solution, the three subtasks are specifically designed as follows:

in the rotation prediction task, for a batch of obtained spliced images, each image block comprises 4 image blocks corresponding to a rotation angle, and each image block is endowed with a pseudo label l through the rotation angle_r，l_r∈{0°，90°，180°，270°}；

In the position prediction task, for a batch of obtained spliced images, 4 image blocks contained in each image respectively correspond to a fixed area position in an original image to which the image blocks belong, and a pseudo label l is given to each image block through the position information_l，l_lE is left at the upper left, the upper right, the lower left and the lower right;

in the common feature extraction task, for a batch of obtained spliced images, 4 image blocks contained in each image respectively correspond to one original image, the 4 image blocks belonging to the same original image are defined as homologous image blocks, and features with higher similarity between the homologous image blocks are defined as common features.

As a preferred technical scheme, the concrete steps of constructing the model are as follows:

in the confrontation training branch, a local-D and a generator G of a local discriminator are constructed, the network structure of the local discriminator is divided into two parts, and a characteristic block module is added between the two parts; the first part receives an original image as input and extracts image features, the feature blocking module is responsible for processing the image features output by the first part into image block features, the second part receives the image block features as input and generates the final output of the local discriminator, the local-D task of the local discriminator is to correctly judge whether the blocked features come from a real image or a generated image, and the loss function of the branch is recorded as L_advIn a network against the original generationThe proposed countermeasure loss is consistent, and the specific expression is as follows:

where x is the true image sampled from the original data set, P_data(x)Is the distribution of the true data, z is the random noise sampled from the prior distribution, P_z(z)Is prior distribution, D is a local discriminator, G is a generator;

in the self-supervision compound task branch, a network architecture consistent with a local discriminator in an antagonistic training branch is used for building a classifier C, the classifier network is also divided into two parts, wherein the network architecture of the first part is the same as that of the first part of the local-D, the network architecture and the first part share network weight, the second part comprises three output heads of the classifier, the two heads are both formed by a full connection layer and respectively responsible for outputting the results of the rotation prediction task and the position prediction task, the third head is formed by a multi-layer perceptron comprising a hidden layer and responsible for outputting the results of the common feature extraction task, and the total loss function of the branch is recorded as L_CT。

As an optimal technical scheme, a plurality of loss functions are adopted for the branch of the self-supervision compound task to carry out combined optimization, and the total loss function L of the branch of the self-supervision compound task_CTIs defined as:

L_CT＝L_rot+L_loc+L_cFE

wherein L is_rot、L_loc、L_CFERespectively representing rotation prediction task loss, position prediction task loss and common feature extraction task loss;

recording the real rotation label of each image block in the spliced image as l_{r_gt}The true position label is l_{l_gt}The rotation label predicted by the classifier for the image block is l_rPredicted position label is l_lAnd a group of vectors output by the multilayer perceptron in the shared feature extraction task is marked as v₁,v₂,…,v_{k.k＝n×4}Using binary cross entropyCalculating the loss of the rotation prediction and the position prediction, and calculating the similarity between different block characteristics by using cosine similarity, wherein the calculation formula of the loss of the three tasks is as follows:

L_rot＝CrossEntropy(l_r,l_{r_gt})

L_loc＝CrossEntropy(l_l,l_{l_gt})

wherein tau is a temperature coefficient, n is a training batch size, C_iAnd representing a set of subscripts corresponding to homologous image blocks of the ith image block, wherein I is an indication function, when a judgment condition is met, the function value is 1, and otherwise, the function value is 0.

As a preferred technical scheme, the specific steps of model training are as follows:

in the countermeasure training branch, a generator G and a local-D discriminator are alternately and iteratively trained, the input of the local discriminator is a batch of images sampled from an original image data set, and the training target is to correctly judge the authenticity of the images in a certain area in the input images; the input of the generator is random noise, and the training target is to output a generated image which is as real as possible and can cheat the local discriminator;

in the self-supervision compound task branch, the input of the classifier is a batch of spliced images, three output heads respectively output the results of three subtasks, and the branch passes through a total loss function L_CTTraining the network;

the confrontation training branch and the self-supervision compound task branch are trained simultaneously, the two branches are connected in the training process by sharing the network weight of the local-D and first parts of the three-head classifier C, and the total loss function of the model is defined as:

wherein L is_adv(G, D) is to combat training loss, L_CT(C) In order to automatically monitor the loss of the compound task, the local discriminators local-D and the generator G are alternately updated and are simultaneously updated with the three-head classifier C in the model training process.

As a preferred technical solution, the inputting the image to be processed into the trained generator network for image generation specifically includes:

random noise is input into a trained generator network, and a high-quality generated image similar to the training set image can be obtained through forward propagation.

The invention provides a system for generating a confrontation network to generate a high-quality image by introducing self-supervision compound task training, and the method is applied to the method for generating the confrontation network to generate the high-quality image by introducing self-supervision compound task training and comprises a data set module, a compound task module, a model building module, a model training module and an image generating module;

the data set module is used for preparing a training data set, and the training data set comprises original image data and spliced image data; the original image data is used for a training process of a confrontation training branch, and the spliced image data is used for a training process of an automatic supervision compound task branch;

the composite task module is used for designing three subtasks to form a composite task, the composite task is used for constructing a self-supervision composite task branch and providing supervision information for model training, the three subtasks are respectively a rotation prediction task, a position prediction task and a common feature extraction task, the rotation prediction task is used for correctly judging tags corresponding to image blocks contained in each spliced image, the position prediction task is used for correctly judging tags corresponding to the image blocks contained in each spliced image, and the common feature extraction task is used for firstly correctly judging which original image each image block belongs to and then extracting common features among homologous image blocks;

the model building module is used for respectively building an antagonistic training branch and an automatic supervision compound task branch, the antagonistic training branch comprises a local discriminator and a generator, and the automatic supervision compound task branch comprises a classifier with three output heads;

the model training module is used for training the built model to obtain a trained generator network; the training specifically comprises the following steps: the method comprises the steps that an original data set is used as input of a confrontation training branch, spliced image data is used as input of an automatic supervision composite task branch, networks in the two branches are trained, and the automatic supervision composite task branch is responsible for providing supervision information for a local discriminator in the confrontation training branch in the training process;

and the image generation module is used for inputting the image to be processed into the trained generator network for image generation.

Yet another aspect of the present invention provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the method of generating an anti-network generated high quality image by introducing self-supervised task-compounding training.

Yet another aspect of the present invention provides a computer-readable storage medium storing a program which, when executed by a processor, implements the method for generating a high-quality image of an anti-net by introducing an auto-supervised task-compounding training.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) aiming at the self-supervision learning of the generated confrontation network, the invention provides a compound auxiliary task based on multi-level information, and meanwhile, the information in the image and the information between the images are used as supervision during network training, so that the training process is stabilized, the universality of the characteristics extracted by the network can be improved, and the quality of the generated image is improved.

(2) The invention provides a local-D (local-local discriminant), so that the network pays attention to more local feature information in the feature extraction process and is matched with the self-supervision compound task branch, and the self-supervision learning effect of the whole model is improved.

(3) The common feature extraction task provided by the invention introduces the idea of comparison learning into the self-supervision learning for generating the countermeasure network, improves the quality of feature extraction by utilizing the advantages of the comparison learning, simultaneously keeps the size of a training batch at a smaller value, and brings larger network performance gain with smaller training cost.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating an overall method for generating a high-quality image of an anti-net by introducing an auto-supervised multitask training according to an embodiment of the present invention.

FIG. 2 is a flow chart of an image processing module according to an embodiment of the present invention; the method comprises three states of image data, wherein the state 1 is an initial state of the image data in a batch, the state 2 is a state after each image is cut into 4 image blocks, and the state 3 is a state after all the image blocks are randomly disordered, randomly rotated and spliced to obtain a batch of spliced image data.

Fig. 3 is an overall structure diagram of a network model according to an embodiment of the present invention.

Fig. 4 is a block diagram of a generator network according to an embodiment of the present invention.

Fig. 5 is a block diagram of a deconvolution module in a generator network according to an embodiment of the present invention.

Fig. 6 is a structural diagram of a local arbiter network according to an embodiment of the present invention.

Fig. 7 is a structural diagram of a convolution module in the local arbiter according to an embodiment of the present invention.

FIG. 8 is a block diagram of a feature partitioning module according to an embodiment of the present invention.

FIG. 9 is a block diagram of a system for generating a high quality image of an anti-net by introducing an auto-supervised multitask training according to an embodiment of the present invention.

Fig. 10 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, the embodiment provides a method for generating a high-quality image of an anti-network by introducing an auto-supervised multitask training, including the following steps:

s1, preparing a training data set comprising an original image data set and a spliced image data set;

furthermore, the original image data set is used as an input of a countertraining branch, and public data sets of Cifar-10, CelebA, ImageNet 32X 32, STL-10 and the like are obtained as the original image data set of the method through a network search engine.

Further, the stitched image data set is used as an input of an auto-supervised compound task branch, and is obtained by performing a certain processing on the original image data set, and the processing flow refers to fig. 2. Firstly, cutting 4 image blocks with a certain side length overlapping rate omega of 0.4 from the upper left area, the upper right area, the lower left area and the lower right area of each image for an original image in a batch, then randomly disordering the image blocks, carrying out one-time rotation transformation, randomly selecting a rotation angle from a set R (0 degrees, 90 degrees, 180 degrees and 270 degrees), adjusting the size of the image blocks by using a bilinear interpolation method to enable the side length of the image blocks to be equal to half of the side length of the original image, and finally splicing the image blocks by taking 4 image blocks as a group to obtain a batch of spliced images with the same size as the original image, namely obtaining a required spliced image data set.

S2, designing a compound task, wherein the compound task is used for constructing a powerful supervision signal for the training process of the network;

further, in order to fully utilize the information of the image, the present embodiment designs a composite task including three sub-tasks, which are respectively a rotation prediction task, a position prediction task, and a common feature extraction task, wherein the first two tasks are designed based on intra-image information, and the third task is designed based on inter-image information.

Furthermore, the rotation prediction task uses the rotation information of the image block to construct a pseudo label l_r，l_rE {0 °, 90 °, 180 °, 270 ° }, which requires the network to correctly determine the rotation labels corresponding to the 4 image blocks included in the stitched image. The network can only successfully accomplish this task if the structural information in the image blocks is well understood.

Furthermore, the position prediction task utilizes the position information of the image block in the original image to construct a pseudo label l_l，l_lAnd e is left-up, right-up, left-down and right-down, and the task requires the network to correctly judge the original position labels corresponding to the 4 image blocks contained in the spliced image. Specifically, in the manufactureWhen the image data sets are spliced, image blocks are obtained by cutting four areas, namely the upper left area, the upper right area, the lower left area and the lower right area of the original image, so that each image block corresponds to a position label in the original image, and a network can correctly predict the position labels of the image blocks only by better understanding the image blocks and the structural characteristics of the original image.

Furthermore, the common feature extraction task utilizes information among different images and requires a network to extract common features among homologous image blocks. Defining 4 image blocks cut out from the same original image as homologous image blocks, and the feature with high similarity between the image blocks is a common feature, wherein the task is characterized in that: the 4 homologous image blocks jointly form an original image with complete semantics, and a certain overlapping area exists between the homologous image blocks, so that some common features exist between the homologous image blocks, and the common features can help the network to correctly distinguish the homologous image blocks from the non-homologous image blocks, so that in the process of completing the task, the extraction capability of the network on the common features is continuously enhanced, and the network is helped to learn representative features in the image.

S3, building a network model;

referring to fig. 3, the whole network model building process includes four parts: and realizing a generator by using a ResNet-50 network, realizing a local discriminator by using the ResNet-50 network, constructing a three-head classifier, and designing a loss function of each branch and a total loss function of the network.

Further, the concrete steps of building the network model are as follows:

s31, constructing a generator network:

referring to fig. 4, the structure of the constructed generator network; for input random noise, firstly, learning is carried out through a full-connection layer with 4096 output channels, and the size of the obtained vector is adjusted to be 4 multiplied by 256; then passing through three continuous deconvolution modules; and then carrying out batch normalization, ReLU activation and 3 × 3 convolution (the number of output channels is 3, the step length is 1), and finally obtaining final output with the size of 32 × 32 × 3 after Sigmoid activation.

Referring to fig. 5, the structure of the deconvolution module includes the following specific operations: carrying out batch normalization, ReLU activation, upsampling, 3X 3 convolution, ReLU activation and 3X 3 convolution on input X to obtain X₁(ii) a Performing up-sampling and 1 × 1 convolution on input X to obtain X₂The final deconvolution module output is X₁+X₂. The number of output channels of the convolution operation involved in the deconvolution module is 256, and the step length is 1.

S32, constructing a local discriminator network:

please refer to fig. 6 for the structure of the local arbiter network; the local arbiter network is divided into two parts, wherein the first part is composed of a plurality of convolution modules and a ReLU activation operation and is responsible for outputting the characteristics of the middle layer; the second part is composed of a feature blocking module and a full connection layer, and is responsible for processing the characteristics of the middle layer and generating final output. The input image, having an initial size of 32 x 3, is first passed through a first convolution module comprising the steps of: the input X is convolved by 3 × 3, activated by ReLU, convolved by 3 × 3, and convolved by 1 × 1 (with step size of 2) to obtain X₁The input X is convolved by 1 × 1 (step size 2) or 1 × 1 (step size 1) to obtain X₂The output is X₁+X₂(ii) a Then passing through three successive convolution modules; after primary ReLU activation is carried out on the output of the convolution module, the obtained intermediate layer characteristics are input into the characteristic block partitioning module for processing, and block partitioning characteristics are output; and finally, inputting the block characteristics into a full-connection layer with the number of output channels being 1 to obtain a final output result.

Further, referring to fig. 7, the structure of three consecutive convolution modules specifically includes the following steps: ReLU activation and 3 × 3 convolution repeated twice on input X, and 1 × 1 convolution (step size is 2) once, to obtain X₁The input X is convolved 1 × 1 (step size 1) and 1 × 1 (step size 2) to obtain X₂The output of the final convolution module is X₁+X₂. The number of output channels of all convolution operations involved in the convolution module is 128, and the step size of all 3 × 3 convolutions is 1. It should be noted that in the latter two convolution modules, the 1 × 1 convolution operation with step size 2 is left out, and thereforeThe size of the intermediate layer features obtained after the final ReLU activation was 8 × 8 × 128.

As shown in fig. 8, for the input middle layer feature F, average pooling is performed first (average pooling kernel size is 2 × 2, step size is 2), and then 2 × 2 blocking operation is performed on the obtained feature in the length and width dimensions to obtain a blocking feature F_pThe size is 2 × 2 × 128, which is four times the number of input intermediate layer features F.

S33, constructing a three-head classifier network:

the structure of the three-head classifier network is also divided into two parts, wherein the structure of the first part is the same as that of the local discriminator, the first part shares weight and synchronously updates parameters, and the first part is mainly responsible for extracting intermediate layer characteristics from the input spliced image; the second part contains the feature blocking module and three output headers. The structure of the feature blocking module please refer to fig. 8, which is responsible for processing the middle layer features into blocking features and outputting the blocking features to the three headers. Two of the three output heads are respectively composed of a simple full connection layer, the number of output channels is 4, and the output heads are respectively responsible for predicting the rotation angle and the original position label of the image block; the third head is composed of a multilayer perceptron comprising a hidden layer and is responsible for finishing the common feature extraction task, the number of output channels of the hidden layer is consistent with the number of input channels of the block feature, a ReLU activation function is used, and finally the number of output channels of the output layer is 64. A method of performing an additional non-linear transformation of features using a multi-layered perceptron was first proposed in the article "a simple frame for coherent learning of visual representations" (ICML,2020) by Chen et al, which can help the network learn a higher quality representation of features.

S4, designing a loss function:

total loss function L of the model proposed in this embodiment_finalConsists of two branches of loss functions, namely: loss function L against training branches_advAnd loss function L of the branch of the self-supervision compound task_CT. The expression of the model total loss function is as follows:

wherein L is_advIs consistent with the loss function in classical generation countermeasure networks, which was first proposed by Goodfellow et al in the article "genetic adaptive networks. L is_advThe specific expression of (a) is as follows:

where x is the true image sampled from the original data set, P_data(x)Is the distribution of the true data, z is the random noise sampled from the prior distribution, P_z(z)Is a prior distribution, D is a local discriminator, and G is a generator.

L_CTConsists of the loss of three subtasks, namely: rotation prediction task loss L_rotPosition prediction task loss L_locCommon feature extraction task loss L_CFEIt is defined as:

L_CT＝L_rot+L_loc+L_CFE

L_rot＝CrossEntropy(l_r,l_{r_gt})

L_loc＝CrossEntropy(l_l,l_{l_gt})

wherein l_{r_gt}And l_{l_gt}The true rotation label and the true position label, l, of each image block in the mosaic image are respectively represented_rAnd l_lRotation label and position label, Cro, respectively, representing the classifier as a prediction of the image blockssEncopy (. cndot.) is a binary cross-entropy function, v₁,v₂,…,v_{k.k＝n×4}Representing a group of vectors output by the multilayer perceptron, sim (-) is a cosine similarity function, I is an indication function, when a judgment condition is met, the function value is 1, otherwise, the function value is 0, tau is a temperature coefficient, the default value is 0.3, n is the size of a training batch, the default value is 64, C_iThe set of indices corresponds to a homologous image block representing the ith image block.

S4, training the constructed model;

inputting an original image data set into a generator and a local discriminator for alternate training, inputting a spliced image data set into a three-head classifier for training, and performing loss optimization by using an Adam algorithm with a self-adaptive learning rate, wherein the training batch size is 64, and 30 ten thousand times of training are performed in total.

Further, in the training process, random noise is input into a generator for the confrontation training branch to obtain a generated image, when the generated image is input into the discriminator, the confrontation training loss and the back propagation gradient are calculated, and the generator adjusts parameters to optimize so that the generated image closer to a real image tends to be generated; when a batch of real images sampled from an original data set are input into a local discriminator, the antithetical training loss and the back propagation gradient are calculated, and the discriminator adjusts parameters to optimize, so that the discrimination capability of the generated images and the real images tends to be improved; for the self-supervision compound task branch, when a batch of spliced images are input into a three-head classifier, the classifier simultaneously carries out three subtasks in the compound task, calculates the total loss and the back propagation gradient of the compound task, adjusts parameters for optimization, so that the classifier tends to extract more comprehensive and more universal image characteristics to better complete the compound task, and shares the parameters with a local discriminator, so that the classifier and the local discriminator are updated simultaneously in the training process, and the local discriminator also obtains the characteristic extraction capability brought by the compound task. In the training process of the model, the generator and the local discriminator form a mutual confrontation relationship, namely the generator aims to generate a vivid image which can cheat the discriminator, the discriminator aims to continuously improve the capability of distinguishing the vivid image from the generated image, and when the confrontation between the two reaches the balance, the image generation capability of the generator and the feature extraction capability of the discriminator reach a higher level. When training is completed, the parameters of the entire model are saved.

S5, generating an image;

by removing the local discriminators in the confrontation training branch and the entire self-supervised composite task branch, image generation can be accomplished using only the generator network: random noise is input into a generator, and an image with high quality similar to the original data set can be obtained through forward propagation.

The method for generating the high-quality image by the confrontation network by introducing the self-supervision compound task training simultaneously utilizes the information in the image and the information between the images, constructs a compound task comprising three subtasks, guides the network to learn more stable and more universal characteristics in the image, and simultaneously constructs a local discriminator to improve the capability of the network for extracting the local information of the image, thereby obviously improving the training effect of the network and improving the quality of the finally generated image.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the method for generating the high-quality image by generating the countermeasure network through introducing the self-supervision multitask training in the embodiment, the invention also provides a system for generating the high-quality image by generating the countermeasure network through introducing the self-supervision multitask training, and the system can be used for executing the method for generating the high-quality image by generating the countermeasure network through introducing the self-supervision multitask training. For illustrative purposes, only those portions relevant to the embodiments of the present invention are shown in the schematic structural diagram of the system for generating a high-quality image against a network by introducing an automatic supervision multitasking training, and it will be understood by those skilled in the art that the illustrated structure does not constitute a limitation of the apparatus, and may include more or less components than those illustrated, or may combine some components, or may be arranged differently.

Referring to fig. 9, in another embodiment of the present application, there is provided a system 100 for generating a high-quality image of an antagonistic network by introducing an auto-supervised compound task training, the system including a data set module 101, a compound task module 102, a model building module 103, a model training module 104, and an image generating module 105;

the data set module 101 is configured to prepare a training data set, where the training data set includes original image data and stitched image data; the original image data is used for a training process of a confrontation training branch, and the spliced image data is used for a training process of an automatic supervision compound task branch;

the composite task module 102 is configured to design three subtasks to form a composite task, where the composite task is used to construct a self-supervision composite task branch and provide supervision information for model training, the three subtasks are a rotation prediction task, a position prediction task, and a common feature extraction task, the rotation prediction task is used to correctly determine tags corresponding to image blocks included in each stitched image, the position prediction task is used to correctly determine tags corresponding to image blocks included in each stitched image, and the common feature extraction task is used to correctly determine which original image each image block belongs to first, and then extract common features between homologous image blocks;

the model building module 103 is used for respectively building an antagonistic training branch and an automatic supervision compound task branch, wherein the antagonistic training branch comprises a local discriminator and a generator, and the automatic supervision compound task branch comprises a classifier with three output heads;

the model training module 104 is configured to train the built model to obtain a trained generator network; the training specifically comprises the following steps: the method comprises the steps that an original data set is used as input of a confrontation training branch, spliced image data is used as input of an automatic supervision composite task branch, networks in the two branches are trained, and the automatic supervision composite task branch is responsible for providing supervision information for a local discriminator in the confrontation training branch in the training process;

the image generation module 105 is configured to input an image to be processed to a trained generator network for image generation.

It should be noted that, the system for generating a high-quality image by generating an anti-network through introducing an auto-supervised composite task training of the present invention corresponds to the method for generating a high-quality image by generating an anti-network through introducing an auto-supervised composite task training of the present invention one to one, and the technical features and the advantages thereof described in the above embodiment of the method for generating a high-quality image by generating an anti-network through introducing an auto-supervised composite task training are both applicable to the embodiment of generating a high-quality image by generating an anti-network through introducing an auto-supervised composite task training, and specific contents thereof may be referred to the description in the embodiment of the method of the present invention, and are not described herein again, thus it is stated that.

In addition, in the implementation of the system for generating a high-quality image for an anti-network by introducing the self-supervision multitask training according to the above embodiment, the logical division of the program modules is only an example, and in practical applications, the above function distribution may be performed by different program modules according to needs, for example, due to the configuration requirements of corresponding hardware or the convenience of implementation of software, that is, the internal structure of the system for generating a high-quality image for an anti-network by introducing the self-supervision multitask training is divided into different program modules to perform all or part of the above described functions.

Referring to fig. 10, in an embodiment, an electronic device 200 for implementing a method for generating an anti-network-generated high-quality image by introducing an auto-supervised compound task training is provided, and the electronic device may include a first processor 201, a first memory 202 and a bus, and may further include a computer program stored in the first memory 202 and executable on the first processor 201, such as an auto-supervised compound task training anti-network-generated high-quality image program 203.

The first memory 202 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a removable hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 200. Further, the first memory 202 may also include both an internal storage unit and an external storage device of the electronic device 200. The first memory 202 can be used for storing not only application software installed in the electronic device 200 and various types of data, such as codes for generating a network-opposing-network-generating high-quality image program 203 for the self-supervision multitasking training, but also temporarily storing data that has been output or will be output.

The first processor 201 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 200 by running or executing programs or modules (e.g., federal learning defense programs, etc.) stored in the first memory 202 and calling data stored in the first memory 202.

Fig. 10 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device 200, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

The self-supervised multitasking training and generation countermeasure network generation high-quality image program 203 stored in the first memory 202 of the electronic device 200 is a combination of a plurality of instructions, and when executed in the first processor 201, can realize that:

Further, the integrated modules/units of the electronic device 200 may be stored in a non-volatile computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The method for generating the confrontation network to generate the high-quality image by introducing the self-supervision compound task training is characterized by comprising the following steps:

2. The method for generating the confrontation network to generate the high-quality image by introducing the self-supervision compound task training as claimed in claim 1, wherein the preparation of the stitched image data is specifically as follows:

3. The method for generating the confrontation network generation high-quality image by introducing the self-supervision compound task training as claimed in claim 1, wherein the three subtasks are specifically designed as follows:

4. The method for generating the confrontation network to generate the high-quality image by introducing the self-supervision compound task training according to the claim 1, characterized in that the specific steps of constructing the model are as follows:

in the confrontation training branch, a local-D and a generator G of a local discriminator are constructed, the network structure of the local discriminator is divided into two parts, and a characteristic block module is added between the two parts; the first part receives an original image as input and extracts image features, the feature blocking module is responsible for processing the image features output by the first part into image block features, the second part receives the image block features as input and generates the final output of the local discriminator, the local-D task of the local discriminator is to correctly judge whether the blocked features come from a real image or a generated image, and the loss function of the branch is recorded as L_advConsistent with the countermeasure loss proposed in the originally generated countermeasure network, the specific expression is as follows:

in the self-supervision compound task branch, a network architecture consistent with a local discriminator in an antagonistic training branch is used for building a classifier C, the classifier network is also divided into two parts, wherein the network architecture of the first part is the same as that of the first part of local-D, the first part and the second part share network weight, and the second partThree output heads containing a classifier, wherein two heads are respectively composed of a full connection layer and are respectively responsible for outputting results of a rotation prediction task and a position prediction task, the third head is composed of a multilayer perceptron comprising a hidden layer and is responsible for outputting results of a common feature extraction task, and the total loss function of the branch is recorded as L_CT。

5. Method for generating confrontation network-generated high-quality images by introducing self-supervision task composition training according to claim 4, characterized in that a plurality of loss functions are used for combined optimization of the self-supervision task composition branch, the total loss function L of which is_CTIs defined as:

L_CT＝L_rot+L_loc+L_CFE

recording the real rotation label of each image block in the spliced image as l_{r_gt}The true position label is l_{l_gt}The rotation label predicted by the classifier for the image block is l_rPredicted position label is l_lAnd a group of vectors output by the multilayer perceptron in the shared feature extraction task is marked as v₁，v₂，...，v_{k.k＝n×4}The loss of the rotation prediction and the position prediction is calculated by using binary cross entropy, the similarity between different block characteristics is calculated by using cosine similarity, and the calculation formula of the loss of the three tasks is as follows:

L_rot＝CrossEntropy(l_r，l_{r_gt})

L_loc＝CrossEntropy(l_l，l_{l_gt})

6. The method for generating the confrontation network to generate the high-quality image by introducing the self-supervision compound task training as the claim 1, wherein the specific steps of the model training are as follows:

in the confrontation training branch, a generator G and a local-D discriminator are alternately and iteratively trained, the input of the local discriminator is a batch of images sampled from an original image data set, and the training target is to correctly judge the authenticity of the images in a certain area in the input images; the input of the generator is random noise, and the training target is to output a generated image which is as real as possible and can cheat the local discriminator;

the confrontation training branch and the self-supervision compound task branch are trained simultaneously, the two branches are connected by sharing the network weight of the local discriminator local-D and the first part of the three-head classifier C in the training process, and the total loss function of the model is defined as:

7. The method for generating a high-quality image of an anti-network by introducing an auto-supervised compound task training according to claim 1, wherein the image to be processed is input to a trained generator network for image generation, specifically:

8. The system for generating the confrontation network to generate the high-quality image by introducing the self-supervision compound task training is characterized by being applied to the method for generating the confrontation network to generate the high-quality image by introducing the self-supervision compound task training in any one of claims 1 to 7, and comprising a data set module, a compound task module, a model building module, a model training module and an image generating module;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the method of generating an anti-net generated high quality image by introducing an auto-supervised task composite training set forth in any one of claims 1-7.

10. A computer-readable storage medium storing a program which, when executed by a processor, implements the method of generating an antagonistic network generated high quality image by the introduction of an unsupervised multitask training according to any one of claims 1 to 7.