CN117541883B

CN117541883B - Image generation model training, image generation method, system and electronic equipment

Info

Publication number: CN117541883B
Application number: CN202410027223.0A
Authority: CN
Inventors: 何金龙; 高旻; 寇勇; 彭林春; 陶李; 许馨月
Original assignee: Sichuan Jianshan Technology Co ltd
Current assignee: Sichuan Jianshan Technology Co ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-09
Anticipated expiration: 2044-01-09
Also published as: CN117541883A; CN118298219A

Abstract

The application provides an image generation model training method, an image generation system and electronic equipment, wherein the training data set is acquired; inputting training data into an image generation model to be trained, and obtaining a model generation image output by the model; generating an image and an actual acquisition image based on an optimal transmission theory and a model, and determining a loss function value; and optimizing the internal parameters of the model according to the loss function value to obtain a trained image generation model. The loss function value is determined based on the optimal transmission theory, the model generation image and the actual acquisition image, so that the human eye perception can be more closely achieved; and optimizing the internal parameters of the image generation model to be trained based on the loss function value, and generating a more real image which is perceived by human eyes by the obtained trained image generation model. The method provided by the application can be oriented to application fields such as metauniverse, digital twin, intelligent planning and design and the like.

Description

Image generation model training, image generation method, system and electronic equipment

Technical Field

The present application relates to the field of image generation technologies, and in particular, to an image generation model training method, an image generation system, and an electronic device.

Background

Image information is more intuitive and easier to understand than mere semantic information (e.g., text information), and images play a vital role in current information delivery.

Semantic image generation refers to generating a real image based on semantic segmentation results. There are currently a few methods by which real images can be generated based on semantic images, e.g., SMIS, pix2pixHD, SPADE models, etc. However, the image generated based on the existing image generation model is still not true enough, and a certain gap exists between the image and the actually acquired image.

Disclosure of Invention

In view of the foregoing, an object of an embodiment of the present application is to provide an image generation model training method, an image generation system and an electronic device, which are used for solving a technical problem that an image generated by an existing image generation model is not real enough.

In a first aspect, an embodiment of the present application provides an image generation model training method, including:

acquiring a training data set; the training data set comprises a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images;

inputting the semantic tag image and the actual acquisition image into an image generation model to be trained, and obtaining a model generation image output by the image generation model to be trained;

Determining a loss function value based on an optimal transmission theory, the model generated image and the actual acquired image;

and optimizing the internal parameters of the image generation model to be trained according to the loss function value so as to obtain a trained image generation model.

In the implementation process, the image generation model training method comprises the steps of obtaining a training data set comprising a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images; inputting the semantic tag image and the actual acquisition image into an image generation model to be trained, and obtaining a model generation image output by the image generation model to be trained; determining a loss function value based on an optimal transmission theory, the model generated image and the actual acquired image; and optimizing the internal parameters of the image generation model to be trained according to the loss function value so as to obtain the trained image generation model. Because the loss function value is determined based on the optimal transmission theory, the model generation image and the actual acquisition image, the loss function value determined based on the optimal transmission theory can be closer to human perception; and optimizing the internal parameters of the image generation model to be trained based on the loss function value, and generating a more real image which is perceived by human eyes by the obtained trained image generation model. The technical problem that an image generated by an existing image generation model is not real enough is solved.

Optionally, in an embodiment of the present application, the loss function value includes: a transmission loss value and a penalty loss value; the determining a loss function value based on the optimal transmission theory, the model generated image and the actual acquired image comprises the following steps: determining the transmission loss value based on the optimal transmission theory, the model generated image and the actual acquired image; the penalty loss value is determined based on the lipschz constraint, the model generated image, and the actual acquired image.

In the implementation process, the loss function value comprises a transmission loss value determined based on an optimal transmission theory, a model generated image and an actual acquired image and a penalty loss value determined based on a Lipohsh constraint, a model generated image and an actual acquired image, so that misleading convergence in the process of optimizing internal parameters of an image generation model to be trained can be improved, and a more real image which is close to human eye perception can be generated based on the trained image generation model.

Optionally, in an embodiment of the present application, the method for generating the image generation model to be trained includes: generating the image generation model to be trained based on a generation countermeasure network; wherein the generation of the countermeasure network includes a generator and a discriminator.

In the implementation process, compared with the traditional training method, the method is based on the generation countermeasure network, the image generation model to be trained is generated, repeated sampling by using a Markov chain is not needed, a complex variation lower bound is avoided, the probability calculation problem can be avoided, a clearer model generation image can be generated, the quality of the model generation image is improved, and meanwhile, the training difficulty of the model is reduced.

Optionally, in an embodiment of the present application, the image generation model to be trained specifically includes: the image encoder is configured to acquire image characteristic data according to the actual acquired image; the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data; the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data; the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data; the shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data; the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data; a first generator configured to determine the model-generated image from the first summed feature data and the semantic tag image; a discriminator configured to discriminate the model generated image and the actual acquired image.

In the implementation process, a maximum pooling layer, an average pooling layer, a minimum pooling layer, a shared multi-layer perceptron and a first activation function layer are added between the image encoder and the first generator; the image feature data are aggregated through a plurality of pooling layers, and then the obtained first dimension reduction feature data, second dimension reduction feature data and third dimension reduction feature data are subjected to feature sharing through a shared multi-layer perceptron, so that richer shared feature data are obtained; and generating a more real model generation image which is close to human eye perception based on richer shared characteristic data.

Optionally, in an embodiment of the present application, the image generation model to be trained specifically includes: the image encoder is configured to acquire image characteristic data according to the actual acquired image; a second generator, the second generator comprising: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the image characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain a plurality of first feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image; a discriminator configured to discriminate the model generated image and the actual acquired image.

In the implementation process, the second generator comprises a remodelling module, an attention module and an up-sampling module, wherein the attention module comprises a plurality of convolution layers with different convolution kernel sizes and a second activation function layer; the plurality of convolution layers with different convolution kernel sizes can obtain a plurality of feature extraction data with different sizes, and the second activation function layer is used for carrying out weighted summation on the plurality of feature extraction data with different sizes so as to obtain second summation feature data with more space information, so that a more real model generated image which is close to human eye perception is generated.

Optionally, in an embodiment of the present application, the image generation model to be trained specifically includes: the image encoder is configured to acquire image characteristic data according to the actual acquired image; the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data; the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data; the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data; the shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data; the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data; a second generator, the second generator comprising: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the first summation characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image; a discriminator configured to discriminate the model generated image and the actual acquired image.

In the implementation process, a maximum pooling layer, an average pooling layer, a minimum pooling layer, a shared multi-layer perceptron and a first activation function layer are added between the image encoder and the second generator; the image feature data are aggregated through a plurality of pooling layers, and then the obtained first dimension reduction feature data, second dimension reduction feature data and third dimension reduction feature data are subjected to feature sharing through a shared multi-layer perceptron, so that richer shared feature data are obtained; in addition, the second generator comprises a reshaping module, an attention module and an upsampling module, wherein the attention module comprises a plurality of convolution layers with different convolution kernel sizes and a second activation function layer; the method comprises the steps that a plurality of convolution layers with different convolution kernel sizes can obtain a plurality of feature extraction data with different sizes, and then the feature extraction data with different sizes are weighted and summed through a second activation function layer to obtain second summation feature data with more space information; based on the richer shared feature data and the second summation feature data with more spatial information, a more real model generated image which is perceived by human eyes can be generated.

Optionally, in an embodiment of the present application, the semantic tag image includes at least: one or more of a background semantic tag, a building semantic tag, a road semantic tag, a water area semantic tag, a wasteland semantic tag, a forest semantic tag and an agricultural semantic tag; and the actual collection image at least comprises one or more of an actual background collection image, an actual building collection image, an actual road collection image, an actual water area collection image, an actual wasteland collection image, an actual forest collection image and an actual agriculture collection image, which correspond to the semantic tag image.

In the implementation process, based on the semantic tag image including one or more of the combination of the background semantic tag, the building semantic tag, the road semantic tag, the water area semantic tag, the wasteland semantic tag, the forest semantic tag and the agricultural semantic tag, and the actual acquisition image including one or more of the combination of the actual background acquisition image, the actual building acquisition image, the actual road acquisition image, the actual water area acquisition image, the actual wasteland acquisition image, the actual forest acquisition image and the actual agricultural acquisition image, the obtained trained image generation model can be used for realizing the generation of the remote sensing image.

In a second aspect, an embodiment of the present application provides an image generating method, including:

acquiring a semantic tag image to be input;

inputting the semantic tag image into a trained image generation model, and obtaining an image generation result output by the trained image generation model; wherein the trained image generation model is determined based on the image generation model training method as described in any of the first aspects.

In a third aspect, embodiments of the present application provide an image generation model training system, including:

The training data set acquisition module is used for acquiring a training data set; the training data set comprises a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images;

the generated image acquisition module is used for inputting the semantic tag image and the actual acquisition image into an image generation model to be trained to obtain a model generation image output by the image generation model to be trained;

the loss function determining module is used for determining a loss function value based on an optimal transmission theory, the model generated image and the actual acquired image;

and the parameter optimization module is used for optimizing the internal parameters of the image generation model to be trained according to the loss function value so as to obtain a trained image generation model.

Optionally, in an embodiment of the present application, the loss function value in the image generation model training system includes: a transmission loss value and a penalty loss value; the loss function determining module is specifically configured to: determining the transmission loss value based on the optimal transmission theory, the model generated image and the actual acquired image; the penalty loss value is determined based on the lipschz constraint, the model generated image, and the actual acquired image.

Optionally, in an embodiment of the present application, the image generation model training system further includes: the model generation module is used for generating the image generation model to be trained based on the generation countermeasure network; wherein the generation of the countermeasure network includes a generator and a discriminator.

Optionally, in an embodiment of the present application, the image generation model to be trained specifically may include: the image encoder is configured to acquire image characteristic data according to the actual acquired image; the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data; the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data; the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data; the shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data; the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data; a first generator configured to determine the model-generated image from the first summed feature data and the semantic tag image; a discriminator configured to discriminate the model generated image and the actual acquired image.

Optionally, in an embodiment of the present application, the image generation model to be trained specifically may include: the image encoder is configured to acquire image characteristic data according to the actual acquired image; a second generator, the second generator comprising: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the image characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image; a discriminator configured to discriminate the model generated image and the actual acquired image.

Optionally, in an embodiment of the present application, the image generation model to be trained specifically may include: the image encoder is configured to acquire image characteristic data according to the actual acquired image; the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data; the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data; the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data; the shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data; the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data; a second generator, the second generator comprising: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the first summation characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image; a discriminator configured to discriminate the model generated image and the actual acquired image.

Optionally, in an embodiment of the present application, the semantic tag image in the image generation model training system includes at least: one or more of a background semantic tag, a building semantic tag, a road semantic tag, a water area semantic tag, a wasteland semantic tag, a forest semantic tag and an agricultural semantic tag; and the actual collection image at least comprises one or more of an actual background collection image, an actual building collection image, an actual road collection image, an actual water area collection image, an actual wasteland collection image, an actual forest collection image and an actual agriculture collection image, which correspond to the semantic tag image.

In a fourth aspect, embodiments of the present application provide an image generation system, including:

the semantic image acquisition module is used for acquiring a semantic tag image to be input;

the generation result acquisition module is used for inputting the semantic tag image into a trained image generation model and acquiring an image generation result output by the trained image generation model; wherein the trained image generation model is determined based on the image generation model training method as described in any of the first aspects.

In a fifth aspect, embodiments of the present application further provide an electronic device, including: a memory and a processor, the memory storing a computer program executable by the processor, the computer program, when executed by the processor, performing the image generation model training method as described in the first aspect above or the image generation method as described in the second aspect.

In a sixth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, perform an image generation model training method as described in the first aspect or an image generation method as described in the second aspect above.

The beneficial effects of this application are: acquiring a training data set comprising a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images; inputting the semantic tag image and the actual acquisition image into an image generation model to be trained, and obtaining a model generation image output by the image generation model to be trained; determining a loss function value based on an optimal transmission theory, the model generated image and the actual acquired image; and optimizing the internal parameters of the image generation model to be trained according to the loss function value so as to obtain the trained image generation model. Because the loss function value is determined based on the optimal transmission theory, the model generation image and the actual acquisition image, the loss function value determined based on the optimal transmission theory can be closer to human perception; and optimizing the internal parameters of the image generation model to be trained based on the loss function value, and generating a more real image which is perceived by human eyes by the obtained trained image generation model. The technical problem that an image generated by an existing image generation model is not real enough is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an image generation model training method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a first image generation model to be trained according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a second image generation model to be trained according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an attention module according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a third image generation model to be trained according to an embodiment of the present application;

fig. 6 is a flowchart of an image generating method according to an embodiment of the present application;

FIG. 7 is a comparison chart of images generated by different image generation methods according to embodiments of the present application;

fig. 8 is a schematic diagram of a semantic tag image according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an image generation model training system according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an image generating system according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the technical solutions of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical solutions of the present application, and thus are only examples, and are not intended to limit the scope of protection of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In the description of the embodiments of the present application, the technical terms "first," "second," etc. are used merely to distinguish between different objects and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated, a particular order or a primary or secondary relationship. In the description of the embodiments of the present application, the meaning of "plurality" is two or more unless explicitly defined otherwise.

Please refer to fig. 1, which illustrates a flowchart of an image generation model training method provided in an embodiment of the present application. The image generation model training method can comprise the following steps:

step 101, acquiring a training data set; the training data set comprises a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images;

102, inputting the semantic tag image and the actual acquisition image into an image generation model to be trained, and obtaining a model generation image output by the image generation model to be trained;

step 103, determining a loss function value based on an optimal transmission theory, the model generated image and the actual acquired image;

and 104, optimizing the internal parameters of the image generation model to be trained according to the loss function value so as to obtain a trained image generation model.

In step 101, the semantic tag image specifically may include: one or more of road semantic tags, forest semantic tags, building semantic tags, water area semantic tags and the like; the road semantic tag represents a tag image containing road semantic information, the forest semantic tag represents a tag image containing forest semantic information, the building semantic tag represents a tag image containing building semantic information, and the water semantic tag represents a tag image containing water semantic information. Accordingly, the actual collected image may be one or more of a road actual collected image, a forest actual collected image, a building actual collected image, a water actual collected image, and the like. The semantic tag image may be a plurality of images of different colors (each image includes only one color), or may be an image including a plurality of colors; the different colors correspond to different semantic information, specifically, a semantic tag image containing road semantic information may be represented by a pink (partial) image, a semantic tag image containing forest semantic information may be represented by a green (partial) image, a semantic tag image containing forest semantic information may be represented by a blue (partial) image, and a semantic tag image containing building semantic information may be represented by a yellow (partial) image. In addition, the semantic tag images containing different semantic information can be represented by images corresponding to different gray values. It can be appreciated that the specific implementation manner of the semantic tag image can be adjusted according to the actual application situation, which is not specifically limited in this application.

Wherein in step 102, the image generation model to be trained described above may be generated based on a diffusion model (e.g., a denoising diffusion probability model, a noise condition scoring network, a stochastic differential equation, etc.) or a generation countermeasure network. Specifically, taking the example of generating the image generation model to be trained based on the generation countermeasure network, the image generation model to be trained may include a generator and a discriminator, and may further include an image encoder.

In step 103, taking an example that the image generation model to be trained includes a generator and a discriminator, the image generation model can be discriminated based on the discriminator, and the image can be actually acquired; and according to the optimal transmission theory, the identified model generates an image and an actual acquisition image, and the loss function value is determined.

In step 104, the internal parameters of the image generating model to be trained may be optimized according to the loss function value, until the trained image generating model is obtained when the loss function value between the model generating image and the actually acquired image meets a preset loss threshold.

Therefore, the image generation model training method provided by the embodiment of the application can be implemented by acquiring a training data set comprising a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images; inputting the semantic tag image and the actual acquisition image into an image generation model to be trained, and obtaining a model generation image output by the image generation model to be trained; generating an image and the acquired image based on an optimal transmission theory and a model, and determining a loss function value; and optimizing the internal parameters of the image generation model to be trained according to the loss function value to obtain a trained image generation model. Because the loss function value is determined based on the optimal transmission theory, the model generation image and the actual acquisition image, the loss function value determined based on the optimal transmission theory can be closer to human perception; and optimizing the internal parameters of the image generation model to be trained based on the loss function value, and generating a more real image which is perceived by human eyes by the obtained trained image generation model. The technical problem that an image generated by an existing image generation model is not real enough is solved.

In some alternative embodiments, the loss function value includes: a transmission loss value and a penalty loss value; step 103, determining a loss function value based on an optimal transmission theory, the model generated image and the actual acquired image, including: determining the transmission loss value based on the optimal transmission theory, the model generated image and the actual acquired image; the penalty loss value is determined based on the lipschz constraint, the model generated image, and the actual acquired image.

The model generation image can be converted into conventional distribution similar to uniform distribution or Gaussian distribution, the model generation image is converted into discrete set in the feature space, and a loss function in the model training process is built based on a semi-discrete optimal transmission method. In particular, it can be based on

Calculating the loss function value; wherein,representing the loss function value->Representing transmission loss value, < >>Representing penalty loss values; x represents the discrete value of the model generated image in the feature space,/for the model generated image>Probability space representing generator->The probability space of the discriminator is represented,representing a discriminator function, ++>、/>Weight parameters corresponding to different penalty loss values respectively, < - >Sign indicating the second norm ++>Representing gradient operators +_>Representing the solving of the discriminator function->Gradient values of (a). Wherein (1)>Can be 3, 5 or other reasonable value, < >>Can also be 3, 5 or other reasonable values, +.>、/>The values of (2) may be the same or different, ">、/>The specific values of (c) may be adjusted according to the actual application to control the actual preference of the loss function.

In some optional embodiments, the method for generating the image generation model to be trained includes: generating the image generation model to be trained based on a generation countermeasure network; wherein the generation of the countermeasure network includes a generator and a discriminator.

Wherein, the generating countermeasure network (Generative Adversarial Networks, GANs) may include a generator and a discriminator, the generator in the generating countermeasure network GANs may learn the distribution of training data (semantic tag images in the training data set), and determine, according to the distribution of the training data, a model generation image corresponding to the semantic tag image input into the image generation model (which may be an image generation model to be trained or a trained image generation model); the discriminator is used to determine whether the data input to the discriminator is a model-generated image from the generator or an actual captured image. The image generation model to be trained can also comprise an image encoder or a characteristic amplification module and the like.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a first image generation model to be trained according to an embodiment of the present application.

In some optional embodiments, the image generation model to be trained may specifically include: the image encoder is configured to acquire image characteristic data according to the actual acquired image; the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data; the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data; the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data; the shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data; the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data; a first generator configured to determine the model-generated image from the first summed feature data and the semantic tag image; a discriminator configured to discriminate the model generated image and the actual acquired image.

The maximum pooling layer, the average pooling layer and the minimum pooling layer can aggregate a plurality of information fragments in the image feature data to obtain a plurality of feature descriptors (including first dimension reduction feature data, second dimension reduction feature data and third dimension reduction feature data) so as to further represent attribute information of the tag image; then, the feature descriptors are subjected to sharing processing through a sharing multi-layer perceptron so as to create richer sharing feature data; and then carrying out weighted summation on the shared characteristic data through a first activation function layer to obtain first summation characteristic data, so that the generator can determine a model generated image according to the first summation characteristic data. The image encoder may be implemented by multiple convolution layers (the convolution kernel sizes of the multiple convolution layers may be different). The maximum pooling layer, the average pooling layer, the minimum pooling layer, the shared multi-layer perceptron, and the first activation function layer may be disposed between the image encoder and the first generator; or may be provided in the image encoder, in particular after a plurality of convolution layers in the image encoder. The first activation function layer may be implemented by a Sigmoid activation function or a Tanh activation function. The image encoder, the first generator and the discriminator may all be implemented by existing image encoders, generators or discriminators.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a second image generation model to be trained according to an embodiment of the present application.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an attention module according to an embodiment of the present application.

In some optional embodiments, the image generation model to be trained may specifically include: the image encoder is configured to acquire image characteristic data according to the actual acquired image; a second generator, the second generator comprising: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the image characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image; a discriminator configured to discriminate the model generated image and the actual acquired image.

The number of the convolution layers can be 3, 5 or other reasonable values; the convolution kernel size of the convolution layer may specifically be 3*3, 5*5, 12×12, 24×24 or 32×32. The second activation function layer may be implemented by a Sigmoid activation function or a Tanh activation function. It will be appreciated that the number of convolution layers and the convolution kernel size of the included convolution layers may be adjusted according to the practical application, and fig. 4 only illustrates an exemplary case where "the convolution layers include convolution layers of 3*3, 5*5, 12×12, 24×24 and 32×32 convolution kernel sizes" at the same time.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a third image generation model to be trained according to an embodiment of the present application.

In some optional embodiments, the image generation model to be trained may specifically include: the image encoder is configured to acquire image characteristic data according to the actual acquired image; the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data; the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data; the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data; the shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data; the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data; a second generator, the second generator comprising: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the first summation characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image; a discriminator configured to discriminate the model generated image and the actual acquired image.

The specific implementation manner of the embodiment may refer to the specific implementation manner of the two image generation models to be trained. The second generator may also be implemented "on the basis of the SPADE ResBlk by integrating the above-mentioned attention module in each upsampling layer of the SPADE ResBlk". Generating a model on an image to be trained comprises the following steps: image encoder, max-pooling layer, average pooling layer, min-pooling layer, shared multi-layer perceptron, first activation function layer, second generator, and authenticationIn the case of the device, the loss function valueIs->Can be 5 @, @>The specific value of (2) may also be 5 to obtain a more realistic model-generated image.

In some alternative embodiments, the semantic tag image may include: one or more of a background semantic tag, a building semantic tag, a road semantic tag, a water area semantic tag, a wasteland semantic tag, a forest semantic tag and an agricultural semantic tag; the actual collected image may include one or more combinations of an actual background collected image, an actual building collected image, an actual road collected image, an actual water area collected image, an actual wasteland collected image, an actual forest collected image, and an actual agriculture collected image, corresponding to the semantic tag image.

The image generation model training method can be used for determining a remote sensing image generation model. Specifically, the method comprises the following steps: the image generation model for generating remote sensing images under different geographic environments can be obtained by collecting semantic tag images such as background semantic tags, building semantic tags, road semantic tags, water area semantic tags, wasteland semantic tags, forest semantic tags, agricultural semantic tags and the like and collecting actual images such as actual background collection images, actual building collection images, actual road collection images, actual water area collection images, actual wasteland collection images, actual forest collection images, actual agricultural collection images and the like which correspond to the semantic tag images. Semantic tag images, including but not limited to the tag image categories described above, may be generated by setting more semantic tag image categories to generate more detailed model-generated images from the image-generated models. For example, a water area tag image may be refined according to the actual pollution level of the corresponding water area, a forest tag image may be refined according to the kinds of trees in the forest and the corresponding seasons, an agricultural tag image may be refined according to the planted crop types and the corresponding seasons, and so on.

Referring to fig. 6, fig. 6 is a flowchart of an image generating method according to an embodiment of the present application. The image generation method comprises the following steps:

step 201, acquiring a semantic tag image to be input;

step 202, inputting the semantic tag image into a trained image generation model, and obtaining an image generation result output by the trained image generation model; wherein the trained image generation model is determined based on the image generation model training method as described in any of the first aspects.

The semantic tag image may include only one type of tag image corresponding to a semantic tag, or may be a combination of tag images corresponding to a plurality of different semantic tags. The specific implementation manner of the semantic tag image can refer to the specific implementation manner of the semantic tag image in the image generation model training method. Referring to fig. 7, fig. 7 is a comparison chart of images generated by different image generating methods according to an embodiment of the present application; the leftmost column in fig. 7 is a plurality of semantic tags in the training process of the image generation model, the second column is a plurality of semantic tag images to be input, and the third column is an actual image corresponding to each semantic tag image; the fourth column is an image generation result obtained by performing image generation on the semantic tag image based on the image generation method provided by the application. It should be noted that the semantic tag image in the image generation model training process may also be in the form of the semantic tag image shown in the second column in fig. 7.

Referring to fig. 8, fig. 8 is a schematic diagram of a semantic tag image according to an embodiment of the present application. Fig. 8 specifically illustrates a schematic labeling diagram of the "semantic tag" corresponding to the third semantic tag image in the second column in fig. 7.

It can be seen that based on the image generation method provided by the application, a model generation image very close to the actual situation can be obtained.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an image generation model training system according to an embodiment of the present application. The image generation model training system comprises:

a training data set acquisition module 301, configured to acquire a training data set; the training data set comprises a plurality of semantic tag images and actual acquisition images corresponding to the semantic tag images;

the generated image obtaining module 302 is configured to input the semantic tag image and the actual acquired image to an image generation model to be trained, and obtain a model generation image output by the image generation model to be trained;

a loss function determining module 303, configured to determine a loss function value based on an optimal transmission theory, the model generated image, and the actual acquired image;

and the parameter optimization module 304 is configured to optimize internal parameters of the image generation model to be trained according to the loss function value, so as to obtain a trained image generation model.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an image generating system according to an embodiment of the present application. The image generation system includes:

a semantic image acquisition module 401, configured to acquire a semantic tag image to be input;

a generation result obtaining module 402, configured to input the semantic tag image into a trained image generation model, and obtain an image generation result output by the trained image generation model; wherein the trained image generation model is determined based on the image generation model training method as described in any of the first aspects.

It should be understood that the image generation model training system/image generation system corresponds to the image generation model training method/image generation method embodiment described above, and is capable of performing the steps involved in the method embodiment described above, and specific functions of the image generation model training system/image generation system may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The image generation model training System/image generation System includes at least one software functional module that can be stored in memory in the form of software or firmware (firmware) or cured in the Operating System (OS) of the device.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 500 provided in an embodiment of the present application includes: processor 501 and memory 502, which are interconnected and communicate with each other by a communication bus 503 and/or other form of connection mechanism (not shown). The memory 502 stores a computer program executable by the processor 501, which when executed by the processor 501, performs the image generation model training method as described in the first aspect above or the image generation method as described in the second aspect.

Embodiments of the present application also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by the processor 501, perform the image generation model training method as described in the first aspect or the image generation method as described in the second aspect above.

The storage medium may be implemented by any type of volatile or nonvolatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/system and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The foregoing description is merely an optional implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art may easily think about changes or substitutions within the technical scope of the embodiments of the present application, and the changes or substitutions should be covered in the scope of the embodiments of the present application.

Claims

1. A method of training an image generation model, the method comprising:

Optimizing the internal parameters of the image generation model to be trained according to the loss function value to obtain a trained image generation model;

the method for generating the image generation model to be trained comprises the following steps: generating the image generation model to be trained based on a generation countermeasure network; the generation of the countermeasure network includes a generator and an authenticator;

the image generation model to be trained specifically comprises the following steps: an image encoder, a second generator, and a discriminator;

the image encoder is configured to acquire image characteristic data according to the actual acquired image;

the second generator includes: a remodelling module, an attention module and an upsampling module; the remodelling module is configured to perform data remodelling on the image characteristic data and the semantic tag image to obtain a plurality of initial image data; the attention module includes: a plurality of convolution layers having different convolution kernel sizes and a second activation function layer; the convolution layers with different convolution kernel sizes are configured to perform feature extraction of different sizes on each initial image data to obtain feature extraction data of different sizes; the second activation function layer is configured to perform weighted summation on the feature extraction data with the plurality of different sizes to obtain second summation feature data; the up-sampling module is configured to up-sample the plurality of second summation characteristic data to obtain the model generation image;

The discriminator is configured to discriminate the model-generated image and the actual acquired image.

2. The method of claim 1, wherein the loss function value comprises: a transmission loss value and a penalty loss value;

the determining a loss function value based on the optimal transmission theory, the model generated image and the actual acquired image comprises the following steps:

determining the transmission loss value based on the optimal transmission theory, the model generated image and the actual acquired image;

the penalty loss value is determined based on the lipschz constraint, the model generated image, and the actual acquired image.

3. The method according to claim 1, wherein the image generation model to be trained specifically further comprises:

the maximum pooling layer is configured to perform maximum pooling operation on the image characteristic data to obtain first dimension reduction characteristic data;

the average pooling layer is configured to perform average pooling operation on the image characteristic data to obtain second dimension reduction characteristic data;

the minimum pooling layer is configured to perform minimum pooling operation on the image characteristic data to obtain third dimension reduction characteristic data;

The shared multi-layer sensor is configured to perform feature sharing on the first dimension reduction feature data, the second dimension reduction feature data and the third dimension reduction feature data, and output first shared feature data, second shared feature data and third shared feature data;

the first activation function layer is configured to perform weighted summation on the first shared feature data, the second shared feature data and the third shared feature data to obtain first summation feature data;

a first generator configured to determine the model-generated image from the first summed feature data and the semantic tag image.

4. The method according to claim 1, wherein the image generation model to be trained specifically further comprises:

the remodeling module is specifically configured to perform data remodeling on the first summation feature data and the semantic tag image to obtain a plurality of initial image data.

5. The method according to claim 1 or 2, wherein the semantic tag image comprises at least: one or more of a background semantic tag, a building semantic tag, a road semantic tag, a water area semantic tag, a wasteland semantic tag, a forest semantic tag and an agricultural semantic tag;

and the actual collection image at least comprises one or more of an actual background collection image, an actual building collection image, an actual road collection image, an actual water area collection image, an actual wasteland collection image, an actual forest collection image and an actual agriculture collection image, which correspond to the semantic tag image.

6. An image generation method, the method comprising:

acquiring a semantic tag image to be input;

inputting the semantic tag image into a trained image generation model, and obtaining an image generation result output by the trained image generation model; wherein the trained image generation model is determined based on the image generation model training method according to any of claims 1-5.

7. An image generation model training system, the system comprising:

the parameter optimization module is used for optimizing the internal parameters of the image generation model to be trained according to the loss function value so as to obtain a trained image generation model;

Wherein the system further comprises: the model generation module is used for generating the image generation model to be trained based on the generation countermeasure network; the generation of the countermeasure network includes a generator and an authenticator;

8. An image generation system, the system comprising:

the generation result acquisition module is used for inputting the semantic tag image into a trained image generation model and acquiring an image generation result output by the trained image generation model; wherein the trained image generation model is determined based on the image generation model training method according to any of claims 1-5.

9. An electronic device, the electronic device comprising:

a memory;

a processor;

the memory having stored thereon a computer program executable by the processor for performing the method of any of claims 1-6 when executed by the processor.