CN112614199A

CN112614199A - Semantic segmentation image conversion method and device, computer equipment and storage medium

Info

Publication number: CN112614199A
Application number: CN202011321375.XA
Authority: CN
Inventors: 孟云龙
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-04-06

Abstract

The application relates to a semantic segmentation image conversion method, a semantic segmentation image conversion device, computer equipment and a storage medium. The method comprises the following steps: receiving an initial live-action image acquired by acquisition equipment; performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image; the semantic segmentation images are input into a multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation images are output, content modalities of the live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images of different content modalities and generation parameters of the live-action images corresponding to different content modalities. By adopting the method, the diversity of the generated live-action images can be improved.

Description

Semantic segmentation image conversion method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of image generation, and in particular, to a semantic segmentation image conversion method, apparatus, computer device, and storage medium.

Background

Before business processing is carried out by applying a neural network model, model training is carried out by needing a large number of training set images, and when the training set images are difficult to collect, image conversion processing can be carried out by the collected initial images to obtain converted training set images with large data volume.

In the conventional method, the image conversion is usually performed through a conditional generation countermeasure network (conditional adaptive network).

However, the above method has a problem of mode collapse (mode collapse layout), so that only one-to-one images can be output, that is, only one corresponding image can be output based on an input image, and the output image is relatively single and lacks diversity.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a medium for converting a semantic division image, which can improve the diversity of live-action images for converting the semantic division image.

A method of semantically segmenting image transformations, the method comprising:

receiving an initial live-action image acquired by acquisition equipment;

performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image;

the semantic segmentation images are input into a multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation images are output, content modalities of the live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images of different content modalities and generation parameters of the live-action images corresponding to different content modalities.

In one embodiment, performing semantic segmentation processing on the initial live-action image, and before generating the semantic segmented image, the method further includes:

carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing;

performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image, wherein the semantic segmentation processing comprises the following steps:

and performing semantic segmentation processing on the initial live-action image after the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image after the image size normalization processing.

In one embodiment, inputting the semantically segmented image into a multi-modal conditional generation countermeasure network, outputting a plurality of live-action images corresponding to the semantically segmented image, comprises:

performing feature extraction on the semantic segmentation image by an encoder in the countermeasure network generated by the multi-modal condition to generate a feature map corresponding to the semantic segmentation image;

and decoding and converting the characteristic graph by a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, decoding and converting the feature map by a generator in the multi-modal conditional generation countermeasure network to obtain a plurality of live-action images corresponding to the semantically segmented image, includes:

a plurality of different generation parameters are configured through the generator, and a plurality of live-action images corresponding to the semantic segmentation images are generated according to the generation parameters and the feature map.

In one embodiment, the generating of the multi-modal conditional generation countermeasure network comprises:

acquiring a training set image;

generating a confrontation network by inputting the initial multi-modal condition constructed by the training set image, and generating corresponding each prediction live-action image based on each determined generation parameter;

generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image;

inputting each image set into an identifier for authenticity identification, and outputting corresponding identification results;

and adjusting the initial multi-modal condition to generate network parameters of the countermeasure network based on each identification result, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

In one embodiment, after performing iterative training on the initial multi-modal condition generation countermeasure network after adjusting the network parameters according to a preset number of iterations, the method further includes:

storing the initial multi-modal condition generation countermeasure network and the corresponding training index value of each iterative training;

generating a countermeasure network by obtaining multi-modal conditions after training, comprising:

and determining the initial multi-modal condition generation countermeasure network corresponding to the highest training index value, and generating the countermeasure network for the multi-modal condition after training.

In one embodiment, adjusting the initial multi-modal condition based on each identification result to generate a network parameter of the countermeasure network, and performing iterative training on the initial multi-modal condition generated countermeasure network after the network parameter adjustment according to a preset iteration number to generate the trained multi-modal condition generated countermeasure network, includes:

calculating a loss value of the countermeasure network generated under the initial multi-modal condition according to each identification result, and performing first adjustment on network parameters of the countermeasure network generated under the initial multi-modal condition based on the loss value to obtain a first adjusted initial multi-modal condition generation countermeasure network;

according to the predicted live-action images with true identification results and corresponding generation parameters, determining difference indexes among the predicted live-action images based on a preset regulation function, and performing second adjustment on network parameters of the countermeasure network generated under the initial multi-modal condition based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network;

and according to the preset iteration times, performing iterative training on the first adjusted and second adjusted initial multi-modal condition generation countermeasure network to obtain a trained multi-modal condition generation countermeasure network.

A semantic segmentation image conversion apparatus, the apparatus comprising:

the initial live-action image receiving module is used for receiving an initial live-action image acquired by the acquisition equipment;

the semantic segmentation processing module is used for performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image;

the multi-mode live-action image generation module is used for inputting the semantic segmentation images into a multi-mode condition generation countermeasure network and outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modalities of the live-action images in the plurality of live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images in different content modalities and the generation parameters of the live-action images corresponding to the different content modalities.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the above embodiments when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.

According to the semantic segmentation image conversion method, the semantic segmentation image conversion device, the computer equipment and the storage medium, the semantic segmentation image is obtained, the semantic segmentation image is input into the multi-mode condition generation countermeasure network, the multiple live-action images corresponding to the semantic segmentation image are output, the content modes of the live-action images in the multiple live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on the preset regulation and control function, and the preset regulation and control function is determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to different content modes. The multi-modal condition generation countermeasure network is generated by training difference indexes determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modalities and generation parameters of the live-action images corresponding to different content modalities, so that when the countermeasure network is generated according to the multi-modal condition to generate the live-action images, the diversity of the generated live-action images can be improved.

Drawings

FIG. 1 is a diagram of an application scenario of a semantic segmentation image transformation method in one embodiment;

FIG. 2 is a flow diagram illustrating a method for semantic segmentation image transformation in one embodiment;

FIG. 3 is a flow chart illustrating the semantic segmentation image conversion step in another embodiment;

FIG. 4 is a flow diagram illustrating the steps of generating a plurality of live-action images in one embodiment;

FIG. 5 is a flowchart illustrating the iterative training step of the network in one embodiment;

FIG. 6 is a block diagram of a semantic segmentation image conversion apparatus according to an embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The semantic segmentation image conversion method provided by the application can be applied to the application environment shown in fig. 1. Wherein the collection device 102 communicates with the server 104 over a network. The capture device 102 may capture an initial live-action image and send it to the server 104 over a network. After receiving the initial live-action image, the server 104 may perform semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image. Further, the server 104 may input the semantically segmented image into the multi-modal conditional generation countermeasure network, and output a plurality of live-action images corresponding to the semantically segmented image. The multi-mode condition generation countermeasure network is generated by training difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images of different content modes and generation parameters of the live-action images corresponding to the different content modes. The capturing device 102 may be, but not limited to, a device having an image capturing function and a transmitting function, such as various cameras, video cameras, and video recorders, and the server 104 may be implemented by an independent server or a server cluster including a plurality of servers.

In one embodiment, as shown in fig. 2, a semantic segmentation image conversion method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S202, receiving an initial live-action image collected by a collecting device.

The initial live-action image refers to an image generated by acquiring a real life scene through an acquisition device, and may be a road live-action image or the like, for example.

In this embodiment, due to the limitation of the live-action scene, when the live-action image is collected by the collecting device, the live-action image meeting the preset requirement cannot be collected, or a sufficient number of live-action images cannot be collected. The method and the device are used for generating a large batch of live-action images based on a small number of collected initial live-action images.

In this embodiment, after the capturing device captures the initial live-action image, the initial live-action image may be transmitted to the server through the network, so as to perform further processing through the server.

And step S204, performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image.

The semantic segmentation image is an image that classifies each pixel point in an image, determines the category of each point, and represents an original image in a category style, and can be shown as a condition image input in fig. 3.

In this embodiment, the server may perform semantic segmentation processing on the acquired image when acquiring the initial live-action image, so as to obtain a semantic segmentation image corresponding to the live-action image. For example, the server may perform semantic segmentation processing by performing semantic segmentation based on region, full convolution network semantic segmentation, and weak supervision semantic segmentation, etc. to thereby semantically segment an image corresponding to the initial live-action image.

Step S206, inputting the semantic segmentation images into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modes of the live-action images in the plurality of live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images of different content modes and the generation parameters of the live-action images corresponding to different content modes.

In this embodiment, after acquiring the semantic segmentation image, the server may input the acquired semantic segmentation image into a pre-trained multi-modal condition generation countermeasure network, so as to generate a plurality of live-action images corresponding to different content modalities through the multi-modal condition generation countermeasure network.

With continued reference to fig. 3, the server generates multiple live-action images generated by the countermeasure network through multi-modal conditions, as shown in 301 and 302, the content modalities of the multiple live-action images are different, and there is a difference in color, pixel difference, and the like.

In this embodiment, the multi-modal conditional generation countermeasure network can be generated based on the difference index determined by the preset regulatory function, and the preset regulatory function can be shown in formula (1).

Wherein G (E (x), z)₁) Generating a parameter z for the correspondence₁Output live-action image, G (E (x), z)₂) Generating a parameter z for the correspondence₂And outputting the live-action image. In this embodiment, when the server trains the multi-modal conditional generation countermeasure network, the value of the difference index determined by the regulatory function is as large as possible, that is, the live-action image G (e (x), z is output₁) And G (E (x), z)₂) The diversity is as large as possible, so that a plurality of scene images with large diversity can be generated by the trained multi-modal condition generation countermeasure network.

In the semantic segmentation image conversion method, a semantic segmentation image is obtained, the semantic segmentation image is input into a multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation image are output, the content modalities of the live-action images in the plurality of live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images in the different content modalities and the generation parameters of the live-action images corresponding to the different content modalities. The multi-modal condition generation countermeasure network is generated by training difference indexes determined based on a preset regulation function, and the preset regulation function is determined according to the live-action images of different content modalities and generation parameters of the live-action images corresponding to different content modalities, so that when the countermeasure network is generated according to the multi-modal condition to generate the live-action images, the diversity of the generated live-action images can be improved.

In one embodiment, performing semantic segmentation processing on the initial live-action image, and before generating the semantic segmented image, the method may further include: and carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing.

Specifically, the sizes of the initial live-action images acquired by different terminals may be inconsistent, and thus, the sizes of the initial live-action images acquired by the server are inconsistent.

In this embodiment, after receiving the initial live-action image, the server may perform a preprocessing on the initial live-action image, for example, perform a normalization process on the image size, and adjust the image size of the initial live-action image to a preset size.

Specifically, the step of performing, by the server, image size normalization processing on the initial live-action image may be to determine whether a length-width ratio of the initial live-action image is consistent with a length-width ratio of a preset size, if not, fill the initial live-action image through a preset pixel image until the length-width ratio is consistent with the length-width ratio of the preset size, and scale the initial live-action image to the preset size to obtain the initial live-action image of the preset size, and if so, scale the initial live-action image to the preset size to obtain the initial live-action image of the preset size.

For example, the length and width of the preset size are 256 × 256, and the size of the initial real image obtained from the terminal is 255 × 256, the server may fill the length of the initial real image to 256 × 256 by presetting a pixel image, for example, 0 pixel, and similarly, when the size of the initial real image is 256 × 255, the server may fill the width of the semantic segmentation image to 256 × 256 by 0 pixel.

Optionally, when the size of the initial real-scene image acquired by the server is 512 × 510, the initial real-scene image may be filled to 512 × 512 by 0 pixel, so that the length-width ratio of the initial real-scene image is consistent with the length-width ratio of the preset size, and then reduced to 256 × 256, so that the initial real-scene image is consistent with the preset size; alternatively, if the initial real-scene image size is 128 × 123, the preset image may be filled up to 128 × 128 by 0 pixels to be consistent with the length-width ratio of the preset size, and then enlarged to 256 × 256 to be consistent with the preset size.

In other embodiments, the server may further perform image brightness adjustment, image rectification processing, and the like on the initial live-action image, so that the processed initial live-action image is obtained.

In this embodiment, performing semantic segmentation processing on the initial live-action image to generate a semantic segmented image corresponding to the initial live-action image may include: and performing semantic segmentation processing on the initial live-action image after the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image after the image size normalization processing.

Specifically, after obtaining the initial live-action image after the image size normalization, the server may perform semantic segmentation on the initial live-action image after the image size normalization, for example, perform semantic segmentation by performing semantic segmentation based on a region, full convolution network semantic segmentation, weak supervision semantic segmentation, and the like, so as to obtain a semantic segmentation image corresponding to the initial live-action image after the image size normalization.

It can be understood by those skilled in the art that, in this embodiment, the image size normalization processing may also be performed after semantic segmentation, that is, after performing semantic segmentation on the initial live-action image to obtain a semantic segmented image, the image size normalization processing is performed on the semantic segmented image to obtain an image whose size meets the requirement of generating the countermeasure network input under the multi-modal condition.

In the above embodiment, the initial live-action image after the image size normalization processing is obtained by performing the image size normalization processing on the initial live-action image, and then the semantic segmentation processing is performed, so that the output semantic segmented image can meet the input requirement of the multi-modal condition generation countermeasure network, and the accuracy of the multi-modal condition generation countermeasure network generated live-action image can be improved.

In one embodiment, referring to fig. 4, the server inputting the semantically segmented image into the multi-modal conditional generation countermeasure network and outputting a plurality of live-action images corresponding to the semantically segmented image may include:

step S402, performing feature extraction on the semantic segmentation images through an encoder in the multi-modal conditional generation countermeasure network, and generating feature maps corresponding to the semantic segmentation images.

In this embodiment, the multi-modal conditional generation countermeasure network may include an encoder and a corresponding generator (or decoder), and after the server inputs the original image into the multi-modal conditional generation countermeasure network, the encoder of the multi-modal conditional generation countermeasure network may perform feature extraction on the semantically segmented image to obtain a feature map of the corresponding semantically segmented image.

In this embodiment, the encoder may include a plurality of convolutional layers, each of which has a different size and number of channels, for example, 8 convolutional layers may be included, where the first convolutional layer to the eighth convolutional layer have a smaller size and the number of channels gradually increases. Specifically, the number of the first convolutional layer channels is 64 channels, the number of the second convolutional layer channels is 128 channels, the number of the third convolutional layer channels is 512 channels, the number of the fourth convolutional layer channels is 512 channels, the number of the fifth convolutional layer channels is 512 channels, the number of the sixth convolutional layer channels is 512 channels, the number of the seventh convolutional layer channels is 512 channels, and the number of the eighth convolutional layer channels is 512 channels.

And S404, decoding and converting the feature map by a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

Wherein the generator may comprise a multi-layer network structure, for example, may be a generator comprising a hopping connection, or a generic generator, etc.

In one embodiment, the generator including the jump connection may include a plurality of convolutional layers, a plurality of Dropout layers, and Tann layers, and specifically may be a first 512-channel convolutional layer, followed by a Dropout layer, a 1024-channel convolutional layer, a 512-channel convolutional layer, a 125-channel convolutional layer, a 3-channel convolutional layer, and Tann layers in this order.

In another embodiment, a generic generation year may include multiple convolutional layers, for example, including 64-channel convolutional layers, 128-channel convolutional layers, 256-channel convolutional layers, and multiple 512-channel convolutional layers.

In this embodiment, the server may input the feature map generated by the encoder into the generator for decoding and converting to obtain the live-action image corresponding to the semantic segmentation image.

It will be understood by those skilled in the art that the above descriptions of the encoder and generator structures are only examples, and in other embodiments, other structures are also possible, and the present application is not limited thereto.

In one embodiment, decoding and converting the feature map by a generator in the multi-modal conditional generation countermeasure network to obtain a plurality of live-action images corresponding to the semantically segmented image may include: a plurality of different generation parameters are configured through the generator, and a plurality of live-action images corresponding to the semantic segmentation images are generated according to the generation parameters and the feature map.

Specifically, with continued reference to fig. 3, after the server generates the feature map, the server takes two unequal generation parameters, i.e., the latent vector z, from the normal distribution N (0,1)₁And z₂And from the potential space between the encoder and the generator.

In the present embodiment, the generator generates from eachParameters and feature maps for generating multiple live-action images corresponding to semantically segmented images, i.e. based on z₁And z₂Generating corresponding live-action map data G (E (x), z)₁) And G (E (x), z)₂)。

In the above embodiment, different generation parameters are configured, and a plurality of live-action images corresponding to the semantic segmentation image are generated based on the feature map, so that the generated live-action images are different from each other.

In one embodiment, the generating of the multi-modal conditional generation countermeasure network can include: acquiring a training set image; generating a confrontation network by inputting the initial multi-modal condition constructed by the training set image, and generating corresponding each prediction live-action image based on each determined generation parameter; generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image; inputting each image set into an identifier for authenticity identification, and outputting corresponding identification results; and adjusting the initial multi-modal condition to generate network parameters of the countermeasure network based on each identification result, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

In this embodiment, the server obtains a batch of paired two-dimensional plane maps and satellite map images, and uses the two-dimensional plane maps as training set images.

Further, the server randomly takes two unequal generation parameters, i.e. the latent vector z, from the normal distribution N (0,1)₁And z₂And generating the countermeasure network from the initial multi-modal condition input from the latent space between the encoder and the generator.

Further, the server generates a confrontation network through the initial multi-modal condition to extract the characteristics of the images in the training set, and generates corresponding test map images based on the generated parameters and the characteristic images obtained by characteristic extraction.

Further, the server generates image sets from the test map images and the training set images, inputs the image sets into the discriminator to discriminate authenticity, and outputs corresponding discrimination results.

In this embodiment, when the discrimination result output by the discriminator is false, the server may calculate a loss value from the obtained test map image and the satellite map image corresponding to the two-dimensional plane map to update the network parameters of the initial multi-modal condition generation countermeasure network based on the calculated loss value.

Further, the server performs iterative training on the confrontation network generated under the initial multi-modal condition after the network parameters are adjusted according to the preset iteration times until the identification result output by the identifier is true, so as to obtain the confrontation network generated under the trained multi-modal condition.

In the embodiment, the authenticity of the generated image is identified, corresponding identification results are output, the initial multi-modal condition is adjusted based on the identification results to generate the network parameters of the confrontation network, and the confrontation network generated by the initial multi-modal condition after the network parameters are adjusted is subjected to iterative training according to the preset iteration times, so that the confrontation network generated by the obtained initial multi-modal condition is more accurate, and the accuracy of the generated live-action image is improved.

In one embodiment, after performing iterative training on the initial multi-modal condition generation countermeasure network after adjusting the network parameters according to a preset number of iterations, the method may further include: storing the initial multi-modal condition generation countermeasure network and the corresponding training index value for each iterative training.

In this embodiment, the server may generate the confrontation network and the corresponding training index value corresponding to the initial multi-modal condition generation confrontation network stored in each iteration training when performing the model training or after performing the iteration training on the initial multi-modal condition generation confrontation network after adjusting the network parameters.

The training index value is an index value indicating the completion degree and the completion condition and effect of the initial multi-modal condition generation countermeasure network, and may specifically be one index level or a score.

In other embodiments, the server may also store the initial multi-modal condition generation countermeasure network parameters and the training index value for each iterative training, which is not limited in this application.

In this embodiment, obtaining the trained multi-modal condition generation countermeasure network may include: and determining the initial multi-modal condition generation countermeasure network corresponding to the highest training index value, and generating the countermeasure network for the multi-modal condition after training.

Specifically, after the network training of the preset iteration number is completed, the server may rank the training index values according to the training index value of each iteration training, and determine the initial multi-modal condition generation countermeasure network with the highest training index value after ranking to generate the countermeasure network for the trained multi-modal condition, and use the generated countermeasure network for the subsequent test application.

In the above embodiment, the countermeasure network generated by the initial multi-modal condition after each iterative training and the corresponding training index value are stored, and the initial multi-modal condition generation countermeasure network with the highest index value is selected as the final multi-modal condition generation countermeasure network after the training is completed, so that the finally determined multi-modal condition generation countermeasure network is the network with the best training effect, and the accuracy of network prediction can be improved.

In one embodiment, referring to fig. 5, the adjusting, by the server, the network parameters of the initial multi-modal condition generation countermeasure network based on each identification result, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to a preset number of iterations to generate the trained multi-modal condition generation countermeasure network may include:

step S502, according to each identification result, calculating a loss value of the initial multi-modal condition generation countermeasure network, and performing first adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the loss value to obtain a first adjusted initial multi-modal condition generation countermeasure network.

Specifically, the server may determine whether the authentication result is true or false after obtaining the authentication result. In this embodiment, when the server determines that the discrimination result output by the discriminator is false, the server may calculate a loss value from the obtained test map image and the satellite map image corresponding to the two-dimensional plane map to perform a first adjustment on the network parameters of the initial multi-modal condition generation countermeasure network based on the calculated loss value.

In this embodiment, the loss value may be calculated by different loss functions, such as a cross-entropy loss function, a two-class loss function, a multi-class loss function, and the like, which is not limited thereto.

Step S504, according to the predicted live-action images with the true identification results and the corresponding generation parameters, the difference indexes between the predicted live-action images are determined based on the preset regulation and control functions, and the network parameters of the initial multi-modal condition generation countermeasure network are subjected to second adjustment based on the difference indexes, so that the second adjusted initial multi-modal condition generation countermeasure network is obtained.

In this embodiment, the server may further calculate a difference index between each predicted live-action image according to the predicted live-action image generated by the different generation parameters and each corresponding generation parameter through a preset regulation function.

In this embodiment, the control function can be shown by the foregoing formula (1), and is not described herein again.

In this embodiment, in order to make the difference between the plurality of live-action images output by the warring network generated based on the trained multi-modal condition as large as possible, the server may perform a second adjustment on the network parameters of the warring network generated based on the initial multi-modal condition according to the calculated difference index, so as to obtain a second adjusted warring network generated based on the initial multi-modal condition.

Step S506, according to the preset iteration times, performing iterative training on the first-adjusted and second-adjusted initial multi-modal condition generation countermeasure network to obtain a trained multi-modal condition generation countermeasure network.

In this embodiment, the server performs iterative training for a preset number of iterations on the first adjusted initial multi-modal condition generation countermeasure network and the second adjusted initial multi-modal condition generation countermeasure network, so as to obtain a trained multi-modal condition generation countermeasure network.

In the above embodiment, the loss value and the difference index are respectively calculated, and the initially multi-modal condition generation countermeasure network is respectively adjusted and iterated, so that the training accuracy can be improved, and the accuracy of prediction of the finally obtained multi-modal condition generation countermeasure network can be improved.

In this embodiment, the multi-modal conditional generation countermeasure network and the discriminators can be trained by way of cross-iterative training. Specifically, cross iterative training is performed by: a one-generation multi-modal conditional generation countermeasure network can generate poor images, and then a one-generation discriminator can accurately classify the pictures generated by the multi-modal conditional generation countermeasure network and the real pictures, in short, the discriminator is a two-classifier, and outputs 0 for the images generated by the multi-modal conditional generation countermeasure network and 1 for the real images. Further, the server starts to train a second-generation multi-modal condition generation countermeasure network based on the discrimination result of the discriminator, and the second-generation multi-modal condition generation countermeasure network can generate slightly better images, so that the first-generation discriminator can consider the generated images to be real images. The server then trains a second generation discriminator which accurately recognizes the real image and the second generation multi-modal condition generating image generated by the countermeasure network. By analogy, there will be three, four, … n generations of multimodal conditions generating countermeasure networks and discriminators. Finally, the final discriminator cannot distinguish the multi-modal condition to generate an image generated by the countermeasure network and a real image, so that the training is completed.

It should be understood that although the steps in the flowcharts of fig. 2, 4 and 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, and 5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a semantic segmentation image conversion apparatus including: the system comprises an initial live-action image receiving module 100, a semantic segmentation processing module 200 and a multi-modal live-action image generating module 300, wherein:

the initial live-action image receiving module 100 is configured to receive an initial live-action image acquired by an acquisition device.

The semantic segmentation processing module 200 is configured to perform semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image.

The multi-modal live-action image generation module 300 is configured to input the semantic segmentation images into a multi-modal condition generation countermeasure network, and output a plurality of live-action images corresponding to the semantic segmentation images, where content modalities of the live-action images in the plurality of live-action images are different, the multi-modal condition generation countermeasure network is generated by training a difference index determined based on a preset regulation function, and the preset regulation function is determined according to generation parameters of the live-action images in the different content modalities and the live-action images in the different content modalities.

In one embodiment, the apparatus may further include:

and the normalization processing module is configured to perform image size normalization processing on the initial live-action image before the semantic segmentation processing module 200 performs semantic segmentation processing on the initial live-action image to generate a semantic segmented image, so as to obtain the initial live-action image after the image size normalization processing.

In this embodiment, the semantic segmentation processing module 200 is configured to perform semantic segmentation processing on the initial live-action image after the image size normalization processing, and generate a semantic segmentation image corresponding to the initial live-action image after the image size normalization processing.

In one embodiment, the multi-modal live-action image generation module 300 may include:

and the feature extraction submodule is used for performing feature extraction on the semantic segmentation images through an encoder in the multi-modal condition generation countermeasure network to generate feature maps corresponding to the semantic segmentation images.

And the decoding and converting submodule is used for decoding and converting the characteristic diagram by a generator in the countermeasure network generated by the multi-modal condition to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, the decoding conversion sub-module is configured to configure a plurality of different generation parameters through the generator, and generate a plurality of live-action images corresponding to the semantic segmentation images according to the generation parameters and the feature map.

In one embodiment, the apparatus may further include:

and the training module is used for training the multi-modal condition to generate the confrontation network.

In this embodiment, the training module may include:

and the training set image acquisition submodule is used for acquiring training set images.

And the prediction image generation sub-module is used for generating a countermeasure network according to the initial multi-modal condition constructed by inputting the training set images and generating corresponding prediction live-action images based on the determined generation parameters.

And the image set generation submodule is used for generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image.

And the identification submodule is used for inputting each image set into the identifier to carry out authenticity identification and outputting corresponding identification results.

And the iterative training submodule is used for adjusting the initial multi-modal condition to generate network parameters of the countermeasure network based on each identification result, and generating the countermeasure network for the initial multi-modal condition after the network parameters are adjusted according to the preset iteration times to carry out iterative training so as to obtain the trained multi-modal condition to generate the countermeasure network.

In one embodiment, the apparatus may further include:

and the storage module is used for storing the confrontation network generated by the initial multi-modal condition of each iterative training and a corresponding training index value after the iterative training sub-module carries out iterative training on the confrontation network generated by the initial multi-modal condition after the network parameters are adjusted according to the preset iteration times.

In this embodiment, the iterative training sub-module may determine the initial multi-modal condition generation countermeasure network corresponding to the highest training index value, and generate the countermeasure network for the trained multi-modal condition.

In one embodiment, the iterative training sub-module may include:

and the first adjusting unit is used for calculating a loss value of the initial multi-modal condition generation countermeasure network according to each identification result, and performing first adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the loss value to obtain a first adjusted initial multi-modal condition generation countermeasure network.

And the second adjusting unit is used for determining difference indexes among the predicted live-action images based on a preset regulation function according to the predicted live-action images with real identification results and corresponding generation parameters, and performing second adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network.

And the iteration unit is used for performing iterative training on the first adjusted and second adjusted initial multi-modal condition generation countermeasure network according to the preset iteration times to obtain a trained multi-modal condition generation countermeasure network.

For specific limitations of the semantic segmentation image conversion device, reference may be made to the above limitations of the semantic segmentation image conversion method, which are not described herein again. The modules in the semantic segmentation image conversion device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as semantically segmented images and generated real-scene images. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of semantically segmenting an image transformation.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: receiving an initial live-action image acquired by acquisition equipment; performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image; the semantic segmentation images are input into a multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation images are output, content modalities of the live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images of different content modalities and generation parameters of the live-action images corresponding to different content modalities.

In one embodiment, the processor, when executing the computer program, implements semantic segmentation processing on the initial live-action image, and before generating the semantic segmented image, may further implement the following steps: and carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing.

In this embodiment, the performing, by the processor, the semantic segmentation processing on the initial live-action image when the computer program is executed to generate the semantic segmented image corresponding to the initial live-action image may include: and performing semantic segmentation processing on the initial live-action image after the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image after the image size normalization processing.

In one embodiment, the processor, when executing the computer program, implementing the input of the semantically segmented image into the multi-modal conditional generation countermeasure network and the output of the plurality of live-action images corresponding to the semantically segmented image may include: performing feature extraction on the semantic segmentation image by an encoder in the countermeasure network generated by the multi-modal condition to generate a feature map corresponding to the semantic segmentation image; and decoding and converting the characteristic graph by a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, the processor, when executing the computer program, performs decoding conversion on the feature map by a generator in the multi-modal conditional generation countermeasure network to obtain a plurality of live-action images corresponding to the semantically segmented image, and may include: a plurality of different generation parameters are configured through the generator, and a plurality of live-action images corresponding to the semantic segmentation images are generated according to the generation parameters and the feature map.

In one embodiment, the manner in which the processor executes the computer program to implement the generation of the multi-modal conditional generation countermeasure network can include: acquiring a training set image; generating a confrontation network by inputting the initial multi-modal condition constructed by the training set image, and generating corresponding each prediction live-action image based on each determined generation parameter; generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image; inputting each image set into an identifier for authenticity identification, and outputting corresponding identification results; and adjusting the initial multi-modal condition to generate network parameters of the countermeasure network based on each identification result, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

In one embodiment, after the processor executes the computer program to perform iterative training on the challenge network generated under the initial multi-modal condition after the network parameters are adjusted according to the preset iteration times, the following steps may be further implemented: storing the initial multi-modal condition generation countermeasure network and the corresponding training index value for each iterative training.

In this embodiment, the implementing the trained multi-modal conditional generation countermeasure network by the processor when executing the computer program may include: and determining the initial multi-modal condition generation countermeasure network corresponding to the highest training index value, and generating the countermeasure network for the multi-modal condition after training.

In one embodiment, the adjusting, by the processor, the network parameters of the challenge network generated under the initial multi-modal condition based on each identification result when the processor executes the computer program, and performing iterative training on the challenge network generated under the initial multi-modal condition after the network parameters are adjusted according to a preset number of iterations to generate the trained multi-modal condition generation challenge network may include: calculating a loss value of the countermeasure network generated under the initial multi-modal condition according to each identification result, and performing first adjustment on network parameters of the countermeasure network generated under the initial multi-modal condition based on the loss value to obtain a first adjusted initial multi-modal condition generation countermeasure network; according to the predicted live-action images with true identification results and corresponding generation parameters, determining difference indexes among the predicted live-action images based on a preset regulation function, and performing second adjustment on network parameters of the countermeasure network generated under the initial multi-modal condition based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network; and according to the preset iteration times, performing iterative training on the first adjusted and second adjusted initial multi-modal condition generation countermeasure network to obtain a trained multi-modal condition generation countermeasure network.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: receiving an initial live-action image acquired by acquisition equipment; performing semantic segmentation processing on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image; the semantic segmentation images are input into a multi-mode condition generation countermeasure network, a plurality of live-action images corresponding to the semantic segmentation images are output, content modalities of the live-action images are different, the multi-mode condition generation countermeasure network is generated by training of difference indexes determined based on preset regulation and control functions, and the preset regulation and control functions are determined according to the live-action images of different content modalities and generation parameters of the live-action images corresponding to different content modalities.

In one embodiment, the computer program, when executed by the processor, implements semantic segmentation processing on the initial live-action image, and may further implement the following steps before generating the semantic segmentation image: and carrying out image size normalization processing on the initial live-action image to obtain the initial live-action image after the image size normalization processing.

In this embodiment, the performing, by the processor, the semantic segmentation processing on the initial live-action image to generate a semantic segmented image corresponding to the initial live-action image may include: and performing semantic segmentation processing on the initial live-action image after the image size normalization processing to generate a semantic segmentation image corresponding to the initial live-action image after the image size normalization processing.

In one embodiment, the computer program when executed by the processor for implementing the input of the semantically segmented image into the multi-modal conditional generation countermeasure network and the output of the plurality of live-action images corresponding to the semantically segmented image may include: performing feature extraction on the semantic segmentation image by an encoder in the countermeasure network generated by the multi-modal condition to generate a feature map corresponding to the semantic segmentation image; and decoding and converting the characteristic graph by a generator in the multi-modal condition generation countermeasure network to obtain a plurality of live-action images corresponding to the semantic segmentation images.

In one embodiment, when executed by a processor, the computer program implementing decoding conversion of the feature map by a generator in the multi-modal conditional generation countermeasure network to obtain a plurality of live-action images corresponding to the semantically segmented image may include: a plurality of different generation parameters are configured through the generator, and a plurality of live-action images corresponding to the semantic segmentation images are generated according to the generation parameters and the feature map.

In one embodiment, the manner in which the computer program when executed by the processor implements the generation of the multimodal conditional generation countermeasure network can include: acquiring a training set image; generating a confrontation network by inputting the initial multi-modal condition constructed by the training set image, and generating corresponding each prediction live-action image based on each determined generation parameter; generating each image set corresponding to each predicted live-action image according to each predicted live-action image and the training set image; inputting each image set into an identifier for authenticity identification, and outputting corresponding identification results; and adjusting the initial multi-modal condition to generate network parameters of the countermeasure network based on each identification result, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to preset iteration times to obtain the trained multi-modal condition generation countermeasure network.

In one embodiment, after the computer program is executed by the processor to perform iterative training on the challenge network generated by the initial multi-modal condition adjusted by the network parameter according to a preset iteration number, the following steps may be further performed: storing the initial multi-modal condition generation countermeasure network and the corresponding training index value for each iterative training.

In this embodiment, the computer program, when executed by the processor, for implementing a trained multi-modal conditional generation countermeasure network, may include: and determining the initial multi-modal condition generation countermeasure network corresponding to the highest training index value, and generating the countermeasure network for the multi-modal condition after training.

In one embodiment, the implementing, by the processor, the network parameter of the challenge network generated by adjusting the initial multi-modal condition based on each identification result, and performing iterative training on the challenge network generated by the initial multi-modal condition after the network parameter adjustment according to a preset number of iterations to generate the trained multi-modal condition generation challenge network may include: calculating a loss value of the countermeasure network generated under the initial multi-modal condition according to each identification result, and performing first adjustment on network parameters of the countermeasure network generated under the initial multi-modal condition based on the loss value to obtain a first adjusted initial multi-modal condition generation countermeasure network; according to the predicted live-action images with true identification results and corresponding generation parameters, determining difference indexes among the predicted live-action images based on a preset regulation function, and performing second adjustment on network parameters of the countermeasure network generated under the initial multi-modal condition based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network; and according to the preset iteration times, performing iterative training on the first adjusted and second adjusted initial multi-modal condition generation countermeasure network to obtain a trained multi-modal condition generation countermeasure network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of semantically segmented image transformation, the method comprising:

receiving an initial live-action image acquired by acquisition equipment;

inputting the semantic segmentation images into a multi-mode condition generation countermeasure network, outputting a plurality of live-action images corresponding to the semantic segmentation images, wherein the content modalities of the live-action images in the plurality of live-action images are different, the multi-mode condition generation countermeasure network is generated by training difference indexes determined based on a preset regulation and control function, and the preset regulation and control function is determined according to the live-action images in different content modalities and the generation parameters of the live-action images corresponding to the different content modalities.

2. The method according to claim 1, wherein before performing semantic segmentation processing on the initial live-action image to generate a semantic segmented image, the method further comprises:

the semantic segmentation processing is performed on the initial live-action image to generate a semantic segmentation image corresponding to the initial live-action image, and the semantic segmentation processing comprises the following steps:

3. The method of claim 1, wherein the inputting the semantically segmented image into a multi-modal conditional generation countermeasure network, outputting a plurality of live-action images corresponding to the semantically segmented image, comprises:

performing feature extraction on the semantic segmentation image through an encoder in a multi-modal conditional generation countermeasure network to generate a feature map corresponding to the semantic segmentation image;

4. The method according to claim 3, wherein the decoding and converting the feature map by a generator in the multi-modal conditional generation countermeasure network to obtain a plurality of live-action images corresponding to the semantically segmented image comprises:

and configuring a plurality of different generation parameters through the generator, and generating a plurality of live-action images corresponding to the semantic segmentation images according to the generation parameters and the feature map.

5. The method of claim 1, wherein generating the multimodal conditional generation countermeasure network comprises:

acquiring a training set image;

inputting each image set into a discriminator to carry out authenticity identification, and outputting corresponding identification results;

and adjusting the initial multi-modal condition to generate a network parameter of the countermeasure network based on each identification result, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameter adjustment according to a preset iteration number to obtain a trained multi-modal condition generation countermeasure network.

6. The method of claim 5, wherein after the iterative training of the initial multi-modal condition generation countermeasure network with adjusted network parameters according to the preset number of iterations, the method further comprises:

the trained multi-modal condition generating countermeasure network comprises:

7. The method of claim 5, wherein the adjusting the initial multi-modal condition generation countermeasure network parameters based on each of the identification results, and performing iterative training on the initial multi-modal condition generation countermeasure network after the network parameters are adjusted according to a preset number of iterations to generate a trained multi-modal condition generation countermeasure network comprises:

calculating a loss value of the initial multi-modal condition generation countermeasure network according to each identification result, and performing first adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the loss value to obtain a first adjusted initial multi-modal condition generation countermeasure network;

according to the predicted live-action images with true identification results and the corresponding generated parameters, determining difference indexes among the predicted live-action images based on a preset regulation function, and performing second adjustment on network parameters of the initial multi-modal condition generation countermeasure network based on the difference indexes to obtain a second adjusted initial multi-modal condition generation countermeasure network;

8. A semantic segmentation image conversion apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.