WO2023030182A1

WO2023030182A1 - Image generation method and apparatus

Info

Publication number: WO2023030182A1
Application number: PCT/CN2022/115028
Authority: WO
Inventors: 蒋敏; 蒋子平; 王云鹏
Original assignee: 华为技术有限公司; 兰卡斯特大学
Priority date: 2021-08-30
Filing date: 2022-08-26
Publication date: 2023-03-09
Also published as: CN114037640A

Abstract

The present application discloses an image generation method. The image generation method comprises: determining a first semantic target image of a source image, the first semantic target image being at least one semantic target image included in the source image; and determining a target image according to the first semantic target image and a first prior distribution, the first prior distribution being a prior distribution of noise of a first background image and semantic target information to be generated, the target image comprising a plurality of images that comprise said semantic target information, said semantic target information comprising a label of a semantic target image to be generated and indication information of said semantic target image, and the first background image being a background image of the source image. Therefore, a target picture set having stable quality and diversity can be automatically generated for multiple semantic target images included in a specific sample picture set.

Description

Image generation method and device

This application claims the priority of a Chinese patent application with application number 202111006421.1 and application title "Image Generation Method and Device" filed with the China Patent Office on August 30, 2021, the entire contents of which are incorporated herein by reference.

technical field

The present application relates to the field of computer technology, in particular to an image generation method and device.

Background technique

At present, image generation technology is more and more widely used in the field of computer vision, and can be widely used in industrial automation, biomedicine, automobiles and other fields, such as automatic driving, intelligent detection and video surveillance.

In the prior art, image data can be processed effectively through the powerful computing capability of the deep neural network model. For example, a discriminative class model can predict its unknown properties for a given known input, that is, identify the type of a given picture and detect all semantic objects present in the picture; a generative class model can model the distribution of data Constructs, describes observable data sets, and thus generates data with the same distribution as them. Combining the current discriminative model and generative model can realize the mutual conversion of images of different categories.

Current generative models have greatly inspired research on generating images of large and complex scenes, for example, conditional generative adversarial nets (cGAN). But on the one hand, the labor cost required by the current generative model before implementation cannot be ignored, for example, it is necessary to prepare a sufficient number of original image sets and target image sets, and manually label the semantic targets in them; on the other hand, in complex Only images for a single semantic target can be generated in the scene, and the quality of the generated images is not good, for example, the details of the semantic target and the image texture are missing, resulting in severe distortion of the picture, and even some semantic targets are missing.

How to automatically generate a target image set with stable quality and diversity for the multi-semantic target images contained in a small number of sample image sets has become an urgent problem to be solved in the industry.

Contents of the invention

The present application provides an image generation method and device, which can automatically generate a target image set with stable quality and diversity for multi-semantic target images contained in a small number of sample image sets.

In a first aspect, an image generation method is provided, the method comprising: determining a first semantic target image of a source image, where the first semantic target image is at least one of at least one semantic target image contained in the source image; according to the The target image is determined by the first semantic target image and the first prior distribution. The first prior distribution is the prior distribution of noise obtained from the first background image and the semantic target information to be generated. The target image includes various An image of semantic target information, the semantic target information to be generated includes a label of the semantic target image to be generated and indication information of the semantic target image to be generated, and the first background image is the background image of the source image.

According to the above scheme provided by the present application, a priori distribution is generated based on the information of the background image of the source image and the semantic target image to be generated, random sampling according to the prior distribution can generate the semantic target image to be generated in the first background image The target image is distributed in diversity and conforms to the full image texture.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining the target image according to the first semantic target image and the first prior distribution includes: determining the second semantic target image according to the first semantic target image A background image; generating the first prior distribution according to the first background image and the semantic target information to be generated; generating the target image according to the first prior distribution and the first background image.

According to the above technical solution, the generation module determines the region of the first background image according to the first semantic target image, further determines the characteristics of the first background image, generates the first prior distribution according to the characteristics of the first background image and the semantic target to be generated, In this way, it is ensured that the generated image conforms to the whole image texture and image distortion is avoided, and the prior distribution of the semantic target to be generated based on the characteristics of the first background image can make subsequent sampling diverse, thereby ensuring the diversity of the target image.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: generating the target image according to the first prior distribution and the first background image, including: generating the target image according to the first prior distribution The noise of the semantic target image to be generated; the target image is generated according to the noise of the semantic target image to be generated and the first background image.

According to the above technical solution, noise sampling is performed from the first prior distribution, and the target image is further combined with the first background image. The noise sampling is random, thereby ensuring the diversity of the target image.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining the first background image according to the first semantic target image, including: performing smoothing processing on the first semantic target image; The smoothed first semantic target image determines the first background image.

According to the above technical solution, the semantic target image to be replaced is smoothed to ensure that the area of the semantic target image to be generated can completely cover the replaced semantic target image, so as to ensure that the generated target image conforms to the texture of the whole image and avoid semantic target image distortion.

With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: removing the smoothed first semantic target image from the source image.

According to the above technical solution, deleting the first semantic target image from the source image can enable the semantic target image to be generated to be better integrated with the first background image, which is beneficial to the consistency of the texture of the whole image.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: using the target image as an input image, and using an image discriminator to be trained to identify the authenticity of the target image;

According to the output result of the image discriminator to be trained and the input image, adjust the network parameter value of the image generator, the image generator is used to generate the target image;

The target image generated by the image generator after the network parameter value is adjusted is used as the input image, and the identification action of the image discriminator to be trained is repeated until the training process converges.

According to the technical solution, the generator and the discriminator are trained, so that the image generated by the generator tends to be real, and the generated target image is more realistic and natural.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: using an image discriminator to be trained to identify the authenticity of the target image, including:

The image discriminator to be trained discriminates images in different regions included in the target image, including:

The image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.

According to this technical solution, the discriminator performs region recognition on the target image, and uses a convolutional network to obtain a two-dimensional region discrimination result, which is used to represent the probability that the image in each region is real. In particular, this application weights the result of region discrimination to obtain a loss function, so that the model pays more attention to the result of the generated region, that is, focuses on the discrimination of the semantic target image region to be generated, and then improves the discriminant ability of the discriminator, so that the generator The generated target images are more realistic and natural.

With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: performing region division on the target image by the image discriminator to be trained.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: the image discriminator to be trained smoothes the semantic target image to be generated included in the target image; the image discriminator to be trained The device performs authenticity judgment on the target image including the smoothed semantic target image to be generated.

According to this technical solution, the discriminator performs smoothing processing on the replaced semantic target image to ensure that the area of the semantic target image can completely cover the replaced semantic target image, so as to ensure that the generated target image conforms to the texture of the whole image and avoid semantic target Distortion of the image.

With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: marking the semantic target of the target image according to the semantic target information to be generated and the information of the first semantic target image.

According to the technical solution, the generated target image can be automatically marked based on the semantic target label in the semantic target information to be generated, which avoids the workload of manual labeling and effectively improves the efficiency of training data set preparation.

With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: the image discriminator to be trained includes an image detector, and the image detector is used to perform the first semantic target image included in the source image feature extraction.

According to the technical solution, the discriminator is coupled with the discriminator. In this case, the feature extraction shared by the discriminator and the detector can be realized, thereby greatly reducing the computational load of the model and effectively improving the image generation efficiency.

In a second aspect, an image generating device is provided, and the image generating device executes the units of the method in the first aspect or various implementations thereof.

A third aspect provides an image generation device, including a processor and a memory, the memory is used to store a computer program, the processor is used to call and run the computer program from the memory, so that the communication device executes the first aspect and Image generation methods among its various possible implementations.

Optionally, there are one or more processors, and one or more memories.

Optionally, the memory can be integrated with the processor, or the memory can be set separately from the processor.

In a fourth aspect, a computer-readable storage medium is provided, wherein the computer-readable medium stores program code for execution by a device, and the program code includes the method for executing the first aspect or the second aspect.

In a fifth aspect, a computer program product including instructions is provided, and when the computer program product is run on a computer, the computer is made to execute the method in any one of the implementation manners in the foregoing aspects.

According to a sixth aspect, a chip is provided, and the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface, and executes the method in any one of the above aspects.

Optionally, as an implementation manner, the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the above-mentioned A method in any one of the implementations of the aspect.

The aforementioned chip may specifically be a field-programmable gate array (field-programmable gate array, FPGA) or an application-specific integrated circuit (application-specific integrated circuit, ASIC).

Description of drawings

FIG. 1 shows a schematic structural diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a product realization form provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of an image generation method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a discriminator structure and a detector structure for shared feature extraction provided by an embodiment of the present application;

Fig. 6 is a schematic block diagram of an image generating device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be understood that the names of all nodes and devices in this application are only the names set by this application for the convenience of description, and the names in actual applications may be different. It should not be understood that this application limits the names of various nodes and devices. On the contrary, Any names with the same or similar functions as the nodes or devices used in this application are regarded as methods or equivalent replacements in this application, and are all within the scope of protection of this application, and will not be described in detail below.

Since the embodiment of the present application involves the application of a large number of neural networks, for ease of understanding, the following first introduces the related terms and concepts of the neural network that may be involved in the embodiment of the present application.

(1) neural network

A neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x _s and an intercept 1 as input, and the output of the operation unit can be:

Among them, s=1, 2, ... n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neuron unit. f is the activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

(2) Deep Neural Network

Deep neural network (DNN), also known as multi-layer neural network, can be understood as a neural network with multiple hidden layers. DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

Although DNN looks complicated, it is actually not complicated in terms of the work of each layer. In simple terms, it is the following linear relationship expression:

in,

is the input vector,

is the output vector,

Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just an input vector

After such a simple operation to get the output vector

Due to the large number of DNN layers, the coefficient W and the offset vector

The number is also higher. The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as

The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

In summary, the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as

It should be noted that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).

(3) Convolutional neural network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to some adjacent neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location. The convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In the process of training the deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then according to the difference between the two to update the weight vector of each layer of neural network (of course, there is usually a process of optimization before the first update, which is to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make it predict lower, and keep adjusting until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(5) Backbone network

In neural networks, especially in the field of computer vision (CV), image features are generally extracted first. This part is the foundation of the entire CV task, because subsequent tasks depend on the extracted image features. Therefore, this part of the network structure is called the backbone network.

(6)U-Net

It is one of the earliest algorithms for semantic segmentation using a fully convolutional network. Its network structure is divided into a downsampling stage and an upsampling stage. There are only convolutional layers and pooling layers in the network structure, and there is no fully connected layer. The shallow high-resolution layer is used to solve the problem of pixel positioning, and the deeper layer is used to solve the problem of pixel classification, so that the segmentation of image semantic level can be realized. In the structure of U-Net, it includes a shrinking path that captures contextual information and a symmetrical expanding path that allows precise localization. In this application, this method can be understood as using very little data to complete end-to-end training, that is, the input is an image, and the output is also an image, and the best results are obtained.

(7) Image features

Image features can be understood as numerical features transformed from raw data through feature extraction operations, which are convenient for algorithm understanding and processing. In the embodiment of the present application, the image features specifically refer to the image features extracted using the backbone network model.

(8) Discriminant model

In the embodiment of the present application, the discriminative class model can predict its unknown attributes for a given known input, that is, identify the category to which a given picture belongs, and detect all semantic objects existing in the picture. The discriminative model is, for example, an object classification model, an object detection model.

In the discriminative model, the first thing to realize is the classification of a single semantic target image, that is, the target classification model. The model uses a convolutional layer to extract the features of the local area of the image, and inputs the obtained features to the fully connected layer for classification. On this basis, the target detection model uses a similar backbone network to extract features from images containing multiple semantic targets, and then locates and recognizes the semantic targets in the original image according to the feature map output by the backbone network. According to the different model structures, the target detection model can be divided into a single-step model and a two-step model. Among them, the single-step method represented by YOLO neural network (you only look once neural network, YOLO) and SSD detector (single shot multibox detector, SSD) adopts the regression method to calculate each grid point with the overall feature map as input. The probability of containing a semantic target; represented by Faster Region Based Convolutional Neural Networks (Faster RCNN) and Mask Region Based Convolutional Neural Networks (Mask RCNN) The two-step method introduces the candidate area network. First, according to the feature information, it is identified whether the preset candidate box contains a semantic target, and then the area with a high probability of containing a semantic target is input into the classification for target detection.

(9) Generate model

In the embodiment of the present application, the generated model takes the generated confrontation model as an example. The model consists of a generator and a discriminator, where the generator aims to learn the mapping from the noise distribution to the target data distribution, and the discriminator is used to judge whether the received data is real or not. During training, first use the fake data and real data generated by the generator to train the discriminator respectively, then connect the generator and the discriminator in series, and fix the parameters of the discriminator, aiming at "generating real data" to train the generator. During the training process, the ability of the discriminator to distinguish true and false is improved, so that the pictures generated by the generator are closer to the real distribution, and finally reach a balance. The training of generative confrontation network can be understood as the following optimization problem:

Generative confrontation models can perform well in generating small single-category images, but it is difficult to handle high-definition images, and it is also impossible to generate images of different categories based on one model. On this basis, the conditional generative adversarial model (cGAN) introduces control conditions on the basis of the original model to achieve a controllable effect on the generation of semantic target categories, so that one model can be used to generate pictures of different categories. Specifically, the model represents the category of the semantic target to be generated as a one-dimensional vector as a condition vector. In the generator, the embedding layer is used to encode the noise and conditions, and then sent to the network for generation; in the judger, the features and noise obtained by the backbone network are encoded to judge whether the input picture is a real picture under the target category.

To facilitate understanding of the embodiment of the present application, firstly, a schematic structural diagram of a system architecture 100 of the embodiment of the present application is briefly described with reference to FIG. 1 . As shown in FIG. 1 , the system architecture 100 includes a data set module 110 and a model training module 120 . The data set module 110 is one of the foundations of model training, and a high-quality large-scale data set is the key to training a high-quality model.

Wherein, in the data set module 110, as shown in FIG. 1 , the data acquisition device 111 is used to collect raw data; the data preprocessing device 112 can be used to screen, filter and data label the original data, wherein the data label can be Manual labeling may also be automatic labeling, which is not limited in this embodiment of the present application. The data generation device 113 is used to generate new data according to the marked data. Among them, data generation is an important means to solve the problem of insufficient dataset size.

In the data set module 110, a data storage library 114 is also included for storing automatically generated data sets. This data set can be used in the training module for model training.

As shown in FIG. 1 , the model training module 120 includes a training device 121 , and the training device 121 can train the target model 122 based on the training data maintained in the database 130 .

The training device 121 is described below to obtain the target model 122 based on the training data. The training device 120 processes the input data and compares the output data with the input original data until the output data of the training device 120 is consistent with the input original data. The difference of is less than a certain threshold, thus completing the training of the target model 122 .

The above-mentioned target model 122 can be used to implement the image generation method of the embodiment of the present application, that is, the picture set automatically generated by the data generation device 113 is input into the target model 122, and the authenticity of the pictures included in the picture set can be judged. The target model 122 in the embodiment of the present application may specifically be a cGAN network.

It should be noted that, in practical applications, the training data maintained in the database 114 may not all be collected by the data collection device 111 , but may also be received from other devices. In addition, it should be noted that the training device 121 does not necessarily perform target model training based entirely on the training data maintained by the database 114, and may also obtain training data from the cloud or other places for model training. limited.

The target model 122 trained according to the training device 121 can be applied to different systems or devices, such as an execution device, which can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.

It is worth noting that the above-mentioned training device 121 can generate corresponding target models 122 based on different training data for different goals or different tasks, and the corresponding target models 122 can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. Thus providing the desired result to the user.

It should be noted that FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, devices, modules, etc. shown in the figure does not constitute any limitation.

Since CNN is a very common neural network, the structure of CNN will be introduced in detail in combination with Figure 2 below. As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. A deep learning architecture refers to an algorithm that is updated through a neural network model. Multiple levels of learning are performed at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.

In FIG. 2 , a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 230 . The convolutional neural network in Figure 2 can be applied to the image classification model structure. The internal layer structure of the CNN 200 in FIG. 2 will be described in detail below.

Convolutional layer/pooling layer 220:

Convolution layer:

As shown in Figure 2, the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.

The following will take the convolutional layer 221 as an example to introduce the inner working principle of one convolutional layer.

The convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix. The convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row×column) are applied, That is, multiple matrices of the same shape. The output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc. The multiple weight matrices have the same size (row×column), and the convolutional feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple convolutional feature maps of the same size are combined to form The output of the convolution operation.

The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .

When the convolutional neural network 200 has multiple convolutional layers, the convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features; with the convolutional neural network With the deepening of 200 depth, the features extracted by the later convolutional layers (such as 226) become more and more complex, such as high-level semantic features, and the higher semantic features are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 221-226 shown in 220 in Figure 2, one layer of convolutional layers can be followed by one layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 3 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be determined according to the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.

After the multi-layer hidden layer in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 3, the propagation from 210 to 240 direction is forward propagation) is completed, the backpropagation (as shown in Fig. 3, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network shown in FIG. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

A product realization form provided by the embodiment of the present application is introduced below.

FIG. 3 is a schematic diagram of a product realization form provided by an embodiment of the present application.

A product implementation form of the embodiment of the present application may include program codes that are included in the data set system and deployed on server hardware. The program code of the present application may exist in the data generation module of the dataset system, such as the data generation device 113 in the system architecture 100 . This part of the code runs on the host storage (memory or disk) and is used to implement the innovative automatic data generation method.

Fig. 3 shows a kind of product realization form of the present application, and this product comprises server 310, hardware 320 and software 330, and wherein hardware 330 comprises host computer memory or disk, is used for running the program code of this application; Software 320 comprises image recognition device 321 and an image generation device 322, the image recognition device 321 is used to identify the authenticity of the input image, the image generation device 322 is used to generate the diversity images of the target semantic information required by the user, and store the generated pictures to the host memory or The magnetic disk is used to input the image recognition device 321 for image recognition.

The image generation method and device provided in the embodiments of the present application can be applied to other similar image editing and replacement tasks. For example, the replacement of vehicles or other target objects in different road conditions in autonomous driving, the replacement of hairstyles, hair colors and headgear in video surveillance scenes, and the replacement of different product shapes, colors and layouts in industrial automation quality monitoring scenarios ,wait. Specifically, the image generation method of the embodiment of the present application can be applied in the scene of picture editing and replacement, and the embodiment of the present application takes the automatic driving scene as an example to introduce the image generation method of the embodiment of the present application in detail.

In order to facilitate the understanding of this embodiment, a method for generating a sample image disclosed in the embodiment of the present disclosure is firstly introduced in detail. The execution subject of the method for generating a sample image provided in the embodiment of the present disclosure is generally a computer device with certain computing capabilities. The computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc. In some possible implementation manners, the method for generating a sample image may be implemented by a processor invoking computer-readable instructions stored in a memory.

The image generation method of the embodiment of the present application is described below with reference to FIG. 4 . FIG. 4 shows a schematic flowchart of an image generation method 400 provided by an embodiment of the present application. The method can be applied in a dataset system, and the dataset system includes a detection module and a generation module. Among them, the generation module includes a discriminator and a generator, and the generator includes an encoding structure and a decoding structure. The detection model and the generation model can be trained based on the generation confrontation network. The method can be executed by the above execution device. Optionally, the method 400 may be processed by the CPU, or jointly processed by the CPU and the GPU, or other processors suitable for neural network calculation may be used instead of the GPU, which is not limited in this application.

In the training, the detection model first uses the labeled source image to train the picture recognizer. The labeled source image refers to pre-labeling the semantic targets included in the picture in the source image, which can be manually labeled, or Automatic labeling without limitation. The purpose of training the image recognizer is to enable the image recognizer to recognize the semantic target corresponding to the annotation through the training. The training method may adopt the training method in the prior art, or any method that can achieve the purpose of training the picture recognizer, which is not limited in this embodiment of the present application.

In the embodiment of the present application, the trained detection model performs semantic target recognition on the source image, and then the generation model edits the identified semantic target according to predetermined conditions, realizing end-to-end automatic image generation and automatic labeling.

In the embodiment of the present application, the source image is an example of an input original picture, the source image includes a plurality of input pictures, the first semantic target image is an example of a semantic target, and the first background image is the semantic target information to be generated according to the user's An example of an image after processing the source image, and a target image is an example of a picture generated by the generator.

It should be understood that the source image may be called an original image, an input image, an original picture, and other similar terms, and the target image may also be called a generated image, a generated image, and other similar terms. This embodiment of the present application does not limit it.

The method 400 includes step S410 to step S440. Steps S410 to S440 will be described in detail below.

S410, the detection model determines a first semantic target image of the source image.

S420. The generation model determines a first background image according to the first semantic target image.

S430. The generation model generates a first prior distribution according to the first background image and the semantic target information to be generated.

S440. A generating model generates the target image according to the first prior distribution and the first background image.

Steps S410 to S440 will be described in detail below.

The source image is used as the input image of the dataset system, and the detection model of the dataset system includes a picture recognizer, which can recognize the first semantic target image of the source image.

In the embodiment of the present application, the image generation method can be used in an automatic driving scene, so the source image can be an image collected by an automatic driving vehicle, and the first semantic target image can be a vehicle image in the source image, for example, a bus, Cars, bicycles, etc.

Specifically, in a possible implementation manner, the above-mentioned recognizer is trained, so it can recognize the first semantic target image according to the annotation, wherein the source image is the original input image, including one or more images, each The image includes one or more semantic objects, at least one of which is the first semantic object image.

In a possible implementation, the detection model labels the identified semantic target image, specifically including labeling the bounding box of the semantic target and the category of the semantic target, and records the position of the detected bounding box and appropriately zooms in , which includes part of the boundary part to avoid the fusion of the generated semantic target with the surrounding environment.

For example, the source image is an image of an automatic driving scene, and the source image contains multiple semantic objects, such as buses, roads, sky, plants, pedestrians, etc., wherein the first semantic object image can be a vehicle, for example, bus. The detection model records the position of the first semantic target image and enlarges the bounding box, and at the same time identifies the category of the semantic target as a vehicle.

It should be understood that the bounding box of the first semantic target image can be of any shape, and the arbitrary shape can be understood as properly zooming in on the detected area containing the first semantic target image, and acquiring a part of the environment image around the first semantic target image, Therefore, the arbitrary shape includes the complete image of the first semantic target and the surrounding partial environment image, such as rectangle, circle, trapezoid, etc., which is not limited in this embodiment of the present application.

It should be understood that the above examples are only illustrative and should not constitute any limitation to the embodiment of the present application. This application takes image generation in an automatic driving scene as an example. In addition, the image generation method provided in the embodiment of the present application can also be applied to In any image generation scenario where replacement of semantic objects is required.

The generation model determines the first semantic target image detected by the detection model, and edits the source image according to the first semantic target image to determine the first background image and perform feature extraction to determine the features of the first background image.

Specifically, the generation model determines the first semantic target image detected by the detection model, the generation model determines the position and labeling information of the first semantic target image, and processes the source image according to the labeling information of the first semantic target, thereby determining the first background image, further, feature extraction is performed on the first background image to obtain features of the first background image.

In a possible implementation manner, the smoothing process is performed on the first semantic target image to enlarge the area of the first semantic target image. It can be understood that after the zoom-in process, the semantic target to be generated can be The area of the first semantic target image is completely covered, so as to achieve the consistency of the texture of the whole image.

It should be understood that, in the present application, after smoothing the first semantic target image, the area of the first semantic target image may be directly covered by the target semantic image to be generated.

In a possible implementation manner, the smoothed first semantic target image is removed from the source image to obtain a first background image.

In a possible implementation, the semantic target information to be generated is given semantic target information, for example, the semantic target information can be replacing a bus with a truck, or removing a bus, similarly, according to the user All the requirements defined by the requirements can be used as the semantic target information.

In a possible implementation manner, the semantic target information to be generated may also be system-defined, which is not limited in this embodiment of the present application.

Preferably, the semantic target information to be generated may include the label of the semantic target image and the indication information of the semantic target image to be generated, the semantic target can be understood as the semantic target to be generated, the semantic target to be generated is the same as the semantic target volume in the source image , with similar semantic categories.

For example, when the first semantic target image detected in the source image is a large car, the label of the semantic target to be generated can be truck, bus, fire engine, etc., and the semantic target label included in the semantic target information is traffic tool, the semantic target volume is described as large. For another example, when the first semantic target image is a bicycle, the semantic target to be generated may be a motorcycle, a tricycle, and the like. The semantic target category included in the semantic target information is vehicle, and the semantic target volume is described as small.

It should be understood that processing the source image according to the semantic target information to be generated and the annotation information of the first semantic target image includes: smoothing the first semantic target image, for example, the indication information in the semantic target information to be generated is If the bus is replaced by a truck, the generation model determines the position and bounding box of the first semantic target in the source image according to the annotation information of the first semantic target, and deletes or overwrites the image in the bounding box, thus obtaining The first background image; for another example, if the instruction information in the semantic target information to be generated is to remove the bus, then the generation model determines the position and bounding box of the first semantic target, deletes the image in the bounding box, and fills the surrounding The environment color block, as an understanding, for example, is filled into the bounding box with the road surface color block, so as to obtain the first background image.

It should be understood that the above examples are only illustrative descriptions, and should not constitute any limitation to the embodiment of the present application.

It should be understood that in the embodiment of the present application, the generator includes two parts, an encoding structure and a decoding structure, and the encoder includes a priori conditional encoding module.

Specifically, taking the first background image as input, the encoder down-samples the image through a convolutional layer and a pooling layer to extract features of the first background image. Subsequently, the extracted features of the first background image and the semantic target information to be generated are coded and merged, and input to the prior condition encoding module, through which the prior condition encoding module obtains the prior condition of the current source image, and according to the prior condition Generate the prior distribution of the noise of the semantic target to be generated.

It should be understood that the first prior distribution is only illustrative. This embodiment of the present application does not limit it.

It should be understood that the semantic target information to be generated is determined by user definition or system definition, and the semantic target to be generated is taken as an understanding, for example, it can be object B, object C, object D, object E, etc. The first semantic target image may be object A, and the semantic target information to be generated may be understood as replacing object A in the source image with object B, or replacing it with object C or object D, and so on. Take the semantic target information as an example of replacing object A in the source image with object B. In this case, the discriminator first identifies object A in the source image. After inputting it to the generator, the generator first recognizes object A The area of the first background image is smoothed to obtain the first background image, and then the encoder performs feature extraction on the first background image by sampling, encodes and merges the features of the first background image with the information of the object B, and inputs it to the prior condition encoding module , get the prior distribution about the object B. It can be understood that if object C is to be replaced, a prior distribution of object C needs to be generated.

In the embodiment of the present application, it should be understood that the first background image is used as the input of the encoder, and the encoder obtains a length vector and matrix through the convolution layer and the pooling layer, and the vector and matrix are used as the prior condition of the Gaussian distribution, Sample from this Gaussian distribution. When a single object A is to be generated, according to the difference in random sampling from the Gaussian distribution and the distribution parameters of the Gaussian distribution, different images of the object A such as A1, A2, and A3 will be generated, so that the image of the object A can be generated Diversity image. Wherein, the size of the distribution parameter depends on the size of the replaced image area, that is, the area size of the first semantic target image.

Further, the noise obtained by Gaussian distribution sampling and the features of the first background image are combined and decoded. For example, when the first background image represents a well-lit environment, the target images obtained by combined decoding can be A1, A2, A3 Wait for the image of the object A under sufficient light, and for example, when the first background image is in a rainy weather environment, then the target image obtained by combining and decoding can be A1', A2', A3', etc. Object A in rainy weather image.

It should be understood that the environment information of the source image can be obtained by using the first background image as a priori condition. This environment information can be understood as the condition of the surrounding environment except the semantic target to be replaced, and then the environment, texture and overall image of the generated area can be controlled. unanimous.

It should be understood that the semantic target information to be generated is conditioned, including the label of the semantic target to be generated, so that the semantic target output type of the generated image can be controlled to be similar to or consistent with the source image.

Specifically, the generative model samples the noise from the above-mentioned first prior distribution, combines it with the first background image features extracted by the encoder, and inputs the result into the decoder to generate the target image.

An optional understanding is to generate the noise sampled from the above prior distribution by the generation model. This noise can be understood as the feature of the change in the position distribution of the semantic target image to be generated based on the Gaussian distribution in the first background image. Based on the change A number of different features can be obtained by encoding the features with the features of the first background image, and different images can be obtained by decoding the multiple different features by the decoder.

It should be understood that the foregoing prior distribution based on the Gaussian distribution is only an exemplary description, and the prior distribution may also be in other forms, which are not limited in this embodiment of the present application.

It should be understood that the different images generated above are generated based on sampling noise, and the noise obtained by each sampling is different, so as to realize the generation of the distribution change images of different positions of the semantic target to be generated in the source image.

It can be understood that the above-mentioned distribution of different positions of the semantic target to be generated in the source image can be understood as a change of angle, a change of position, a change of layout, etc., and the color of the semantic target to be generated can also be changed according to user definition.

Under the guidance of the above-mentioned prior distribution, the position of the semantic target image to be generated in the above-mentioned first background image has a more reasonable and diverse distribution, so that the generated sample data is more reasonable and abundant.

In a possible implementation manner, the generation model generates an annotation of the target image according to the semantic target information to be generated and the information of the first semantic target image, so as to realize automatic annotation of the semantic target.

It should be understood that the above-mentioned generator and discriminator are all trained generators and discriminators, and a specific training method will be introduced in method 500 below.

In order to facilitate the understanding of those skilled in the art, the following will be described in conjunction with the example in FIG. 5 .

FIG. 5 provides a schematic block diagram of modules included according to an embodiment of the present application. As shown in Figure 5, it includes: a detection module and a generation module.

Among them, the detection module includes a recognizer; the generation module includes a generator and a discriminator, and the generator includes an encoding structure and a decoding structure. The detection model and the generation model can be trained based on the generative confrontation network.

For the detection module, specifically, a source image is used as an input, and the source image is pre-labeled, and a recognizer is trained using the source image with the mark, so that the recognizer can detect a semantic target. The training method may adopt the training method in the prior art, or any method that can achieve the purpose of training the picture recognizer, which is not limited in this embodiment of the present application.

It should be understood that the trained recognizer can determine the first semantic target image of the source image, as described in step S410 of the method 400, and details are not repeated here.

It should be understood that when the data samples are unbalanced, the model will enhance its ability to identify categories with a large amount of data, and weaken its ability to identify categories with a small amount of data. Therefore, the selected bounding box has a higher probability of selecting a category with a larger sample size. In the semantic target editing module, the selected original semantic target will be replaced by other semantic targets, and the semantic targets and their labels that could not be identified are retained in the original image.

For the generation module, specifically, in the semantic target editing module, the source image can be processed according to the semantic target information to be generated to obtain the first background image, and the first background image can be input into the editor to obtain the characteristics of the first background image, and then the source Input the image and target semantic category into the editor to obtain the prior conditions, further generate the prior distribution of the noise, and then extract the noise features from the prior distribution, edit and merge the noise features and the first background image features into the editor, and convert the obtained The features are input to the decoder for decoding to obtain a synthesized image.

It should be understood that the image synthesized by the generator needs to be input to the discriminator to further judge whether it is true or not.

It can be understood that the generator and the discriminator are connected in series and the parameters of the discriminator are fixed, and the parameters of the generator are optimized with the goal of "the picture is true", so that the pictures generated by the generator can "deceive" the discriminator. That is, the pictures generated by the generator are as close as possible to the real pictures.

Specifically, the image synthesized by the generator is input to the discriminator, and the discriminator needs to obtain a real picture, that is, the source image, and the discriminator judges whether the input image is real or not based on the source image. Adjust the network parameter value of the generator according to the discrimination result of the discriminator, and further, use the adjusted picture generated by the generator as the input discriminator to continue the above steps, until the result of judging its output is true, the image to be trained is generated The training end condition of the device and the training end condition of the image discriminator to be trained are balanced.

In a possible implementation manner, the discriminator performs smoothing processing on the target semantic information included in the target image, and then performs authenticity discrimination on the target image including the smoothed target semantic information. An optional understanding, the discriminator judges the image generated by the recognizer based on the real image, the generated image is only different from the real image in the first semantic target bounding box, and there is no change in other semantic targets, so , the main discriminative region of the discriminator is the region within the bounding box of the first semantic object.

In a possible implementation manner, the discriminator divides the target image into regions, and performs authenticity discrimination for different regions.

Specifically, the image discriminator to be trained performs weighting on the different regions when performing loss function calculation in combination with the semantic target information to be generated.

It should be understood that when the discriminator recognizes a region with a high overlap rate with the semantic target image to be generated, it will increase the weight of the loss function when calculating the loss function, so that the discriminant model pays more attention to the generated semantic target image, thereby improving the model. The effect, that is, the authenticity of the discrimination in this area can be improved.

In a possible implementation manner, the discriminator to be trained enlarges the bounding box including target semantic information in the image, and performs authenticity discrimination on the image in the bounding box area.

It should be understood that in the embodiment of the present application, the discrimination area of the discriminator is expanded, not only within the contour range of the semantic target, therefore, the discriminator also has a process of discrimining the environment area of the semantic target in the bounding box, and then controls the output of the generative model, so that The local fusion of the target image generated by the trained recognizer is smooth and natural, and the texture of the whole image is highly consistent.

In a possible implementation, after the backbone network of the discriminator extracts the image features, the features are embedded with the category information map based on the original image annotation, and then the convolutional network is used to obtain the two-dimensional region discrimination results, and the To represent the probability that the image in each region is real. In particular, this application weights the results of region discrimination to obtain a loss function, so that the model pays more attention to the results of region generation.

Preferably, the discriminator of the present application can adopt the PatchGAN structure.

It can be understood that after the discriminator is trained, its ability to identify real pictures and fake pictures increases. When optimizing the parameters of the generator, it will correspondingly improve the generator's ability to generate fake pictures. Through continuous iterative optimization, the discriminator and generator gradually reach a balance, thus completing the training of the model.

It should be understood that the trained recognizer can directly generate the target image.

The annotation of the target image is generated according to the semantic information of the target and the information of the first semantic target image, so as to realize the automatic labeling of the semantic target, avoid the workload of manual labeling, and effectively improve the preparation efficiency of the training data set.

In a possible implementation, the detection module in Figure 5 can be coupled with the generation module, that is, the identifier and the discriminator are coupled. In this case, the discriminator and the detector can share feature extraction, thereby greatly The computational load of the model is reduced, and the efficiency of image generation is effectively improved.

According to the image generation method of the embodiment of the present application, based on the preset semantic target information to be generated, a recognizer is introduced to automatically identify the semantic target to be replaced in the original picture, preprocess the semantic target, and then obtain the processed picture features, Combining the semantic target information to be generated, the conditional distribution of the noise is generated and merged with the image features, and then automatically replaced with the expected generated semantic target, and finally a picture that meets the texture information of the image and meets the diversity is synthesized and automatically labeled to ensure the diversity of generated data sex.

Fig. 6 is a structural block diagram of an image generating device provided by an embodiment of the present application. The image generation device 600 includes: a detection unit 610 , a generation unit 620 and a discrimination unit 630 .

Wherein, the detection unit 610 is configured to determine a first semantic target image of the source image, where the first semantic target image is at least one of at least one semantic target image included in the source image.

The generating unit 620 is configured to determine a first background image according to the first semantic target image; generate a priori distribution of noise according to the first background image and the semantic target information to be generated; according to the prior distribution and the first The background image generates the target image.

Specifically, the generating unit 620 is configured to extract noise from the prior distribution, and generate the target image according to the noise and the first background image.

Specifically, the generating unit 620 is configured to perform smoothing processing on the first semantic target image, and determine the first background image according to the smoothed first semantic target image.

In a possible implementation manner, the generating unit 620 is configured to remove the smoothed first semantic target image from the source image to obtain a first background image.

Specifically, the generation unit 620 is further configured to complete the labeling of the semantic target of the target image according to the semantic target information.

The training unit 630 is used to train the image generated by the generating unit to be realistic.

Specifically, the target image is used as an input image, and an image discriminator to be trained is used to identify the authenticity of the target image; according to the output result of the image discriminator and the input image, the network parameters of the image generator are adjusted value, the image generator is used to generate the target image; the target image generated by the image generator after the network parameter value is adjusted is used as an input image, and the identification action of the image discriminator to be trained is repeated until the training process converges .

In a possible implementation manner, the image discriminator to be trained discriminates images in different regions included in the target image.

Optionally, the image discriminator to be trained performs region division on the target image.

Specifically, the image discriminator to be trained is also used to perform smoothing processing on the target image including the target semantic information, and perform authenticity discrimination on the target image including the smoothed target semantic information.

It should be understood that the smoothing process, for example, enlarges the bounding box of the target semantic image, and the image discriminator to be trained performs authenticity discrimination on images in the bounding box area.

It should be understood that the discriminators in the detection unit 610 and the generation unit 620 may share an image detector, and the image detector is used to perform feature extraction on the first semantic target image included in the source image. Thus, image generation efficiency can be improved.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

It should be understood that the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

It should also be understood that the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Access memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).

The above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of computer program products. The computer program product comprises one or more computer instructions or computer programs. When the computer instruction or computer program is loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive.

It should be understood that the term "and/or" in this article is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B may mean: A exists alone, and A and B exist at the same time , there are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship, but it may also indicate an "and/or" relationship, which can be understood by referring to the context.

In this application, "at least one" means one or more, and "multiple" means two or more. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .

It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A method for generating an image, comprising:

determining a first semantic target image of a source image, the first semantic target image being at least one of at least one semantic target image contained in the source image;

Determine the target image according to the first semantic target image and the first prior distribution, the first prior distribution is the prior distribution of the noise obtained according to the first background image and the semantic target information to be generated, and the target image includes A variety of images containing the semantic target information to be generated, the semantic target information to be generated includes the label of the semantic target image to be generated and the indication information of the semantic target image to be generated, the first background image is the source image background image.
The method according to claim 1, wherein said determining the target image according to the first semantic target image and the first prior distribution comprises:

determining the first background image based on the first semantic target image;

generating the first prior distribution according to the first background image and the semantic target information to be generated;

The target image is generated according to the first prior distribution and the first background image.
The method according to claim 2, wherein said generating said target image according to said first prior distribution and said first background image comprises:

generating the noise of the semantic target image to be generated according to the first prior distribution;

The target image is generated according to the noise of the semantic target image to be generated and the first background image.
The method according to any one of claims 1-3, wherein determining the first background image according to the first semantic target image comprises:

smoothing the first semantic target image;

The first background image is determined according to the smoothed first semantic target image.
The method according to claim 4, characterized in that the method further comprises:

The smoothed first semantic target image is removed from the source image.
The method according to any one of claims 1-5, wherein the method further comprises:

Using the target image as an input image, using an image discriminator to be trained to identify the authenticity of the target image;

According to the output result of the image discriminator to be trained and the input image, adjust the network parameter value of the image generator, the image generator is used to generate the target image;

Taking the target image generated by the image generator after adjusting the network parameter value as the input image, repeating the identification action of the image discriminator to be trained until the training process converges.
The method according to claim 6, wherein said utilizing an image discriminator to be trained to identify the authenticity of said target image comprises:

The image discriminator to be trained discriminates images in different regions included in the target image, including:

The image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.
The method according to claim 7, wherein the method further comprises:

The image discriminator to be trained performs region division on the target image.
The method according to any one of claims 6-8, wherein the method further comprises:

The image discriminator to be trained performs smoothing processing on the semantic target image to be generated included in the target image;

The image discriminator to be trained performs authenticity discrimination on the target image including the smoothed semantic target image to be generated.
The method according to any one of claims 1-9, wherein the method further comprises:

Annotating the semantic target of the target image is completed according to the semantic target information to be generated and the information of the first semantic target image.
The method according to any one of claims 1-10, wherein the method further comprises:

The image discriminator to be trained includes an image detector, and the image detector is used to perform feature extraction on the first semantic target image included in the source image.
An image generating device, characterized in that it comprises:

a detection unit, configured to determine a first semantic target image of the source image, where the first semantic target image is at least one of at least one semantic target image included in the source image;

A generating unit, configured to determine a target image according to the first semantic target image and a first prior distribution, the first prior distribution being a prior distribution of noise obtained according to the first background image and semantic target information to be generated, The target image includes a variety of images containing the semantic target information to be generated, the semantic target information to be generated includes the label of the semantic target image to be generated and the indication information of the semantic target image to be generated, and the first background image is The background image for the source image.
The image generating device according to claim 12, wherein the generating unit is specifically used for:

determining the first background image according to the first semantic target image;

generating the first prior distribution according to the first background image and the semantic target information to be generated;

The target image is generated according to the first prior distribution and the first background image.
The image generating device according to claim 13, wherein the generating unit is further used for:

generating the noise of the semantic target image to be generated according to the first prior distribution;

The target image is generated according to the noise of the semantic target image to be generated and the first background image.
The image generating device according to any one of claims 12-14, wherein the generating unit is also used for:

smoothing the first semantic target image;

The first background image is determined according to the smoothed first semantic target image.
The image generating device according to claim 15, wherein the generating unit is further used for:

The smoothed first semantic target image is removed from the source image.
The image generation device according to any one of claims 12-16, wherein the image generation device further comprises a training unit, configured to:

Using the target image as an input image, using an image discriminator to be trained to identify the authenticity of the target image;

According to the output result of the image discriminator to be trained and the input image, adjust the network parameter value of the image generator, the image generator is used to generate the target image;

Taking the target image generated by the image generator after adjusting the network parameter value as the input image, repeating the identification action of the image discriminator to be trained until the training process converges.
The image generation device according to claim 17, wherein the image discriminator to be trained discriminates images in different regions included in the target image, including:

The image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.
The image generation device according to claim 18, wherein the image discriminator to be trained performs region division on the target image.
The image generation device according to any one of claims 17-19, wherein the image discriminator to be trained performs smoothing processing on the semantic target image to be generated included in the target image;

The image discriminator to be trained performs authenticity discrimination on the target image including the smoothed semantic target image to be generated.
The image generating device according to any one of claims 12-20, wherein the generating unit is further used for:

Annotating the semantic target of the target image is completed according to the semantic target information to be generated and the information of the first semantic target image.
The image generating device according to any one of claims 12-21, wherein the image discriminator to be trained includes an image detector, and the image detector is used to detect the first Semantic target image for feature extraction.
An electronic device, characterized in that it comprises:

A processor and a memory, wherein the memory is used to store program instructions, and the processor is used to invoke the program instructions to execute the method according to any one of claims 1 to 11.
A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, and the program code includes a method for executing the method according to any one of claims 1 to 11.
A chip, characterized in that the chip includes a processor and a data interface, and the processor reads the instructions stored on the memory through the data interface to execute any one of claims 1 to 11. Methods.