WO2023030182A1 - Image generation method and apparatus - Google Patents

Image generation method and apparatus Download PDF

Info

Publication number
WO2023030182A1
WO2023030182A1 PCT/CN2022/115028 CN2022115028W WO2023030182A1 WO 2023030182 A1 WO2023030182 A1 WO 2023030182A1 CN 2022115028 W CN2022115028 W CN 2022115028W WO 2023030182 A1 WO2023030182 A1 WO 2023030182A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
semantic
target image
generated
target
Prior art date
Application number
PCT/CN2022/115028
Other languages
French (fr)
Chinese (zh)
Inventor
蒋敏
蒋子平
王云鹏
Original Assignee
华为技术有限公司
兰卡斯特大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 兰卡斯特大学 filed Critical 华为技术有限公司
Publication of WO2023030182A1 publication Critical patent/WO2023030182A1/en

Links

Images

Classifications

    • G06T5/70
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the present application relates to the field of computer technology, in particular to an image generation method and device.
  • image generation technology is more and more widely used in the field of computer vision, and can be widely used in industrial automation, biomedicine, automobiles and other fields, such as automatic driving, intelligent detection and video surveillance.
  • image data can be processed effectively through the powerful computing capability of the deep neural network model.
  • a discriminative class model can predict its unknown properties for a given known input, that is, identify the type of a given picture and detect all semantic objects present in the picture;
  • a generative class model can model the distribution of data Constructs, describes observable data sets, and thus generates data with the same distribution as them. Combining the current discriminative model and generative model can realize the mutual conversion of images of different categories.
  • the present application provides an image generation method and device, which can automatically generate a target image set with stable quality and diversity for multi-semantic target images contained in a small number of sample image sets.
  • an image generation method comprising: determining a first semantic target image of a source image, where the first semantic target image is at least one of at least one semantic target image contained in the source image; according to the The target image is determined by the first semantic target image and the first prior distribution.
  • the first prior distribution is the prior distribution of noise obtained from the first background image and the semantic target information to be generated.
  • the target image includes various An image of semantic target information, the semantic target information to be generated includes a label of the semantic target image to be generated and indication information of the semantic target image to be generated, and the first background image is the background image of the source image.
  • a priori distribution is generated based on the information of the background image of the source image and the semantic target image to be generated, random sampling according to the prior distribution can generate the semantic target image to be generated in the first background image
  • the target image is distributed in diversity and conforms to the full image texture.
  • the method further includes: determining the target image according to the first semantic target image and the first prior distribution includes: determining the second semantic target image according to the first semantic target image A background image; generating the first prior distribution according to the first background image and the semantic target information to be generated; generating the target image according to the first prior distribution and the first background image.
  • the generation module determines the region of the first background image according to the first semantic target image, further determines the characteristics of the first background image, generates the first prior distribution according to the characteristics of the first background image and the semantic target to be generated, In this way, it is ensured that the generated image conforms to the whole image texture and image distortion is avoided, and the prior distribution of the semantic target to be generated based on the characteristics of the first background image can make subsequent sampling diverse, thereby ensuring the diversity of the target image.
  • the method further includes: generating the target image according to the first prior distribution and the first background image, including: generating the target image according to the first prior distribution The noise of the semantic target image to be generated; the target image is generated according to the noise of the semantic target image to be generated and the first background image.
  • noise sampling is performed from the first prior distribution, and the target image is further combined with the first background image.
  • the noise sampling is random, thereby ensuring the diversity of the target image.
  • the method further includes: determining the first background image according to the first semantic target image, including: performing smoothing processing on the first semantic target image; The smoothed first semantic target image determines the first background image.
  • the semantic target image to be replaced is smoothed to ensure that the area of the semantic target image to be generated can completely cover the replaced semantic target image, so as to ensure that the generated target image conforms to the texture of the whole image and avoid semantic target image distortion.
  • the method further includes: removing the smoothed first semantic target image from the source image.
  • deleting the first semantic target image from the source image can enable the semantic target image to be generated to be better integrated with the first background image, which is beneficial to the consistency of the texture of the whole image.
  • the method further includes: using the target image as an input image, and using an image discriminator to be trained to identify the authenticity of the target image;
  • the image generator is used to generate the target image
  • the target image generated by the image generator after the network parameter value is adjusted is used as the input image, and the identification action of the image discriminator to be trained is repeated until the training process converges.
  • the generator and the discriminator are trained, so that the image generated by the generator tends to be real, and the generated target image is more realistic and natural.
  • the method further includes: using an image discriminator to be trained to identify the authenticity of the target image, including:
  • the image discriminator to be trained discriminates images in different regions included in the target image, including:
  • the image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.
  • the discriminator performs region recognition on the target image, and uses a convolutional network to obtain a two-dimensional region discrimination result, which is used to represent the probability that the image in each region is real.
  • this application weights the result of region discrimination to obtain a loss function, so that the model pays more attention to the result of the generated region, that is, focuses on the discrimination of the semantic target image region to be generated, and then improves the discriminant ability of the discriminator, so that the generator The generated target images are more realistic and natural.
  • the method further includes: performing region division on the target image by the image discriminator to be trained.
  • the method further includes: the image discriminator to be trained smoothes the semantic target image to be generated included in the target image; the image discriminator to be trained The device performs authenticity judgment on the target image including the smoothed semantic target image to be generated.
  • the discriminator performs smoothing processing on the replaced semantic target image to ensure that the area of the semantic target image can completely cover the replaced semantic target image, so as to ensure that the generated target image conforms to the texture of the whole image and avoid semantic target Distortion of the image.
  • the method further includes: marking the semantic target of the target image according to the semantic target information to be generated and the information of the first semantic target image.
  • the generated target image can be automatically marked based on the semantic target label in the semantic target information to be generated, which avoids the workload of manual labeling and effectively improves the efficiency of training data set preparation.
  • the method further includes: the image discriminator to be trained includes an image detector, and the image detector is used to perform the first semantic target image included in the source image feature extraction.
  • the discriminator is coupled with the discriminator.
  • the feature extraction shared by the discriminator and the detector can be realized, thereby greatly reducing the computational load of the model and effectively improving the image generation efficiency.
  • an image generating device executes the units of the method in the first aspect or various implementations thereof.
  • a third aspect provides an image generation device, including a processor and a memory, the memory is used to store a computer program, the processor is used to call and run the computer program from the memory, so that the communication device executes the first aspect and Image generation methods among its various possible implementations.
  • processors there are one or more processors, and one or more memories.
  • the memory can be integrated with the processor, or the memory can be set separately from the processor.
  • a computer-readable storage medium stores program code for execution by a device, and the program code includes the method for executing the first aspect or the second aspect.
  • a computer program product including instructions is provided, and when the computer program product is run on a computer, the computer is made to execute the method in any one of the implementation manners in the foregoing aspects.
  • a chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface, and executes the method in any one of the above aspects.
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the above-mentioned A method in any one of the implementations of the aspect.
  • the aforementioned chip may specifically be a field-programmable gate array (field-programmable gate array, FPGA) or an application-specific integrated circuit (application-specific integrated circuit, ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • FIG. 1 shows a schematic structural diagram of a system architecture provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a product realization form provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of an image generation method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a discriminator structure and a detector structure for shared feature extraction provided by an embodiment of the present application
  • Fig. 6 is a schematic block diagram of an image generating device provided by an embodiment of the present application.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is actually not complicated in terms of the work of each layer.
  • it is the following linear relationship expression: in, is the input vector, is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and the offset vector The number is also higher.
  • DNN The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location.
  • the convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Image features can be understood as numerical features transformed from raw data through feature extraction operations, which are convenient for algorithm understanding and processing.
  • the image features specifically refer to the image features extracted using the backbone network model.
  • the discriminative class model can predict its unknown attributes for a given known input, that is, identify the category to which a given picture belongs, and detect all semantic objects existing in the picture.
  • the discriminative model is, for example, an object classification model, an object detection model.
  • the discriminative model the first thing to realize is the classification of a single semantic target image, that is, the target classification model.
  • the model uses a convolutional layer to extract the features of the local area of the image, and inputs the obtained features to the fully connected layer for classification.
  • the target detection model uses a similar backbone network to extract features from images containing multiple semantic targets, and then locates and recognizes the semantic targets in the original image according to the feature map output by the backbone network.
  • the target detection model can be divided into a single-step model and a two-step model.
  • the single-step method represented by YOLO neural network (you only look once neural network, YOLO) and SSD detector (single shot multibox detector, SSD) adopts the regression method to calculate each grid point with the overall feature map as input.
  • the probability of containing a semantic target represented by Faster Region Based Convolutional Neural Networks (Faster RCNN) and Mask Region Based Convolutional Neural Networks (Mask RCNN)
  • the two-step method introduces the candidate area network. First, according to the feature information, it is identified whether the preset candidate box contains a semantic target, and then the area with a high probability of containing a semantic target is input into the classification for target detection.
  • the generated model takes the generated confrontation model as an example.
  • the model consists of a generator and a discriminator, where the generator aims to learn the mapping from the noise distribution to the target data distribution, and the discriminator is used to judge whether the received data is real or not.
  • the generator aims to learn the mapping from the noise distribution to the target data distribution, and the discriminator is used to judge whether the received data is real or not.
  • the discriminator is used to judge whether the received data is real or not.
  • the ability of the discriminator to distinguish true and false is improved, so that the pictures generated by the generator are closer to the real distribution, and finally reach a balance.
  • the training of generative confrontation network can be understood as the following optimization problem:
  • Generative confrontation models can perform well in generating small single-category images, but it is difficult to handle high-definition images, and it is also impossible to generate images of different categories based on one model.
  • the conditional generative adversarial model introduces control conditions on the basis of the original model to achieve a controllable effect on the generation of semantic target categories, so that one model can be used to generate pictures of different categories.
  • the model represents the category of the semantic target to be generated as a one-dimensional vector as a condition vector.
  • the embedding layer is used to encode the noise and conditions, and then sent to the network for generation; in the judger, the features and noise obtained by the backbone network are encoded to judge whether the input picture is a real picture under the target category.
  • the system architecture 100 includes a data set module 110 and a model training module 120 .
  • the data set module 110 is one of the foundations of model training, and a high-quality large-scale data set is the key to training a high-quality model.
  • the data acquisition device 111 is used to collect raw data; the data preprocessing device 112 can be used to screen, filter and data label the original data, wherein the data label can be Manual labeling may also be automatic labeling, which is not limited in this embodiment of the present application.
  • the data generation device 113 is used to generate new data according to the marked data. Among them, data generation is an important means to solve the problem of insufficient dataset size.
  • a data storage library 114 is also included for storing automatically generated data sets. This data set can be used in the training module for model training.
  • the model training module 120 includes a training device 121 , and the training device 121 can train the target model 122 based on the training data maintained in the database 130 .
  • the training device 121 is described below to obtain the target model 122 based on the training data.
  • the training device 120 processes the input data and compares the output data with the input original data until the output data of the training device 120 is consistent with the input original data. The difference of is less than a certain threshold, thus completing the training of the target model 122 .
  • the above-mentioned target model 122 can be used to implement the image generation method of the embodiment of the present application, that is, the picture set automatically generated by the data generation device 113 is input into the target model 122, and the authenticity of the pictures included in the picture set can be judged.
  • the target model 122 in the embodiment of the present application may specifically be a cGAN network.
  • the training data maintained in the database 114 may not all be collected by the data collection device 111 , but may also be received from other devices.
  • the training device 121 does not necessarily perform target model training based entirely on the training data maintained by the database 114, and may also obtain training data from the cloud or other places for model training. limited.
  • the target model 122 trained according to the training device 121 can be applied to different systems or devices, such as an execution device, which can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.
  • an execution device which can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc.
  • an execution device can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc.
  • AR augmented reality
  • VR virtual reality
  • vehicle terminal etc.
  • server or cloud etc.
  • the above-mentioned training device 121 can generate corresponding target models 122 based on different training data for different goals or different tasks, and the corresponding target models 122 can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. Thus providing the desired result to the user.
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture.
  • a deep learning architecture refers to an algorithm that is updated through a neural network model. Multiple levels of learning are performed at different levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 230 .
  • the convolutional neural network in Figure 2 can be applied to the image classification model structure.
  • the internal layer structure of the CNN 200 in FIG. 2 will be described in detail below.
  • the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix.
  • the convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the convolutional feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple convolutional feature maps of the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features; with the convolutional neural network With the deepening of 200 depth, the features extracted by the later convolutional layers (such as 226) become more and more complex, such as high-level semantic features, and the higher semantic features are more suitable for the problem to be solved.
  • pooling layer can be followed by one layer of convolutional layers.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 3 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be determined according to the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the backpropagation (as shown in Fig. 3, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network shown in FIG. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
  • FIG. 3 is a schematic diagram of a product realization form provided by an embodiment of the present application.
  • a product implementation form of the embodiment of the present application may include program codes that are included in the data set system and deployed on server hardware.
  • the program code of the present application may exist in the data generation module of the dataset system, such as the data generation device 113 in the system architecture 100 . This part of the code runs on the host storage (memory or disk) and is used to implement the innovative automatic data generation method.
  • Fig. 3 shows a kind of product realization form of the present application, and this product comprises server 310, hardware 320 and software 330, and wherein hardware 330 comprises host computer memory or disk, is used for running the program code of this application;
  • Software 320 comprises image recognition device 321 and an image generation device 322, the image recognition device 321 is used to identify the authenticity of the input image, the image generation device 322 is used to generate the diversity images of the target semantic information required by the user, and store the generated pictures to the host memory or The magnetic disk is used to input the image recognition device 321 for image recognition.
  • the image generation method and device provided in the embodiments of the present application can be applied to other similar image editing and replacement tasks. For example, the replacement of vehicles or other target objects in different road conditions in autonomous driving, the replacement of hairstyles, hair colors and headgear in video surveillance scenes, and the replacement of different product shapes, colors and layouts in industrial automation quality monitoring scenarios ,wait.
  • the image generation method of the embodiment of the present application can be applied in the scene of picture editing and replacement, and the embodiment of the present application takes the automatic driving scene as an example to introduce the image generation method of the embodiment of the present application in detail.
  • the execution subject of the method for generating a sample image provided in the embodiment of the present disclosure is generally a computer device with certain computing capabilities.
  • the computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the method for generating a sample image may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 4 shows a schematic flowchart of an image generation method 400 provided by an embodiment of the present application.
  • the method can be applied in a dataset system, and the dataset system includes a detection module and a generation module.
  • the generation module includes a discriminator and a generator
  • the generator includes an encoding structure and a decoding structure.
  • the detection model and the generation model can be trained based on the generation confrontation network.
  • the method can be executed by the above execution device.
  • the method 400 may be processed by the CPU, or jointly processed by the CPU and the GPU, or other processors suitable for neural network calculation may be used instead of the GPU, which is not limited in this application.
  • the detection model first uses the labeled source image to train the picture recognizer.
  • the labeled source image refers to pre-labeling the semantic targets included in the picture in the source image, which can be manually labeled, or Automatic labeling without limitation.
  • the purpose of training the image recognizer is to enable the image recognizer to recognize the semantic target corresponding to the annotation through the training.
  • the training method may adopt the training method in the prior art, or any method that can achieve the purpose of training the picture recognizer, which is not limited in this embodiment of the present application.
  • the trained detection model performs semantic target recognition on the source image, and then the generation model edits the identified semantic target according to predetermined conditions, realizing end-to-end automatic image generation and automatic labeling.
  • the source image is an example of an input original picture
  • the source image includes a plurality of input pictures
  • the first semantic target image is an example of a semantic target
  • the first background image is the semantic target information to be generated according to the user's
  • a target image is an example of a picture generated by the generator.
  • the source image may be called an original image, an input image, an original picture, and other similar terms
  • the target image may also be called a generated image, a generated image, and other similar terms. This embodiment of the present application does not limit it.
  • the method 400 includes step S410 to step S440. Steps S410 to S440 will be described in detail below.
  • the detection model determines a first semantic target image of the source image.
  • the generation model determines a first background image according to the first semantic target image.
  • the generation model generates a first prior distribution according to the first background image and the semantic target information to be generated.
  • a generating model generates the target image according to the first prior distribution and the first background image.
  • Steps S410 to S440 will be described in detail below.
  • the detection model determines a first semantic target image of the source image.
  • the source image is used as the input image of the dataset system, and the detection model of the dataset system includes a picture recognizer, which can recognize the first semantic target image of the source image.
  • the image generation method can be used in an automatic driving scene, so the source image can be an image collected by an automatic driving vehicle, and the first semantic target image can be a vehicle image in the source image, for example, a bus, Cars, bicycles, etc.
  • the above-mentioned recognizer is trained, so it can recognize the first semantic target image according to the annotation, wherein the source image is the original input image, including one or more images, each The image includes one or more semantic objects, at least one of which is the first semantic object image.
  • the detection model labels the identified semantic target image, specifically including labeling the bounding box of the semantic target and the category of the semantic target, and records the position of the detected bounding box and appropriately zooms in , which includes part of the boundary part to avoid the fusion of the generated semantic target with the surrounding environment.
  • the source image is an image of an automatic driving scene
  • the source image contains multiple semantic objects, such as buses, roads, sky, plants, pedestrians, etc.
  • the first semantic object image can be a vehicle, for example, bus.
  • the detection model records the position of the first semantic target image and enlarges the bounding box, and at the same time identifies the category of the semantic target as a vehicle.
  • the bounding box of the first semantic target image can be of any shape, and the arbitrary shape can be understood as properly zooming in on the detected area containing the first semantic target image, and acquiring a part of the environment image around the first semantic target image, Therefore, the arbitrary shape includes the complete image of the first semantic target and the surrounding partial environment image, such as rectangle, circle, trapezoid, etc., which is not limited in this embodiment of the present application.
  • the generation model determines a first background image according to the first semantic target image.
  • the generation model determines the first semantic target image detected by the detection model, and edits the source image according to the first semantic target image to determine the first background image and perform feature extraction to determine the features of the first background image.
  • the generation model determines the first semantic target image detected by the detection model, the generation model determines the position and labeling information of the first semantic target image, and processes the source image according to the labeling information of the first semantic target, thereby determining the first background image, further, feature extraction is performed on the first background image to obtain features of the first background image.
  • the smoothing process is performed on the first semantic target image to enlarge the area of the first semantic target image.
  • the semantic target to be generated can be The area of the first semantic target image is completely covered, so as to achieve the consistency of the texture of the whole image.
  • the area of the first semantic target image may be directly covered by the target semantic image to be generated.
  • the smoothed first semantic target image is removed from the source image to obtain a first background image.
  • the semantic target information to be generated is given semantic target information, for example, the semantic target information can be replacing a bus with a truck, or removing a bus, similarly, according to the user All the requirements defined by the requirements can be used as the semantic target information.
  • the semantic target information to be generated may also be system-defined, which is not limited in this embodiment of the present application.
  • the semantic target information to be generated may include the label of the semantic target image and the indication information of the semantic target image to be generated, the semantic target can be understood as the semantic target to be generated, the semantic target to be generated is the same as the semantic target volume in the source image , with similar semantic categories.
  • the label of the semantic target to be generated can be truck, bus, fire engine, etc.
  • the semantic target label included in the semantic target information is traffic tool
  • the semantic target volume is described as large.
  • the semantic target to be generated may be a motorcycle, a tricycle, and the like.
  • the semantic target category included in the semantic target information is vehicle, and the semantic target volume is described as small.
  • processing the source image according to the semantic target information to be generated and the annotation information of the first semantic target image includes: smoothing the first semantic target image, for example, the indication information in the semantic target information to be generated is If the bus is replaced by a truck, the generation model determines the position and bounding box of the first semantic target in the source image according to the annotation information of the first semantic target, and deletes or overwrites the image in the bounding box, thus obtaining The first background image; for another example, if the instruction information in the semantic target information to be generated is to remove the bus, then the generation model determines the position and bounding box of the first semantic target, deletes the image in the bounding box, and fills the surrounding The environment color block, as an understanding, for example, is filled into the bounding box with the road surface color block, so as to obtain the first background image.
  • the environment color block as an understanding, for example, is filled into the bounding box with the road surface color block, so as to obtain the first background image.
  • the generation model generates a first prior distribution according to the first background image and the semantic target information to be generated.
  • the generator includes two parts, an encoding structure and a decoding structure, and the encoder includes a priori conditional encoding module.
  • the encoder down-samples the image through a convolutional layer and a pooling layer to extract features of the first background image. Subsequently, the extracted features of the first background image and the semantic target information to be generated are coded and merged, and input to the prior condition encoding module, through which the prior condition encoding module obtains the prior condition of the current source image, and according to the prior condition Generate the prior distribution of the noise of the semantic target to be generated.
  • the semantic target information to be generated is determined by user definition or system definition, and the semantic target to be generated is taken as an understanding, for example, it can be object B, object C, object D, object E, etc.
  • the first semantic target image may be object A, and the semantic target information to be generated may be understood as replacing object A in the source image with object B, or replacing it with object C or object D, and so on. Take the semantic target information as an example of replacing object A in the source image with object B. In this case, the discriminator first identifies object A in the source image.
  • the generator After inputting it to the generator, the generator first recognizes object A The area of the first background image is smoothed to obtain the first background image, and then the encoder performs feature extraction on the first background image by sampling, encodes and merges the features of the first background image with the information of the object B, and inputs it to the prior condition encoding module , get the prior distribution about the object B. It can be understood that if object C is to be replaced, a prior distribution of object C needs to be generated.
  • the first background image is used as the input of the encoder, and the encoder obtains a length vector and matrix through the convolution layer and the pooling layer, and the vector and matrix are used as the prior condition of the Gaussian distribution, Sample from this Gaussian distribution.
  • the encoder obtains a length vector and matrix through the convolution layer and the pooling layer, and the vector and matrix are used as the prior condition of the Gaussian distribution, Sample from this Gaussian distribution.
  • different images of the object A such as A1, A2, and A3 will be generated, so that the image of the object A can be generated Diversity image.
  • the size of the distribution parameter depends on the size of the replaced image area, that is, the area size of the first semantic target image.
  • the noise obtained by Gaussian distribution sampling and the features of the first background image are combined and decoded.
  • the target images obtained by combined decoding can be A1, A2, A3 Wait for the image of the object A under sufficient light, and for example, when the first background image is in a rainy weather environment, then the target image obtained by combining and decoding can be A1', A2', A3', etc.
  • the environment information of the source image can be obtained by using the first background image as a priori condition.
  • This environment information can be understood as the condition of the surrounding environment except the semantic target to be replaced, and then the environment, texture and overall image of the generated area can be controlled. unanimous.
  • semantic target information to be generated is conditioned, including the label of the semantic target to be generated, so that the semantic target output type of the generated image can be controlled to be similar to or consistent with the source image.
  • a generating model generates the target image according to the first prior distribution and the first background image.
  • the generative model samples the noise from the above-mentioned first prior distribution, combines it with the first background image features extracted by the encoder, and inputs the result into the decoder to generate the target image.
  • An optional understanding is to generate the noise sampled from the above prior distribution by the generation model.
  • This noise can be understood as the feature of the change in the position distribution of the semantic target image to be generated based on the Gaussian distribution in the first background image. Based on the change A number of different features can be obtained by encoding the features with the features of the first background image, and different images can be obtained by decoding the multiple different features by the decoder.
  • the different images generated above are generated based on sampling noise, and the noise obtained by each sampling is different, so as to realize the generation of the distribution change images of different positions of the semantic target to be generated in the source image.
  • the above-mentioned distribution of different positions of the semantic target to be generated in the source image can be understood as a change of angle, a change of position, a change of layout, etc., and the color of the semantic target to be generated can also be changed according to user definition.
  • the position of the semantic target image to be generated in the above-mentioned first background image has a more reasonable and diverse distribution, so that the generated sample data is more reasonable and abundant.
  • the generation model generates an annotation of the target image according to the semantic target information to be generated and the information of the first semantic target image, so as to realize automatic annotation of the semantic target.
  • FIG. 5 provides a schematic block diagram of modules included according to an embodiment of the present application. As shown in Figure 5, it includes: a detection module and a generation module.
  • the detection module includes a recognizer; the generation module includes a generator and a discriminator, and the generator includes an encoding structure and a decoding structure.
  • the detection model and the generation model can be trained based on the generative confrontation network.
  • a source image is used as an input, and the source image is pre-labeled, and a recognizer is trained using the source image with the mark, so that the recognizer can detect a semantic target.
  • the training method may adopt the training method in the prior art, or any method that can achieve the purpose of training the picture recognizer, which is not limited in this embodiment of the present application.
  • the trained recognizer can determine the first semantic target image of the source image, as described in step S410 of the method 400, and details are not repeated here.
  • the model will enhance its ability to identify categories with a large amount of data, and weaken its ability to identify categories with a small amount of data. Therefore, the selected bounding box has a higher probability of selecting a category with a larger sample size.
  • the selected original semantic target will be replaced by other semantic targets, and the semantic targets and their labels that could not be identified are retained in the original image.
  • the source image can be processed according to the semantic target information to be generated to obtain the first background image, and the first background image can be input into the editor to obtain the characteristics of the first background image, and then the source Input the image and target semantic category into the editor to obtain the prior conditions, further generate the prior distribution of the noise, and then extract the noise features from the prior distribution, edit and merge the noise features and the first background image features into the editor, and convert the obtained
  • the features are input to the decoder for decoding to obtain a synthesized image.
  • the generator and the discriminator are connected in series and the parameters of the discriminator are fixed, and the parameters of the generator are optimized with the goal of "the picture is true", so that the pictures generated by the generator can "deceive" the discriminator. That is, the pictures generated by the generator are as close as possible to the real pictures.
  • the image synthesized by the generator is input to the discriminator, and the discriminator needs to obtain a real picture, that is, the source image, and the discriminator judges whether the input image is real or not based on the source image. Adjust the network parameter value of the generator according to the discrimination result of the discriminator, and further, use the adjusted picture generated by the generator as the input discriminator to continue the above steps, until the result of judging its output is true, the image to be trained is generated The training end condition of the device and the training end condition of the image discriminator to be trained are balanced.
  • the discriminator performs smoothing processing on the target semantic information included in the target image, and then performs authenticity discrimination on the target image including the smoothed target semantic information.
  • the discriminator judges the image generated by the recognizer based on the real image, the generated image is only different from the real image in the first semantic target bounding box, and there is no change in other semantic targets, so , the main discriminative region of the discriminator is the region within the bounding box of the first semantic object.
  • the discriminator divides the target image into regions, and performs authenticity discrimination for different regions.
  • the image discriminator to be trained performs weighting on the different regions when performing loss function calculation in combination with the semantic target information to be generated.
  • the discriminator when the discriminator recognizes a region with a high overlap rate with the semantic target image to be generated, it will increase the weight of the loss function when calculating the loss function, so that the discriminant model pays more attention to the generated semantic target image, thereby improving the model.
  • the effect that is, the authenticity of the discrimination in this area can be improved.
  • the discriminator to be trained enlarges the bounding box including target semantic information in the image, and performs authenticity discrimination on the image in the bounding box area.
  • the discrimination area of the discriminator is expanded, not only within the contour range of the semantic target, therefore, the discriminator also has a process of discrimining the environment area of the semantic target in the bounding box, and then controls the output of the generative model, so that The local fusion of the target image generated by the trained recognizer is smooth and natural, and the texture of the whole image is highly consistent.
  • the features are embedded with the category information map based on the original image annotation, and then the convolutional network is used to obtain the two-dimensional region discrimination results, and the To represent the probability that the image in each region is real.
  • this application weights the results of region discrimination to obtain a loss function, so that the model pays more attention to the results of region generation.
  • the discriminator of the present application can adopt the PatchGAN structure.
  • the trained recognizer can directly generate the target image.
  • the annotation of the target image is generated according to the semantic information of the target and the information of the first semantic target image, so as to realize the automatic labeling of the semantic target, avoid the workload of manual labeling, and effectively improve the preparation efficiency of the training data set.
  • the detection module in Figure 5 can be coupled with the generation module, that is, the identifier and the discriminator are coupled.
  • the discriminator and the detector can share feature extraction, thereby greatly The computational load of the model is reduced, and the efficiency of image generation is effectively improved.
  • a recognizer is introduced to automatically identify the semantic target to be replaced in the original picture, preprocess the semantic target, and then obtain the processed picture features, Combining the semantic target information to be generated, the conditional distribution of the noise is generated and merged with the image features, and then automatically replaced with the expected generated semantic target, and finally a picture that meets the texture information of the image and meets the diversity is synthesized and automatically labeled to ensure the diversity of generated data sex.
  • Fig. 6 is a structural block diagram of an image generating device provided by an embodiment of the present application.
  • the image generation device 600 includes: a detection unit 610 , a generation unit 620 and a discrimination unit 630 .
  • the detection unit 610 is configured to determine a first semantic target image of the source image, where the first semantic target image is at least one of at least one semantic target image included in the source image.
  • the generating unit 620 is configured to determine a first background image according to the first semantic target image; generate a priori distribution of noise according to the first background image and the semantic target information to be generated; according to the prior distribution and the first The background image generates the target image.
  • the generating unit 620 is configured to extract noise from the prior distribution, and generate the target image according to the noise and the first background image.
  • the generating unit 620 is configured to perform smoothing processing on the first semantic target image, and determine the first background image according to the smoothed first semantic target image.
  • the generating unit 620 is configured to remove the smoothed first semantic target image from the source image to obtain a first background image.
  • the generation unit 620 is further configured to complete the labeling of the semantic target of the target image according to the semantic target information.
  • the training unit 630 is used to train the image generated by the generating unit to be realistic.
  • the target image is used as an input image, and an image discriminator to be trained is used to identify the authenticity of the target image; according to the output result of the image discriminator and the input image, the network parameters of the image generator are adjusted value, the image generator is used to generate the target image; the target image generated by the image generator after the network parameter value is adjusted is used as an input image, and the identification action of the image discriminator to be trained is repeated until the training process converges .
  • the image discriminator to be trained discriminates images in different regions included in the target image.
  • the image discriminator to be trained performs region division on the target image.
  • the image discriminator to be trained is also used to perform smoothing processing on the target image including the target semantic information, and perform authenticity discrimination on the target image including the smoothed target semantic information.
  • the smoothing process for example, enlarges the bounding box of the target semantic image, and the image discriminator to be trained performs authenticity discrimination on images in the bounding box area.
  • the discriminators in the detection unit 610 and the generation unit 620 may share an image detector, and the image detector is used to perform feature extraction on the first semantic target image included in the source image.
  • image generation efficiency can be improved.
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory Access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product comprises one or more computer instructions or computer programs.
  • the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “multiple” means two or more.
  • At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Abstract

The present application discloses an image generation method. The image generation method comprises: determining a first semantic target image of a source image, the first semantic target image being at least one semantic target image included in the source image; and determining a target image according to the first semantic target image and a first prior distribution, the first prior distribution being a prior distribution of noise of a first background image and semantic target information to be generated, the target image comprising a plurality of images that comprise said semantic target information, said semantic target information comprising a label of a semantic target image to be generated and indication information of said semantic target image, and the first background image being a background image of the source image. Therefore, a target picture set having stable quality and diversity can be automatically generated for multiple semantic target images included in a specific sample picture set.

Description

图像生成方法及装置Image generation method and device
本申请要求于2021年8月30日提交中国专利局、申请号为202111006421.1、申请名称为“图像生成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application with application number 202111006421.1 and application title "Image Generation Method and Device" filed with the China Patent Office on August 30, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种图像生成方法及装置。The present application relates to the field of computer technology, in particular to an image generation method and device.
背景技术Background technique
目前,图像生成技术在计算机视觉领域的应用越来越广泛,可以广泛应用于工业自动化、生物医药、汽车等领域,例如,自动驾驶、智能检测以及视频监控等。At present, image generation technology is more and more widely used in the field of computer vision, and can be widely used in industrial automation, biomedicine, automobiles and other fields, such as automatic driving, intelligent detection and video surveillance.
现有技术中,通过深度神经网络模型强大的计算能力可以有效进行图像数据的处理。例如,通过判别类模型可以对给定的已知输入预测其未知的属性,即,识别给定图片所述类型以及检测图片中存在的所有语义目标;通过生成类模型可以对数据的分布进行模型构建,描述可观测的数据集,从而生成与之具有相同分布的数据。结合当前的判别模型和生成模型可以实现不同类别的图像的相互转换。In the prior art, image data can be processed effectively through the powerful computing capability of the deep neural network model. For example, a discriminative class model can predict its unknown properties for a given known input, that is, identify the type of a given picture and detect all semantic objects present in the picture; a generative class model can model the distribution of data Constructs, describes observable data sets, and thus generates data with the same distribution as them. Combining the current discriminative model and generative model can realize the mutual conversion of images of different categories.
当前的生成模型极大的启发了关于生成大型复杂场景图片的研究,例如,条件生成对抗网络(conditional generative adversarial nets,cGAN)。但一方面,当前的生成模型在实现之前所需的人工成本不可忽视,例如,需要准备足够数量的原始图片集与目标图片集,并对其中的语义目标进行人工标注;另一方面,在复杂场景下仅能生成针对单一语义目标的图片,并且生成的图片质量欠佳,例如,语义目标细节与图像纹理缺失,导致图片严重失真,甚至部分语义目标缺失。Current generative models have greatly inspired research on generating images of large and complex scenes, for example, conditional generative adversarial nets (cGAN). But on the one hand, the labor cost required by the current generative model before implementation cannot be ignored, for example, it is necessary to prepare a sufficient number of original image sets and target image sets, and manually label the semantic targets in them; on the other hand, in complex Only images for a single semantic target can be generated in the scene, and the quality of the generated images is not good, for example, the details of the semantic target and the image texture are missing, resulting in severe distortion of the picture, and even some semantic targets are missing.
如何针对少量样本图片集中包含的多语义目标图片实现自动生成质量稳定并且多样性的目标图片集,成为业界亟需解决的问题。How to automatically generate a target image set with stable quality and diversity for the multi-semantic target images contained in a small number of sample image sets has become an urgent problem to be solved in the industry.
发明内容Contents of the invention
本申请提供一种图像生成方法及装置,能够针对少量样本图片集中包含的多语义目标图像实现自动生成质量稳定并且多样性的目标图片集。The present application provides an image generation method and device, which can automatically generate a target image set with stable quality and diversity for multi-semantic target images contained in a small number of sample image sets.
第一方面,提供了一种图像生成方法,该方法包括:确定源图像的第一语义目标图像,该第一语义目标图像为该源图像包含的至少一个语义目标图像中的至少一个;根据该第一语义目标图像和第一先验分布确定目标图像,该第一先验分布为根据第一背景图像和待生成语义目标信息得到的噪声的先验分布,该目标图像包括多种包含待生成语义目标信息的图像,该待生成语义目标信息包括待生成语义目标图像的标签和待生成语义目标图像的指示信息,该第一背景图像为上述源图像的背景图像。In a first aspect, an image generation method is provided, the method comprising: determining a first semantic target image of a source image, where the first semantic target image is at least one of at least one semantic target image contained in the source image; according to the The target image is determined by the first semantic target image and the first prior distribution. The first prior distribution is the prior distribution of noise obtained from the first background image and the semantic target information to be generated. The target image includes various An image of semantic target information, the semantic target information to be generated includes a label of the semantic target image to be generated and indication information of the semantic target image to be generated, and the first background image is the background image of the source image.
根据本申请提供的上述方案,基于源图像的背景图像和待生成的语义目标图像的信息生成先验分布,根据该先验分布进行随机抽样可以生成待生成的语义目标图像在第一背景 图像的多样性分布并且符合全图纹理的目标图像。According to the above scheme provided by the present application, a priori distribution is generated based on the information of the background image of the source image and the semantic target image to be generated, random sampling according to the prior distribution can generate the semantic target image to be generated in the first background image The target image is distributed in diversity and conforms to the full image texture.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:该根据第一语义目标图像和第一先验分布确定目标图像,包括:根据该第一语义目标图像确定该第一背景图像;根据该第一背景图像和该待生成语义目标信息生成该第一先验分布;根据该第一先验分布和该第一背景图像生成该目标图像。With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining the target image according to the first semantic target image and the first prior distribution includes: determining the second semantic target image according to the first semantic target image A background image; generating the first prior distribution according to the first background image and the semantic target information to be generated; generating the target image according to the first prior distribution and the first background image.
根据上述技术方案,生成模块根据第一语义目标图像来确定第一背景图像的区域,进一步确定第一背景图像的特征,根据第一背景图像特征和待生成的语义目标生成第一先验分布,从而确保生成的图像能够符合全图纹理,避免图像失真,并且基于第一背景图像特征生成的待生成的语义目标的先验分布可以使得后续的抽样具有多样性,从而保障目标图像的多样性。According to the above technical solution, the generation module determines the region of the first background image according to the first semantic target image, further determines the characteristics of the first background image, generates the first prior distribution according to the characteristics of the first background image and the semantic target to be generated, In this way, it is ensured that the generated image conforms to the whole image texture and image distortion is avoided, and the prior distribution of the semantic target to be generated based on the characteristics of the first background image can make subsequent sampling diverse, thereby ensuring the diversity of the target image.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:该根据该第一先验分布和该第一背景图像生成该目标图像,包括:根据该第一先验分布生成该待生成语义目标图像的噪声;根据该待生成语义目标图像的噪声和该第一背景图像生成该目标图像。With reference to the first aspect, in some implementations of the first aspect, the method further includes: generating the target image according to the first prior distribution and the first background image, including: generating the target image according to the first prior distribution The noise of the semantic target image to be generated; the target image is generated according to the noise of the semantic target image to be generated and the first background image.
根据上述技术方案,从该第一先验分布进行噪声抽样,进一步结合第一背景图像生成目标图像,噪声抽样具有随机性,从而保证目标图像的多样性。According to the above technical solution, noise sampling is performed from the first prior distribution, and the target image is further combined with the first background image. The noise sampling is random, thereby ensuring the diversity of the target image.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:根据该第一语义目标图像确定第一背景图像,包括:对该第一语义目标图像进行平滑处理;根据该经过平滑处理的第一语义目标图像确定该第一背景图像。With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining the first background image according to the first semantic target image, including: performing smoothing processing on the first semantic target image; The smoothed first semantic target image determines the first background image.
根据上述技术方案,对要替换的语义目标图像进行平滑处理,确保要生成的语义目标图像的区域能够完成覆盖被替换的语义目标图像,从而保证生成的目标图像符合全图纹理,避免语义目标图像的失真。According to the above technical solution, the semantic target image to be replaced is smoothed to ensure that the area of the semantic target image to be generated can completely cover the replaced semantic target image, so as to ensure that the generated target image conforms to the texture of the whole image and avoid semantic target image distortion.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:从该源图像中移除该经过平滑处理的第一语义目标图像。With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: removing the smoothed first semantic target image from the source image.
根据上述技术方案,将第一语义目标图像从源图像中删除,可以使得待生成语义目标图像能够更好的融合第一背景图像,有利于全图纹理的一致性。According to the above technical solution, deleting the first semantic target image from the source image can enable the semantic target image to be generated to be better integrated with the first background image, which is beneficial to the consistency of the texture of the whole image.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:将该目标图像作为输入图像,利用待训练的图像判别器鉴别该目标图像的真实性;With reference to the first aspect, in some implementations of the first aspect, the method further includes: using the target image as an input image, and using an image discriminator to be trained to identify the authenticity of the target image;
根据该待训练的图像判别器的输出结果以及该输入图像,调整图像生成器的网络参数值,该图像生成器用于生成该目标图像;According to the output result of the image discriminator to be trained and the input image, adjust the network parameter value of the image generator, the image generator is used to generate the target image;
将网络参数值调整后的图像生成器生成的目标图像作为输入图像,重复该待训练的图像判别器的鉴别动作,直至该训练过程收敛。The target image generated by the image generator after the network parameter value is adjusted is used as the input image, and the identification action of the image discriminator to be trained is repeated until the training process converges.
根据该技术方案,对生成器和判别器进行训练,使得生成器生成的图像趋于真实,确保生成的目标图像更加逼真自然。According to the technical solution, the generator and the discriminator are trained, so that the image generated by the generator tends to be real, and the generated target image is more realistic and natural.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:利用待训练的图像判别器鉴别该目标图像的真实性,包括:With reference to the first aspect, in some implementations of the first aspect, the method further includes: using an image discriminator to be trained to identify the authenticity of the target image, including:
该待训练的图像判别器对该目标图像包括的不同区域内的图像进行判别,包括:The image discriminator to be trained discriminates images in different regions included in the target image, including:
该待训练的图像判别器结合所述待生成语义目标信息对所述不同区域在执行损失函数计算时进行加权。The image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.
根据该技术方案,判别器对目标图像进行区域识别,使用卷积网络得到二维的区域判 别结果,用以表示每个区域内图像为真实的概率。特别的,本申请对区域判别的结果进行加权得到损失函数,从而使模型更加关注生成区域的结果,即,对待生成的语义目标图像区域进行重点判别,进而提高判别器的判别能力,使得生成器生成的目标图像更加逼真自然。According to this technical solution, the discriminator performs region recognition on the target image, and uses a convolutional network to obtain a two-dimensional region discrimination result, which is used to represent the probability that the image in each region is real. In particular, this application weights the result of region discrimination to obtain a loss function, so that the model pays more attention to the result of the generated region, that is, focuses on the discrimination of the semantic target image region to be generated, and then improves the discriminant ability of the discriminator, so that the generator The generated target images are more realistic and natural.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:该待训练的图像判别器对该目标图像进行区域划分。With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: performing region division on the target image by the image discriminator to be trained.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:该待训练的图像判别器对该目标图像中包括的待生成语义目标图像进行平滑处理;该待训练的图像判别器对该包括经过平滑处理的待生成语义目标图像的目标图像进行真实性判别。With reference to the first aspect, in some implementations of the first aspect, the method further includes: the image discriminator to be trained smoothes the semantic target image to be generated included in the target image; the image discriminator to be trained The device performs authenticity judgment on the target image including the smoothed semantic target image to be generated.
根据该技术方案,判别器对已经替换的语义目标图像进行平滑处理,确保要该语义目标图像的区域能够完成覆盖被替换的语义目标图像,从而保证生成的目标图像符合全图纹理,避免语义目标图像的失真。According to this technical solution, the discriminator performs smoothing processing on the replaced semantic target image to ensure that the area of the semantic target image can completely cover the replaced semantic target image, so as to ensure that the generated target image conforms to the texture of the whole image and avoid semantic target Distortion of the image.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:根据该待生成语义目标信息和该第一语义目标图像的信息完成该目标图像的语义目标的标注。With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: marking the semantic target of the target image according to the semantic target information to be generated and the information of the first semantic target image.
根据该技术方案,基于待生成的语义目标信息中的语义目标标签可以对生成的目标图像进行自动标注,避免了人工标注的工作量,有效提升训练数据集准备效率。According to the technical solution, the generated target image can be automatically marked based on the semantic target label in the semantic target information to be generated, which avoids the workload of manual labeling and effectively improves the efficiency of training data set preparation.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:该待训练的图像判别器包括图像检测器,该图像检测器用于对该源图像包括的第一语义目标图像进行特征提取。With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: the image discriminator to be trained includes an image detector, and the image detector is used to perform the first semantic target image included in the source image feature extraction.
根据该技术方案,将识别器与判别器耦合,在这种情况下,可以实现判别器与检测器共享特征提取,从而大幅降低了模型的运算量,有效提升图像生成效率。According to the technical solution, the discriminator is coupled with the discriminator. In this case, the feature extraction shared by the discriminator and the detector can be realized, thereby greatly reducing the computational load of the model and effectively improving the image generation efficiency.
第二方面,提供了一种图像生成装置,该图像生成装置执行该第一方面或其各种实现方式中的方法的单元。In a second aspect, an image generating device is provided, and the image generating device executes the units of the method in the first aspect or various implementations thereof.
第三方面,提供了一种图像生成装置,包括,处理器,存储器,该存储器用于存储计算机程序,该处理器用于从存储器中调用并运行该计算机程序,使得该通信设备执行第一方面及其各种可能实现方式中的图像生成方法。A third aspect provides an image generation device, including a processor and a memory, the memory is used to store a computer program, the processor is used to call and run the computer program from the memory, so that the communication device executes the first aspect and Image generation methods among its various possible implementations.
可选地,该处理器为一个或多个,该存储器为一个或多个。Optionally, there are one or more processors, and one or more memories.
可选地,该存储器可以与该处理器集成在一起,或者该存储器与处理器分离设置。Optionally, the memory can be integrated with the processor, or the memory can be set separately from the processor.
第四方面,提供了一种计算机可读存储介质,其特征在于,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面或第二方面该的方法。In a fourth aspect, a computer-readable storage medium is provided, wherein the computer-readable medium stores program code for execution by a device, and the program code includes the method for executing the first aspect or the second aspect.
第五方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述各方面中的任意一种实现方式中的方法。In a fifth aspect, a computer program product including instructions is provided, and when the computer program product is run on a computer, the computer is made to execute the method in any one of the implementation manners in the foregoing aspects.
第六方面,提供一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取存储器上存储的指令,执行上述各方面中的任意一种实现方式中的方法。According to a sixth aspect, a chip is provided, and the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface, and executes the method in any one of the above aspects.
可选地,作为一种实现方式,该芯片还可以包括存储器,该存储器中存储有指令,该处理器用于执行该存储器上存储的指令,当该指令被执行时,该处理器用于执行上述各方面中的任意一种实现方式中的方法。Optionally, as an implementation manner, the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the above-mentioned A method in any one of the implementations of the aspect.
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。The aforementioned chip may specifically be a field-programmable gate array (field-programmable gate array, FPGA) or an application-specific integrated circuit (application-specific integrated circuit, ASIC).
附图说明Description of drawings
图1示出了本申请实施例提供的系统架构的结构示意图;FIG. 1 shows a schematic structural diagram of a system architecture provided by an embodiment of the present application;
图2为本申请实施例提供的一种卷积神经网络的结构示意图;FIG. 2 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;
图3为本申请实施例提供的一种产品实现形态示意图;FIG. 3 is a schematic diagram of a product realization form provided by an embodiment of the present application;
图4为本申请实施例提供的一种图像生成方法的示意性流程图;FIG. 4 is a schematic flowchart of an image generation method provided by an embodiment of the present application;
图5为本申请实施例提供的一种共享特征提取的判别器结构与检测器结构示意图;FIG. 5 is a schematic diagram of a discriminator structure and a detector structure for shared feature extraction provided by an embodiment of the present application;
图6是本申请实施例提供的一种图像生成装置的示意性框图。Fig. 6 is a schematic block diagram of an image generating device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
应理解,本申请中所有节点、装置的名称仅仅是本申请为描述方便而设定的名称,在实际应用中的名称可能不同,不应理解本申请限定各种节点、装置的名称,相反,任何具有和本申请中用到的节点或装置具有相同或类似功能的名称都视作本申请的方法或等效替换,都在本申请的保护范围之内,以下不再赘述。It should be understood that the names of all nodes and devices in this application are only the names set by this application for the convenience of description, and the names in actual applications may be different. It should not be understood that this application limits the names of various nodes and devices. On the contrary, Any names with the same or similar functions as the nodes or devices used in this application are regarded as methods or equivalent replacements in this application, and are all within the scope of protection of this application, and will not be described in detail below.
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。Since the embodiment of the present application involves the application of a large number of neural networks, for ease of understanding, the following first introduces the related terms and concepts of the neural network that may be involved in the embodiment of the present application.
(1)神经网络(1) neural network
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为: A neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
Figure PCTCN2022115028-appb-000001
Figure PCTCN2022115028-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于对神经网络中获取到的特征进行非线性变换,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。 Among them, s=1, 2, ... n, n is a natural number greater than 1, W s is the weight of x s , and b is the bias of the neuron unit. f is the activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.
(2)深度神经网络(2) Deep Neural Network
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。Deep neural network (DNN), also known as multi-layer neural network, can be understood as a neural network with multiple hidden layers. DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022115028-appb-000002
其中,
Figure PCTCN2022115028-appb-000003
是输入向量,
Figure PCTCN2022115028-appb-000004
是输出向量,
Figure PCTCN2022115028-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022115028-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022115028-appb-000007
由于DNN层数多,系数W和偏移向量
Figure PCTCN2022115028-appb-000008
的数量也比较多。 这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022115028-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
Although DNN looks complicated, it is actually not complicated in terms of the work of each layer. In simple terms, it is the following linear relationship expression:
Figure PCTCN2022115028-appb-000002
in,
Figure PCTCN2022115028-appb-000003
is the input vector,
Figure PCTCN2022115028-appb-000004
is the output vector,
Figure PCTCN2022115028-appb-000005
Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just an input vector
Figure PCTCN2022115028-appb-000006
After such a simple operation to get the output vector
Figure PCTCN2022115028-appb-000007
Due to the large number of DNN layers, the coefficient W and the offset vector
Figure PCTCN2022115028-appb-000008
The number is also higher. The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as
Figure PCTCN2022115028-appb-000009
The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022115028-appb-000010
In summary, the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
Figure PCTCN2022115028-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。It should be noted that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
(3)卷积神经网络(3) Convolutional neural network
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。Convolutional neural network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to some adjacent neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location. The convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
(4)损失函数(4) Loss function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training the deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then according to the difference between the two to update the weight vector of each layer of neural network (of course, there is usually a process of optimization before the first update, which is to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make it predict lower, and keep adjusting until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing the loss as much as possible.
(5)主干网络(5) Backbone network
在神经网络,尤其是计算机视觉(computer vision,CV)领域,一般先对图像进行特征提取,这一部分是整个CV任务的根基,因为后续的任务都依赖于提取出来的图像特征。因此将这一部分网络结构称为主干网络。In neural networks, especially in the field of computer vision (CV), image features are generally extracted first. This part is the foundation of the entire CV task, because subsequent tasks depend on the extracted image features. Therefore, this part of the network structure is called the backbone network.
(6)U-Net(6)U-Net
是比较早的使用全卷积网络进行语义分割的算法之一,其网络结构分为下采样阶段和上采样阶段,网络结构中只有卷积层和池化层,没有全连接层,网络中较浅的高分辨率层用来解决像素定位的问题,较深的层用来解决像素分类的问题,从而可以实现图像语义级别的分割。在U-Net的结构中,包括捕获一个上下文信息的收缩路径和一个允许精确定位的对称拓展路径。在本申请中,这种方法可以理解为,使用非常少的数据完成端到端的训 练,即输入是一幅图像,输出也是一幅图像,并获得最好的效果。It is one of the earliest algorithms for semantic segmentation using a fully convolutional network. Its network structure is divided into a downsampling stage and an upsampling stage. There are only convolutional layers and pooling layers in the network structure, and there is no fully connected layer. The shallow high-resolution layer is used to solve the problem of pixel positioning, and the deeper layer is used to solve the problem of pixel classification, so that the segmentation of image semantic level can be realized. In the structure of U-Net, it includes a shrinking path that captures contextual information and a symmetrical expanding path that allows precise localization. In this application, this method can be understood as using very little data to complete end-to-end training, that is, the input is an image, and the output is also an image, and the best results are obtained.
(7)图像特征(7) Image features
图像特征可以理解是由原始数据通过特征提取操作转化的,方便算法理解、处理的数值特征。在本申请实施例中,图像特征特指使用主干网络模型提取到的图像特征。Image features can be understood as numerical features transformed from raw data through feature extraction operations, which are convenient for algorithm understanding and processing. In the embodiment of the present application, the image features specifically refer to the image features extracted using the backbone network model.
(8)判别模型(8) Discriminant model
本申请实施例中,判别类模型可以对给定的已知输入预测其未知的属性,即识别给定图片所属的类别,以及检测图片中存在的所有语义目标。该判别类模型例如,目标分类模型、目标检测模型。In the embodiment of the present application, the discriminative class model can predict its unknown attributes for a given known input, that is, identify the category to which a given picture belongs, and detect all semantic objects existing in the picture. The discriminative model is, for example, an object classification model, an object detection model.
判别模型中,首先实现的是对单一语义目标图像的分类,即目标分类模型。该模型采用了卷积层对图像进行局部区域的特征的提取,并将得到的特征输入到全连接层进行分类。在此基础上,目标检测模型采用了类似的主干网络对包含多语义目标的图像进行特征提取,随后根据主干网络输出的特征图进行对原始图像中语义目标进行定位与识别。根据模型结构的不同,目标检测模型又可以分为单步模型与两步模型。其中,以YOLO神经网络(you only look once neural network,YOLO)、SSD检测器(single shot multibox detector,SSD)为代表的单步法采用了回归的方法以整体特征图为输入计算每个格点中包含语义目标的概率;以基于快速区域的卷积神经网络(Faster Region Based Convolutional Neural Networks,Faster RCNN)和基于掩模区域的卷积神经网络(Mask Region Based Convolutional Neural Networks,Mask RCNN)为代表的两步法引入了候选区域网络,首先依据特征信息鉴别预设的候选框中是否包含语义目标,随后将包含语义目标概率较高的区域输入到分类其中,进行目标检测。In the discriminative model, the first thing to realize is the classification of a single semantic target image, that is, the target classification model. The model uses a convolutional layer to extract the features of the local area of the image, and inputs the obtained features to the fully connected layer for classification. On this basis, the target detection model uses a similar backbone network to extract features from images containing multiple semantic targets, and then locates and recognizes the semantic targets in the original image according to the feature map output by the backbone network. According to the different model structures, the target detection model can be divided into a single-step model and a two-step model. Among them, the single-step method represented by YOLO neural network (you only look once neural network, YOLO) and SSD detector (single shot multibox detector, SSD) adopts the regression method to calculate each grid point with the overall feature map as input. The probability of containing a semantic target; represented by Faster Region Based Convolutional Neural Networks (Faster RCNN) and Mask Region Based Convolutional Neural Networks (Mask RCNN) The two-step method introduces the candidate area network. First, according to the feature information, it is identified whether the preset candidate box contains a semantic target, and then the area with a high probability of containing a semantic target is input into the classification for target detection.
(9)生成模型(9) Generate model
本申请实施例中,生成模型以生成对抗模型为例。该模型由生成器与判别器组成,其中生成器旨在学习从噪声分布到目标数据分布的映射,而判别器用以判别接受到的数据是否为真实的。在训练时,首先使用生成器生成的假数据与真实数据分别对判别器进行训练,随后串联生成器与判别器,并固定判别器的参数,以“生成真实数据”为目标,训练生成器。在训练过程中,判别器鉴别真假的能力提升,从而使得生成器产生的图片也更加接近真实分布,最终达到均衡。生成对抗网络的训练可以理解为如下的优化问题:In the embodiment of the present application, the generated model takes the generated confrontation model as an example. The model consists of a generator and a discriminator, where the generator aims to learn the mapping from the noise distribution to the target data distribution, and the discriminator is used to judge whether the received data is real or not. During training, first use the fake data and real data generated by the generator to train the discriminator respectively, then connect the generator and the discriminator in series, and fix the parameters of the discriminator, aiming at "generating real data" to train the generator. During the training process, the ability of the discriminator to distinguish true and false is improved, so that the pictures generated by the generator are closer to the real distribution, and finally reach a balance. The training of generative confrontation network can be understood as the following optimization problem:
生成对抗模型在生成小型单一类别图片中可以具有很好的表现,但难以处理高清晰度的图片,亦无法基于一个模型生成不同类别的图片。在此基础上,条件生成对抗模型(cGAN)在原模型的基础上引入了控制条件,实现了对生成语义目标类别可控的效果,从而可以使用一个模型生成不同类别的图片。具体的,该模型将待生成的语义目标的类别表示为一维向量作为条件向量。在生成器中,使用嵌入层将噪声与条件编码,随后传送入网络进行生成;在判断器中,将主干网络得到的特征与噪声编码,判断输入图片是否为目标类别下的真实图片。Generative confrontation models can perform well in generating small single-category images, but it is difficult to handle high-definition images, and it is also impossible to generate images of different categories based on one model. On this basis, the conditional generative adversarial model (cGAN) introduces control conditions on the basis of the original model to achieve a controllable effect on the generation of semantic target categories, so that one model can be used to generate pictures of different categories. Specifically, the model represents the category of the semantic target to be generated as a one-dimensional vector as a condition vector. In the generator, the embedding layer is used to encode the noise and conditions, and then sent to the network for generation; in the judger, the features and noise obtained by the backbone network are encoded to judge whether the input picture is a real picture under the target category.
为便于理解本申请实施例,首先结合图1简要说明本申请实施例的一种系统架构100的结构示意图。如图1所示,该系统架构100包括数据集模块110和模型训练模块120。数据集模块110是模型训练的基础之一,高质量大规模的数据集,是训练得到高质量模型的关键。To facilitate understanding of the embodiment of the present application, firstly, a schematic structural diagram of a system architecture 100 of the embodiment of the present application is briefly described with reference to FIG. 1 . As shown in FIG. 1 , the system architecture 100 includes a data set module 110 and a model training module 120 . The data set module 110 is one of the foundations of model training, and a high-quality large-scale data set is the key to training a high-quality model.
其中,在数据集模块110中,如图1所示,数据采集设备111用于采集原始数据;数 据预处理设备112可以用于进行原始数据的筛选、过滤及数据标注,其中,该数据标注可以为人工标注,也可以为自动标注,本申请实施例对此不作限定。数据生成设备113用于根据标注的数据生成新的数据。其中,数据生成是解决数据集规模不够的重要手段。Wherein, in the data set module 110, as shown in FIG. 1 , the data acquisition device 111 is used to collect raw data; the data preprocessing device 112 can be used to screen, filter and data label the original data, wherein the data label can be Manual labeling may also be automatic labeling, which is not limited in this embodiment of the present application. The data generation device 113 is used to generate new data according to the marked data. Among them, data generation is an important means to solve the problem of insufficient dataset size.
在该数据集模块110中,还包括数据存储库114,用于存储自动生成的数据集。该数据集可以用于训练模块进行模型训练。In the data set module 110, a data storage library 114 is also included for storing automatically generated data sets. This data set can be used in the training module for model training.
如图1所示,模型训练模块120中包括训练设备121,该训练设备121可以基于数据库130中维护的训练数据训练得到目标模型122。As shown in FIG. 1 , the model training module 120 includes a training device 121 , and the training device 121 can train the target model 122 based on the training data maintained in the database 130 .
下面对训练设备121基于训练数据得到目标模型122进行描述,训练设备120对输入的数据进行处理,将输出的数据与输入的原始数据进行对比,直到训练设备120输出的数据与输入的原始数据的差值小于一定的阈值,从而完成目标模型122的训练。The training device 121 is described below to obtain the target model 122 based on the training data. The training device 120 processes the input data and compares the output data with the input original data until the output data of the training device 120 is consistent with the input original data. The difference of is less than a certain threshold, thus completing the training of the target model 122 .
上述目标模型122能够用于实现本申请实施例的图像生成方法,即,将通过数据生成设备113自动生成的图片集输入该目标模型122,即可判断该图片集包括的图片的真伪。本申请实施例中的目标模型122具体可以为cGAN网络。The above-mentioned target model 122 can be used to implement the image generation method of the embodiment of the present application, that is, the picture set automatically generated by the data generation device 113 is input into the target model 122, and the authenticity of the pictures included in the picture set can be judged. The target model 122 in the embodiment of the present application may specifically be a cGAN network.
需要说明的是,在实际的应用中,所述数据库114中维护的训练数据不一定都来自于数据采集设备111的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备121也不一定完全基于数据库114维护的训练数据进行目标模型的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。It should be noted that, in practical applications, the training data maintained in the database 114 may not all be collected by the data collection device 111 , but may also be received from other devices. In addition, it should be noted that the training device 121 does not necessarily perform target model training based entirely on the training data maintained by the database 114, and may also obtain training data from the cloud or other places for model training. limited.
根据训练设备121训练得到的目标模型122可以应用于不同的系统或设备中,如执行设备,所述执行设备可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。The target model 122 trained according to the training device 121 can be applied to different systems or devices, such as an execution device, which can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.
值得说明的是,上述训练设备121可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型122,该相应的目标模型122即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。It is worth noting that the above-mentioned training device 121 can generate corresponding target models 122 based on different training data for different goals or different tasks, and the corresponding target models 122 can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. Thus providing the desired result to the user.
值得注意的是,图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。It should be noted that FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, devices, modules, etc. shown in the figure does not constitute any limitation.
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过神经网络模型更新的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。Since CNN is a very common neural network, the structure of CNN will be introduced in detail in combination with Figure 2 below. As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. A deep learning architecture refers to an algorithm that is updated through a neural network model. Multiple levels of learning are performed at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
在图2中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。图二中的卷积神经网络可适用于图像分类模型结构。下面对图2中的CNN 200中内部的层结构进行详细的介绍。In FIG. 2 , a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 230 . The convolutional neural network in Figure 2 can be applied to the image classification model structure. The internal layer structure of the CNN 200 in FIG. 2 will be described in detail below.
卷积层/池化层220:Convolutional layer/pooling layer 220:
卷积层:Convolution layer:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中, 221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in Figure 2, the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。The following will take the convolutional layer 221 as an example to introduce the inner working principle of one convolutional layer.
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。The convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix. The convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row×column) are applied, That is, multiple matrices of the same shape. The output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc. The multiple weight matrices have the same size (row×column), and the convolutional feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple convolutional feature maps of the same size are combined to form The output of the convolution operation.
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
当卷积神经网络200有多个卷积层的时候,的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 200 has multiple convolutional layers, the convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features; with the convolutional neural network With the deepening of 200 depth, the features extracted by the later convolutional layers (such as 226) become more and more complex, such as high-level semantic features, and the higher semantic features are more suitable for the problem to be solved.
池化层:Pooling layer:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 221-226 shown in 220 in Figure 2, one layer of convolutional layers can be followed by one layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
全连接层230:Fully connected layer 230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图3所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 3 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be determined according to the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成,反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 3, the propagation from 210 to 240 direction is forward propagation) is completed, the backpropagation (as shown in Fig. 3, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.
需要说明的是,图2所示的卷积神经网络仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network shown in FIG. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
下面介绍本申请实施例提供的一种产品实现形态。A product realization form provided by the embodiment of the present application is introduced below.
图3为本申请实施例提供的一种产品实现形态示意图。FIG. 3 is a schematic diagram of a product realization form provided by an embodiment of the present application.
本申请实施例的一种产品实现形态,可以包含在数据集系统中,并且部署在服务器硬件上的程序代码。本申请的程序代码可以存在于数据集系统的数据生成模块中,例如系统架构100中的数据生成设备113。此部分代码运行于主机存储(内存或磁盘)上,用于执行创新的数据自动生成方法。A product implementation form of the embodiment of the present application may include program codes that are included in the data set system and deployed on server hardware. The program code of the present application may exist in the data generation module of the dataset system, such as the data generation device 113 in the system architecture 100 . This part of the code runs on the host storage (memory or disk) and is used to implement the innovative automatic data generation method.
图3示出了本申请的一种产品实现形态,该产品包括服务器310、硬件320和软件330,其中硬件330包括主机内存或磁盘,用于运行本申请的程序代码;软件320包括图像识别装置321和图像生成装置322,该图像识别装置321用于识别输入图像的真伪,该图像生成装置322用于生成用户要求的目标语义信息的多样性图像,并将生成的图片存储至主机内存或磁盘,用于输入图像识别装置321进行图片识别。Fig. 3 shows a kind of product realization form of the present application, and this product comprises server 310, hardware 320 and software 330, and wherein hardware 330 comprises host computer memory or disk, is used for running the program code of this application; Software 320 comprises image recognition device 321 and an image generation device 322, the image recognition device 321 is used to identify the authenticity of the input image, the image generation device 322 is used to generate the diversity images of the target semantic information required by the user, and store the generated pictures to the host memory or The magnetic disk is used to input the image recognition device 321 for image recognition.
本申请实施例提供的图像生成方法和装置可以运用于其他类似的图片编辑、替换的任务中。例如,自动驾驶中不同路况场景中交通工具或其他目标物体的替换、视频监控场景中的人物发型、发色及头饰的替换、工业自动化质量监测场景中的不同产品形状、颜色及布局等的替换,等。具体而言,本申请实施例的图像生成方法能够应用在图片编辑替换的场景中,本申请实施例以自动驾驶场景为例对本申请实施例的图像生成方法进行详细的介绍。The image generation method and device provided in the embodiments of the present application can be applied to other similar image editing and replacement tasks. For example, the replacement of vehicles or other target objects in different road conditions in autonomous driving, the replacement of hairstyles, hair colors and headgear in video surveillance scenes, and the replacement of different product shapes, colors and layouts in industrial automation quality monitoring scenarios ,wait. Specifically, the image generation method of the embodiment of the present application can be applied in the scene of picture editing and replacement, and the embodiment of the present application takes the automatic driving scene as an example to introduce the image generation method of the embodiment of the present application in detail.
为便于对本实施例进行理解,首先对本公开实施例所公开的一种样本图像生成方法进行详细介绍,本公开实施例所提供的样本图像生成方法的执行主体一般为具有一定计算能力的计算机设备,该计算机设备例如包括:终端设备或服务器或其它处理设备,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该样本图像生成方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。In order to facilitate the understanding of this embodiment, a method for generating a sample image disclosed in the embodiment of the present disclosure is firstly introduced in detail. The execution subject of the method for generating a sample image provided in the embodiment of the present disclosure is generally a computer device with certain computing capabilities. The computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc. In some possible implementation manners, the method for generating a sample image may be implemented by a processor invoking computer-readable instructions stored in a memory.
下面结合图4描述本申请实施例的图像生成方法。图4示出了本申请实施例提供的一种图像生成方法400的示意性流程图。所述方法可以应用于数据集系统中,所述数据集系统包括检测模块和生成模块。其中,生成模块中包括判别器与生成器,生成器包括编码结构和解码结构,检测模型和生成模型可以基于生成对抗网络进行模型训练。所述方法可以由上述执行设备执行。可选地,方法400可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而是使用其他适合用于神经网络计算的处理器,本申请不做限制。The image generation method of the embodiment of the present application is described below with reference to FIG. 4 . FIG. 4 shows a schematic flowchart of an image generation method 400 provided by an embodiment of the present application. The method can be applied in a dataset system, and the dataset system includes a detection module and a generation module. Among them, the generation module includes a discriminator and a generator, and the generator includes an encoding structure and a decoding structure. The detection model and the generation model can be trained based on the generation confrontation network. The method can be executed by the above execution device. Optionally, the method 400 may be processed by the CPU, or jointly processed by the CPU and the GPU, or other processors suitable for neural network calculation may be used instead of the GPU, which is not limited in this application.
在训练中,检测模型首先使用带有标注的源图像训练图片识别器,该带有标注的源图像指的是预先对源图像中的图片包括的语义目标进行标注,可以进行人工标注,或是自动标注,不作限定。训练图片识别器的目的为通过训练使得图片识别器通过标注能够识别对应标注的语义目标。该训练方式可以采用现有技术中的训练方式,也可以采用任意可以达到训练图片识别器目的的方式,本申请实施例对此不作限定。In the training, the detection model first uses the labeled source image to train the picture recognizer. The labeled source image refers to pre-labeling the semantic targets included in the picture in the source image, which can be manually labeled, or Automatic labeling without limitation. The purpose of training the image recognizer is to enable the image recognizer to recognize the semantic target corresponding to the annotation through the training. The training method may adopt the training method in the prior art, or any method that can achieve the purpose of training the picture recognizer, which is not limited in this embodiment of the present application.
在本申请实施例中,经过训练的检测模型对源图像进行语义目标识别,然后生成模型根据预先给定的条件,对识别到的语义目标进行编辑,实现端到端的自动图像生成和自动标注。In the embodiment of the present application, the trained detection model performs semantic target recognition on the source image, and then the generation model edits the identified semantic target according to predetermined conditions, realizing end-to-end automatic image generation and automatic labeling.
在本申请实施例中,源图像作为输入的原始图片的一例,该源图像包括多个输入图片,第一语义目标图像作为语义目标的一例,第一背景图像为根据用户的待生成语义目标信息对源图像进行处理后的图像的一例,目标图像为生成器生成的图片的一例。In the embodiment of the present application, the source image is an example of an input original picture, the source image includes a plurality of input pictures, the first semantic target image is an example of a semantic target, and the first background image is the semantic target information to be generated according to the user's An example of an image after processing the source image, and a target image is an example of a picture generated by the generator.
应理解,源图像可以称为原始图像,输入图像,原始图片等相似术语,目标图像也可以称为生成图像,生成图片等相似术语,本申请实施例采用源图像和目标图像为例进行说明,本申请实施例对此不作限定。It should be understood that the source image may be called an original image, an input image, an original picture, and other similar terms, and the target image may also be called a generated image, a generated image, and other similar terms. This embodiment of the present application does not limit it.
方法400包括步骤S410至步骤S440。下面对步骤S410至步骤S440进行详细说明。The method 400 includes step S410 to step S440. Steps S410 to S440 will be described in detail below.
S410,检测模型确定源图像的第一语义目标图像。S410, the detection model determines a first semantic target image of the source image.
S420,生成模型根据所述第一语义目标图像确定第一背景图像。S420. The generation model determines a first background image according to the first semantic target image.
S430,生成模型根据所述第一背景图像和待生成语义目标信息生成第一先验分布。S430. The generation model generates a first prior distribution according to the first background image and the semantic target information to be generated.
S440,生成模型根据所述第一先验分布和所述第一背景图像生成所述目标图像。S440. A generating model generates the target image according to the first prior distribution and the first background image.
下面对步骤S410至步骤S440进行详细说明。Steps S410 to S440 will be described in detail below.
S410,检测模型确定源图像的第一语义目标图像。S410, the detection model determines a first semantic target image of the source image.
源图像作为数据集系统的输入图像,数据集系统的检测模型包括图片识别器,图片识别器可以识别源图像的第一语义目标图像。The source image is used as the input image of the dataset system, and the detection model of the dataset system includes a picture recognizer, which can recognize the first semantic target image of the source image.
在本申请实施例中,该图像生成方法可以用于自动驾驶场景中,因此源图像可以为自动驾驶车辆采集的图像,第一语义目标图像可以为源图像中的车辆图像,例如,公交车,小汽车,自行车等。In the embodiment of the present application, the image generation method can be used in an automatic driving scene, so the source image can be an image collected by an automatic driving vehicle, and the first semantic target image can be a vehicle image in the source image, for example, a bus, Cars, bicycles, etc.
具体的,在一种可能的实施方式中,上述识别器是经过训练的,因此可以根据标注识别第一语义目标图像,其中,该源图像为原始输入图像,包括一个或多个图像,每个图像包括一个或多个语义目标,该第一语义目标图像为其中的至少一个。Specifically, in a possible implementation manner, the above-mentioned recognizer is trained, so it can recognize the first semantic target image according to the annotation, wherein the source image is the original input image, including one or more images, each The image includes one or more semantic objects, at least one of which is the first semantic object image.
在一种可能的实施方式中,检测模型针对识别到的语义目标图像进行标注,具体包括标注该语义目标的边界框及语义目标的类别,对于检测到的边界框,记录其位置并进行适当放大,从而包含了部分边界部分,以避免生成的语义目标与周围环境融合不够流畅。In a possible implementation, the detection model labels the identified semantic target image, specifically including labeling the bounding box of the semantic target and the category of the semantic target, and records the position of the detected bounding box and appropriately zooms in , which includes part of the boundary part to avoid the fusion of the generated semantic target with the surrounding environment.
例如,该源图像为自动驾驶场景的图像,该源图像包含多个语义目标,例如,公交车,道路,天空,植物,行人等,其中,该第一语义目标图像可以为交通工具,例如,公交车。检测模型对该第一语义目标图像的位置进行记录并放大边界框,同时识别其语义目标的类别为交通工具。For example, the source image is an image of an automatic driving scene, and the source image contains multiple semantic objects, such as buses, roads, sky, plants, pedestrians, etc., wherein the first semantic object image can be a vehicle, for example, bus. The detection model records the position of the first semantic target image and enlarges the bounding box, and at the same time identifies the category of the semantic target as a vehicle.
应理解,第一语义目标图像的边界框可以为任意形状,该任意形状可以理解为对检测到的包含第一语义目标图像的区域进行适当放大,获取第一语义目标图像周围的部分环境图像,因此该任意形状中包括第一语义目标的完整图像和周围的部分环境图像,该任意形状例如:矩形,圆形,梯形等,本申请实施例对此不作限定。It should be understood that the bounding box of the first semantic target image can be of any shape, and the arbitrary shape can be understood as properly zooming in on the detected area containing the first semantic target image, and acquiring a part of the environment image around the first semantic target image, Therefore, the arbitrary shape includes the complete image of the first semantic target and the surrounding partial environment image, such as rectangle, circle, trapezoid, etc., which is not limited in this embodiment of the present application.
应理解,以上举例仅为示例性说明,不应对本申请实施例构成任何限定,本申请以图像生成在自动驾驶场景为例,除此以外,本申请实施例提供的图像生成方法还可以应用于任意需要替换语义目标的图像生成场景中。It should be understood that the above examples are only illustrative and should not constitute any limitation to the embodiment of the present application. This application takes image generation in an automatic driving scene as an example. In addition, the image generation method provided in the embodiment of the present application can also be applied to In any image generation scenario where replacement of semantic objects is required.
S420,生成模型根据所述第一语义目标图像确定第一背景图像。S420. The generation model determines a first background image according to the first semantic target image.
生成模型确定检测模型检测的第一语义目标图像,并根据该第一语义目标图像对源图像进行编辑确定第一背景图像,并进行特征提取,从而确定第一背景图像的特征。The generation model determines the first semantic target image detected by the detection model, and edits the source image according to the first semantic target image to determine the first background image and perform feature extraction to determine the features of the first background image.
具体的,生成模型确定检测模型检测的第一语义目标图像,生成模型确定第一语义目标图像的位置及标注信息,根据该第一语义目标的标注信息对源图像进行处理,从而确定第一背景图像,进一步的,对第一背景图像进行特征提取,得到第一背景图像的特征。Specifically, the generation model determines the first semantic target image detected by the detection model, the generation model determines the position and labeling information of the first semantic target image, and processes the source image according to the labeling information of the first semantic target, thereby determining the first background image, further, feature extraction is performed on the first background image to obtain features of the first background image.
在一种可能的实施方式中,对所述第一语义目标图像进行平滑处理,用于对所述第一语义目标图像区域进行放大处理,可以理解为,经过放大处理后,待生成语义目标可以完全覆盖该第一语义目标图像的区域,从而实现全图纹理的一致性。In a possible implementation manner, the smoothing process is performed on the first semantic target image to enlarge the area of the first semantic target image. It can be understood that after the zoom-in process, the semantic target to be generated can be The area of the first semantic target image is completely covered, so as to achieve the consistency of the texture of the whole image.
应理解,在本申请中,对所述第一语义目标图像进行平滑处理后,可以直接由待生成目标语义图像覆盖该第一语义目标图像的区域。It should be understood that, in the present application, after smoothing the first semantic target image, the area of the first semantic target image may be directly covered by the target semantic image to be generated.
在一种可能的实施方式中,从所述源图像中移除所述经过平滑处理的第一语义目标图像,获得第一背景图像。In a possible implementation manner, the smoothed first semantic target image is removed from the source image to obtain a first background image.
在一种可能的实施方式中,待生成语义目标信息为给定的语义目标信息,例如,该语义目标信息可以为将公交车替换为卡车,或者为,将公交车去掉,类似的,根据用户需求定义的要求均可以作为该语义目标信息。In a possible implementation, the semantic target information to be generated is given semantic target information, for example, the semantic target information can be replacing a bus with a truck, or removing a bus, similarly, according to the user All the requirements defined by the requirements can be used as the semantic target information.
在一种可能的实施方式中,该待生成语义目标信息也可以是系统自定义的,本申请实施例对此不作限定。In a possible implementation manner, the semantic target information to be generated may also be system-defined, which is not limited in this embodiment of the present application.
优选的,该待生成语义目标信息可以包括语义目标图像的标签和待生成语义目标图像的指示信息,该语义目标可以理解为待生成的语义目标,该待生成语义目标与源图像中语义目标体积,语义类别相近。Preferably, the semantic target information to be generated may include the label of the semantic target image and the indication information of the semantic target image to be generated, the semantic target can be understood as the semantic target to be generated, the semantic target to be generated is the same as the semantic target volume in the source image , with similar semantic categories.
例如,当源图像被检测到的第一语义目标图像为体积较大的汽车时,待生成的语义目标的标签可以为卡车,公交车,消防车等,语义目标信息包括的语义目标标签为交通工具,语义目标体积描述为大型。再例如,当第一语义目标图像为自行车时,待生成的语义目标可以为摩托车,三轮车等。语义目标信息包括的语义目标类别为交通工具,语义目标体积描述为小型。For example, when the first semantic target image detected in the source image is a large car, the label of the semantic target to be generated can be truck, bus, fire engine, etc., and the semantic target label included in the semantic target information is traffic tool, the semantic target volume is described as large. For another example, when the first semantic target image is a bicycle, the semantic target to be generated may be a motorcycle, a tricycle, and the like. The semantic target category included in the semantic target information is vehicle, and the semantic target volume is described as small.
应理解,根据待生成语义目标信息和第一语义目标图像的标注信息对源图像进行处理,包括:对所述第一语义目标图像进行平滑处理,例如,待生成语义目标信息中的指示 信息为将公交车替换为卡车,则生成模型根据第一语义目标的标注信息在源图像中确定第一语义目标的位置及边界框,并对该边界框内的图像进行删除处理或者覆盖处理,从而得到第一背景图像;再例如,待生成语义目标信息中的指示信息为将公交车去掉,则生成模型确定第一语义目标的位置及边界框,对边界框内的图像进行删除处理,同时填充周围环境色块,作为一种理解,例如,使用路面色块填充至该边界框内,从而获得第一背景图像。It should be understood that processing the source image according to the semantic target information to be generated and the annotation information of the first semantic target image includes: smoothing the first semantic target image, for example, the indication information in the semantic target information to be generated is If the bus is replaced by a truck, the generation model determines the position and bounding box of the first semantic target in the source image according to the annotation information of the first semantic target, and deletes or overwrites the image in the bounding box, thus obtaining The first background image; for another example, if the instruction information in the semantic target information to be generated is to remove the bus, then the generation model determines the position and bounding box of the first semantic target, deletes the image in the bounding box, and fills the surrounding The environment color block, as an understanding, for example, is filled into the bounding box with the road surface color block, so as to obtain the first background image.
应理解,以上举例仅为示例性说明,不应对本申请实施例构成任何限定。It should be understood that the above examples are only illustrative descriptions, and should not constitute any limitation to the embodiment of the present application.
S430,生成模型根据所述第一背景图像和所述待生成语义目标信息生成第一先验分布。S430. The generation model generates a first prior distribution according to the first background image and the semantic target information to be generated.
应理解,本申请实施例中,生成器包括编码结构和解码结构两部分,编码器包括先验条件编码模块。It should be understood that in the embodiment of the present application, the generator includes two parts, an encoding structure and a decoding structure, and the encoder includes a priori conditional encoding module.
具体的,以第一背景图像作为输入,编码器通过卷积层、池化层对图像进行下采样并提取第一背景图像的特征。随后,将提取第一背景图像的特征与待生成语义目标信息进行编码合并,输入至先验条件编码模块,通过该先验条件编码模块得到关于当前源图像的先验条件,并依据先验条件生成该待生成语义目标的噪声的先验分布。Specifically, taking the first background image as input, the encoder down-samples the image through a convolutional layer and a pooling layer to extract features of the first background image. Subsequently, the extracted features of the first background image and the semantic target information to be generated are coded and merged, and input to the prior condition encoding module, through which the prior condition encoding module obtains the prior condition of the current source image, and according to the prior condition Generate the prior distribution of the noise of the semantic target to be generated.
应理解,第一先验分布仅为示例性说明。本申请实施例对此不作限定。It should be understood that the first prior distribution is only illustrative. This embodiment of the present application does not limit it.
应理解,待生成语义目标信息为用户定义或系统定义等方式确定,待生成语义目标作为一种理解,例如,可以为物体B,物体C,物体D,物体E等。第一语义目标图像可以为物体A,则,该待生成语义目标信息可以理解为将源图像中的物体A替换为物体B,或者替换为物体C或物体D等。以语义目标信息为将源图像中的物体A替换为物体B为例,在该情况下,判别器首先判别出源图像中的物体A,输入至生成器后,生成器中先对该物体A的区域进行平滑处理得到第一背景图像,然后编码器对该第一背景图像通过采样进行特征提取,将该第一背景图像的特征与物体B的信息进行编码合并,输入至先验条件编码模块,得到关于物体B的先验分布。可以理解,如果要替换为物体C,则需要生成物体C的先验分布。It should be understood that the semantic target information to be generated is determined by user definition or system definition, and the semantic target to be generated is taken as an understanding, for example, it can be object B, object C, object D, object E, etc. The first semantic target image may be object A, and the semantic target information to be generated may be understood as replacing object A in the source image with object B, or replacing it with object C or object D, and so on. Take the semantic target information as an example of replacing object A in the source image with object B. In this case, the discriminator first identifies object A in the source image. After inputting it to the generator, the generator first recognizes object A The area of the first background image is smoothed to obtain the first background image, and then the encoder performs feature extraction on the first background image by sampling, encodes and merges the features of the first background image with the information of the object B, and inputs it to the prior condition encoding module , get the prior distribution about the object B. It can be understood that if object C is to be replaced, a prior distribution of object C needs to be generated.
在本申请实施例中,应理解,第一背景图像作为编码器的输入,编码器通过卷积层和池化层,得到一个长度向量和矩阵,该向量和矩阵作为高斯分布的先验条件,从该高斯分布中进行抽样。当待生成一个单一物体A,根据从高斯分布的随机抽样的不同,以及高斯分布的分布参数的不同,会生成A1,A2,A3等不同的关于物体A的图像,从而可以生成关于物体A的多样性图像。其中,分布参数的大小取决于被替换图像区域的大小,即,第一语义目标图像的区域大小。In the embodiment of the present application, it should be understood that the first background image is used as the input of the encoder, and the encoder obtains a length vector and matrix through the convolution layer and the pooling layer, and the vector and matrix are used as the prior condition of the Gaussian distribution, Sample from this Gaussian distribution. When a single object A is to be generated, according to the difference in random sampling from the Gaussian distribution and the distribution parameters of the Gaussian distribution, different images of the object A such as A1, A2, and A3 will be generated, so that the image of the object A can be generated Diversity image. Wherein, the size of the distribution parameter depends on the size of the replaced image area, that is, the area size of the first semantic target image.
进一步的,将高斯分布抽样得到的噪声与第一背景图像的特征进行合并解码,例如,当该第一背景图像表示光线充足的环境,则,合并解码得到的目标图像可以为A1,A2,A3等处于充足光线下的物体A的图像,再例如,当第一背景图像为阴雨天气环境,则,合并解码得到的目标图像可以为A1’,A2’,A3’等处于阴雨天气的物体A的图像。Further, the noise obtained by Gaussian distribution sampling and the features of the first background image are combined and decoded. For example, when the first background image represents a well-lit environment, the target images obtained by combined decoding can be A1, A2, A3 Wait for the image of the object A under sufficient light, and for example, when the first background image is in a rainy weather environment, then the target image obtained by combining and decoding can be A1', A2', A3', etc. Object A in rainy weather image.
应理解,以第一背景图像作为先验条件可以获得源图像的环境信息,该环境信息可以理解为除去要替换的语义目标的周围环境的状况,进而可以控制生成区域的环境、纹理与全图一致。It should be understood that the environment information of the source image can be obtained by using the first background image as a priori condition. This environment information can be understood as the condition of the surrounding environment except the semantic target to be replaced, and then the environment, texture and overall image of the generated area can be controlled. unanimous.
应理解,以待生成语义目标信息为条件,其中包括待生成的语义目标的标签,从而可以控制生成图片的语义目标输出类型与源图像相近或一致。It should be understood that the semantic target information to be generated is conditioned, including the label of the semantic target to be generated, so that the semantic target output type of the generated image can be controlled to be similar to or consistent with the source image.
S440,生成模型根据所述第一先验分布和所述第一背景图像生成所述目标图像。S440. A generating model generates the target image according to the first prior distribution and the first background image.
具体的,生成模型从上述第一先验分布中抽样得到噪声,并与编码器中提取的第一背景图像特征进行合并,将结果输入至解码器中,生成目标图像。Specifically, the generative model samples the noise from the above-mentioned first prior distribution, combines it with the first background image features extracted by the encoder, and inputs the result into the decoder to generate the target image.
可选的一种理解,生成模型从上述先验分布中抽样得到的噪声,该噪声可以理解为待生成语义目标图像在第一背景图像中基于高斯分布的位置分布变化量特征,基于该变化量特征与第一背景图像特征进行编码可以得到多个不同的特征,经过解码器解码该多个不同的特征可以获取不同的图像。An optional understanding is to generate the noise sampled from the above prior distribution by the generation model. This noise can be understood as the feature of the change in the position distribution of the semantic target image to be generated based on the Gaussian distribution in the first background image. Based on the change A number of different features can be obtained by encoding the features with the features of the first background image, and different images can be obtained by decoding the multiple different features by the decoder.
应理解,上述基于高斯分布的先验分布仅为示例性说明,该先验分布也可以为其他形式,本申请实施例对此不作限定。It should be understood that the foregoing prior distribution based on the Gaussian distribution is only an exemplary description, and the prior distribution may also be in other forms, which are not limited in this embodiment of the present application.
应理解,上述生成的不同的图像是基于抽样的噪声生成的,每一次抽样得到的噪声都是不同的,从而实现了待生成语义目标在源图像中的不同位置分布变化图像的生成。It should be understood that the different images generated above are generated based on sampling noise, and the noise obtained by each sampling is different, so as to realize the generation of the distribution change images of different positions of the semantic target to be generated in the source image.
可以理解,上述待生成语义目标在源图像中的不同位置分布可以理解为角度的变化,位置的变化,布局的变化等,并且,该待生成语义目标也可以根据用户定义进行颜色的变化等。It can be understood that the above-mentioned distribution of different positions of the semantic target to be generated in the source image can be understood as a change of angle, a change of position, a change of layout, etc., and the color of the semantic target to be generated can also be changed according to user definition.
在上述先验分布的指导下,待生成语义目标图像在上述第一背景图像中的位置具有更加合理和多样性的分布,从而生成的样本数据更加合理丰富。Under the guidance of the above-mentioned prior distribution, the position of the semantic target image to be generated in the above-mentioned first background image has a more reasonable and diverse distribution, so that the generated sample data is more reasonable and abundant.
在一种可能的实施方式中,生成模型根据待生成语义目标信息和所述第一语义目标图像的信息生成目标图像的标注,实现语义目标的自动标注。In a possible implementation manner, the generation model generates an annotation of the target image according to the semantic target information to be generated and the information of the first semantic target image, so as to realize automatic annotation of the semantic target.
应理解,上述生成器和判别器均为经过训练的生成器和判别器,下面在方法500中介绍具体的训练方法。It should be understood that the above-mentioned generator and discriminator are all trained generators and discriminators, and a specific training method will be introduced in method 500 below.
为了便于本领域技术人员理解,下面将结合图5中的例子进行描述。In order to facilitate the understanding of those skilled in the art, the following will be described in conjunction with the example in FIG. 5 .
图5提供了根据本申请实施例包括的模块的一个示意性框图。如图5所示,包括:检测模块和生成模块。FIG. 5 provides a schematic block diagram of modules included according to an embodiment of the present application. As shown in Figure 5, it includes: a detection module and a generation module.
其中,检测模块包括识别器;生成模块中包括生成器与判别器,生成器包括编码结构和解码结构,检测模型和生成模型可以基于生成对抗网络进行模型训练。Among them, the detection module includes a recognizer; the generation module includes a generator and a discriminator, and the generator includes an encoding structure and a decoding structure. The detection model and the generation model can be trained based on the generative confrontation network.
针对检测模块,具体的,源图像作为输入,该源图像预先进行标注,使用该带有标注的源图像训练识别器,从而使得识别器可以检测到语义目标。该训练方式可以采用现有技术中的训练方式,也可以采用任意可以达到训练图片识别器目的的方式,本申请实施例对此不作限定。For the detection module, specifically, a source image is used as an input, and the source image is pre-labeled, and a recognizer is trained using the source image with the mark, so that the recognizer can detect a semantic target. The training method may adopt the training method in the prior art, or any method that can achieve the purpose of training the picture recognizer, which is not limited in this embodiment of the present application.
应理解,经过训练的识别器可以确定源图像的第一语义目标图像,如方法400步骤S410所述,在此不进行赘述。It should be understood that the trained recognizer can determine the first semantic target image of the source image, as described in step S410 of the method 400, and details are not repeated here.
应理解,当数据样本不均衡时,模型会针对数据量丰富的类别识别能力增强,对数据量较少的类别识别能力减弱。因此,选取的边界框具有较高的概率选中样本量较多的类别。在语义目标编辑模块中,被选中原始语义目标会被替换为其他语义目标,而原本无法识别到的语义目标及其标签被保留在了原始的图像中。It should be understood that when the data samples are unbalanced, the model will enhance its ability to identify categories with a large amount of data, and weaken its ability to identify categories with a small amount of data. Therefore, the selected bounding box has a higher probability of selecting a category with a larger sample size. In the semantic target editing module, the selected original semantic target will be replaced by other semantic targets, and the semantic targets and their labels that could not be identified are retained in the original image.
针对生成模块,具体的,在语义目标编辑模块中,可以根据待生成语义目标信息对源图像进行处理得到第一背景图像,将第一背景图像输入编辑器得到第一背景图像特征,再将源图像和目标语义类别输入编辑器得到先验条件,进一步生成噪声的先验分布,进而从先验分布提取噪声特征,将该噪声特征与第一背景图像特征输入编辑器进行编辑合并,将 得到的特征输入解码器进行解码,得到合成的图像。For the generation module, specifically, in the semantic target editing module, the source image can be processed according to the semantic target information to be generated to obtain the first background image, and the first background image can be input into the editor to obtain the characteristics of the first background image, and then the source Input the image and target semantic category into the editor to obtain the prior conditions, further generate the prior distribution of the noise, and then extract the noise features from the prior distribution, edit and merge the noise features and the first background image features into the editor, and convert the obtained The features are input to the decoder for decoding to obtain a synthesized image.
应理解,生成器合成的图像需要输入判别器进一步进行判断是否为真。It should be understood that the image synthesized by the generator needs to be input to the discriminator to further judge whether it is true or not.
可以理解,该生成器和判别器进行串联并固定判别器参数,并以“该图片为真”为目标,优化生成器的参数,以实现生成器生成的图片能够“欺骗”判别器的目的,即,生成器生成的图片尽可能与真实图片无限接近。It can be understood that the generator and the discriminator are connected in series and the parameters of the discriminator are fixed, and the parameters of the generator are optimized with the goal of "the picture is true", so that the pictures generated by the generator can "deceive" the discriminator. That is, the pictures generated by the generator are as close as possible to the real pictures.
具体的,将生成器合成的图像输入判别器,判别器需要获取真实图片,即源图像,判别器根据源图像来判断输入的图像是否为真。根据判别器的判别结果来调整生成器的网络参数值,进一步的,将调整调整后的生成器生成的图片作为输入判别器继续上述步骤,知道判断其输出的结果为真,待训练的图像生成器的训练结束条件和所述待训练的图像判别器的训练结束条件达到平衡。Specifically, the image synthesized by the generator is input to the discriminator, and the discriminator needs to obtain a real picture, that is, the source image, and the discriminator judges whether the input image is real or not based on the source image. Adjust the network parameter value of the generator according to the discrimination result of the discriminator, and further, use the adjusted picture generated by the generator as the input discriminator to continue the above steps, until the result of judging its output is true, the image to be trained is generated The training end condition of the device and the training end condition of the image discriminator to be trained are balanced.
在一种可能的实施方式中,判别器对所述目标图像中包括的目标语义信息进行平滑处理,然后对所述包括经过平滑处理的目标语义信息的目标图像进行真实性判别。可选的一种理解,判别器根据真实图像对识别器生成的图像进行判别,该生成的图像仅在第一语义目标边界框内的图像与真实图像有区别,其他语义目标没有任何变化,因此,判别器的主要判别区域为该第一语义目标的边界框内的区域。In a possible implementation manner, the discriminator performs smoothing processing on the target semantic information included in the target image, and then performs authenticity discrimination on the target image including the smoothed target semantic information. An optional understanding, the discriminator judges the image generated by the recognizer based on the real image, the generated image is only different from the real image in the first semantic target bounding box, and there is no change in other semantic targets, so , the main discriminative region of the discriminator is the region within the bounding box of the first semantic object.
在一种可能的实施方式中,判别器对所述目标图像进行区域划分,针对不同区域进行真实性判别。In a possible implementation manner, the discriminator divides the target image into regions, and performs authenticity discrimination for different regions.
具体的,所述待训练的图像判别器结合所述待生成语义目标信息对所述不同区域在执行损失函数计算时进行加权。Specifically, the image discriminator to be trained performs weighting on the different regions when performing loss function calculation in combination with the semantic target information to be generated.
应理解,判别器识别到与待生成语义目标图像重合率较高的区域时,在进行损失函数计算时会增大损失函数的权重,使判别模型更关注生成的语义目标图像,从而提升模型的效果,也即,可以提高该区域内判别的真实性。It should be understood that when the discriminator recognizes a region with a high overlap rate with the semantic target image to be generated, it will increase the weight of the loss function when calculating the loss function, so that the discriminant model pays more attention to the generated semantic target image, thereby improving the model. The effect, that is, the authenticity of the discrimination in this area can be improved.
在一种可能的实施方式中,待训练的判别器对图像中包括目标语义信息的边界框进行放大处理,对所述边界框区域内的图像进行真实性判别。In a possible implementation manner, the discriminator to be trained enlarges the bounding box including target semantic information in the image, and performs authenticity discrimination on the image in the bounding box area.
应理解,本申请实施例中,判别器的判别区域扩大,不仅是在语义目标轮廓范围内,因此,判别器对边界框内语义目标的环境区域也有判别过程,进而控制生成模型的输出,使得经过训练的识别器生成的目标图像局部融合流畅自然,全图纹理高度一致。It should be understood that in the embodiment of the present application, the discrimination area of the discriminator is expanded, not only within the contour range of the semantic target, therefore, the discriminator also has a process of discrimining the environment area of the semantic target in the bounding box, and then controls the output of the generative model, so that The local fusion of the target image generated by the trained recognizer is smooth and natural, and the texture of the whole image is highly consistent.
在一种可能的实施方式中,当判别器的主干网络提取到图像特征后,将该特征与基于原始图像标注的类别信息图进行嵌入,随后使用卷积网络得到二维的区域判别结果,用以表示每个区域内图像为真实的概率。特别的,本申请对区域判别的结果进行加权得到损失函数,从而使模型更加关注生成区域的结果。In a possible implementation, after the backbone network of the discriminator extracts the image features, the features are embedded with the category information map based on the original image annotation, and then the convolutional network is used to obtain the two-dimensional region discrimination results, and the To represent the probability that the image in each region is real. In particular, this application weights the results of region discrimination to obtain a loss function, so that the model pays more attention to the results of region generation.
优选的,本申请的判别器可以采用PatchGAN结构。Preferably, the discriminator of the present application can adopt the PatchGAN structure.
可以理解,判别器在经过训练后识别真实图片、伪造图片的能力上升,在优化生成器的参数时,会相应的提升生成器生成以假乱真图片的能力。通过持续迭代优化,判别器与生成器逐渐达到平衡,从而完成模型的训练。It can be understood that after the discriminator is trained, its ability to identify real pictures and fake pictures increases. When optimizing the parameters of the generator, it will correspondingly improve the generator's ability to generate fake pictures. Through continuous iterative optimization, the discriminator and generator gradually reach a balance, thus completing the training of the model.
应理解,经过训练后的识别器可以直接生成目标图像。It should be understood that the trained recognizer can directly generate the target image.
根据目标语义信息、第一语义目标图像的信息生成目标图像的标注,实现语义目标的自动标注,避免了人工标注的工作量,有效提升训练数据集准备效率。The annotation of the target image is generated according to the semantic information of the target and the information of the first semantic target image, so as to realize the automatic labeling of the semantic target, avoid the workload of manual labeling, and effectively improve the preparation efficiency of the training data set.
在一种可能的实施方式中,图5中的检测模块可以与生成模块耦合,即,将识别器与 判别器耦合,在这种情况下,可以实现判别器与检测器共享特征提取,从而大幅降低了模型的运算量,有效提升图像生成效率。In a possible implementation, the detection module in Figure 5 can be coupled with the generation module, that is, the identifier and the discriminator are coupled. In this case, the discriminator and the detector can share feature extraction, thereby greatly The computational load of the model is reduced, and the efficiency of image generation is effectively improved.
根据本申请实施例的图像生成方法,基于预置的待生成语义目标信息,引入识别器自动识别原始图片中待替换的语义目标,对该语义目标做预处理,随后获取处理后的图片特征,结合待生成语义目标信息,生成噪声的条件分布并与图片特征进行合并,然后自动替换为预期生成的语义目标,最后合成符合图片纹理信息并满足多样性的图片并自动标注,保证生成数据的多样性。According to the image generation method of the embodiment of the present application, based on the preset semantic target information to be generated, a recognizer is introduced to automatically identify the semantic target to be replaced in the original picture, preprocess the semantic target, and then obtain the processed picture features, Combining the semantic target information to be generated, the conditional distribution of the noise is generated and merged with the image features, and then automatically replaced with the expected generated semantic target, and finally a picture that meets the texture information of the image and meets the diversity is synthesized and automatically labeled to ensure the diversity of generated data sex.
图6是本申请实施例提供的一种图像生成装置的结构框图。该图像生成装置600包括:检测单元610、生成单元620和判别单元630。Fig. 6 is a structural block diagram of an image generating device provided by an embodiment of the present application. The image generation device 600 includes: a detection unit 610 , a generation unit 620 and a discrimination unit 630 .
其中,检测单元610用于确定源图像的第一语义目标图像,所述第一语义目标图像为所述源图像包含的至少一个语义目标图像中的至少一个。Wherein, the detection unit 610 is configured to determine a first semantic target image of the source image, where the first semantic target image is at least one of at least one semantic target image included in the source image.
生成单元620用于根据所述第一语义目标图像确定第一背景图像;根据所述第一背景图像和待生成语义目标信息生成噪声的先验分布;根据所述先验分布和所述第一背景图像生成所述目标图像。The generating unit 620 is configured to determine a first background image according to the first semantic target image; generate a priori distribution of noise according to the first background image and the semantic target information to be generated; according to the prior distribution and the first The background image generates the target image.
具体的,生成单元620用于从所述先验分布抽取噪声,根据所述噪声和所述第一背景图像生成所述目标图像。Specifically, the generating unit 620 is configured to extract noise from the prior distribution, and generate the target image according to the noise and the first background image.
具体的,生成单元620用于对所述第一语义目标图像进行平滑处理,根据所述经过平滑处理的第一语义目标图像确定所述第一背景图像。Specifically, the generating unit 620 is configured to perform smoothing processing on the first semantic target image, and determine the first background image according to the smoothed first semantic target image.
在一种可能的实施方式中,生成单元620用于从所述源图像中移除所述经过平滑处理的第一语义目标图像,获得第一背景图像。In a possible implementation manner, the generating unit 620 is configured to remove the smoothed first semantic target image from the source image to obtain a first background image.
具体的,生成单元620还用于根据所述语义目标信息完成所述目标图像的语义目标的标注。Specifically, the generation unit 620 is further configured to complete the labeling of the semantic target of the target image according to the semantic target information.
训练单元630用于训练所述生成单元生成的图像趋于真实。The training unit 630 is used to train the image generated by the generating unit to be realistic.
具体的,将所述目标图像作为输入图像,利用待训练的图像判别器鉴别所述目标图像的真实性;根据所述图像判别器的输出结果以及所述输入图像,调整图像生成器的网络参数值,所述图像生成器用于生成所述目标图像;将网络参数值调整后的图像生成器生成的目标图像作为输入图像,重复所述待训练的图像判别器的鉴别动作,直至该训练过程收敛。Specifically, the target image is used as an input image, and an image discriminator to be trained is used to identify the authenticity of the target image; according to the output result of the image discriminator and the input image, the network parameters of the image generator are adjusted value, the image generator is used to generate the target image; the target image generated by the image generator after the network parameter value is adjusted is used as an input image, and the identification action of the image discriminator to be trained is repeated until the training process converges .
在一种可能的实施方式中,所述待训练的图像判别器对所述目标图像包括的不同区域内的图像进行判别。In a possible implementation manner, the image discriminator to be trained discriminates images in different regions included in the target image.
可选的,所述待训练的图像判别器对所述目标图像进行区域划分。Optionally, the image discriminator to be trained performs region division on the target image.
具体的,所述待训练的图像判别器还用于对所述目标图像中包括目标语义信息进行平滑处理,并对所述包括经过平滑处理的目标语义信息的目标图像进行真实性判别。Specifically, the image discriminator to be trained is also used to perform smoothing processing on the target image including the target semantic information, and perform authenticity discrimination on the target image including the smoothed target semantic information.
应理解,该平滑处理,例如,对所述目标语义图像的边界框进行放大处理,所述待训练的图像判别器对所述边界框区域内的图像进行真实性判别。It should be understood that the smoothing process, for example, enlarges the bounding box of the target semantic image, and the image discriminator to be trained performs authenticity discrimination on images in the bounding box area.
应理解,检测单元610和生成单元620中的判别器可以共用一个图像检测器,所述图像检测器用于对所述源图像包括的第一语义目标图像进行特征提取。从而能够提高图像生成效率。It should be understood that the discriminators in the detection unit 610 and the generation unit 620 may share an image detector, and the image detector is used to perform feature extraction on the first semantic target image included in the source image. Thus, image generation efficiency can be improved.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。It should also be understood that the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Access memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of computer program products. The computer program product comprises one or more computer instructions or computer programs. When the computer instruction or computer program is loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive.
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。It should be understood that the term "and/or" in this article is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B may mean: A exists alone, and A and B exist at the same time , there are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship, but it may also indicate an "and/or" relationship, which can be understood by referring to the context.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。In this application, "at least one" means one or more, and "multiple" means two or more. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的 先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (25)

  1. 一种图像生成方法,其特征在于,包括:A method for generating an image, comprising:
    确定源图像的第一语义目标图像,所述第一语义目标图像为所述源图像包含的至少一个语义目标图像中的至少一个;determining a first semantic target image of a source image, the first semantic target image being at least one of at least one semantic target image contained in the source image;
    根据所述第一语义目标图像和第一先验分布确定目标图像,所述第一先验分布为根据第一背景图像和待生成语义目标信息得到的噪声的先验分布,所述目标图像包括多种包含所述待生成语义目标信息的图像,所述待生成语义目标信息包括待生成语义目标图像的标签和待生成语义目标图像的指示信息,所述第一背景图像为所述源图像的背景图像。Determine the target image according to the first semantic target image and the first prior distribution, the first prior distribution is the prior distribution of the noise obtained according to the first background image and the semantic target information to be generated, and the target image includes A variety of images containing the semantic target information to be generated, the semantic target information to be generated includes the label of the semantic target image to be generated and the indication information of the semantic target image to be generated, the first background image is the source image background image.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一语义目标图像和第一先验分布确定目标图像,包括:The method according to claim 1, wherein said determining the target image according to the first semantic target image and the first prior distribution comprises:
    根据所述第一语义目标图像确定所述第一背景图像;determining the first background image based on the first semantic target image;
    根据所述第一背景图像和所述待生成语义目标信息生成所述第一先验分布;generating the first prior distribution according to the first background image and the semantic target information to be generated;
    根据所述第一先验分布和所述第一背景图像生成所述目标图像。The target image is generated according to the first prior distribution and the first background image.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第一先验分布和所述第一背景图像生成所述目标图像,包括:The method according to claim 2, wherein said generating said target image according to said first prior distribution and said first background image comprises:
    根据所述第一先验分布生成所述待生成语义目标图像的噪声;generating the noise of the semantic target image to be generated according to the first prior distribution;
    根据所述待生成语义目标图像的噪声和所述第一背景图像生成所述目标图像。The target image is generated according to the noise of the semantic target image to be generated and the first background image.
  4. 根据权利要求1-3项中任一项所述的方法,其特征在于,根据所述第一语义目标图像确定第一背景图像,包括:The method according to any one of claims 1-3, wherein determining the first background image according to the first semantic target image comprises:
    对所述第一语义目标图像进行平滑处理;smoothing the first semantic target image;
    根据所述经过平滑处理的第一语义目标图像确定所述第一背景图像。The first background image is determined according to the smoothed first semantic target image.
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, characterized in that the method further comprises:
    从所述源图像中移除所述经过平滑处理的第一语义目标图像。The smoothed first semantic target image is removed from the source image.
  6. 根据权利要求1-5项中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    将所述目标图像作为输入图像,利用待训练的图像判别器鉴别所述目标图像的真实性;Using the target image as an input image, using an image discriminator to be trained to identify the authenticity of the target image;
    根据所述待训练的图像判别器的输出结果以及所述输入图像,调整图像生成器的网络参数值,所述图像生成器用于生成所述目标图像;According to the output result of the image discriminator to be trained and the input image, adjust the network parameter value of the image generator, the image generator is used to generate the target image;
    将网络参数值调整后的图像生成器生成的目标图像作为输入图像,重复所述待训练的图像判别器的鉴别动作,直至训练过程收敛。Taking the target image generated by the image generator after adjusting the network parameter value as the input image, repeating the identification action of the image discriminator to be trained until the training process converges.
  7. 根据权利要求6所述的方法,其特征在于,所述利用待训练的图像判别器鉴别所述目标图像的真实性,包括:The method according to claim 6, wherein said utilizing an image discriminator to be trained to identify the authenticity of said target image comprises:
    所述待训练的图像判别器对所述目标图像包括的不同区域内的图像进行判别,包括:The image discriminator to be trained discriminates images in different regions included in the target image, including:
    所述待训练的图像判别器结合所述待生成语义目标信息对所述不同区域在执行损失函数计算时进行加权。The image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:The method according to claim 7, wherein the method further comprises:
    所述待训练的图像判别器对所述目标图像进行区域划分。The image discriminator to be trained performs region division on the target image.
  9. 根据权利要求6-8项中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 6-8, wherein the method further comprises:
    所述待训练的图像判别器对所述目标图像中包括的待生成语义目标图像进行平滑处理;The image discriminator to be trained performs smoothing processing on the semantic target image to be generated included in the target image;
    所述待训练的图像判别器对所述包括经过平滑处理的待生成语义目标图像的目标图像进行真实性判别。The image discriminator to be trained performs authenticity discrimination on the target image including the smoothed semantic target image to be generated.
  10. 根据权利要求1-9项中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-9, wherein the method further comprises:
    根据所述待生成语义目标信息和所述第一语义目标图像的信息完成所述目标图像的语义目标的标注。Annotating the semantic target of the target image is completed according to the semantic target information to be generated and the information of the first semantic target image.
  11. 根据权利要求1-10项中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-10, wherein the method further comprises:
    所述待训练的图像判别器包括图像检测器,所述图像检测器用于对所述源图像包括的第一语义目标图像进行特征提取。The image discriminator to be trained includes an image detector, and the image detector is used to perform feature extraction on the first semantic target image included in the source image.
  12. 一种图像生成装置,其特征在于,包括:An image generating device, characterized in that it comprises:
    检测单元,用于确定源图像的第一语义目标图像,所述第一语义目标图像为所述源图像包含的至少一个语义目标图像中的至少一个;a detection unit, configured to determine a first semantic target image of the source image, where the first semantic target image is at least one of at least one semantic target image included in the source image;
    生成单元,用于根据所述第一语义目标图像和第一先验分布确定目标图像,所述第一先验分布为根据第一背景图像和待生成语义目标信息得到的噪声的先验分布,所述目标图像包括多种包含所述待生成语义目标信息的图像,所述待生成语义目标信息包括待生成语义目标图像的标签和待生成语义目标图像的指示信息,所述第一背景图像为所述源图像的背景图像。A generating unit, configured to determine a target image according to the first semantic target image and a first prior distribution, the first prior distribution being a prior distribution of noise obtained according to the first background image and semantic target information to be generated, The target image includes a variety of images containing the semantic target information to be generated, the semantic target information to be generated includes the label of the semantic target image to be generated and the indication information of the semantic target image to be generated, and the first background image is The background image for the source image.
  13. 根据权利要求12所述的图像生成装置,其特征在于,所述生成单元具体用于:The image generating device according to claim 12, wherein the generating unit is specifically used for:
    根据所述第一语义目标图像确定所述第一背景图像;determining the first background image according to the first semantic target image;
    根据所述第一背景图像和所述待生成语义目标信息生成所述第一先验分布;generating the first prior distribution according to the first background image and the semantic target information to be generated;
    根据所述第一先验分布和所述第一背景图像生成所述目标图像。The target image is generated according to the first prior distribution and the first background image.
  14. 根据权利要求13所述的图像生成装置,其特征在于,所述生成单元还用于:The image generating device according to claim 13, wherein the generating unit is further used for:
    根据所述第一先验分布生成所述待生成语义目标图像的噪声;generating the noise of the semantic target image to be generated according to the first prior distribution;
    根据所述待生成语义目标图像的噪声和所述第一背景图像生成所述目标图像。The target image is generated according to the noise of the semantic target image to be generated and the first background image.
  15. 根据权利要求12-14项中任一项所述的图像生成装置,其特征在于,所述生成单元还用于:The image generating device according to any one of claims 12-14, wherein the generating unit is also used for:
    对所述第一语义目标图像进行平滑处理;smoothing the first semantic target image;
    根据所述经过平滑处理的第一语义目标图像确定所述第一背景图像。The first background image is determined according to the smoothed first semantic target image.
  16. 根据权利要求15所述的图像生成装置,其特征在于,所述生成单元还用于:The image generating device according to claim 15, wherein the generating unit is further used for:
    从所述源图像中移除所述经过平滑处理的第一语义目标图像。The smoothed first semantic target image is removed from the source image.
  17. 根据权利要求12-16项中任一项所述的图像生成装置,其特征在于,所述图像生成装置还包括训练单元,用于:The image generation device according to any one of claims 12-16, wherein the image generation device further comprises a training unit, configured to:
    将所述目标图像作为输入图像,利用待训练的图像判别器鉴别所述目标图像的真实性;Using the target image as an input image, using an image discriminator to be trained to identify the authenticity of the target image;
    根据所述待训练的图像判别器的输出结果以及所述输入图像,调整图像生成器的网络参数值,所述图像生成器用于生成所述目标图像;According to the output result of the image discriminator to be trained and the input image, adjust the network parameter value of the image generator, the image generator is used to generate the target image;
    将网络参数值调整后的图像生成器生成的目标图像作为输入图像,重复所述待训练的图像判别器的鉴别动作,直至训练过程收敛。Taking the target image generated by the image generator after adjusting the network parameter value as the input image, repeating the identification action of the image discriminator to be trained until the training process converges.
  18. 根据权利要求17所述的图像生成装置,其特征在于,所述待训练的图像判别器对所述目标图像包括的不同区域内的图像进行判别,包括:The image generation device according to claim 17, wherein the image discriminator to be trained discriminates images in different regions included in the target image, including:
    所述待训练的图像判别器结合所述待生成语义目标信息对所述不同区域在执行损失函数计算时进行加权。The image discriminator to be trained combines the semantic target information to be generated to weight the different regions when calculating the loss function.
  19. 根据权利要求18所述的图像生成装置,其特征在于,所述待训练的图像判别器对所述目标图像进行区域划分。The image generation device according to claim 18, wherein the image discriminator to be trained performs region division on the target image.
  20. 根据权利要求17-19项中任一项所述的图像生成装置,其特征在于,所述待训练的图像判别器对所述目标图像中包括的待生成语义目标图像进行平滑处理;The image generation device according to any one of claims 17-19, wherein the image discriminator to be trained performs smoothing processing on the semantic target image to be generated included in the target image;
    所述待训练的图像判别器对所述包括经过平滑处理的待生成语义目标图像的目标图像进行真实性判别。The image discriminator to be trained performs authenticity discrimination on the target image including the smoothed semantic target image to be generated.
  21. 根据权利要求12-20项中任一项所述的图像生成装置,其特征在于,所述生成单元还用于:The image generating device according to any one of claims 12-20, wherein the generating unit is further used for:
    根据所述待生成语义目标信息和所述第一语义目标图像的信息完成所述目标图像的语义目标的标注。Annotating the semantic target of the target image is completed according to the semantic target information to be generated and the information of the first semantic target image.
  22. 根据权利要求12-21项中任一项所述的图像生成装置,其特征在于,所述待训练的图像判别器包括图像检测器,所述图像检测器用于对所述源图像包括的第一语义目标图像进行特征提取。The image generating device according to any one of claims 12-21, wherein the image discriminator to be trained includes an image detector, and the image detector is used to detect the first Semantic target image for feature extraction.
  23. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    处理器和存储器,其中,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求1至11项中任一项所述的方法。A processor and a memory, wherein the memory is used to store program instructions, and the processor is used to invoke the program instructions to execute the method according to any one of claims 1 to 11.
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至11项中任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, and the program code includes a method for executing the method according to any one of claims 1 to 11.
  25. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至11项中任一项所述的方法。A chip, characterized in that the chip includes a processor and a data interface, and the processor reads the instructions stored on the memory through the data interface to execute any one of claims 1 to 11. Methods.
PCT/CN2022/115028 2021-08-30 2022-08-26 Image generation method and apparatus WO2023030182A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111006421.1 2021-08-30
CN202111006421.1A CN114037640A (en) 2021-08-30 2021-08-30 Image generation method and device

Publications (1)

Publication Number Publication Date
WO2023030182A1 true WO2023030182A1 (en) 2023-03-09

Family

ID=80140000

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115028 WO2023030182A1 (en) 2021-08-30 2022-08-26 Image generation method and apparatus

Country Status (2)

Country Link
CN (1) CN114037640A (en)
WO (1) WO2023030182A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091873A (en) * 2023-04-10 2023-05-09 宁德时代新能源科技股份有限公司 Image generation method, device, electronic equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114037640A (en) * 2021-08-30 2022-02-11 华为技术有限公司 Image generation method and device
CN116563673B (en) * 2023-07-10 2023-12-12 浙江华诺康科技有限公司 Smoke training data generation method and device and computer equipment
CN117475262B (en) * 2023-12-26 2024-03-19 苏州镁伽科技有限公司 Image generation method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150170006A1 (en) * 2013-12-16 2015-06-18 Adobe Systems Incorporated Semantic object proposal generation and validation
CN107610126A (en) * 2017-08-31 2018-01-19 浙江工业大学 A kind of interactive image segmentation method based on local prior distribution
CN112200889A (en) * 2020-10-30 2021-01-08 上海商汤智能科技有限公司 Sample image generation method, sample image processing method, intelligent driving control method and device
CN114037640A (en) * 2021-08-30 2022-02-11 华为技术有限公司 Image generation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150170006A1 (en) * 2013-12-16 2015-06-18 Adobe Systems Incorporated Semantic object proposal generation and validation
CN107610126A (en) * 2017-08-31 2018-01-19 浙江工业大学 A kind of interactive image segmentation method based on local prior distribution
CN112200889A (en) * 2020-10-30 2021-01-08 上海商汤智能科技有限公司 Sample image generation method, sample image processing method, intelligent driving control method and device
CN114037640A (en) * 2021-08-30 2022-02-11 华为技术有限公司 Image generation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUAN QIU; HUA MIN; HU HAI-GEN: "A modified grabcut approach for image segmentation based on local prior distribution", 2017 INTERNATIONAL CONFERENCE ON WAVELET ANALYSIS AND PATTERN RECOGNITION (ICWAPR), IEEE, 9 July 2017 (2017-07-09), pages 122 - 126, XP033234503, ISBN: 978-1-5386-0410-6, DOI: 10.1109/ICWAPR.2017.8076675 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091873A (en) * 2023-04-10 2023-05-09 宁德时代新能源科技股份有限公司 Image generation method, device, electronic equipment and storage medium
CN116091873B (en) * 2023-04-10 2023-11-28 宁德时代新能源科技股份有限公司 Image generation method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114037640A (en) 2022-02-11

Similar Documents

Publication Publication Date Title
WO2023030182A1 (en) Image generation method and apparatus
CN111612008B (en) Image segmentation method based on convolution network
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
Li et al. A survey on semantic segmentation
EP4099220A1 (en) Processing apparatus, method and storage medium
CN111696110B (en) Scene segmentation method and system
CN113870160B (en) Point cloud data processing method based on transformer neural network
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN115861619A (en) Airborne LiDAR (light detection and ranging) urban point cloud semantic segmentation method and system of recursive residual double-attention kernel point convolution network
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
WO2023125628A1 (en) Neural network model optimization method and apparatus, and computing device
Hu et al. A video streaming vehicle detection algorithm based on YOLOv4
Jemilda et al. Moving object detection and tracking using genetic algorithm enabled extreme learning machine
Ahmad et al. 3D capsule networks for object classification from 3D model data
CN113724286A (en) Method and device for detecting saliency target and computer-readable storage medium
Yang et al. C-RPNs: Promoting object detection in real world via a cascade structure of Region Proposal Networks
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
Zhang et al. A small target detection method based on deep learning with considerate feature and effectively expanded sample size
Mahaur et al. An improved lightweight small object detection framework applied to real-time autonomous driving
Zhou et al. A novel object detection method in city aerial image based on deformable convolutional networks
Wu et al. Vehicle detection based on adaptive multi-modal feature fusion and cross-modal vehicle index using RGB-T images
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
CN114764856A (en) Image semantic segmentation method and image semantic segmentation device
Chacon-Murguia et al. Moving object detection in video sequences based on a two-frame temporal information CNN

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863320

Country of ref document: EP

Kind code of ref document: A1