WO2020258902A1

WO2020258902A1 - Image generating and neural network training method, apparatus, device, and medium

Info

Publication number: WO2020258902A1
Application number: PCT/CN2020/076835
Authority: WO
Inventors: 黄明杨; 张昶旭; 刘春晓; 石建萍
Original assignee: 商汤集团有限公司
Priority date: 2019-06-24
Filing date: 2020-02-26
Publication date: 2020-12-30
Also published as: JP2022512340A; CN112132167A; CN112132167B; KR20210088656A

Abstract

An image generating method, a neural network training method, an apparatus, an electronic device, and a computer storage medium. The image generating method comprises: extracting content features of a content image by utilizing multiple first network unit blocks sequentially connected in a first neural network to obtain the content features outputted by the first network unit blocks (101); extracting a style feature of a style image (102); correspondingly feedforward inputting the content features respectively outputted by the first network unit blocks into multiple second network unit blocks sequentially connected in a second neural network, feedforward inputting the style feature from the first second network unit block in the multiple second network unit blocks, and, when the second network unit blocks process the features inputted by each, obtaining a generated image outputted by the second neural network, where the multiple first network unit blocks correspond to the multiple second network unit blocks (103).

Description

Image generation and neural network training method, device, equipment and medium

Cross references to related applications

This disclosure is filed based on a Chinese patent application with an application number of 201910551145.3 and an application date of June 24, 2019, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this disclosure by way of introduction.

Technical field

The present disclosure relates to the field of image processing, and in particular to an image generation method and neural network training method, device, electronic equipment, and computer storage medium.

Background technique

The method of image generation can be to generate from one real image to another, and then subjectively judge whether the generated image is more realistic through human vision. With the application of neural networks, neural network-based image generation methods have emerged in related technologies. The neural network can usually be trained based on paired data, and then the content image can be styled through the trained neural network. Here, paired data is represented by The content image and style image that have the same content characteristics for training, and the style image and the content image have different style characteristics. However, in actual scenarios, the above-mentioned paired data rarely occurs, so this method is not easy to implement.

Summary of the invention

The embodiments of the present disclosure are expected to provide a technical solution for image generation.

In a first aspect, an embodiment of the present disclosure provides an image generation method, the method includes: extracting content features of a content image by using sequentially connected multi-layer first network unit blocks in a first neural network to obtain first The content features respectively output by the network unit blocks; the style features of the style images are extracted; the content features respectively output by the first network unit blocks of each layer are correspondingly fed forward into the second neural network connected to the multilayer second network unit sequentially Block, and feed-forward the style features from the first-layer second network unit block in the multi-layer second network unit block, and obtain the input feature after each second network unit block processes The generated image output by the second neural network, wherein the multi-layer first network unit block corresponds to the multi-layer second network unit block.

In the second aspect, the embodiments of the present disclosure also propose a neural network training method. The method further includes: extracting the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain each The content features respectively output by the first network unit block of the layer; extract the style features of the style image; the content features respectively output by the first network unit blocks of each layer are fed forward into the sequentially connected multi-layer first neural network in the second neural network. Two network unit blocks, and feed forward the style features from the first-layer second network unit block in the multi-layer second network unit block, and after each second network unit block processes the respective input features Obtain the generated image output by the second neural network, where the multi-layer first network unit block corresponds to the multi-layer second network unit block; the generated image is identified to obtain the authentication result; Adjusting the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result.

In a third aspect, an embodiment of the present disclosure also provides an image generation device. The device includes a first extraction module, a second extraction module, and a first processing module. The first extraction module is used to use the first neural network. The first network unit blocks of multiple layers connected in sequence extract the content features of the content image to obtain the content features respectively output by the first network unit blocks of each layer; the second extraction module is used to extract the style features of the style image; the first processing Module, used to feed forward the content features respectively outputted by the first network unit blocks of each layer into the second neural network connected in sequence in the second neural network, and transfer the style features from the multiple The first layer second network unit block in the second layer of the network unit block feeds forward input, and the generated image output by the second neural network is obtained after each second network unit block processes the characteristics of each input, wherein The multi-layer first network unit block corresponds to the multi-layer second network unit block.

In a fourth aspect, the embodiments of the present disclosure also provide a neural network training device, which includes a third extraction module, a fourth extraction module, a second processing module, and an adjustment module; wherein, the third extraction module is used to use In the first neural network, the sequentially connected multi-layer first network unit blocks extract the content features of the content image, and obtain the content features respectively output by the first network unit blocks of each layer; the fourth extraction module is used to extract the style features of the style image The second processing module is used to feed the content features respectively output by the first network unit blocks of each layer into the second neural network sequentially connected multi-layer second network unit blocks corresponding to the feedforward input, and the style features Feed forward input from the first-layer second network unit block in the multi-layer second network unit block, and obtain the generated image output by the second neural network after each of the second network unit blocks process the features of their respective inputs Identify the generated image to obtain an identification result; wherein, the multi-layer first network unit block corresponds to the multi-layer second network unit block; an adjustment module is used for the content image, the Adjusting the network parameters of the first neural network and/or the second neural network for the style image, the generated image and the identification result.

In a fifth aspect, the embodiments of the present disclosure also propose an electronic device, including a processor and a memory for storing a computer program that can run on the processor; wherein, when the processor is used to run the computer program, execute Any one of the above image generation methods or any one of the above neural network training methods.

In a sixth aspect, the embodiments of the present disclosure also propose a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, any one of the foregoing image generation methods or any of the foregoing neural network training methods is implemented.

In the image generation method and neural network training method, device, electronic equipment, and computer storage medium proposed in the embodiments of the present disclosure, the content features of the content image are extracted by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain The content features respectively output by the first network unit blocks of each layer; the style features of the style images are extracted; the content features respectively output by the first network unit blocks of each layer are correspondingly fed forward into the sequentially connected multilayers in the second neural network The second network unit block, and the style feature is fed forward from the first-layer second network unit block in the multi-layer second network unit block, and the respective input features are processed by each of the second network unit blocks Then, the generated image output by the second neural network is obtained, wherein the multi-layer first network unit block corresponds to the multi-layer second network unit block. In the embodiments of the present disclosure, both the content image and the style image can be determined in actual need, and the content image and the style image do not need to be a pair of images, which is easy to implement; in addition, in the process of image generation, the first neural network can be used The first network unit block of each layer extracts the content features of the content image multiple times, thereby retaining more semantic information of the content image, so that the generated image retains more semantic information compared with the content image. Therefore, The generated image is more realistic.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, rather than limiting the present disclosure.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification. These drawings illustrate embodiments that conform to the disclosure and are used together with the specification to explain the technical solutions of the disclosure.

FIG. 1 is a flowchart of an image generation method according to an embodiment of the disclosure;

2 is a schematic diagram of the structure of a neural network pre-trained in an embodiment of the disclosure;

FIG. 3 is an exemplary structural diagram of a content encoder according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of an exemplary structure of a CRB in an embodiment of the disclosure;

FIG. 5 is an exemplary structural diagram of a generator of an embodiment of the disclosure;

Fig. 6 shows several exemplary sets of content images, style images, and generated images in the embodiments of the disclosure;

Fig. 7 is a flowchart of a neural network training method according to an embodiment of the disclosure;

8 is a schematic structural diagram of the framework of the image generation method proposed by the application embodiment of the disclosure;

Fig. 9a is a schematic structural diagram of a residual block of a content encoder in an application embodiment of the present disclosure;

Fig. 9b is a schematic structural diagram of a residual block of the generator in an application embodiment of the present disclosure;

FIG. 9c is a schematic structural diagram of the FADE module of the application embodiment of the disclosure;

FIG. 10 is a schematic diagram of the composition structure of an image generating device according to an embodiment of the disclosure;

FIG. 11 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure;

FIG. 12 is a schematic diagram of the composition structure of a neural network training device according to an embodiment of the disclosure.

Detailed ways

The embodiments of the present disclosure will be described in further detail below with reference to the drawings and embodiments. It should be understood that the embodiments provided herein are only used to explain the embodiments of the present disclosure, and are not used to limit the embodiments of the present disclosure. In addition, the embodiments provided below are part of the embodiments for implementing the present disclosure, rather than providing all the embodiments for implementing the present disclosure. In the case of no conflict, the technical solutions described in the embodiments of the present disclosure can be combined in any manner. Implement.

It should be noted that in the embodiments of the present disclosure, the terms "including", "including" or any other variations thereof are intended to cover non-exclusive inclusion, so that a method or device including a series of elements not only includes what is clearly stated Elements, but also include other elements not explicitly listed, or elements inherent to the implementation of the method or device. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other related elements (such as steps or steps in the method) in the method or device that includes the element. The unit in the device, for example, the unit may be part of a circuit, part of a processor, part of a program or software, etc.).

For example, the image generation method and neural network training method provided by the embodiments of the present disclosure include a series of steps, but the image generation method and neural network training method provided by the embodiments of the present disclosure are not limited to the recorded steps. Similarly, the present disclosure The image generation device and neural network training device provided in the embodiments include a series of modules, but the devices provided in the embodiments of the present disclosure are not limited to include the explicitly recorded modules, and may also include information for obtaining relevant information or processing based on information. The module that needs to be set.

The term "and/or" in this article is only an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the term "at least one" in this document means any one or any combination of at least two of the multiple, for example, including at least one of A, B, and C, may mean including A, Any one or more elements selected in the set formed by B and C.

The embodiments of the present disclosure can be applied to a computer system composed of a terminal and a server, and can operate with many other general-purpose or special-purpose computing system environments or configurations. Here, the terminal can be a thin client, a thick client, a handheld or laptop device, a microprocessor-based system, a set-top box, a programmable consumer electronics product, a network personal computer, a vehicle-mounted device, a small computer system, etc. The server can It is server computer system, small computer system, large computer system and distributed cloud computing technology environment including any of the above systems, etc.

Electronic devices such as terminals and servers can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system. Generally, program modules may include routines, programs, object programs, components, logic, data structures, etc., which perform specific tasks or implement specific abstract data types. The computer system/server can be implemented in a distributed cloud computing environment. In the distributed cloud computing environment, tasks are executed by remote processing equipment linked through a communication network. In a distributed cloud computing environment, program modules may be located on a storage medium of a local or remote computing system including a storage device.

Based on the content recorded above, in some embodiments of the present disclosure, an image generation method is proposed. The applicable scenarios of the embodiments of the present disclosure include but are not limited to automatic driving, image generation, image synthesis, computer vision, deep learning, Machine learning, etc.

FIG. 1 is a flowchart of an image generation method according to an embodiment of the disclosure. As shown in FIG. 1, the method may include:

Step 101: Extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer.

Here, the content image may be an image that requires style conversion; for example, the content image may be obtained from a local storage area or the content image may be obtained from the network. For example, the content image may be an image taken by a mobile terminal or a camera. The format of the content image can be Joint Photographic Experts GROUP (JPEG), Bitmap (BMP), Portable Network Graphics (PNG) or other formats; it should be noted that this is only for The format and source of the content image are exemplified, and the embodiment of the present disclosure does not limit the format and source of the content image.

For an image, content features and style features can be extracted. Among them, the content feature is used to characterize the content information of the image, for example, the content feature represents the object position, object shape, object size, etc. in the image; the style feature is used to represent the style information of the content image, for example, the style feature is used to represent weather, Style information such as day, night, and conversation style.

In the embodiment of the present disclosure, the style conversion may refer to the conversion of the style feature of the content image into another style feature. Illustratively, the conversion of the style feature of the content image may be the conversion from day to night, and from night to day. , Conversion between different weather styles, conversion between different painting styles, conversion from real images to computer-graphic images (CG) images, conversion from CG images to real images; different weather The conversion between styles can be from sunny to rainy, rainy to sunny, sunny to cloudy, cloudy to sunny, cloudy to rainy, rainy to cloudy, sunny to cloudy Snow conversion, snow to sunny conversion, cloudy to snow conversion, snow to cloudy conversion, snow to rain conversion or rain to snow conversion, etc.; the conversion of different painting styles can be oil painting Conversion to ink painting, ink painting to oil painting, oil painting to sketch painting, sketch painting to oil painting, sketch painting to ink painting or ink painting to sketch painting, etc.

Here, the first neural network is a network for extracting content features of content images, and the embodiment of the present disclosure does not limit the type of the first neural network. The first neural network includes sequentially connected multi-layer first network unit blocks. In the multi-layer first network unit block of the first neural network, the content characteristics of the content image can be changed from the first network unit block of the multi-layer first network. Feed forward input of the first network unit block of the layer. Among them, the data processing direction corresponding to the feedforward input represents the data processing direction from the input end of the neural network to the output end, corresponding to the forward propagation or forward propagation process; for the feedforward input process, the upper layer of the neural network unit block The output result is used as the input result of the next layer of network unit block.

For the first neural network, the first network unit block of each layer of the first neural network can extract content features for the input data, that is, the output result of the first network unit block of each layer of the first neural network is the corresponding layer first network The content characteristics of the unit blocks and the content characteristics output by different first network unit blocks in the first neural network are different.

Optionally, the representation mode of the content feature of the content image may be a content feature map or other representation mode, which is not limited in the embodiment of the present disclosure.

It is understandable that by successively extracting content features by the first network unit blocks of each layer of the first neural network, the semantic information of the content image from the low level to the high level can be obtained. Optionally, each layer of the first network unit block in the first neural network is a plurality of neural network layers organized in a residual structure, so that it can be based on multiple layers of the first network unit block in each layer organized in a residual structure The neural network layer extracts the content features of the content image.

Step 102: Extract the style features of the style image.

Here, the style image is an image with the target style feature, the target style feature represents the style feature to which the content image needs to be converted, and the style image can be set as needed. In the embodiment of the present disclosure, after acquiring the content image, the target style feature to be converted can be determined, and then the style image can be selected according to the demand.

In practical applications, the style image can be obtained from the local storage area or the network. For example, the style image can be an image taken through a mobile terminal or camera; the format of the style image can be JPEG, BMP, PNG or other formats; Yes, here is only an example of the format and source of the style image, and the embodiment of the present disclosure does not limit the format and source of the style image.

In the embodiment of the present disclosure, the style feature of the content image is different from the style feature of the style image, and the purpose of performing style conversion on the content image may be: to make the generated image obtained after the style conversion have the content feature and style of the content image The stylistic characteristics of the image.

For example, you can convert a day-style content image to a night-style generated image, or convert a sunny-style content image to a rainy-style generated image, or convert an ink painting style content image to an oil painting style generated image, or , Convert CG style images into real image style generated images, etc.

For the implementation of this step, for example, the extracting the style features of the style image includes: extracting the features of the style image distribution; sampling the features of the style image distribution to obtain the style features, and the style features include the style image distribution The mean and standard deviation of the features. Here, by sampling the characteristics of the style image distribution, the style characteristics of the style image can be accurately extracted, which is conducive to accurate style conversion of the content image. In practical applications, at least one layer of convolution operation may be performed on the style image to obtain the characteristics of the style image distribution.

Step 103: Feed forward the content features respectively output by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network, and remove the style features from the multi-layer second network unit blocks The first layer of the second network unit block feed forward input, and the generated image of the second neural network output is obtained after each second network unit block processes the respective input characteristics. Among them, the multi-layer first network unit block and the multi-layer second The network unit block corresponds.

Here, the second neural network includes successively connected multi-layer second network unit blocks, and the output result of the previous network unit block in the second neural network is the input result of the next network unit block; optionally, The second network unit blocks of each layer in the second neural network are multiple neural network layers organized in a residual structure. In this way, it can be based on the multiple neural network layer pairs organized in a residual structure in each second network unit block. The input features are processed.

In practical applications, step 101 to step 103 can be implemented by a processor in an electronic device. The processor can be an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), Digital signal processing device (Digital Signal Processing Device, DSPD), programmable logic device (Programmable Logic Device, PLD), FPGA, central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor At least one.

It can be seen that in the embodiments of the present disclosure, both the content image and the style image can be determined according to actual needs, and the content image and the style image do not need to be a pair of images, which is easy to implement; in addition, in the process of image generation, The first network unit block of each layer of the first neural network is used to extract the content features of the content image multiple times, thereby retaining more semantic information of the content image, so that the generated image retains more than the content image. Semantic information, therefore, the generated image is more realistic.

In addition, in the process of image generation based on the neural network structure in the embodiment of the present disclosure, the style of the style image can be determined according to actual needs, and does not limit the style characteristics of the style image and the style characteristics of the style image used when training the neural network. That is to say, the training image of the dark night style is used when the neural network is trained, but when the image is generated based on the trained neural network, you can choose the content image and the snowy style, the rainy style or other styles Images, which generate images that meet the actual needs of the style, not only the dark night style images, and improve the generalization and universality of the image generation method.

Further, a variety of style images with different style characteristics can be set according to user needs, and then generated images with different style characteristics can be obtained for one content image. For example, when performing image generation based on a trained neural network, for the same content image, you can separately input a dark night style image, a cloudy day style image, and a rainy day style image to the trained neural network, so that the style of the content image can be separated. Converting to dark night style, cloudy style and rainy style, that is, based on the same content image, multiple styles of generated images can be obtained, not only one style of image can be generated, and the applicability of the image generation method is improved.

In the embodiment of the present disclosure, the number of layers of the first network unit block of the first neural network and the number of layers of the second network unit block of the second neural network may be the same, and the first network unit block of each layer of the first neural network It forms a one-to-one correspondence with the second network unit blocks of each layer of the second neural network.

As an implementation manner, the corresponding feedforward input of the content features respectively output by the first network unit blocks of each layer to the sequentially connected multi-layer second network unit blocks in the second neural network includes: sequentially responding to i In the case of 1 to T, the content features output by the first network unit block of the i-th layer are fed forward into the second network unit block of the T-i+1th layer, i is a positive integer, and T represents the first nerve The number of layers of the first network unit block of the network and the second network unit block of the second neural network; that is, the content characteristics output by the first network unit block of the first layer are input into the second network unit block of the last layer, The content features output by the first network unit block of the last layer are input to the second network unit block of the first layer.

In the embodiment of the present disclosure, the received content feature of the second network unit block of each layer in the second neural network is the output feature of the first network unit block of each layer of the first neural network, and the second network unit of each layer in the second neural network The received content characteristics of the block vary with different positions in the second neural network. It can be seen that the second neural network uses style features as input. As the style features deepen from the lower-level second network unit block of the second neural network to the high-level second network unit block, more content features can be integrated, which can be based on The style feature gradually merges the semantic information of each layer of the content image, so that the resulting image can retain the multi-layer voice information and style feature information of the content image.

As an implementation manner, the feature processing of the first-level second network unit block in each of the second network unit blocks includes: the content feature and the style feature from the last-level first network unit block can be multiplied , Obtain the intermediate feature of the first-level second network unit block; add the content feature from the first-level first network unit block of the last layer and the intermediate feature of the first-level second network unit block to obtain the first-level second network unit block The output characteristics of the first layer of the second network unit block input the output characteristics of the second layer of the second network unit block. It can be seen that, by performing the above-mentioned multiplication and addition operations, it is convenient to realize the fusion of the style feature and the content feature of the first network unit block of the last layer.

Optionally, before multiplying the content feature and style feature from the first network unit block of the last layer, a convolution operation may be performed on the content feature from the first network unit block of the last layer. That is, it is possible to first perform a convolution operation on the content features from the first network unit block of the last layer, and then perform a multiplication operation on the result of the convolution operation and the style feature.

As an implementation manner, the input feature processing of the middle layer second network unit block in each second network unit block includes: the input content feature and the output feature of the upper layer second network unit block can be multiplied , Get the intermediate feature of the second network unit block of the middle layer; add the input content feature and the intermediate feature of the second network unit block of the middle layer to obtain the output feature of the second network unit block of the middle layer; The output feature of the network unit block is input to the second network unit block of the next layer. It can be seen that by performing the above-mentioned multiplication operation and addition operation, it is convenient to realize the fusion of the output characteristics of the second network unit block of the upper layer and the corresponding content characteristics.

It should be noted that the second network unit block in the middle layer is the second network unit block in the second neural network except the first layer second network unit block and the last layer second network unit block. In the second neural network , There can be one intermediate second network unit block, or there can be multiple second network unit blocks; the above-mentioned content is only an intermediate second network unit block as an example, the data of the intermediate second network unit block The processing procedure is explained.

Optionally, the intermediate layer second network unit block performs a convolution operation on the received content feature before multiplying the input content feature and the output feature of the upper layer second network unit block.

As an implementation, the input feature processing of the second network unit block of the last layer in each second network unit block includes: the content characteristics from the first network unit block of the first layer can be combined with the second network unit of the upper layer. The output feature of the block is multiplied to obtain the intermediate feature of the second network unit block of the last layer; the content feature from the first network unit block of the first layer and the intermediate feature of the second network unit block of the last layer are added to obtain the generated image .

It can be seen that by performing the above-mentioned multiplication and addition operations, it is convenient to realize the fusion of the output characteristics of the second network unit block of the upper layer and the content characteristics of the first network unit block of the first layer, and then pass the second network unit block of each layer. Data processing can make the generated image merge the style characteristics and the content characteristics of the first network unit block of each layer.

Optionally, the second network unit block of the last layer performs a multiplication operation on the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the upper layer, and performs the multiplication operation on the first network unit block from the first layer. The content feature of the unit block is subjected to convolution operation.

In practical applications, the neural network used for image generation can be pre-trained; the pre-trained neural network will be exemplified below with reference to the accompanying drawings. Figure 2 is a schematic structural diagram of a neural network pre-trained in an embodiment of the disclosure. As shown in Figure 2, the pre-trained neural network includes a content encoder, a style encoder, and a generator; wherein the content encoder is used to utilize the first neural network described above. The network extracts the content features of the content image, the style encoder is used to extract the style features of the style image, and the generator is used to use the second neural network to realize the fusion of the style features and the content features output by the first network unit blocks of each layer.

In actual implementation, the first neural network can be used as the content encoder, the second neural network can be used as the generator, and the neural network used for style feature extraction on the style image can be used as the style encoder. 2, the image to be processed (ie, content image) can be input to the content encoder. In the content encoder, the multi-layer first network unit block of the first neural network can be used for processing, and each layer of the first network unit block The content feature can be output; the style image can also be input into the style encoder, and the style feature of the style image can be extracted from the style encoder. Exemplarily, the first network unit block is a residual block (Residual Block, RB), and the content feature output by the first network unit block of each layer is a content feature map.

Fig. 3 is a schematic diagram of an exemplary structure of the content encoder according to the embodiment of the disclosure. As shown in Fig. 3, the residual block of the content encoder can be marked as CRB, and the content encoder includes seven layers of CRB, the CRB( In A, B), A represents the number of input channels and B represents the number of output channels; in Figure 3, the input of CRB(3,64) is the content image, and the first CRB to the seventh CRB are arranged from bottom to top. CRB(3,64), CRB(64,128), CRB(128,256), CRB(256,512), CRB(512,1024), CRB(1024,1024), CRB(1024,1024) and CRB(1024,1024) , The first layer CRB to the seventh layer CRB can output seven content feature maps respectively.

FIG. 4 is an exemplary structural diagram of the CRB of an embodiment of the disclosure. In FIG. 4, sync BN represents a synchronous BN layer, a rectified linear unit (ReLu) represents a ReLu layer, and Conv represents a convolutional layer.

Represents a summation operation; the structure of CRB shown in Figure 4 is the structure of a standard residual block.

3 and 4, in the embodiments of the present disclosure, a standard residual network structure can be used to extract content features, which facilitates the extraction of content features of content images and reduces semantic information loss. In the generator, the multi-layer second network unit block of the second neural network can be used for processing; for example, the second network unit block is RB.

Figure 5 is an exemplary structural diagram of the generator of the embodiment of the disclosure. As shown in Figure 5, the residual block in the generator can be denoted as GB, and the generator can include seven layers of GB, and the input of each layer of GB is The output of one layer of CRB of the content encoder; in the generator, the first layer GB to the seventh layer GB are GB ResBlk (1024), GB ResBlk (1024), GB ResBlk (1024), arranged from top to bottom, respectively GB ResBlk (512), GB ResBlk (256), GB ResBlk (128), and GB ResBlk (64); in GB ResBlk (C) in Figure 5, C represents the number of channels; the first layer of GB is used to receive style features, The first layer GB to the seventh layer GB are used to receive the content feature maps output from the seventh layer CRB to the first layer CRB; after each layer GB processes the input features, the seventh layer GB output can be used to generate an image.

It can be seen that based on the multi-layer residual block of the content encoder, the structural information of the content image can be encoded to generate multiple content feature maps of different levels; the content encoder can extract more abstract features in the deep layer, and in the surface layer A lot of structural information is retained.

The image generation method of the embodiment of the present disclosure can be applied to various image generation scenarios, for example, can be applied to scenarios such as image entertainment data generation, automatic driving model training test data generation, and the like.

The effect of the image generation method of the embodiment of the present disclosure will be described below with reference to the accompanying drawings. Figure 6 shows several exemplary sets of content images, style images, and generated images in the embodiments of the present disclosure. As shown in Figure 6, the first column represents content images, the second column represents style images, and the third column represents implementations based on the present disclosure. The generated image obtained by the image generation method of the example, the image in the same row represents a group of content images, style images and generated images; the style conversion from the first row to the last row is from day to night, night to day, sunny to rainy. , Rainy to sunny, sunny to cloudy, cloudy to sunny, sunny to snow, and snowy to sunny style conversion, as can be seen from FIG. 6, the generated image based on the image generation method of the embodiment of the present disclosure can be The content information of the content image and the style information of the style image are retained.

In the training process of the neural network in the embodiments of the present disclosure, not only the forward propagation process from input to output is involved, but also the back propagation process from output to input; the training process of the neural network of the present disclosure can be used before Use the reverse process to generate images and use the reverse process to adjust the network parameters of the neural network. The following describes the neural network training method involved in the embodiments of the present disclosure.

FIG. 7 is a flowchart of a neural network training method according to an embodiment of the disclosure. As shown in FIG. 7, the process may include:

Step 701: Extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer.

Step 702: Extract the style features of the style image.

Step 703: Feed forward the content features respectively output by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network, and remove the style features from the multi-layer second network unit blocks The first layer of the second network unit block feed forward input, and the generated image of the second neural network output is obtained after each second network unit block processes the respective input characteristics. Among them, the multi-layer first network unit block and the multi-layer second The network unit block corresponds.

The implementation of steps 701 to 703 in this embodiment is the same as the implementation of steps 101 to 103, and will not be repeated here.

Step 704: Discriminate the generated image, and obtain an identification result.

In the embodiments of the present disclosure, unlike the neural network testing method (ie, the method of image generation based on the trained neural network), during the neural network training process, the output image generated by the generator needs to be identified. Here, the purpose of discriminating the generated image is to determine the probability that the generated image is a real image; in practical applications, this step can be implemented using a discriminator or the like.

Step 705: Adjust the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result.

In practical applications, the network parameters of the first neural network and/or the second neural network can be adjusted based on the reverse process according to the content image, style image, generated image, and identification result, and then the forward process can be used to obtain the generated image again And the identification result, so, through the above-mentioned forward process and reverse process repeatedly, the network iterative optimization of the neural network is performed until the predetermined training completion conditions are met, and the trained neural network for image generation can be obtained. .

In practical applications, steps 701 to 705 can be implemented by a processor in an electronic device. The aforementioned processor can be at least ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. One kind.

In the embodiment of the present disclosure, both the content image and the style image can be determined according to actual needs, and the content image and the style image do not need to be a pair of images, which is easy to implement. In the image generation process of the neural network training process, the first network unit blocks of each layer of the first neural network can be used to extract the content features of the content image multiple times, thereby retaining more semantic information of the content image, so that Compared with the content image, the generated image retains more semantic information; in turn, the trained neural network can better maintain the semantic information of the content image.

Regarding the implementation of adjusting the network parameters of the second neural network, for example, the parameters of the above-mentioned multiplication operation and/or addition operation used in the second network unit block of each layer can be adjusted.

As an implementation manner, the adjusting the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result includes : Determine the Generative Adversarial Net (GAN) loss according to the content image, style image, generated image, and identification results; wherein the Generative Adversarial Net (GAN) loss is used to characterize the difference in content characteristics between the generated image and the content image. And the difference in style characteristics between the generated image and the style image; in one example, the generated confrontation network includes a generator and a discriminator; in response to the loss of the generated confrontation network that does not meet the first predetermined condition, adjust the first Network parameters of the neural network and/or the second neural network. In practical applications, the network parameters of the first neural network and/or the second neural network can be adjusted based on the generation of countermeasures against the network loss, and a minimax strategy can be adopted.

Here, the first predetermined condition may represent a predetermined training completion condition; it is understandable that according to the meaning of generating a confrontation network loss, training the neural network based on the loss of the generation confrontation network can make the generated image obtained based on the trained neural network, It has a high performance of maintaining the content characteristics of the content image and the style characteristics of the style image.

Optionally, the adjusting the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result, further includes: According to the generated image and the style image, determine the style loss; in response to the situation that the style loss does not meet the second predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the style loss; wherein, the style loss It is used to characterize the difference between the style characteristics of the generated image and the style image.

Optionally, the adjusting the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result, further includes: Determine the content loss according to the generated image and the content image; in response to the content loss not meeting the third predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the content loss; wherein the content loss is used for Characterize the difference in content characteristics between the generated image and the content image.

Optionally, the adjusting the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result, further includes: Determine the feature matching loss according to the output features of each intermediate layer second network unit block in each second network unit block and the style image; in response to the feature matching loss not satisfying the fourth predetermined condition, adjust according to the feature matching loss The network parameters of the first neural network and/or the second neural network; wherein, the feature matching loss is used to characterize the difference between the output feature of the second network unit block of each intermediate layer and the style feature of the style image.

The aforementioned second predetermined condition, third predetermined condition, and fourth predetermined condition may represent predetermined training completion conditions; it is understandable that according to the meaning of style loss, content loss or feature matching loss, it can be known that based on style loss, content loss or feature The matching loss training neural network can make the generated image based on the trained neural network have a higher performance of maintaining the content characteristics of the content image.

In the embodiments of the present disclosure, a neural network can be trained based on the foregoing one loss or multiple losses. When the neural network is trained based on one loss, the trained neural network can be obtained when the loss meets the corresponding predetermined condition; When training a neural network with multiple losses, it is necessary to obtain a trained neural network when the aforementioned multiple losses meet the corresponding predetermined conditions. When training a neural network based on multiple losses, since the loss of the neural network can be comprehensively considered from all aspects of neural network training, the accuracy of the style conversion of the trained neural network is higher.

In the embodiment of the present disclosure, the generation of countermeasure network loss, style loss, content loss, or feature matching loss can be represented by a loss function.

The following further describes the embodiments of the present disclosure through a specific application embodiment.

In this application embodiment, the training process of the neural network method can be implemented based on the content encoder, style encoder, generator and discriminator, etc., and the process of image generation based on the completion of the training of the neural network method can be based on the content encoder, style encoding Implements such as generators and generators.

FIG. 8 is a schematic structural diagram of the framework of the image generation method proposed by the application embodiment of the disclosure. As shown in FIG. 8, the input of the content encoder is the image to be processed (that is, the content image), which is used to extract the content characteristics of the content image; The encoder is responsible for extracting the style features of the style image; the generator combines the content features and style features of the first network unit blocks of different layers to generate high-quality images. It should be noted that the discriminator used in the neural network training process is not shown in FIG. 8.

Specifically, referring to FIG. 8, the content encoder includes multiple layers of residual blocks, CRB-1, CRB-2...CRB-T respectively represent the layer 1 residual block to the T layer residual block of the content encoder; generate The generator includes multiple layers of residual blocks, GB-1...GB-T-1, GB-T respectively represent the first layer to the T-th residual block of the generator. In response to the situation where i is between 1 and T, the output result of the i-th layer residual block of the content encoder is input to the T-i+1th layer residual block of the generator; the input of the style encoder is style The image is used to extract the style feature of the style image, and the style feature is input into the first layer residual block of the generator. The output image is obtained based on the output result of the T-th layer residual block GB-T of the generator.

In the application embodiment of the present disclosure, f ^{i is} defined as the content feature map output from the i-th layer residual block of the content encoder, using

Represents the output characteristics of the i-th residual block of the generator. Here, the i-th residual block of the content encoder corresponds to the T-i+1-th layer residual block of the generator;

F ⁱ with the same number of channels, N denotes the size of the batch, C ⁱ represents the number of the channel; H ⁱ and W ⁱ indicates the height and width, respectively. The activation value (n∈[1,N], c∈[1,C ⁱ ], h∈[1,H ⁱ ], ω∈[1,W ⁱ ]) can be expressed as formula (1).

among them,

with

Both correspond to the i-th residual block of the generator, and respectively represent the mean and standard deviation of the features output by the residual block of the previous layer (that is, the residual block of the second neural network),

with

It can be calculated according to formula (2).

with

Is the parameter of the i-th residual block of the generator,

with

It can be obtained by single-layer convolution of f ⁱ ; the image generation method of the application embodiment of the present disclosure is feature-adaptive, that is, the modulation parameter can be calculated directly based on the content characteristics of the content image; and in the related image generation method, the modulation The parameters are unchanged.

In the application embodiment of the present disclosure, the content encoder is represented as E _c and the style encoder is represented as E _s ; the potential distribution x _{s of the} style image is encoded by E _s , for example, z=E _s (x _s ).

Use χ _c and χ _{s to} represent the content image domain and style image domain, respectively, and the training samples (x _c , x _s ) are distributed from the edge in an unsupervised learning environment

with

Extracted from.

Figure 9a is a schematic structural diagram of the residual block of the content encoder in the application embodiment of the disclosure. As shown in Figure 9a, BN represents the BN layer, ReLu represents the ReLu layer, and Conv represents the convolutional layer.

Represents the summation operation; the structure of each residual block CRB of the content encoder is the structure of the standard residual block, and each residual block of the content encoder includes three convolutional layers, one of which is used to skip the connection (skip connection).

In the application embodiment of the present disclosure, the number of layers of the residual block of the generator and the content encoder is the same; Fig. 9b is a schematic structural diagram of the residual block of the generator in the application embodiment of the present disclosure, as shown in Fig. 9b, in the standard residual block On the basis of the difference block, the FADE module is used to replace the BN layer to obtain the structure of the residual block GB of each layer of the generator; in Figure 9b, F1, F2 and F3 represent the first FADE module, the second FADE module and the third FADE module, respectively. FADE module; in each residual block of the generator, the input of each FADE module includes the corresponding content feature map output by the content encoder, refer to Figure 9b, in each residual block of the generator, in the generator Among the three FADE modules of each residual block, the input of F1 and F2 also includes the output characteristics of the previous residual block of the second neural network, and the input of F3 also includes the F1, ReLu layer and convolutional layer in turn Features obtained after processing.

Fig. 9c is a schematic diagram of the structure of the FADE module of the application embodiment of the present disclosure.

Represents the multiplication operation,

Means addition; Conv means convolutional layer, BN means BN layer; Υ and β represent the modulation parameters of each residual block of the generator. It can be seen that FADE takes the content feature map as input, which can be obtained from the convolutional features Derive denormalization parameters.

In the application embodiment of the present disclosure, through the fine design of the connection structure of the content encoder and generator, the trained neural network is made to adaptively transform the content image under the control of the style image.

As an implementation method, the style encoder is proposed based on the Variational Adaptive Encoder (VAE). The output of the style encoder is a mean vector

And standard deviation vector

Latent code (latent code) z is derived from the resampling of style images after encoding

Since the sampling operation is not differentiable, here, the reparameterization trick can be used to convert the sampling into a differentiable operation. Let η be a uniformly distributed random vector with the same size as z; here, η～Ν(η|0,1), then z can be re-parameterized as

Through this operation, we can train a style encoder with backward propagation and train the entire network as an end-to-end model.

In the application embodiments of the present disclosure, various parts of the entire neural network can be jointly trained. For the training of the neural network, the loss function of the entire first neural network can be calculated by referring to formula (3) based on the optimization of the minimax strategy, and then the training of the first neural network can be realized.

Among them, G represents the generator, D represents the discriminator, L _VAE (E _s , G) represents the style loss. Illustratively, the style loss can be the loss of Kullback-Leibler divergence; L _VAE (E _s , G) can be calculated according to formula (4).

L _VAE (E _s ,G)=λ ₀ KL(q(z|x _s )||p _η (z)) (4)

Among them, KL(·) represents the KL divergence, and λ ₀ represents the hyperparameter in L _VAE (E _s ,G).

L _GAN (E _s ,E _c ,G,D) represents the loss of the generated adversarial network, which is used in the adversarial training of the generator and discriminator; L _GAN (E _s ,E _c ,G,D) can be based on the formula ( 5) Perform calculations.

among them,

with

Denotes mathematical expectation, D(·) denotes discriminator, G(·) denotes generator, E _c (x _c ) denotes encoder, λ ₁ denotes hyperparameters in L _GAN (E _s ,E _c ,G,D) .

L _VGG (E _s , E _c , G) represents content loss. Illustratively, the content loss may be a VGG (Visual Geometry Group) loss. L _VGG (E _s , E _c , G) can be calculated according to formula (6).

among them,

Represents the activation map of the mth layer selected from the total M layers,

Means

The number of elements, λ ₂ and

Is the corresponding hyperparameter in L _VGG (E _s ,E _c ,G),

Represents the output image obtained by the generator,

||·|| ₁ means 1-norm.

L _FM (E _s , E _c , G) represents the feature matching loss; L _FM (E _s , E _c , G) can be calculated according to formula (7).

among them,

Represents the k-th scale discriminator i-th layer (k multiscale discriminator having different scales), N _i represents the total number of elements in a discriminator layer i, Q represents the number of layers; λ _* loss in all of the above functions , Are all corresponding weights. The VGG loss has different weights in different layers.

In the application embodiment of the present disclosure, the first neural network is trained based on multi-scale discriminators, and each discriminator on different scales has exactly the same structure; the discriminator with the roughest scale has the largest receptive field; The discriminator can distinguish higher-resolution images.

Those skilled in the art can understand that in the above methods of the specific implementation, the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possibility. The inner logic is determined.

Based on the image generation method proposed in the foregoing embodiment, an embodiment of the present disclosure proposes an image generation device. FIG. 10 is a schematic diagram of the composition structure of an image generation device according to an embodiment of the disclosure. As shown in FIG. 10, the device includes: a first extraction module 1001, a second extraction module 1002, and a first processing module 1003, wherein:

The first extraction module 1001 is configured to extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer;

The second extraction module 1002 is used to extract style features of style images;

The first processing module 1003 is configured to feed forward the content features respectively output by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network, and combine the The style features are fed forward from the first layer second network unit block in the multi-layer second network unit block, and the second neural network output is obtained after each second network unit block processes the respective input features An image is generated, wherein the multi-layer first network unit block corresponds to the multi-layer second network unit block.

Optionally, the first processing module 1003 is configured to feed forward the content features output by the first network unit block of the i-th layer to the T-i+1th layer in response to the situation that i takes 1 to T in sequence. In the second network unit block, i is a positive integer, and T represents the number of layers of the first network unit block of the first neural network and the second network unit block of the second neural network.

Optionally, the first-level second network unit block in each of the second network unit blocks is used to multiply the content feature from the last-level first network unit block and the style feature to obtain the Intermediate features of the first-level second network unit block; add the content features from the first-level first-level network unit block and the intermediate features of the first-level second-level network unit block to obtain the first-level second network The output characteristics of the unit block; input the output characteristics of the first layer second network unit block into the second layer second network unit block.

Optionally, the first-layer second network unit block is also used to perform multiplication operations on the first network unit from the last layer before the content feature from the first network unit block at the last layer and the style feature are multiplied. The content feature of the block is subjected to convolution operation.

Optionally, the middle layer second network unit block in each of the second network unit blocks is used to multiply the input content feature and the output feature of the second network unit block of the upper layer to obtain the Intermediate features of the second network unit block of the middle layer; add the content features of the input and the intermediate features of the second network unit block of the middle layer to obtain the output features of the second network unit block of the middle layer; The output characteristics of the second network unit block of the middle layer are input to the second network unit block of the next layer.

Optionally, the middle layer second network unit block is further configured to perform multiplication on the received content feature before multiplying the input content feature and the output feature of the upper layer second network unit block Convolution operation.

Optionally, the last-level second network unit block in each of the second network unit blocks is used to combine the content characteristics from the first-level first network unit block and the output characteristics of the upper-level second network unit block Perform a multiplication operation to obtain the intermediate feature of the second network unit block of the last layer; add the content feature from the first network unit block of the first layer and the intermediate feature of the last second network unit block to obtain The generated image.

Optionally, the second network unit block of the last layer is used to perform multiplication operations on the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the upper layer. The content features from the first network unit block of the first layer are convolved.

Optionally, the second extraction module 1002 is configured to extract features of the style image distribution; sampling the features of the style image distribution to obtain the style feature, the style feature includes the style image distribution The mean and standard deviation of the features.

Optionally, the first network unit block is configured to extract content features of content images based on multiple neural network layers organized in a residual structure in the first network unit block; and/or, the second network The unit block is used to process the features input to the second network unit based on multiple neural network layers organized in a residual structure in the second network unit block.

In practical applications, the first extraction module 1001, the second extraction module 1002, and the first processing module 1003 can all be implemented by processors. The aforementioned processors can be ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, micro At least one of a controller and a microprocessor.

In addition, the functional modules in this embodiment can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated unit can be realized in the form of hardware or software function module.

If the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this embodiment is essentially or It is said that the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions to enable a computer device (which can A personal computer, server, or network device, etc.) or a processor (processor) executes all or part of the steps of the method described in this embodiment. The aforementioned storage media include: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.

Specifically, the computer program instructions corresponding to an image generation method or neural network training method in this embodiment can be stored on storage media such as optical disks, hard disks, and USB flash drives. When the storage medium is related to an image generation method Or when the computer program instructions corresponding to the neural network training method are read or executed by an electronic device, any one of the image generation methods or any one of the neural network training methods of the foregoing embodiments is implemented.

Based on the same technical concept of the foregoing embodiment, refer to FIG. 11, which shows an electronic device 11 provided by an embodiment of the present disclosure. The electronic device 11 includes: a memory 111 and a processor 112; wherein, the memory 111 is used for A computer program is stored; the processor 112 is configured to execute the computer program stored in the memory to implement any image generation method or any neural network training method in the foregoing embodiments.

The various components in the electronic device 11 may be coupled together through a bus system. It can be understood that the bus system is used to realize the connection and communication between these components. In addition to the data bus, the bus system also includes a power bus, a control bus, and a status signal bus. However, for the sake of clear description, various buses are marked as bus systems in FIG. 11.

In practical applications, the aforementioned memory 111 may be a volatile memory (volatile memory), such as RAM; or a non-volatile memory (non-volatile memory), such as ROM, flash memory, or hard disk (Hard Disk). Drive, HDD) or Solid-State Drive (SSD); or a combination of the foregoing types of memories, and provide instructions and data to the processor 112.

The aforementioned processor 112 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It can be understood that, for different devices, the electronic devices used to implement the above-mentioned processor functions may also be other, which is not specifically limited in the embodiment of the present disclosure.

FIG. 12 is a schematic diagram of the composition structure of a neural network training device according to an embodiment of the disclosure. As shown in FIG. 12, the device includes: a third extraction module 1201, a fourth extraction module 1202, a second processing module 1203, and an adjustment module 1204; among them,

The third extraction module 1201 is configured to extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer;

The fourth extraction module 1202 is used to extract style features of the style image;

The second processing module 1203 is configured to feed forward the content characteristics respectively output by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network, and combine the The style feature is fed forward from the first layer second network unit block in the multi-layer second network unit block, and the output of the second neural network is obtained after each second network unit block processes the respective input features Generate an image; identify the generated image to obtain an authentication result; wherein the multi-layer first network unit block corresponds to the multi-layer second network unit block;

The adjustment module 1204 is configured to adjust the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result.

Optionally, the second processing module 1203 is configured to feed forward the content features output by the first network unit block of the i-th layer to the T-i+1th layer in response to the situation where i takes 1 to T in sequence. In the second network unit block, i is a positive integer, and T represents the number of layers of the first network unit block of the first neural network and the second network unit block of the second neural network.

Optionally, the last-level second network unit block is also used to perform multiplication operations on the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the upper layer. The content feature from the first network unit block of the first layer is subjected to a convolution operation.

Optionally, the adjustment module 1204 is configured to adjust the multiplication operation parameter and/or the addition operation parameter.

Optionally, the adjustment module 1204 is configured to determine, according to the content image, the style image, the generated image, and the identification result, to generate a countermeasure network loss; in response to the generation countermeasure network loss that does not meet the first Under a predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the loss of the generated confrontation network; wherein the loss of the generated confrontation network is used to characterize the difference between the generated image and The content feature difference of the content image, and the style feature difference between the generated image and the style image.

Optionally, the adjustment module 1204 is further configured to determine a style loss according to the generated image and the style image; in response to the situation that the style loss does not meet a second predetermined condition, adjust according to the style loss The network parameters of the first neural network and/or the second neural network; wherein the style loss is used to characterize the difference between the style characteristics of the generated image and the style image.

Optionally, the adjustment module 1204 is further configured to determine the content loss according to the generated image and the content image; in response to the content loss not satisfying a third predetermined condition, adjust the content loss according to the content loss The network parameters of the first neural network and/or the second neural network; wherein the content loss is used to characterize the content feature difference between the generated image and the content image.

Optionally, the adjustment module 1204 is further configured to determine the feature matching loss according to the output feature of each intermediate layer second network unit block in each second network unit block and the style image; in response to the feature If the matching loss does not meet the fourth predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the feature matching loss; wherein, the feature matching loss is used to characterize the The difference between the output feature of the second network unit block of each middle layer and the style feature of the style image.

Optionally, the fourth extraction module 1202 is configured to extract features of the style image distribution; sampling the features of the style image distribution to obtain the style feature, the style feature includes the style image distribution The mean and standard deviation of the features.

In practical applications, the third extraction module 1201, the fourth extraction module 1202, the second processing module 1203, and the adjustment module 1204 can all be implemented by a processor, and the processor can be ASIC, DSP, DSPD, PLD, FPGA, CPU , At least one of a controller, a microcontroller, and a microprocessor.

In some embodiments, the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments. For specific implementation, refer to the description of the above method embodiments. For brevity, here No longer.

In an exemplary embodiment, the embodiment of the present disclosure further provides a computer storage medium, such as the memory 111 including a computer program, which can be executed by the processor 112 of the electronic device 11 to complete the steps described in the foregoing method. The computer-readable storage medium can be FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disk, or CD-ROM, etc.; it can also be a variety of devices including one or any combination of the foregoing memories, such as Mobile phones, computers, tablet devices, personal digital assistants, etc.

The embodiments of the present disclosure provide a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, any image generation method or any neural network training method in the foregoing embodiments is implemented.

The foregoing description of the various embodiments tends to emphasize the differences between the various embodiments, and the same or similarities can be referred to each other, and for the sake of brevity, details are not repeated herein.

The features disclosed in each method or product embodiment provided in the embodiments of the present disclosure can be combined arbitrarily without conflict to obtain a new method embodiment or product embodiment.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present invention.

The embodiments of the present invention are described above with reference to the accompanying drawings, but the embodiments of the present disclosure are not limited to the above-mentioned specific embodiments. The above-mentioned specific embodiments are only illustrative, not restrictive, and are common in the art. Under the enlightenment of the present invention, persons can make many forms without departing from the purpose of the embodiments of the present disclosure and the scope of protection of the claims, and these are all protected by the embodiments of the present disclosure.

Claims

An image generation method, the method includes:

Extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer;

Extract style features of style images;

The content features respectively output by the first network unit blocks of each layer correspond to the feedforward input to the sequentially connected multi-layer second network unit blocks in the second neural network, and the style features are removed from the multi-layer second network The first layer of the second network unit block in the unit block feeds forward input, and the generated image output by the second neural network is obtained after each of the second network unit blocks processes the respective input characteristics, wherein the multi-layer first A network unit block corresponds to the multi-layer second network unit block.
The method according to claim 1, wherein the content features respectively output from the first network unit blocks of each layer correspond to the feedforward input into the second neural network of the sequentially connected multi-layer second network unit blocks, comprising :

In response to the situation where i takes 1 to T in turn, feed forward the content features output by the first network unit block of the i-th layer into the second network unit block of the T-i+1th layer, where i is a positive integer, and T represents all The number of layers of the first network unit block of the first neural network and the second network unit block of the second neural network.
The method according to claim 1 or 2, wherein the feature processing of the first-level second network unit block in each of the second network unit blocks on the input includes:

Multiply the content feature from the first network unit block of the last layer and the style feature to obtain the intermediate feature of the second network unit block of the first layer; combine the content feature from the first network unit block of the last layer with The intermediate features of the first-level second network unit block are added to obtain the output features of the first-level second network unit block; the output features of the first-level second network unit block are input into the second-level second Network unit block.
The method according to claim 3, wherein the method further comprises: before multiplying the content feature from the first network unit block of the last layer and the style feature, performing the multiplication operation on the first network unit from the last layer The content feature of the block is subjected to convolution operation.
The method according to any one of claims 1 to 4, wherein the processing of input characteristics by the second network unit block of the middle layer in each of the second network unit blocks comprises:

Multiply the input content feature and the output feature of the upper layer second network unit block to obtain the intermediate feature of the middle layer second network unit block; compare the input content feature with the middle layer second network The intermediate characteristics of the unit blocks are added to obtain the output characteristics of the second network unit block of the intermediate layer; the output characteristics of the second network unit block of the intermediate layer are input into the input of the second network unit block of the next layer.
The method according to claim 5, wherein the method further comprises: performing multiplication on the received content feature before multiplying the input content feature and the output feature of the second network unit block of the upper layer Convolution operation.
The method according to any one of claims 1 to 6, wherein the characteristic processing of the input by the last-layer second network unit block in each of the second network unit blocks comprises:

Multiply the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the upper layer to obtain the intermediate feature of the second network unit block of the last layer; The content feature of the network unit block and the intermediate feature of the second network unit block of the last layer are added to obtain the generated image.
The method according to claim 7, wherein the method further comprises: before multiplying the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the previous layer, The content feature from the first network unit block of the first layer is subjected to a convolution operation.
8. The method according to any one of claims 1 to 8, wherein said extracting the style features of the style image comprises: extracting the features of the style image distribution;

The characteristics of the style image distribution are sampled to obtain the style characteristics, and the style characteristics include the mean value and the standard deviation of the characteristics of the style image distribution.
The method according to any one of claims 1 to 9, wherein the first network unit block extracting content features of the content image comprises: based on a plurality of nerves organized in a residual structure in the first network unit block The network layer extracts the content features of the content image; and/or,

Processing the input features of the second network unit block includes: processing the features input to the second network unit based on a plurality of neural network layers organized in a residual structure in the second network unit block deal with.
A neural network training method, the method further includes:

Extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer;

Extract style features of style images;

The content features respectively output by the first network unit blocks of each layer correspond to the feedforward input to the sequentially connected multi-layer second network unit blocks in the second neural network, and the style features are removed from the multi-layer second network The first layer of the second network unit block in the unit block feeds forward input, and the generated image output by the second neural network is obtained after each of the second network unit blocks processes the respective input characteristics, wherein the multi-layer first A network unit block corresponds to the multi-layer second network unit block;

Discriminate the generated image to obtain an identification result;

Adjust the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result.
The method according to claim 11, wherein said inputting the content features respectively outputted by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network corresponding to the feedforward input comprises:

In response to the situation where i takes 1 to T in turn, feed forward the content features output by the first network unit block of the i-th layer into the second network unit block of the T-i+1th layer, where i is a positive integer, and T represents all The number of layers of the first network unit block of the first neural network and the second network unit block of the second neural network.
The method according to claim 11 or 12, wherein the feature processing of the first-level second network unit block in each of the second network unit blocks on the input includes:

Multiply the content feature from the first network unit block of the last layer and the style feature to obtain the intermediate feature of the second network unit block of the first layer; combine the content feature from the first network unit block of the last layer with Add the intermediate features of the first-level second network unit block to obtain the output feature of the first-level second network unit block; input the output feature of the first-level second network unit block into the second-level second network Unit block.
The method according to claim 13, wherein the method further comprises:

Before multiplying the content feature from the first network unit block of the last layer with the style feature, perform a convolution operation on the content feature from the first network unit block of the last layer.
The method according to any one of claims 11 to 14, wherein the processing of input characteristics by the second network unit block of the middle layer in each of the second network unit blocks comprises:

Multiply the input content feature and the output feature of the upper layer second network unit block to obtain the intermediate feature of the middle layer second network unit block; compare the input content feature with the middle layer second network The intermediate characteristics of the unit blocks are added to obtain the output characteristics of the second network unit block of the intermediate layer; the output characteristics of the second network unit block of the intermediate layer are input to the second network unit block of the next layer.
15. The method according to claim 15, wherein the method further comprises: performing multiplication on the received content feature before multiplying the input content feature with the output feature of the second network unit block of the upper layer Convolution operation.
The method according to any one of claims 11 to 16, wherein the characteristic processing of the input by the last second network unit block in each of the second network unit blocks comprises:

Multiply the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the upper layer to obtain the intermediate feature of the second network unit block of the last layer; The content feature of the network unit block and the intermediate feature of the second network unit block of the last layer are added to obtain the generated image.
The method according to claim 17, wherein the method further comprises: before multiplying the content feature from the first network unit block of the first layer and the output feature of the second network unit block of the previous layer, The content feature from the first network unit block of the first layer is subjected to a convolution operation.
The method according to any one of claims 13 to 18, wherein adjusting the network parameters of the second neural network comprises: adjusting the multiplication operation parameters and/or addition operation parameters.
The method according to any one of claims 11 to 19, wherein the first neural network and/or the first neural network is adjusted according to the content image, the style image, the generated image, and the identification result. The network parameters of the second neural network include:

According to the content image, the style image, the generated image, and the identification result, determine to generate a counter-network loss;

In response to the situation that the loss of the generative confrontation network does not meet the first predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the loss of the generative confrontation network; wherein, the generation The anti-network loss is used to characterize the content feature difference between the generated image and the content image, and the style feature difference between the generated image and the style image.
22. The method of claim 20, wherein the first neural network and/or the second neural network are adjusted according to the content image, the style image, the generated image, and the identification result The network parameters further include: determining a style loss according to the generated image and the style image;

In response to the situation that the style loss does not meet the second predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the style loss; wherein the style loss is used for characterization The difference between the style characteristics of the generated image and the style image.
The method according to claim 20 or 21, wherein the first neural network and/or the second neural network are adjusted according to the content image, the style image, the generated image, and the identification result. The network parameters of the neural network also include:

Determine content loss according to the generated image and the content image;

In response to the situation that the content loss does not meet the third predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the content loss; wherein the content loss is used to characterize The content feature difference between the generated image and the content image.
The method according to any one of claims 20-22, wherein the first neural network and/or the first neural network is adjusted according to the content image, the style image, the generated image, and the identification result. The network parameters of the second neural network also include:

Determine the feature matching loss according to the output feature of each intermediate layer second network unit block in each second network unit block and the style image;

In response to the case that the feature matching loss does not meet the fourth predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the feature matching loss; wherein, the feature matching loss It is used to characterize the difference between the output feature of the second network unit block of each intermediate layer and the style feature of the style image.
The method according to any one of claims 11 to 23, wherein the extracting the style features of the style image comprises: extracting the features of the style image distribution;

The characteristics of the style image distribution are sampled to obtain the style characteristics, and the style characteristics include the mean value and the standard deviation of the characteristics of the style image distribution.
The method according to any one of claims 11 to 24, wherein the extraction of the content feature of the content image by the first network unit block comprises: based on a plurality of nerves organized in a residual structure in the first network unit block The network layer extracts the content features of the content image; and/or,

Processing the input features of the second network unit block includes: processing the features input to the second network unit based on a plurality of neural network layers organized in a residual structure in the second network unit block deal with.
An image generation device, the device includes a first extraction module, a second extraction module, and a first processing module, wherein:

The first extraction module is configured to extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer;

The second extraction module is used to extract the style features of the style image;

The first processing module is configured to feed forward the content features respectively output by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network, and change the style features from Feed forward input of the first layer second network unit block in the multi-layer second network unit block, and obtain the generated image output by the second neural network after each of the second network unit blocks process the characteristics of the respective input, Wherein, the multi-layer first network unit block corresponds to the multi-layer second network unit block.
The apparatus according to claim 26, wherein the first processing module is configured to feed forward the content feature output by the first network unit block of the i-th layer to the T-th in response to the situation that i takes 1 to T in sequence. -i+1 layer of the second network unit block, i is a positive integer, and T represents the number of layers of the first network unit block of the first neural network and the second network unit block of the second neural network.
The device according to claim 26 or 27, wherein the first-level second network unit block in each of the second network unit blocks is used to combine the content characteristics from the last-level first network unit block with the The style feature is multiplied to obtain the intermediate feature of the first-level second network unit block; the content feature from the last-level first network unit block is added to the intermediate feature of the first-level second network unit block , Obtain the output characteristics of the first layer second network unit block; input the output characteristics of the first layer second network unit block into the second layer second network unit block.
The device according to claim 28, wherein the first-level second network unit block is further configured to perform multiplication on the content feature from the last-level first network unit block and the style feature. The content features of the first network unit block from the last layer are subjected to convolution operation.
The device according to any one of claims 26 to 29, wherein the middle layer second network unit block in each of the second network unit blocks is used to compare the input content characteristics and the upper layer second network The output feature of the unit block is multiplied to obtain the intermediate feature of the second network unit block of the intermediate layer; the content feature of the input and the intermediate feature of the second network unit block of the intermediate layer are added to obtain the Output characteristics of the second network unit block of the middle layer; input the output characteristics of the second network unit block of the middle layer into the second network unit block of the next layer.
The device according to claim 30, wherein the middle-level second network unit block is further configured to perform multiplication operations on the content feature of the input and the output feature of the upper-level second network unit block. The received content feature performs a convolution operation.
The apparatus according to any one of claims 26 to 31, wherein the last-layer second network unit block in each of the second network unit blocks is used to combine content characteristics from the first-layer first network unit block Multiply the output feature of the second network unit block of the upper layer to obtain the intermediate feature of the second network unit block of the last layer; combine the content feature from the first network unit block of the first layer with the first network unit block of the last layer. The intermediate features of the two network unit blocks are added to obtain the generated image.
The apparatus according to claim 32, wherein the second network unit block of the last layer is used to compare the content characteristics of the first network unit block from the first layer and the output characteristics of the second network unit block of the upper layer. Before performing the multiplication operation, perform a convolution operation on the content feature from the first network unit block of the first layer.
The apparatus according to any one of claims 26 to 33, wherein the second extraction module is configured to extract the characteristics of the style image distribution; sampling the characteristics of the style image distribution to obtain the style characteristics The style feature includes the mean value and standard deviation of the feature of the style image distribution.
The apparatus according to any one of claims 26 to 34, wherein the first network unit block is configured to extract content images based on multiple neural network layers organized in a residual structure in the first network unit block Content characteristics; and/or,

The second network unit block is used to process the features input to the second network unit based on multiple neural network layers organized in a residual structure in the second network unit block.
A neural network training device, which includes a third extraction module, a fourth extraction module, a second processing module, and an adjustment module; wherein,

The third extraction module is configured to extract the content features of the content image by using the sequentially connected multi-layer first network unit blocks in the first neural network to obtain the content features respectively output by the first network unit blocks of each layer;

The fourth extraction module is used to extract the style features of the style image;

The second processing module is configured to feed forward the content features respectively output by the first network unit blocks of each layer into the sequentially connected multi-layer second network unit blocks in the second neural network, and change the style features from Feed forward input of the first layer of the second network unit block in the multi-layer second network unit block, and obtain the generated image output by the second neural network after each of the second network unit blocks processes the characteristics of the respective input; Authenticating the generated image to obtain an authentication result; wherein the multi-layer first network unit block corresponds to the multi-layer second network unit block;

The adjustment module is configured to adjust the network parameters of the first neural network and/or the second neural network according to the content image, the style image, the generated image, and the identification result.
The device according to claim 36, wherein the second processing module is configured to feed forward the content feature output by the first network unit block of the i-th layer to the T-th in response to the situation that i takes 1 to T in sequence. -i+1 layer of the second network unit block, i is a positive integer, and T represents the number of layers of the first network unit block of the first neural network and the second network unit block of the second neural network.
The apparatus according to claim 36 or 37, wherein the first-level second network unit block in each of the second network unit blocks is used to combine the content characteristics from the last-level first network unit block with the The style feature is multiplied to obtain the intermediate feature of the first-level second network unit block; the content feature from the last-level first network unit block is added to the intermediate feature of the first-level second network unit block , Obtain the output characteristics of the first layer second network unit block; input the output characteristics of the first layer second network unit block into the second layer second network unit block.
The apparatus according to claim 38, wherein the first-level second network unit block is further configured to perform multiplication on the content feature from the last-level first network unit block and the style feature. The content features of the first network unit block from the last layer are subjected to convolution operation.
The device according to any one of claims 36 to 39, wherein the middle layer second network unit block in each of the second network unit blocks is used to compare the input content characteristics and the upper layer second network The output feature of the unit block is multiplied to obtain the intermediate feature of the second network unit block of the intermediate layer; the content feature of the input and the intermediate feature of the second network unit block of the intermediate layer are added to obtain the Output characteristics of the second network unit block of the middle layer; input the output characteristics of the second network unit block of the middle layer into the second network unit block of the next layer.
The device according to claim 40, wherein the second network unit block of the middle layer is further configured to perform multiplication operations on the content characteristics of the input and the output characteristics of the second network unit block of the upper layer. The received content feature performs a convolution operation.
The device according to any one of claims 36 to 41, wherein the last-layer second network unit block in each of the second network unit blocks is used to combine content characteristics from the first-layer first network unit block Multiply the output feature of the second network unit block of the upper layer to obtain the intermediate feature of the second network unit block of the last layer; combine the content feature from the first network unit block of the first layer with the first network unit block of the last layer. The intermediate features of the two network unit blocks are added to obtain the generated image.
The device according to claim 42, wherein the last-level second network unit block is further used to compare the content characteristics from the first-level first network unit block and the output of the upper-level second network unit block. Before the feature is multiplied, convolution is performed on the content feature from the first network unit block of the first layer.
The device according to any one of claims 38 to 43, wherein the adjustment module is configured to adjust the multiplication operation parameter and/or the addition operation parameter.
The device according to any one of claims 36 to 44, wherein the adjustment module is configured to determine to generate a counter network loss according to the content image, the style image, the generated image, and the identification result; In response to the situation that the loss of the generative confrontation network does not meet the first predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the loss of the generative confrontation network; wherein, the generation The anti-network loss is used to characterize the difference in content characteristics between the generated image and the content image, and the difference in style characteristics between the generated image and the style image.
The device according to claim 45, wherein the adjustment module is further configured to determine a style loss according to the generated image and the style image; in response to the situation that the style loss does not meet a second predetermined condition, according to The style loss adjusts the network parameters of the first neural network and/or the second neural network; wherein the style loss is used to characterize the difference between the style characteristics of the generated image and the style image.
The device according to claim 45 or 46, wherein the adjustment module is further configured to determine a content loss based on the generated image and the content image; in response to a situation where the content loss does not meet a third predetermined condition Adjust the network parameters of the first neural network and/or the second neural network according to the content loss; wherein the content loss is used to characterize the difference in content characteristics between the generated image and the content image.
The apparatus according to any one of claims 45 to 47, wherein the adjustment module is further configured to perform according to the output characteristics of each intermediate layer second network unit block in each of the second network unit blocks and the style image , Determine the feature matching loss;

In response to the case that the feature matching loss does not meet the fourth predetermined condition, adjust the network parameters of the first neural network and/or the second neural network according to the feature matching loss; wherein, the feature matching loss It is used to characterize the difference between the output feature of the second network unit block of each intermediate layer and the style feature of the style image.
The apparatus according to any one of claims 36 to 48, wherein the fourth extraction module is configured to extract the characteristics of the style image distribution; sampling the characteristics of the style image distribution to obtain the style characteristics The style feature includes the mean value and standard deviation of the feature of the style image distribution.
The device according to any one of claims 36 to 49, wherein the first network unit block is configured to extract content images based on multiple neural network layers organized in a residual structure in the first network unit block Content characteristics; and/or,

The second network unit block is used to process the features input to the second network unit based on multiple neural network layers organized in a residual structure in the second network unit block.
An electronic device including a processor and a memory for storing a computer program that can run on the processor; wherein,

When the processor is used to run the computer program, it executes the image generation method according to any one of claims 1 to 10 or the neural network training method according to any one of claims 11 to 25.
A computer storage medium on which a computer program is stored, which when executed by a processor realizes the image generation method of any one of claims 1 to 10 or the neural network of any one of claims 11 to 25 Training method.