CN111539897A

CN111539897A - Method and apparatus for generating image conversion model

Info

Publication number: CN111539897A
Application number: CN202010386593.5A
Authority: CN
Inventors: 杨少雄; 赵晨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-08-14

Abstract

The application discloses a method and a device for generating an image conversion model, and relates to the technical field of augmented reality. The specific implementation scheme is as follows: acquiring a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises an image of a first domain and an image of a second domain; acquiring a pre-established generative countermeasure network; selecting a sample from a sample set, inputting an image of a first domain of the sample into a generation network, and obtaining a pseudo image of a second domain of the sample; respectively extracting high-level features of the pseudo image of the second domain of the sample and high-level features of the image of the second domain of the sample through a perception module; inputting the extracted high-level features into the discrimination network together, and calculating a loss value; and if the training is finished, the generated network is used as an image conversion model. The embodiment speeds up the convergence speed of the model, and the generated picture retains more high-frequency information.

Description

Method and apparatus for generating image conversion model

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of augmented reality.

Background

In recent years, with the rapid development of computer technology, image processing technology is applied to various aspects. For example, face image gender conversion, art work style conversion, and the like are performed.

In the related art, the gender conversion processing method of the face image is mainly realized by using GAN (generic adaptive Nets, generative confrontation network).

In this way, the generated picture has poor definition and is fuzzy, and high-frequency detail information is lacked, and the richness is not enough. In addition, the convergence speed of the model is slow, and a long training time is required.

Disclosure of Invention

A method, apparatus, device, and storage medium for generating an image transformation model are provided.

According to a first aspect, there is provided a method for generating an image conversion model, comprising: acquiring a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises an image of a first domain and an image of a second domain; acquiring a pre-established generative confrontation network, wherein the generative confrontation network comprises a generative network, a discrimination network and a pre-trained sensing module; selecting samples from the sample set, and performing the following training steps: inputting the image of the first domain of the sample into a generation network to obtain a pseudo image of the second domain of the sample; respectively extracting high-level features of the pseudo image of the second domain of the sample and high-level features of the image of the second domain of the sample through a perception module; inputting the extracted high-level features into a discrimination network together, and calculating a loss value; and if the generated confrontation network meets the training completion condition, taking the generated network as an image conversion model.

According to a second aspect, there is provided a method for converting an image, comprising: acquiring an image to be converted; the image is input into the image conversion model generated by the method according to the first aspect, and the converted image is output.

According to a third aspect, there is provided an apparatus for generating an image conversion model, comprising: a sample acquiring unit configured to acquire a preset sample set, wherein the sample set contains at least one sample, and the sample comprises an image of a first domain and an image of a second domain; the network acquisition unit is configured to acquire a pre-established generative confrontation network, wherein the generative confrontation network comprises a generative network, a discrimination network and a pre-trained perception module; a selecting unit configured to select a sample from a sample set; a generating unit configured to input an image of a first domain of the sample into a generating network, resulting in a pseudo-image of a second domain of the sample; an extracting unit configured to extract, by the perception module, high-level features of the pseudo image of the second domain of the sample and high-level features of the image of the second domain of the sample, respectively; a calculation unit configured to input the extracted high-level features together into a discrimination network, and calculate a loss value; and the output unit is configured to take the generated network as an image conversion model if the generated confrontation network meets the training completion condition.

According to a fourth aspect, there is provided an apparatus for converting an image, comprising: an acquisition unit configured to acquire an image to be converted; a conversion unit configured to input the image into the image conversion model generated by the method according to the first aspect, and output the converted image.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of the first aspect.

According to the technology of the application, the model convergence speed is accelerated due to the return distribution, and more high-frequency information is reserved in the generated picture. The definition and the detail richness of the picture are greatly improved, so that the generated image is clearer, real, natural and rich, and the quality and the effect are better. The technology can be widely applied to a plurality of tasks such as image translation, style conversion and the like, and has strong application value.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating an image transformation model according to the present application;

FIG. 3 is a schematic diagram of a generative confrontation network for a method of generating an image transformation model according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating an image transformation model according to the present application;

5a-5d are schematic diagrams of application scenarios of a method for generating an image transformation model according to the present application;

FIG. 6 is a flow diagram of yet another embodiment of a method for converting an image according to the present application;

FIG. 7 is a schematic illustration of an application scenario of a method for converting an image according to the present application;

FIG. 8 is a schematic diagram illustrating the structure of one embodiment of an apparatus for generating an image transformation model according to the present application;

FIG. 9 is a schematic diagram illustrating the structure of one embodiment of an apparatus for converting an image according to the present application;

fig. 10 is a block diagram of an electronic device for a method of generating an image conversion model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary system architecture 100 to which the method for generating an image conversion model, the apparatus for generating an image conversion model, the method for converting an image, or the apparatus for converting an image of the embodiments of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminals

101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing communication links between the

terminals

101, 102, the database server 104 and the server 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the

terminals

101, 102 to interact with the server 105 over the network 103 to receive or send messages or the like. The

terminals

101 and 102 may have various client applications installed thereon, such as a model training application, an image conversion application, a shopping application, a payment application, a web browser, an instant messenger, and the like.

Here, the

terminals

101 and 102 may be hardware or software. When the

terminals

101 and 102 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), laptop portable computers, desktop computers, and the like. When the

terminals

101 and 102 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminals

101, 102 are hardware, an image capturing device may be further mounted thereon. The image acquisition device can be various devices capable of realizing the function of acquiring images, such as a camera, a sensor and the like. The user 110 may use an image capturing device on the

terminal

101, 102 to capture the image to be converted.

Database server 104 may be a database server that provides various services. For example, a database server may have a sample set stored therein. The sample set contains a large number of samples. Wherein the sample may comprise an image of a first domain and an image of a second domain. For example, the first domain is male and the second domain is female. Or the first field is a painting and the second field is a photograph, etc. In this way, the user 110 may also select samples from a set of samples stored by the database server 104 via the

terminals

101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the

terminals

101, 102. The background server may train the initial model using the samples in the sample set sent by the

terminals

101 and 102, and may send the training result (e.g., the generated image transformation model) to the

terminals

101 and 102. In this way, the user can apply the generated image conversion model for image conversion.

Here, the database server 104 and the server 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating the image conversion model or the method for converting the image provided by the embodiment of the present application is generally performed by the server 105. Accordingly, the means for generating the image conversion model or the means for converting the image is generally also provided in the server 105.

It is noted that database server 104 may not be provided in system architecture 100, as server 105 may perform the relevant functions of database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating an image transformation model according to the present application is shown. The method for generating an image conversion model may comprise the steps of:

step 201, a preset sample set is obtained.

In the present embodiment, the execution subject (e.g., the server shown in fig. 1) of the method for generating an image conversion model may acquire a sample set in various ways. For example, the executing entity may obtain the existing sample set stored therein from a database server (e.g., database server 104 shown in fig. 1) via a wired connection or a wireless connection. As another example, a user may collect a sample via a terminal (e.g.,

terminals

101, 102 shown in FIG. 1). In this way, the executing entity may receive samples collected by the terminal and store the samples locally, thereby generating a sample set.

Here, the sample set may include at least one sample. Wherein the sample may comprise an image of a first domain and an image of a second domain. The domain refers to a category of an image, for example, male, female, painting, photograph, and the like. The sample may include an image of a male in a first domain and an image of a female converted into a second domain. The image types of the first domain and the second domain are not limited herein, and may be any combination. The specific implementation process may refer to the sample selection step of step 203.

Step 202, a pre-established generative countermeasure network is obtained.

In the present embodiment, the network structure is as shown in fig. 3. The Generative countermeasure network (GAN) comprises a Generative adaptation network (GIN), a discriminant network (GIN) and a pre-trained perception module. The generating network is used for converting the image of the first domain into the image of the second domain, and the judging network is used for determining whether the input image is the image output by the generating network. The perception module is used for extracting high-level features of the image. The perception module can be a VGG pre-trained model and is obtained by extracting high-level semantic features. Only the high-level features of the VGG network are used and the final classification information is not utilized. And further training into a two-classifier on the basis of the existing VGG. Such as models that VGG trains on other datasets (e.g., imagenet classification datasets), and then continues training based on both classification datasets.

The generation network may be a convolutional neural network for performing image processing (for example, various convolutional neural network structures including a convolutional layer, a pooling layer, an anti-pooling layer, and an anti-convolutional layer, and may perform down-sampling and up-sampling in sequence); the discriminative network may be a convolutional neural network (e.g., various convolutional neural network structures that include a fully-connected layer, where the fully-connected layer may perform a classification function). In addition, the above discriminant network may be other model structures that can be used to implement the classification function, such as a Support Vector Machine (SVM).

In step 203, a sample is selected from the sample set.

In this embodiment, the executing subject may select a sample from the sample set obtained in step 201, and perform the training steps from step 203 to step 208. The selection manner and the number of samples are not limited in the present application. For example, at least one sample may be selected randomly, or a sample with better sharpness (i.e., higher pixels) may be selected from the samples. Each sample may be a pair of images. The pair of images includes an image of a first domain and an image of a second domain. The image of the first domain and the image of the second domain can be selected according to actual requirements. For example, if image gender conversion is required and a male image is changed into a female image, the male image is selected as the first domain image and the female image is selected as the second domain image. Similarly, if image gender conversion is required and the image is converted from female to male, the female image is selected as the first domain image and the male image is selected as the second domain image. If image style conversion is needed, and the photos are converted into oil paintings, the photos such as the male photos, the female photos, the landscape photos and the like are selected as first domain images, and the oil paintings such as the male oil paintings, the female oil paintings, the landscape oil paintings and the like are correspondingly selected as second domain images. The image style conversion may also be a variety of combinations, such as oil painting to photo, photo to chinese painting, chinese painting to photo, monenzi to pica, and so on. The image conversion between any types can be realized only by taking the image of the original type as the first domain image and the image of the target type as the second domain image.

Step 204, inputting the image of the first domain of the sample into a generation network to obtain a pseudo image of the second domain of the sample.

In this embodiment, the generation network may convert an input image of a first domain into a pseudo image of a second domain. For example, a face image of a male is input, and a face image of a female is output. As shown in fig. 3, the image RealA of the first domain is input, and the pseudo image FakeB of the second domain is obtained. In this application, a represents the first domain and B represents the second domain. Real represents the original image and Fake represents the pseudo image.

And step 205, respectively extracting the high-level features of the pseudo image of the second domain of the sample and the high-level features of the image of the second domain of the sample through a pre-trained perception module.

In this embodiment, the perception module is not used here for classification, but for extracting features of the upper layers of the image. As shown in fig. 3, the high-level features of RealB and FakeB are extracted, respectively.

And step 206, inputting the extracted high-level features into a discrimination network together, and calculating a loss value.

In the embodiment, the discrimination network of the present application is different from a classical discrimination network, and the present application adopts a high-level feature to make a decision. The determination network may output 1 if it determines that the input high-level feature is a high-level feature (from the generated data) of the image output by the generation network; if it is determined that the high-level feature of the input image is not the high-level feature of the image output from the generation network (from the real data, i.e., the second image), 0 may be output. The discrimination network may output other values based on a preset value, and is not limited to 1 and 0. An L1 loss value or an L2 loss value may be calculated between the upper level features of the pseudo-image of the second domain of the sample and the upper level features of the image of the second domain of the sample.

And step 207, if the generated confrontation network meets the training completion condition, using the generated network as an image conversion model.

In this embodiment, the training completion condition includes at least one of: training iteration times reach a preset iteration threshold, the loss value is smaller than a preset loss value threshold, and the judgment accuracy of the judgment network is in a preset range. For example, the training iterations reach 5 thousand times. The loss value is less than 0.05, and the discrimination accuracy of the discrimination network reaches 50%. And only keeping the generated network as an image conversion model after the training is finished. The model convergence speed can be accelerated by setting the training completion condition.

Step 208, if the generated countermeasure network does not satisfy the training completion condition, the related parameters in the generated countermeasure network are adjusted to make the loss value converge, and step 203 and step 208 are executed continuously based on the adjusted generated countermeasure network.

In this embodiment, if the training is not completed, the parameters of the generated network or the discriminant network are adjusted to converge the loss value. The parameters of the discrimination network may be kept unchanged, and step 203 to step 208 may be repeatedly executed to adjust the parameters of the corresponding generation network so that the loss value gradually decreases until the loss value is stable. Then, the parameters of each generated network are kept unchanged, and step 203 to step 208 are repeatedly executed to adjust the parameters of the discrimination network so that the loss value gradually increases until the loss value becomes stable. And alternately training the parameters of the generated network and the parameters of the judgment network until the loss value is converged.

It can be seen that the discrimination network does not directly discriminate the true and false pictures, but discriminates the high-level features after passing through the sensing module. After the sensing module is added, compared with the case of directly calculating the distance L2 between pixels, the feedback distribution (the distribution can be understood as the overall distribution condition of one picture pixel (can be understood as a higher layer of characteristics)) has more universality when the feedback derivative is transmitted, so that the model convergence speed is accelerated, the generated picture retains more high-frequency information, and the picture definition and the detail richness are better.

With continued reference to FIG. 4, a flow 400 of yet another embodiment of a method for generating an image transformation model according to the present application is shown. The method for generating an image conversion model may comprise the steps of:

step 401, a preset sample set is obtained.

Step 401 is substantially the same as step 201, and therefore is not described again.

Step 402, a pre-established recurrent countermeasure network is obtained.

In this embodiment, the network structure is shown in fig. 5 a. The generation networks include a first generation network (may be abbreviated as GA2B) for converting an image of a first domain into an image of a second domain, and a second generation network (may be abbreviated as GB2A) for converting an image of a second domain into an image of a first domain, the first discrimination network (may be abbreviated as DA) for determining whether an input image is an image output by the first generation network, and the second discrimination network (may be abbreviated as DB) for determining whether an input image is an image output by the second generation network. For example, a first generation network is used to convert a male face image into a female face image, and a second generation network is used to convert a female face image into a male face image.

The recurrent countermeasure network also includes a perception module for extracting high-level features of the image. The perception module is used for extracting high-level features of the image. The perception module can be a VGG pre-trained model and is obtained by extracting high-level semantic features. Only the high-level features of the VGG network are used and the final classification information is not utilized. And further training into a two-classifier on the basis of the existing VGG. Such as models that VGG trains on other datasets (e.g., imagenet classification datasets), and then continues training based on both classification datasets.

At step 403, a sample is selected from the sample set.

Step 403 is substantially the same as step 203 and thus will not be described again.

Step 404, generating a pseudo image, a reconstructed image and a mapping image of the first domain and the second domain respectively based on the image of the first domain and the image of the second domain of the sample.

In this embodiment, the first domain is denoted by a and the second domain is denoted by B. Real denotes an image, Fake denotes a pseudo image, Recons denotes a reconstructed image, FakeA2A denotes a mapped image of a first domain, and FakeB2B denotes a mapped image of a second domain. As shown in fig. 5a, the image RealA of the first domain is converted into FakeB after passing through the first generation network. FakeB is converted to ReconsA after passing through the second generation network. The RealA is converted into FakeA2A after passing through the second generation network. And the image RealB of the second domain is converted into FakeA after passing through the second generation network. FakeA is converted into ReconsB after passing through the first generation network. The RealB is converted into FakeB2B after passing through the first generation network.

Step 405, respectively extracting the high-level features of the image, the pseudo image, the reconstructed image, the mapping image of the first domain and the image, the pseudo image, the reconstructed image and the mapping image of the second domain of the sample.

In this embodiment, the image, the pseudo image, the reconstructed image, and the high-level features of the mapped image of the sample in the first domain and the image, the pseudo image, the reconstructed image, and the high-level features of the mapped image in the second domain are extracted by the sensing module, respectively. The same perception module is used to extract the high-level features separately. Multiple same sensing modules can be used for parallel extraction, and one sensing module can also be used for serial extraction.

At step 406, a loss value is calculated based on the extracted high-level features.

In this embodiment, the loss value is a weighted sum of the discrimination loss value, the reconstruction loss value, and the identity loss value. The weight of each loss value can be set according to experimental data. For example, the weight of the reconstruction loss value is set to be greater than the weight of the identity loss value. And the loss value is calculated according to the high-level features, the model convergence speed is accelerated, more high-frequency information is reserved in the generated picture, and the picture definition and the detail richness are better. And loss values are calculated from various angles, so that the accuracy of the model is improved.

The calculation of the reconstruction loss value is shown in fig. 5 b. A distance between a high-level feature of the image of the first domain of the sample and a high-level feature of a reconstructed image of the first domain of the sample is calculated as a first reconstruction loss value. A distance between the upper level feature of the image of the second domain of the sample and the upper level feature of the reconstructed image of the second domain of the sample is calculated as a second reconstruction loss value. Determining a sum of the first reconstruction loss value and the second reconstruction loss value as a reconstruction loss value. The distance may be calculated using an L1 loss function or an L2 loss function.

The identity loss value is calculated as shown in figure 5 c. A first identity loss value is calculated as a distance of a high-level feature of the mapped image of the first domain of the sample from a high-level feature of the image of the first domain of the sample. A distance between the high-level features of the mapped image of the second domain of the sample and the high-level features of the image of the second domain of the sample is calculated as a second identity loss value. And determining the sum of the first identity loss value and the second identity loss value as the identity loss value. The distance may be calculated using an L1 loss function or an L2 loss function.

The calculation process for the discrimination loss value is shown in fig. 5 d. The high-level features of the pseudo image in the first domain of the sample and the high-level features of the image in the first domain of the sample are input into a first discrimination network, and a first discrimination loss value is calculated. And inputting the high-level features of the pseudo image of the second domain of the sample and the high-level features of the image of the second domain of the sample into a second judgment network, and calculating a second judgment loss value. And taking the sum of the first discriminant loss value and the first discriminant loss value as a discriminant loss value. The distance may be calculated using an L1 loss function or an L2 loss function.

Step 407, if the recurrent countermeasures network satisfies the training completion condition, the first generation network and the second generation network are used as image conversion models.

In this embodiment, the training completion condition includes at least one of: training iteration times reach a preset iteration threshold, the loss value is smaller than a preset loss value threshold, and the judgment accuracy of the judgment network is in a preset range. After the cycle generation type confrontation network finishes training, 2 generation networks can be obtained: a first generating network and a second generating network. The first generating network converts an image of a first domain into an image of a second domain, for example, converts a male face image into a female face image. The second generating network converts the image of the second domain into the image of the first domain, for example, converts a female face image into a male face image.

In step 408, if the cyclic generation type countermeasure network does not satisfy the training completion condition, the related parameters in the generation type countermeasure network are adjusted so that the loss value converges.

In this embodiment, if the training is not completed, the parameters of the first generating network, the second generating network, the first determining network, and the second determining network are adjusted to converge the loss value. The parameters of the first discrimination network and the second discrimination network may be kept unchanged, and the steps 403 to 408 may be repeatedly performed to adjust the parameters corresponding to the first generation network and the second generation network so that the loss value gradually decreases until the loss value is stable. Then, the parameters of the first generation network and the second generation network are kept unchanged, and step 403 to step 408 are repeatedly executed to adjust the parameters of the first discrimination network and the second discrimination network so that the loss value gradually increases until the loss value is stable. And alternately training the parameters of the first generating network and the second generating network and the parameters of the first judging network and the second judging network until the loss value is converged.

The flexibility of the cycle generating countermeasure network is that it can be trained without providing samples for pair-wise transformation from the source domain to the target domain. In addition, after the sensing module is added, compared with the method of directly calculating the distance L2 between pixels when a derivative is transmitted back, the transmission distribution is more universal, so that the convergence speed of the model is accelerated, more high-frequency information is reserved in the generated picture, and the picture definition and the detail richness are better.

Turning to fig. 6, a flowchart 600 of one embodiment of a method for converting an image provided herein is shown. The method for converting an image may include the steps of:

step 601, acquiring an image to be converted.

In the present embodiment, the execution subject of the method for converting an image (e.g., the server 105 shown in fig. 1) may acquire an image to be converted in various ways. For example, the execution subject may acquire the image to be converted stored therein from a database server (e.g., database server 104 shown in fig. 1) through a wired connection or a wireless connection. As another example, the executing entity may also receive an image acquired by a terminal (e.g.,

terminals

101 and 102 shown in fig. 1) or other device to be converted.

In the present embodiment, the acquisition of the image to be converted may be a color image and/or a grayscale image or the like. And the format of the image to be converted is not limited in this application.

Step 602, inputting the image into the image conversion model, and outputting the converted image.

In this embodiment, the executing subject may input the image acquired in step 601 into an image conversion model, thereby generating a converted image. Step 201 and 208 train the generated image transformation model to transform the image from the first domain to the second domain. And step 401-. As shown in fig. 7, the image conversion model generated by the training of

steps

201 and 208 is used to input the male face image and convert the male face image into the female face image. If 401-408 training is used to generate 2 image transformation models, the transformation can be performed in a two-way manner from male to female and from female to male.

In this embodiment, the image transformation model may be generated using the method described above in the embodiment of fig. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.

It should be noted that the method for converting an image according to the present embodiment may be used to test the image conversion model generated by the above embodiments. And then the image conversion model can be continuously optimized according to the conversion result. The method may also be a practical application method of the image conversion model generated in the above embodiments. The image conversion model generated by the above embodiments is used for image conversion, which is helpful for improving the performance of image conversion.

With continuing reference to FIG. 8, the present application provides one embodiment of an apparatus for converting an image as an implementation of the method illustrated in FIG. 2 described above. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.

As shown in fig. 8, the apparatus 800 for converting an image of the present embodiment may include: a sample acquisition unit 801, a network acquisition unit 802, a selection unit 803, a generation unit 804, an extraction unit 805, a calculation unit 806, an output unit 807, and an adjustment unit 808. The sample acquiring unit 801 is configured to acquire a preset sample set, where the sample set includes at least one sample, and the sample includes an image of a first domain and an image of a second domain. A network obtaining unit 802 configured to obtain a pre-established generative confrontation network, wherein the generative confrontation network includes a generative network, a discriminant network, and a pre-trained sensing module. A selecting unit 803 configured to select a sample from the set of samples. A generating unit 804 configured to input the image of the first domain of the sample into a generating network, resulting in a pseudo-image of the second domain of the sample. An extracting unit 805 configured to extract, by the perception module, high-level features of the pseudo image of the second domain of the sample and high-level features of the image of the second domain of the sample, respectively. A calculating unit 806 configured to input the extracted high-level features together into the discrimination network and calculate a loss value. An output unit 807 configured to take the generating network as an image conversion model if the generating opposing network satisfies the training completion condition.

In some optional implementations of this embodiment, the apparatus 800 further includes an adjusting unit 808 configured to: if the generative confrontation network does not meet the training completion condition, relevant parameters in the generative confrontation network are adjusted to enable the loss value to be converged, and the selection unit, the generation unit, the extraction unit, the calculation unit and the output unit continue to execute the training step based on the adjusted generative confrontation network.

In some optional implementations of the embodiment, the generative countermeasure network is a recurrent generative countermeasure network, the generative network includes a first generative network and a second generative network, the discriminative network includes a first discriminative network and a second discriminative network, the first generative network is used for converting the image of the first domain into the image of the second domain, the second generative network is used for converting the image of the second domain into the image of the first domain, the first discriminative network is used for determining whether the input image is the image output by the first generative network, and the second discriminative network is used for determining whether the input image is the image output by the second generative network.

In some optional implementations of this embodiment, the generating unit 804 is further configured to: inputting the image of the first domain of the sample into a first generation network to obtain a pseudo image of the second domain of the sample; inputting the image of the second domain of the sample into a second generation network to obtain a pseudo image of the first domain of the sample; inputting the pseudo image of the second domain of the sample into a second generation network to obtain a reconstructed image of the first domain of the sample; inputting the pseudo image of the first domain of the sample into a first generation network to obtain a reconstructed image of the second domain of the sample; inputting the image of the first domain of the sample into a second generation network to obtain a mapping image of the first domain of the sample; and inputting the image of the second domain of the sample into the first generation network to obtain a mapping image of the second domain of the sample.

In some optional implementations of the present embodiment, the extraction unit 805 is further configured to: respectively extracting high-level features of the image, the pseudo image, the reconstructed image and the mapping image of the first domain of the sample through a sensing module; and respectively extracting the high-level characteristics of the reconstructed image and the mapping image of the second domain of the sample through a sensing module.

In some optional implementations of this embodiment, the computing unit 806 is further configured to: calculating a discriminant loss value based on the pseudo-image of the first domain and the pseudo-image of the second domain of the sample; calculating a reconstruction loss value based on a reconstructed image of a first domain and a reconstructed image of a second domain of the sample; calculating an identity loss value based on the mapped image of the first domain and the mapped image of the second domain of the sample; and taking the weighted sum of the discriminant loss value, the reconstruction loss value and the identity loss value as the loss value.

In some optional implementations of this embodiment, the computing unit 806 is further configured to: inputting the high-level features of the pseudo image of the first domain of the sample and the high-level features of the image of the first domain of the sample into a first discrimination network, and calculating a first discrimination loss value; inputting the high-level features of the pseudo image of the second domain of the sample and the high-level features of the image of the second domain of the sample into a second judgment network, and calculating a second judgment loss value; and taking the sum of the first discrimination loss value and the first discrimination loss value as a discrimination loss value.

In some optional implementations of this embodiment, the computing unit 806 is further configured to: calculating a distance between the high-level features of the image of the first domain of the sample and the high-level features of the reconstructed image of the first domain of the sample as a first reconstruction loss value; calculating a distance between the high-level feature of the image of the second domain of the sample and the high-level feature of the reconstructed image of the second domain of the sample as a second reconstruction loss value; determining a sum of the first reconstruction loss value and the second reconstruction loss value as a reconstruction loss value.

In some optional implementations of this embodiment, the computing unit 806 is further configured to: calculating the distance between the high-level features of the mapping image of the first domain of the sample and the high-level features of the image of the first domain of the sample to be used as a first identity loss value; calculating the distance between the high-level features of the mapping image of the second domain of the sample and the high-level features of the image of the second domain of the sample to be used as a second identity loss value; and determining the sum of the first identity loss value and the second identity loss value as the identity loss value.

In some optional implementations of this embodiment, the training completion condition includes at least one of: training iteration times reach a preset iteration threshold, the loss value is smaller than a preset loss value threshold, and the judgment accuracy of the judgment network is in a preset range.

It will be understood that the elements described in the apparatus 800 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 800 and the units included therein, and are not described herein again.

With continuing reference to FIG. 9, the present application provides one embodiment of an apparatus for converting an image as an implementation of the method illustrated in FIG. 4 and described above. The embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device can be applied to various electronic devices.

As shown in fig. 9, the apparatus 900 for converting an image of the present embodiment may include: an acquisition unit 901 configured to acquire an image to be converted. A conversion unit 902 configured to input the image into an image conversion model generated by the method as described in the embodiment of fig. 2 or fig. 4, and output the converted image.

It will be understood that the elements described in the apparatus 900 correspond to various steps in the method described with reference to fig. 8. Thus, the operations, features, and advantages described above with respect to the method are also applicable to the apparatus 900 and the units included therein, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 10, it is a block diagram of an electronic device for generating an image conversion model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 10, the electronic apparatus includes: one or more processors 1001, memory 1002, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 10 illustrates an example of one processor 1001.

The memory 1002 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for generating an image transformation model provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for generating an image conversion model provided herein.

The memory 1002, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method for generating an image conversion model in the embodiment of the present application (e.g., the sample acquisition unit 801, the network acquisition unit 802, the selection unit 803, the generation unit 804, the extraction unit 805, the calculation unit 806, the output unit 807, and the adjustment unit 808 shown in fig. 8). The processor 1001 executes various functional applications of the server and data processing, i.e., implements the method for generating an image conversion model in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 1002.

The memory 1002 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of an electronic device for generating an image conversion model, and the like. Further, the memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1002 may optionally include memory located remotely from the processor 1001, which may be connected via a network to an electronic device for generating the image transformation model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the method of generating an image conversion model may further include: an input device 1003 and an output device 1004. The processor 1001, the memory 1002, the input device 1003, and the output device 1004 may be connected by a bus or other means, and the bus connection is exemplified in fig. 10.

The input device 1003 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus used to generate the image conversion model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 1004 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor or OLED (organic electroluminescent display) display screen) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the provided perception module can be embedded into the picture translation related algorithms such as the cycle generation type countermeasure network and the like as a green pluggable module, not only is the speed of model convergence accelerated due to return distribution, but also more high-frequency information is reserved in the generated picture, the definition and the detail richness of the picture are greatly improved, and the generated picture is clearer, real, natural and rich, and better in quality and effect. The technology can be widely applied to a plurality of tasks such as image translation, style conversion and the like, and has strong application value.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for generating an image transformation model, comprising:

acquiring a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises an image of a first domain and an image of a second domain;

acquiring a pre-established generative confrontation network, wherein the generative confrontation network comprises a generative network, a judgment network and a perception module;

the following training steps are performed: selecting a sample from the sample set; inputting the image of the first domain of the sample into the generation network to obtain a pseudo image of the second domain of the sample; respectively extracting the high-level features of the pseudo image of the second domain of the sample and the high-level features of the image of the second domain of the sample through the perception module; inputting the extracted high-level features into the discrimination network together, and calculating a loss value; and if the generative confrontation network meets the training completion condition, taking the generative network as an image conversion model.

2. The method of claim 1, wherein the method further comprises:

if the generative confrontation network does not meet the training completion condition, adjusting the related parameters in the generative confrontation network to make the loss value converge, and continuing to execute the training step based on the adjusted generative confrontation network.

3. The method according to claim 1, wherein the generative countermeasure network is a recurrent generative countermeasure network, the generative network including a first generative network and a second generative network, the discriminative network including a first discriminative network for converting an image of a first domain into an image of a second domain and a second discriminative network for converting an image of a second domain into an image of a first domain, the first discriminative network for determining whether the input image is an image output by the first generative network, the second discriminative network for determining whether the input image is an image output by the second generative network.

4. The method of claim 3, wherein after inputting the image of the first domain of the sample into the generating network, resulting in the pseudo-image of the second domain of the sample, the method further comprises:

inputting the image of the second domain of the sample into the second generation network to obtain a pseudo image of the first domain of the sample;

inputting the pseudo image of the second domain of the sample into the second generation network to obtain a reconstructed image of the first domain of the sample;

inputting the pseudo image of the first domain of the sample into the first generation network to obtain a reconstructed image of the second domain of the sample;

inputting the image of the first domain of the sample into the second generation network to obtain a mapping image of the first domain of the sample;

and inputting the image of the second domain of the sample into the first generation network to obtain a mapping image of the second domain of the sample.

5. The method of claim 4, wherein after extracting, by the perception module, high-level features of the pseudo-image of the second domain of the sample and high-level features of the image of the second domain of the sample, respectively, the method further comprises:

respectively extracting high-level features of the image, the pseudo image, the reconstructed image and the mapping image of the first domain of the sample through the sensing module;

and respectively extracting the high-level characteristics of the reconstructed image and the mapping image of the second domain of the sample through the sensing module.

6. The method of claim 5, wherein said inputting the extracted high-level features together into the discriminative network, calculating a loss value, comprises:

calculating a discriminant loss value based on the pseudo-image of the first domain and the pseudo-image of the second domain of the sample;

calculating a reconstruction loss value based on a reconstructed image of a first domain and a reconstructed image of a second domain of the sample;

calculating an identity loss value based on the mapped image of the first domain and the mapped image of the second domain of the sample;

and taking the weighted sum of the discrimination loss value, the reconstruction loss value and the identity loss value as a loss value.

7. The method of claim 6, wherein said calculating a discriminant loss value based on the pseudo-image of the first domain and the pseudo-image of the second domain of the sample comprises:

inputting the high-level features of the pseudo image of the first domain of the sample and the high-level features of the image of the first domain of the sample into a first discrimination network, and calculating a first discrimination loss value;

inputting the high-level features of the pseudo image of the second domain of the sample and the high-level features of the image of the second domain of the sample into a second judgment network, and calculating a second judgment loss value;

and taking the sum of the first discrimination loss value and the first discrimination loss value as a discrimination loss value.

8. The method of claim 6, wherein said calculating a reconstruction loss value based on the reconstructed image of the first domain and the reconstructed image of the second domain of the sample comprises:

calculating a distance between the high-level features of the image of the first domain of the sample and the high-level features of the reconstructed image of the first domain of the sample as a first reconstruction loss value;

calculating a distance between the high-level feature of the image of the second domain of the sample and the high-level feature of the reconstructed image of the second domain of the sample as a second reconstruction loss value;

determining a sum of the first reconstruction loss value and the second reconstruction loss value as a reconstruction loss value.

9. The method of claim 6, wherein the computing the loss of identity value based on the mapped image of the first and second realms of the sample comprises:

calculating the distance between the high-level features of the mapping image of the first domain of the sample and the high-level features of the image of the first domain of the sample to be used as a first identity loss value;

calculating the distance between the high-level features of the mapping image of the second domain of the sample and the high-level features of the image of the second domain of the sample to be used as a second identity loss value;

determining a sum of the first identity loss value and the second identity loss value as an identity loss value.

10. The method of any of claims 1-9, wherein the training completion condition comprises at least one of:

training iteration times reach a preset iteration threshold value, the loss value is smaller than a preset loss value threshold value, and the judgment accuracy of the judgment network is within a preset range.

11. A method for converting an image, comprising:

acquiring an image to be converted;

inputting the image into an image conversion model generated by the method according to any one of claims 1-10, and outputting the converted image.

12. An apparatus for generating an image transformation model, comprising:

a sample acquiring unit configured to acquire a preset sample set, wherein the sample set contains at least one sample, and the sample comprises an image of a first domain and an image of a second domain;

the system comprises a network acquisition unit, a network acquisition unit and a control unit, wherein the network acquisition unit is configured to acquire a pre-established generative confrontation network, and the generative confrontation network comprises a generative network, a discrimination network and a pre-trained perception module;

a selecting unit configured to select a sample from the set of samples;

a generating unit configured to input an image of a first domain of the sample into the generating network, resulting in a pseudo-image of a second domain of the sample;

an extracting unit configured to extract, by the perception module, high-level features of the pseudo image of the second domain of the sample and high-level features of the image of the second domain of the sample, respectively;

a calculation unit configured to input the extracted high-level features together into the discrimination network, and calculate a loss value;

an output unit configured to take the generated confrontation network as an image conversion model if the generated confrontation network satisfies a training completion condition.

13. The apparatus of claim 12, wherein the apparatus further comprises an adjustment unit configured to:

if the generative confrontation network does not meet the training completion condition, adjusting the related parameters in the generative confrontation network to make the loss value converge, and continuing to execute the training step based on the adjusted generative confrontation network by the selection unit, the generation unit, the extraction unit, the calculation unit and the output unit.

14. The apparatus of claim 11, wherein the generative countermeasure network is a recurring generative countermeasure network, the generative network comprising a first generative network and a second generative network, the discriminative network comprising a first discriminative network for converting images of a first domain into images of a second domain and a second discriminative network for converting images of a second domain into images of a first domain, the first discriminative network for determining whether the input images are images output by the first generative network, the second discriminative network for determining whether the input images are images output by the second generative network.

15. The apparatus of claim 14, wherein the generating unit is further configured to:

16. The apparatus of claim 15, wherein the extraction unit is further configured to:

17. The apparatus of claim 16, wherein the computing unit is further configured to:

18. The apparatus of claim 17, wherein the computing unit is further configured to:

19. The apparatus of claim 17, wherein the computing unit is further configured to:

20. The apparatus of claim 17, wherein the computing unit is further configured to:

21. The apparatus according to any one of claims 12-20, wherein the training completion condition comprises at least one of:

22. An apparatus for converting an image, comprising:

an acquisition unit configured to acquire an image to be converted;

a conversion unit configured to input the image into an image conversion model generated by the method according to any one of claims 1 to 10, and output a converted image.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.