WO2020129716A1

WO2020129716A1 - Model learning device, model learning method, and program

Info

Publication number: WO2020129716A1
Application number: PCT/JP2019/047940
Authority: WO
Inventors: 崇之梅田; 慎吾安藤; 淳嵯峨田
Original assignee: 日本電信電話株式会社
Priority date: 2018-12-18
Filing date: 2019-12-06
Publication date: 2020-06-25
Also published as: JP2020098490A

Abstract

The present invention makes it possible to learn an image generation model which can generate an accurate image down to the details thereof. A partitioning unit 22 outputs image data corresponding to a plurality of partitioned regions for each instance of L instances of partitioning processing performed on a learning image I0 and a semantic layout image S0. A reduction unit 26 reduces the semantic layout image S0 and the plurality of region images to a prescribed size. A learning unit 28 learns an image generation model 32 such that: upon the input to a first generator G2 of images obtained by the reduction of each of the plurality of regions obtained by partitioning the semantic layout image S0 by L-1 instances (0 ≤ 1 < L) of partitioning processing, an (l+1)-th generator Gl+1 generates an image corresponding to each of the plurality of regions obtained by partitioning the learning image I0 by L-1 instances of partitioning processing; and upon the input to the first generator G2 of an image obtained by the reduction of the semantic layout image S0, an (L+1)-th generator GL +1 generates an image corresponding to the learning image I0.

Description

Model learning device, model learning method, and program

The present disclosure relates to a model learning device, a model learning method, and a program.

In machine learning related to image processing such as object recognition, object detection, and semantic segmentation, the number of images and teacher data pairs required for learning is enormous. In particular, in semantic segmentation, it is necessary to give a label to be recognized to each pixel of an image, and it is very expensive to prepare a sufficient amount of teacher data for learning as compared with object recognition and object detection.

Therefore, there is a technique to generate a realistic image that is close to the actual image from existing or arbitrarily created teacher data for simple semantic segmentation (hereinafter referred to as "semantic layout") (Non-Patent Document 1). According to this technique, it is possible to reduce the cost required for the above-mentioned teacher data by generating various images from a single semantic layout.

In the prior art typified by Non-Patent Document 1 described above, a good image can be generated by subjective evaluation of the entire image, while in the details of the generated image, distortion may occur or an object There is a problem that the contour may be discontinuous.

When the image generated by this is used as learning data, the accuracy of the learned model decreases.

The present disclosure has been made in view of the above points, and provides a model learning device, a model learning method, and a program capable of learning an image generation model that generates an image that is accurate in details. To aim.

In order to achieve the above object, the model learning device according to the first aspect of the present disclosure is a model learning device for learning an image generation model that generates a realistic image from a semantic layout image, and is a learning image. And an input unit to which a pair of image data representing a semantic layout image corresponding to the learning image is input as teacher data, and the learning image represented by the image data of the input learning image. , And L is divided into a plurality of regions for each of the semantic layout images represented by the image data of the semantic layout image L times, and each time the division process is performed, image data corresponding to the divided regions A dividing unit for outputting the image, the reducing unit for reducing the sizes of the semantic layout image input to the input unit and the images of the plurality of regions to a predetermined size, and applying in order to the input image. L+1 times (0≦l<L) with respect to an image generation model including L+1 generators that generate a sampled image, the generator having the input size of the first generator being the predetermined size. When the image obtained by reducing each of the plurality of regions obtained by dividing the semantic layout image into the predetermined size by the division process of (1) is input to the first generator, the (l+1)th generator performs the L-1 division process. An image corresponding to each of a plurality of areas obtained by dividing the learning image is generated, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator, A learning unit that performs learning of the image generation model so as to generate an image corresponding to the learning image.

A model learning apparatus according to a second aspect of the present disclosure is the model learning apparatus according to the first aspect, in which the learning unit divides the semantic layout image by dividing processing L−1 times (0≦l<L). The learning image is divided by the image generated by the (l+1)th generator when the image obtained by reducing each of the plurality of regions to the predetermined size is input to the first generator and the L-1 division processing. Between the error from each of the plurality of regions and the image generated by the L+1th generator when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, and the learning image. A loss function derivation unit that derives a loss function including an error is included, and M+1 generators of the image generation model are learned so as to optimize the value of the loss function.

A model learning device according to a third aspect of the present disclosure is the model learning device according to the first aspect or the second aspect, wherein the learning unit is a discriminator for discriminating whether or not the image is generated by a generator. On the other hand, the learning image is identified as not being the image generated by the generator, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator is generated. Learning is performed so as to identify the generated image as an image generated by the generator, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator is The image generation model is learned so that the discriminator discriminates that the generated image is not the image generated by the generator.

A model learning device according to a fourth aspect of the present disclosure is the model learning device according to any one of the first to third aspects, in which the dividing unit is configured to perform one division process in a vertical direction and the like. The image is divided so as to be divided into two parts in the horizontal direction.

In order to achieve the above object, a model learning method according to a fifth aspect of the present disclosure is a model learning method for learning an image generation model that generates a realistic image from a semantic layout image, and the model learning method includes: A step of inputting a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image as teacher data, and an image of the learning image input by the dividing unit. Each of the learning image represented by the data and the semantic layout image represented by the image data of the semantic layout image is divided into a plurality of regions by dividing processing L times. Outputting image data according to the area, reducing the size of the semantic layout image input to the input unit and the images of the plurality of areas by a reducing unit to a predetermined size, and a learning unit. According to the above, the image generation model includes L+1 number of generators for generating upsampled images, which are sequentially applied to the input image, and in which the input size of the first generator is the predetermined size. , When the image obtained by reducing each of the plurality of regions obtained by dividing the semantic layout image to the predetermined size by L-1 times (0≦l<L) is input to the first generator, Generator generates an image corresponding to each of a plurality of regions into which the learning image is divided by L-1 division processing, and generates the first image obtained by reducing the semantic layout image to the predetermined size. Learning the image generation model so that the (L+1)th generator generates an image corresponding to the learning image when input to the image generator.

In order to achieve the above object, a program according to a sixth aspect of the present disclosure is a program for causing a computer to function as each unit of the model learning device according to any one of the first to fourth aspects. is there.

According to the present disclosure, it is possible to obtain an effect that it is possible to learn an image generation model that generates an image that is accurate in details.

It is a block diagram which shows the structure of an example of the model learning apparatus of embodiment. It is a figure which shows an example of the image generation model of embodiment. It is a figure which shows an example of the image generation model provided with the discriminator. It is a flow chart which shows an example of a model learning processing routine in a model learning device of an embodiment. It is a figure explaining an example of other image division methods.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings. As an example, in the present embodiment, the size of an image for learning (hereinafter, referred to as “learning image”) and the size of an image showing a semantic layout (hereinafter, referred to as “semantic layout image”) are 128 vertically and horizontally respectively. The case of pixels will be described. The size and aspect ratio of each image can be set to arbitrary values by changing the network configuration described later.

<Configuration of model learning device of this embodiment>
FIG. 1 is a block diagram showing the configuration of an example of a model learning device 10 of this exemplary embodiment. As shown in FIG. 1, the model learning device 10 of this embodiment includes an input unit 20, a dividing unit 22, an image data storage unit 24, a reducing unit 26, a learning unit 28, and an output unit 30.

As an example, the model learning apparatus 10 according to the present exemplary embodiment includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read) that stores a program for executing a model learning processing routine described below and various data. Only Memory), and can be configured with a computer including. Specifically, the CPU that executes the program functions as the input unit 20, the dividing unit 22, the reducing unit 26, the learning unit 28, and the output unit 30 of the model learning device 10 illustrated in FIG. 1.

As shown in FIG. 1, the teacher data 1 is input to the input unit 20. In the present embodiment, as the teaching data 1, a learning image I ⁰ , which is a realistic image, and a semantic layout image S ⁰ representing a semantic layout obtained as a result of performing semantic segmentation on the learning image I ⁰ . Prepare a pair (combination). In the present embodiment, for simplification of description, a case where there is only one pair of teacher data 1, that is, there is only one learning image I ⁰ and semantic layout image S ⁰ will be described. Can be easily applied when there are multiple pairs. In the following, a plurality of types of semantic layout images such as the semantic layout image S ⁰ which is the teacher data 1 and the semantic layout images S ₁ ¹ to S ₄ ¹ (details will be described later) generated from the semantic layout image S ⁰ will be given. However, in the case of collectively calling without distinguishing each, the suffix of "S" is omitted and simply referred to as "semantic layout image S". Similarly, when the learning images I ^{0 and the} like are collectively referred to without distinction, the suffix “I” is omitted and simply referred to as the “learning image I”.

Specifically, the image data representing the learning image I ⁰ and the image data representing the semantic layout image S ⁰ are input to the input unit 20 as teacher data 1. The image data representing the learning image I ⁰ input to the input unit 20 and the image data representing the semantic layout image S ⁰ are output to the dividing unit 22.

The image data representing the learning image I ⁰ and the image data representing the semantic layout image S ⁰ are input to the dividing unit 22, and the learning image I ⁰ and the semantic layout image S ⁰ are equally divided vertically and horizontally. In addition, the teacher data is artificially increased by performing the dividing process L times (L is an integer of 1 or more) for dividing into a plurality of areas.

As an example, the dividing unit 22 of the present embodiment divides the learning image I ⁰ and the semantic layout image S ⁰ into a pyramid shape as in Reference 1.

[Reference 1] Svetlana Lazebnik et al., "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories", Internet search <URL: http://mplab.ucsd.edu/~marni/niker/niker_niker >

As an example, in the present embodiment, a case where L=2 and the dividing unit 22 performs the dividing process twice will be described. The division process can be performed by changing the network configuration until the size after division becomes 2 pixels vertically and horizontally (2 pixels×2 pixels). Note that the accuracy of the details of the generated image generated by the image generation model 32 improves as the division processing is performed.

First, the division unit 22 performs a division process once on the semantic layout image S ^{0 of} vertical and horizontal 128 pixels (128 pixels×128 pixels), and the image size of the semantic layout image S 64 is 64 pixels (64 pixels×64 pixels). 4 (semantic layout images S ₁ ¹ to S ₄ ¹ , see FIG. 2) are generated. The image data representing each of the generated semantic layout images S ₁ ¹ to S ₄ ¹ is output from the division unit 22 and stored in the image data storage unit 24.

Note that among the subscripts of "S" that indicate the semantic layout image, the superscript indicates the number of times of division processing. Further, the subscript indicates an index in the group of semantic layout images S obtained simultaneously by the division.

Further, the dividing unit 22 of the present embodiment performs the dividing process on each of the semantic layout images S ₁ ¹ to S ₄ ^{1 in} the same manner as described above, and the image size is 32 pixels vertically and horizontally (32 pixels×32 pixels). Sixteen semantic layout images (semantic layout images S ₁ ² to S ₁₆ ² , see FIG. 2) are generated. The image data representing each of the generated semantic layout images S ₁ ² to S ₁₆ ² is output from the division unit 22 and stored in the image data storage unit 24.

The division unit 22 also performs the same division processing as described above on the learning image I ⁰ , and the learning images I ₁ ¹ to I ₄ ¹ and the learning images I ₁ ² to I ₁₆ ² (see FIG. 2). To generate. Image data representing each of the learning images I ₁ ¹ to I ₄ ¹ and the learning images I ₁ ² to I ₁₆ ² is output from the dividing unit 22 and stored in the image data storage unit 24.

In the model learning device 10 of the present embodiment, the image data representing the learning image I ⁰ and the image data representing the semantic layout image S ⁰ are also output from the dividing unit 22 and stored in the image data storage unit 24. It

Image data of each of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ , and S ₁ ² to S ₁₆ ² is input to the reduction unit 26 from the image data storage unit 24. The reduction unit 26 reduces the size of each of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ , and S ₁ ² to S ₁₆ ² . Note that the reduction size is arbitrary, but the vertical and horizontal sizes of the semantic layout image S divided by the L division processing of the division unit 22 are half the size, in other words, the size (area) of 1/4 is a standard. To do. As described above, in the present embodiment, since the size of the semantic layout images S ₁ ² to S ₁₆ ² of 32 pixels×32 pixels is the minimum size, all of the vertical and horizontal 16 pixels (16 pixels×16 pixels) are used. The semantic layout image S is reduced.

The learning unit 28 receives the reduced image data of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ , and S ₁ ² to S ₁₆ ² from the reduction unit 26, and the image data storage unit 24 Image data representing each of the learning images I ₁ ¹ to I ₄ ¹ and the learning images I ₁ ² to I ₁₆ ² is input. The learning unit 28 includes an image generation model 32 and a loss function derivation unit 34.

The learning unit 28 learns the image generation model 32 so as to generate the corresponding learning image I _n ^l from each semantic layout image S _n ^l . The subscript “l” in “S” of the semantic layout image S and “I” of the learning image I indicates the number of division processes (0 to 2 in this embodiment), and “n” is The indexes (1 to 16 in this embodiment) of the divided areas are shown.

The image generation model 32 includes L+1 generators that sequentially apply to the input image and generate upsampled images, and the input size of the first generator is a predetermined size. For example, L=2, and as shown in FIG. 2, the image generation model 32 of the present embodiment includes three generators G (G ₂ , G ₁ , G ₀ ). From the reducing unit 26, the first generator G ₂ receives the image data in which the size of each of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ and S ₁ ² to S ₁₆ ² is reduced.

The first generator G ₂ has a reduced size of each of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ , and S ₁ ² to S ₁₆ ² represented by the input image data, but the vertical and horizontal directions are doubled. Then, while upsampling the size by 4 times, an image corresponding to the learning image I is generated from the semantic layout image S, and image data representing the generated image is output to the generator G ₁ as an output G ₂ (S _n ^l ). Output.

The output G ₂ (S _n ^l ) of the first generator G ₂ is input to the second generator G ₁ . The second generator G ₁ generates an image obtained by upsampling the image represented by each of the outputs G ₂ (S _n ^l ) vertically and horizontally, in other words, by quadrupling the size, and image data representing the generated image. Is output as the output G ₁ (G ₂ (S _n ^l )) to the third generator G ₁ .

The output G ₁ (G ₂ (S _n ^l )) of the _second generator G ₁ is input to the third generator G ₀ . The third generator G ₀ generates an image obtained by up-sampling the image represented by each of the outputs G ₁ (G ₂ (S _n ^l )) by 2 times the vertical and horizontal directions, in other words, quadruple the size, and generating the generated image. The image data representing G is output as output G ₀ (G ₁ (G ₂ (S _n ^l ))). The image finally represented by the output G ₀ (G ₁ (G ₂ (S _n ^l ))) output from the third generator G ₀ has the same size as the learning image I ⁰ (128 pixels×128 pixels). It becomes an image of.

The generator G ₁ (G ₂ , G ₁ , G _{0 in} FIG. 2) is a layer group that performs arbitrary upsampling. For example, as shown in Non-Patent Document 1, a 3 pixel×3 pixel convolution layer-normalization layer-LReLU layer-bilinear upsampling layer may be used, or up-sampling may be performed using a deconvolution layer.

On the other hand, the loss function deriving unit 34 derives the loss function in the image generation model 32. The loss function will be described.

For the output of each generator G ₁ , the reconstruction error from the corresponding learning image I _n ^l is used to prompt the conversion from the semantic layout image to a realistic image. The reconstruction error is calculated by calculating the degree of coincidence between the two images for each pixel, and deriving a loss function in which the loss is high when the images do not coincide, that is, when the images cannot be accurately generated.

Specifically, when an image obtained by reducing each of a plurality of regions obtained by dividing the semantic layout image S into a predetermined size by two division processes is input to the first generator, the third generator G ₀ generates it. Loss function concerning the first generator G ₂ including the reconstruction error between the image obtained by the above-described image and each of the plurality of regions obtained by dividing the learning image I by the _two division processes, and the semantic layout image S by the one division process. The image generated by the second generator G ₁ when the image obtained by reducing each of the plurality of divided regions to a predetermined size is input to the first generator G ₂ and the learning image by one division process. The loss function regarding the second generator G ₁ including the reconstruction error with respect to each of the plurality of regions obtained by dividing I, and the image obtained by reducing the semantic layout image S to a predetermined size are input to the first generator G ₂ . Sometimes, a final loss function including a loss function regarding the third generator G ₀ including a reconstruction error between the image generated by the third generator G ₀ and the learning image I is derived. This updates the network parameters to produce the image more accurately.

For example, loss function Loss _{l = 2} relates to the first generator _{G 2} is defined by the following equation (1).

N ₂ in the above formula (1) is the number of images obtained as a result of the division when the division processing is performed twice, that is, N ₂ =16 here. Further, the loss function Loss _l=2 is calculated only when the image data representing the semantic layout image S _n ² having the number of times of division processing is input, and the image data representing the semantic layout image S having a different number of times of division processing is calculated. When input, it does not calculate.

Furthermore, loss function Loss _{l = 1} regarding the second generator _{G 1} is defined by the following equation (2).

N ₁ in the above equation (2) is the number of images obtained as a result of the division when one division process is performed, that is, N ₁ =4 here. The loss function Loss _l=1 is calculated only when image data representing the semantic layout image S _n ^{1 for} which the number of times of division processing is one is input, and image data representing the semantic layout image S for which the number of times of division processing is different. Is not input, the calculation is not performed.

Furthermore, loss function Loss _{l = 0} regarding the third generator _{G 0} is defined by the following equation (3).

Therefore, the final loss function Loss _generator of the image generation model 32 shown in FIG. 2 is defined by the following equation (4). The loss function Loss _generator is the loss function of the entire image generation model 32.

In the above equation (4), each of α, β and γ is a weight of each loss function and can be set arbitrarily. For example, as shown in the above-mentioned reference document 1, the weight may be set according to the number of times of division processing. If the maximum number of times of division processing is L, the weight of each loss function may be ½ ^L−1. Good.

The loss function deriving unit 34 outputs the loss function Loss _generator , which is derived based on the equations (1) to (4).

The learning unit 28 learns the image generation model 32 so as to optimize the loss function Loss _generator output by the loss function derivation unit 34. As a result, the image generated by the image generation model from the semantic layout image S ⁰ approaches the learning image I ⁰ .

Note that a discriminator D as shown in FIG. 3 may be added to perform adversarial learning between the image generation model 32 and the discriminator D as shown in Reference 2. The image generation model 32 described up to this point performs optimization for each divided area and optimization of the entire image before division at the same time. However, when the area optimization is excessively performed, the joint between the areas is formed. Unnaturalness may appear. Therefore, the above-mentioned unnaturalness can be improved by performing learning using the constraint that the whole image looks natural by the adversarial learning.

The discriminator D receives the output G ₀ (G ₁ (G ₁ (G ₂ (S ₁ ⁰ )))) of the generator G _{0 for} the semantic layout image S ₁ ⁰ and the image data representing the learning image I ₁ ⁰ as inputs, and respectively. Is an image generated by the generator G.

The learning unit 28 identifies to the discriminator D that the learning image I ⁰ is not the image generated by the generator G, and the image obtained by reducing the semantic layout image S ⁰ by the reducing unit 26 is the first generator G. _The image generated by the third generator G ₀ when ₁ is input is learned so as to be identified as the image generated by the generator G. Specifically, the discriminator D is learned so as to optimize the loss function of the discriminator D.

For example, the loss function of the discriminator D is defined by the following equation (5).

In addition, the learning unit 28 generates the image generated by the third generator G ₀ when the image generated by reducing the semantic layout image S ⁰ by the reduction unit 26 is input to the first generator G ₂ , and the generator G generates the image. The image generation model 32 is learned so that the discriminator D discriminates that the image is not the acquired image.

For example, the loss function of the generator G is defined by the following equation (6).

[Reference 2] IanJ. Good fellow et.al., "Generative Adversarial Nets", Internet search <URL: http://datascienceassn.org/sites/default/files/Generative%20Adversarial%20Nets.pdf>

The image generation model 32 that has been learned by the learning unit 28 is output from the output unit 30 to the outside of the model learning device 10. In the image generation device using the learned image generation model 32, when the image data representing the image obtained by reducing the semantic layout image S is input, the learned image generation model 32 generates a realistic image, Image data is output although it represents the image. In the specific example shown in FIG. 2, when image data of an image in which the semantic layout image S ⁰ is reduced to 16 pixels×16 pixels is input, image data representing an image equivalent to the learning image I ⁰ is output. It

<Operation of the model learning device of the present embodiment>
Next, the operation of the model learning device 10 of the present exemplary embodiment will be described with reference to the drawings. FIG. 4 is a flowchart showing an example of a model learning processing routine executed in the model learning device 10 of this embodiment.

The model learning processing routine shown in FIG. 4 has an arbitrary timing, such as the timing when the teacher data 1 is input to the input unit 20 or the timing when the execution instruction of the model learning processing routine is received from the outside of the model learning device 10. Run on.

In step S100 of FIG. 4, as described above, the dividing unit 22 divides each of the image data representing the learning image I ⁰ and the image data representing the semantic layout image S ⁰ into a plurality of regions into vertical and horizontal regions. The teacher data is pseudo-increased by performing the dividing process of dividing into. Image data representing the generated semantic layout image S _n ^l and image data representing the learning image I _n ^l are stored in the image data storage unit 24 for each division process.

In the next step S102, the reduction unit 26 inputs the image data of each of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ and S ₁ ² to S ₁₆ ² from the image data storage unit 24 as described above. It The reduction unit 26 reduces the size of each of the semantic layout images S ⁰ , S ₁ ¹ to S ₄ ¹ , and S ₁ ² to S ₁₆ ² .

In the next step S104, the learning unit 28 learns the image generation model 32 as described above. The learning unit 28 causes the generator G ₂ to input the image data representing each of the reduced semantic layout images S _n ^l input from the reduction unit 26. Then, the loss function deriving unit 34 sequentially inputs the output to the generator G ₁ and the generator G ₀ , and calculates the loss function Loss _l when the image is generated by each of the generators G ₂ , G ₁ , and G _0. By integrating, the loss function Loss _{generator of the} entire image generation model 32 is derived. The learning unit 28 learns the image generation model 32 so as to optimize the loss function Loss _generator . The learned image generation model 32 is output from the output unit 30 to the outside of the model learning device 10, and the model learning processing routine ends.

As described above, the model learning device 10 of the present embodiment is a model learning device for learning the image generation model 32 that generates a realistic image from the semantic layout image S. The model learning device 10 includes an input unit 20, a dividing unit 22, a reducing unit 26, and a learning unit 28. The input unit 20, the image data representing the learning image I ^0, pairs of image data representing a semantic layout images S ⁰ corresponding to the learning image I ⁰ is input as the teacher data 1. The dividing unit 22 divides each of the learning image I ⁰ represented by the input image data of the learning image I ⁰ and the semantic layout image S ⁰ represented by the image data of the semantic layout image S ⁰ into a plurality of regions. The dividing process is performed L times, and each time the dividing process is performed, image data corresponding to the divided regions is output. The reduction unit 26 reduces the size of the semantic layout image S ⁰ input to the input unit 20 and the images of the plurality of regions to a predetermined size. The learning unit 28 is an L+1 number of generators G that generate up-sampled images that are sequentially applied to the input image, and that the input size of the first generator G ₂ is a predetermined size. For the image generation model 32 having the following, each of a plurality of regions obtained by dividing the semantic layout image S ⁰ by L-1 (0≦l<L) division processing is reduced to a predetermined size, and the image is reduced to a first generator. When input to G ₂ , the l+1-th generator G _l+1 generates an image corresponding to each of a plurality of regions into which the learning image I ⁰ is divided by the L−1 division processing, and the semantic layout image S ⁰ is generated. Of the image generation model 32 so that the L+1-th generator G _L+1 generates an image corresponding to the learning image I ⁰ when the image reduced to the predetermined size is input to the first generator G ₂ . Learn.

With the above configuration, the model learning apparatus 10 according to the present exemplary embodiment optimizes the image generated by the generator G according to the number L of division processes for each region obtained by dividing the input semantic layout image S. Try to change. Therefore, according to the model learning device 10 of the present exemplary embodiment, it is possible to learn the image generation model 32 that can generate an image that is accurate in details.

It should be noted that in the present embodiment, the method of dividing the semantic layout image S ⁰ is described as a pyramid-shaped form, but the method of dividing the semantic layout image S ⁰ is not limited to the form. In the pyramid-shaped division, the area is often set so as to extend over some objects, and in that case, the consistency of the objects is not guaranteed. Therefore, for example, in addition to the pyramid-shaped division, a division method in which an object is taken into consideration may be added. In this case, the above problem can be improved.

Fig. 5 is a schematic diagram of the image segmentation method that pays attention to objects.

For the semantic layout image S ⁰ , candidate regions of the same size as the pyramid-shaped division are randomly set (solid line frame in the figure). Following the previous example, a candidate area of 64 pixels×64 pixels and a candidate area of 32 pixels×32 pixels were set. The true area of each label is also set (dotted line frame in the figure).

Next, for each of these candidate areas, IoU (Intersection of Union) is calculated for each label, and by using the candidate area where IoU exceeds the threshold value for learning, it is possible to generate an image with object consistency. Becomes IoU is defined as a value obtained by dividing the number of pixels in the intersection of the true region and the candidate region by the number of pixels in the union of the true region and the candidate region.

Note that the present disclosure is not limited to the above embodiment, and various modifications and applications are possible without departing from the gist of the present disclosure.

Further, in the present embodiment, the mode in which the program is pre-installed has been described, but the program can be stored in a computer-readable recording medium and provided, or provided via a network. It is also possible to do so.

1 Teacher Data 10 Model Learning Device 20 Input Unit 22 Division Unit 26 Reduction Unit 28 Learning Unit 32 Image Generation Model 34 Loss Function Derivation Unit D Discriminator G ₀ , G ₁ , G ₂ Generator I, I ⁰ , I _n ^l Learning Image S, S ⁰ , S _n ^l Semantic layout image

Claims

A model learning device for learning an image generation model for generating a realistic image from a semantic layout image,
An input unit in which a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image is input as teacher data,
Each of the learning image represented by the input image data of the learning image and the semantic layout image represented by the image data of the semantic layout image is divided into a plurality of regions by dividing processing L times, and divided. A dividing unit that outputs image data according to a plurality of divided areas each time processing is performed,
A reduction unit that reduces the size of the semantic layout image input to the input unit and the images of the plurality of regions to a predetermined size,
For an image generation model including L+1 generators that generate upsampled images, which are sequentially applied to the input images, and in which the input size of the first generator is the predetermined size,
When the image obtained by reducing each of the plurality of areas obtained by dividing the semantic layout image to the predetermined size by the L-1 (0≦l<L) division processing is input to the first generator, the (l+1)th generation is performed. A unit generates an image corresponding to each of a plurality of regions obtained by dividing the learning image by L-1 division processing,
Learning the image generation model such that the L+1th generator generates an image corresponding to the learning image when an image obtained by reducing the semantic layout image to the predetermined size is input to the first generator. A learning section that
Model learning device equipped with.
The learning unit is
When the image obtained by reducing each of the plurality of areas obtained by dividing the semantic layout image to the predetermined size by the L-1 (0≦l<L) division processing is input to the first generator, the (l+1)th generation is performed. Between the image generated by the image generator and each of the plurality of regions obtained by dividing the learning image by the L-1 division process, and the image obtained by reducing the semantic layout image to the predetermined size as the first generator. A loss function deriving unit for deriving a loss function including an error between the image generated by the L+1-th generator and the learning image when
Learn M+1 generators of the image generation model to optimize the value of the loss function,
The model learning device according to claim 1.
The learning unit discriminates, with respect to the discriminator for discriminating whether or not the image is generated by the generator, that the image for learning is not the image generated by the generator, and the semantic layout image is determined to have the predetermined size. Learning to identify the image generated by the L+1th generator as the image generated by the generator when the reduced image is input to the first generator, and
When the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the discriminator identifies that the image generated by the L+1th generator is not the image generated by the generator. Learning the image generation model,
The model learning device according to claim 1.
The dividing unit divides an image into two equal parts in a vertical direction and an equal amount in a horizontal direction by one dividing process.
The model learning device according to any one of claims 1 to 3.
A model learning method for learning an image generation model for generating a realistic image from a semantic layout image,
A step of inputting, as teacher data, a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image to the input unit;
The division unit divides the learning image represented by the input image data of the learning image and the semantic layout image represented by the image data of the semantic layout image into a plurality of regions into L regions. Each time performing the division process and outputting the image data corresponding to the plurality of divided regions,
A reducing unit reducing the size of the semantic layout image input to the input unit and the images of the plurality of regions to a predetermined size;
An image including L+1 generators that generate upsampled images, which are sequentially applied to the input image by the learning unit, and in which the input size of the first generator is the predetermined size. For the generative model,
When the image obtained by reducing each of the plurality of areas obtained by dividing the semantic layout image to the predetermined size by the L-1 (0≦l<L) division processing is input to the first generator, the (l+1)th generation is performed. A unit generates an image corresponding to each of a plurality of regions obtained by dividing the learning image by L-1 division processing,
Learning the image generation model such that the L+1th generator generates an image corresponding to the learning image when an image obtained by reducing the semantic layout image to the predetermined size is input to the first generator. The steps to
Model learning method including.
A program for causing a computer to function as each unit of the model learning device according to any one of claims 1 to 4.