WO2020129716A1 - Model learning device, model learning method, and program - Google Patents

Model learning device, model learning method, and program Download PDF

Info

Publication number
WO2020129716A1
WO2020129716A1 PCT/JP2019/047940 JP2019047940W WO2020129716A1 WO 2020129716 A1 WO2020129716 A1 WO 2020129716A1 JP 2019047940 W JP2019047940 W JP 2019047940W WO 2020129716 A1 WO2020129716 A1 WO 2020129716A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
learning
generator
input
unit
Prior art date
Application number
PCT/JP2019/047940
Other languages
French (fr)
Japanese (ja)
Inventor
崇之 梅田
慎吾 安藤
淳 嵯峨田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Publication of WO2020129716A1 publication Critical patent/WO2020129716A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present disclosure relates to a model learning device, a model learning method, and a program.
  • Non-Patent Document 1 there is a technique to generate a realistic image that is close to the actual image from existing or arbitrarily created teacher data for simple semantic segmentation (hereinafter referred to as "semantic layout") (Non-Patent Document 1). According to this technique, it is possible to reduce the cost required for the above-mentioned teacher data by generating various images from a single semantic layout.
  • Non-Patent Document 1 a good image can be generated by subjective evaluation of the entire image, while in the details of the generated image, distortion may occur or an object There is a problem that the contour may be discontinuous.
  • the present disclosure has been made in view of the above points, and provides a model learning device, a model learning method, and a program capable of learning an image generation model that generates an image that is accurate in details. To aim.
  • the model learning device is a model learning device for learning an image generation model that generates a realistic image from a semantic layout image, and is a learning image. And an input unit to which a pair of image data representing a semantic layout image corresponding to the learning image is input as teacher data, and the learning image represented by the image data of the input learning image.
  • And L is divided into a plurality of regions for each of the semantic layout images represented by the image data of the semantic layout image L times, and each time the division process is performed, image data corresponding to the divided regions
  • the (l+1)th generator When the image obtained by reducing each of the plurality of regions obtained by dividing the semantic layout image into the predetermined size by the division process of (1) is input to the first generator, the (l+1)th generator performs the L-1 division process. An image corresponding to each of a plurality of areas obtained by dividing the learning image is generated, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator, A learning unit that performs learning of the image generation model so as to generate an image corresponding to the learning image.
  • a model learning apparatus is the model learning apparatus according to the first aspect, in which the learning unit divides the semantic layout image by dividing processing L ⁇ 1 times (0 ⁇ l ⁇ L).
  • the learning image is divided by the image generated by the (l+1)th generator when the image obtained by reducing each of the plurality of regions to the predetermined size is input to the first generator and the L-1 division processing.
  • the learning image is divided by the error from each of the plurality of regions and the image generated by the L+1th generator when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, and the learning image.
  • a loss function derivation unit that derives a loss function including an error is included, and M+1 generators of the image generation model are learned so as to optimize the value of the loss function.
  • a model learning device is the model learning device according to the first aspect or the second aspect, wherein the learning unit is a discriminator for discriminating whether or not the image is generated by a generator.
  • the learning image is identified as not being the image generated by the generator, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator is generated.
  • Learning is performed so as to identify the generated image as an image generated by the generator, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator is The image generation model is learned so that the discriminator discriminates that the generated image is not the image generated by the generator.
  • a model learning device is the model learning device according to any one of the first to third aspects, in which the dividing unit is configured to perform one division process in a vertical direction and the like. The image is divided so as to be divided into two parts in the horizontal direction.
  • a model learning method for learning an image generation model that generates a realistic image from a semantic layout image
  • the model learning method includes: A step of inputting a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image as teacher data, and an image of the learning image input by the dividing unit.
  • Each of the learning image represented by the data and the semantic layout image represented by the image data of the semantic layout image is divided into a plurality of regions by dividing processing L times.
  • the image generation model includes L+1 number of generators for generating upsampled images, which are sequentially applied to the input image, and in which the input size of the first generator is the predetermined size.
  • Generator When the image obtained by reducing each of the plurality of regions obtained by dividing the semantic layout image to the predetermined size by L-1 times (0 ⁇ l ⁇ L) is input to the first generator, Generator generates an image corresponding to each of a plurality of regions into which the learning image is divided by L-1 division processing, and generates the first image obtained by reducing the semantic layout image to the predetermined size. Learning the image generation model so that the (L+1)th generator generates an image corresponding to the learning image when input to the image generator.
  • a program according to a sixth aspect of the present disclosure is a program for causing a computer to function as each unit of the model learning device according to any one of the first to fourth aspects. is there.
  • the size of an image for learning (hereinafter, referred to as “learning image”) and the size of an image showing a semantic layout (hereinafter, referred to as “semantic layout image”) are 128 vertically and horizontally respectively.
  • learning image an image for learning
  • semantic layout image an image showing a semantic layout
  • the size and aspect ratio of each image can be set to arbitrary values by changing the network configuration described later.
  • FIG. 1 is a block diagram showing the configuration of an example of a model learning device 10 of this exemplary embodiment.
  • the model learning device 10 of this embodiment includes an input unit 20, a dividing unit 22, an image data storage unit 24, a reducing unit 26, a learning unit 28, and an output unit 30.
  • the model learning apparatus 10 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read) that stores a program for executing a model learning processing routine described below and various data. Only Memory), and can be configured with a computer including.
  • the CPU that executes the program functions as the input unit 20, the dividing unit 22, the reducing unit 26, the learning unit 28, and the output unit 30 of the model learning device 10 illustrated in FIG. 1.
  • the teacher data 1 is input to the input unit 20.
  • a learning image I 0 which is a realistic image
  • a semantic layout image S 0 representing a semantic layout obtained as a result of performing semantic segmentation on the learning image I 0 .
  • a case where there is only one pair of teacher data 1, that is, there is only one learning image I 0 and semantic layout image S 0 will be described. Can be easily applied when there are multiple pairs.
  • semantic layout image S 0 which is the teacher data 1 and the semantic layout images S 1 1 to S 4 1 (details will be described later) generated from the semantic layout image S 0
  • the suffix of "S” is omitted and simply referred to as "semantic layout image S”.
  • learning images I 0 and the like are collectively referred to without distinction, the suffix “I” is omitted and simply referred to as the “learning image I”.
  • the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 are input to the input unit 20 as teacher data 1.
  • the image data representing the learning image I 0 input to the input unit 20 and the image data representing the semantic layout image S 0 are output to the dividing unit 22.
  • the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 are input to the dividing unit 22, and the learning image I 0 and the semantic layout image S 0 are equally divided vertically and horizontally.
  • the teacher data is artificially increased by performing the dividing process L times (L is an integer of 1 or more) for dividing into a plurality of areas.
  • the dividing unit 22 of the present embodiment divides the learning image I 0 and the semantic layout image S 0 into a pyramid shape as in Reference 1.
  • the division process can be performed by changing the network configuration until the size after division becomes 2 pixels vertically and horizontally (2 pixels ⁇ 2 pixels). Note that the accuracy of the details of the generated image generated by the image generation model 32 improves as the division processing is performed.
  • the division unit 22 performs a division process once on the semantic layout image S 0 of vertical and horizontal 128 pixels (128 pixels ⁇ 128 pixels), and the image size of the semantic layout image S 64 is 64 pixels (64 pixels ⁇ 64 pixels).
  • 4 semantic layout images S 1 1 to S 4 1 , see FIG. 2 are generated.
  • the image data representing each of the generated semantic layout images S 1 1 to S 4 1 is output from the division unit 22 and stored in the image data storage unit 24.
  • the superscript indicates the number of times of division processing. Further, the subscript indicates an index in the group of semantic layout images S obtained simultaneously by the division.
  • the dividing unit 22 of the present embodiment performs the dividing process on each of the semantic layout images S 1 1 to S 4 1 in the same manner as described above, and the image size is 32 pixels vertically and horizontally (32 pixels ⁇ 32 pixels).
  • Sixteen semantic layout images (semantic layout images S 1 2 to S 16 2 , see FIG. 2) are generated.
  • the image data representing each of the generated semantic layout images S 1 2 to S 16 2 is output from the division unit 22 and stored in the image data storage unit 24.
  • the division unit 22 also performs the same division processing as described above on the learning image I 0 , and the learning images I 1 1 to I 4 1 and the learning images I 1 2 to I 16 2 (see FIG. 2). To generate. Image data representing each of the learning images I 1 1 to I 4 1 and the learning images I 1 2 to I 16 2 is output from the dividing unit 22 and stored in the image data storage unit 24.
  • the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 are also output from the dividing unit 22 and stored in the image data storage unit 24. It
  • Image data of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 is input to the reduction unit 26 from the image data storage unit 24.
  • the reduction unit 26 reduces the size of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 .
  • the reduction size is arbitrary, but the vertical and horizontal sizes of the semantic layout image S divided by the L division processing of the division unit 22 are half the size, in other words, the size (area) of 1/4 is a standard. To do.
  • the size of the semantic layout images S 1 2 to S 16 2 of 32 pixels ⁇ 32 pixels is the minimum size, all of the vertical and horizontal 16 pixels (16 pixels ⁇ 16 pixels) are used.
  • the semantic layout image S is reduced.
  • the learning unit 28 receives the reduced image data of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 from the reduction unit 26, and the image data storage unit 24 Image data representing each of the learning images I 1 1 to I 4 1 and the learning images I 1 2 to I 16 2 is input.
  • the learning unit 28 includes an image generation model 32 and a loss function derivation unit 34.
  • the learning unit 28 learns the image generation model 32 so as to generate the corresponding learning image I n l from each semantic layout image S n l .
  • the subscript “l” in “S” of the semantic layout image S and “I” of the learning image I indicates the number of division processes (0 to 2 in this embodiment), and “n” is The indexes (1 to 16 in this embodiment) of the divided areas are shown.
  • the first generator G 2 has a reduced size of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 represented by the input image data, but the vertical and horizontal directions are doubled. Then, while upsampling the size by 4 times, an image corresponding to the learning image I is generated from the semantic layout image S, and image data representing the generated image is output to the generator G 1 as an output G 2 (S n l ). Output.
  • the output G 2 (S n l ) of the first generator G 2 is input to the second generator G 1 .
  • the second generator G 1 generates an image obtained by upsampling the image represented by each of the outputs G 2 (S n l ) vertically and horizontally, in other words, by quadrupling the size, and image data representing the generated image. Is output as the output G 1 (G 2 (S n l )) to the third generator G 1 .
  • the output G 1 (G 2 (S n l )) of the second generator G 1 is input to the third generator G 0 .
  • the third generator G 0 generates an image obtained by up-sampling the image represented by each of the outputs G 1 (G 2 (S n l )) by 2 times the vertical and horizontal directions, in other words, quadruple the size, and generating the generated image.
  • the image data representing G is output as output G 0 (G 1 (G 2 (S n l ))).
  • the image finally represented by the output G 0 (G 1 (G 2 (S n l ))) output from the third generator G 0 has the same size as the learning image I 0 (128 pixels ⁇ 128 pixels). It becomes an image of.
  • the generator G 1 (G 2 , G 1 , G 0 in FIG. 2) is a layer group that performs arbitrary upsampling.
  • a 3 pixel ⁇ 3 pixel convolution layer-normalization layer-LReLU layer-bilinear upsampling layer may be used, or up-sampling may be performed using a deconvolution layer.
  • the loss function deriving unit 34 derives the loss function in the image generation model 32.
  • the loss function will be described.
  • the reconstruction error from the corresponding learning image I n l is used to prompt the conversion from the semantic layout image to a realistic image.
  • the reconstruction error is calculated by calculating the degree of coincidence between the two images for each pixel, and deriving a loss function in which the loss is high when the images do not coincide, that is, when the images cannot be accurately generated.
  • the third generator G 0 when an image obtained by reducing each of a plurality of regions obtained by dividing the semantic layout image S into a predetermined size by two division processes is input to the first generator, the third generator G 0 generates it. Loss function concerning the first generator G 2 including the reconstruction error between the image obtained by the above-described image and each of the plurality of regions obtained by dividing the learning image I by the two division processes, and the semantic layout image S by the one division process.
  • the image generated by the second generator G 1 when the image obtained by reducing each of the plurality of divided regions to a predetermined size is input to the first generator G 2 and the learning image by one division process.
  • the loss function regarding the second generator G 1 including the reconstruction error with respect to each of the plurality of regions obtained by dividing I, and the image obtained by reducing the semantic layout image S to a predetermined size are input to the first generator G 2 .
  • a final loss function including a loss function regarding the third generator G 0 including a reconstruction error between the image generated by the third generator G 0 and the learning image I is derived. This updates the network parameters to produce the image more accurately.
  • the final loss function Loss generator of the image generation model 32 shown in FIG. 2 is defined by the following equation (4).
  • the loss function Loss generator is the loss function of the entire image generation model 32.
  • each of ⁇ , ⁇ and ⁇ is a weight of each loss function and can be set arbitrarily.
  • the weight may be set according to the number of times of division processing. If the maximum number of times of division processing is L, the weight of each loss function may be 1 ⁇ 2 L ⁇ 1. Good.
  • the loss function deriving unit 34 outputs the loss function Loss generator , which is derived based on the equations (1) to (4).
  • the learning unit 28 learns the image generation model 32 so as to optimize the loss function Loss generator output by the loss function derivation unit 34. As a result, the image generated by the image generation model from the semantic layout image S 0 approaches the learning image I 0 .
  • a discriminator D as shown in FIG. 3 may be added to perform adversarial learning between the image generation model 32 and the discriminator D as shown in Reference 2.
  • the image generation model 32 described up to this point performs optimization for each divided area and optimization of the entire image before division at the same time. However, when the area optimization is excessively performed, the joint between the areas is formed. Unnaturalness may appear. Therefore, the above-mentioned unnaturalness can be improved by performing learning using the constraint that the whole image looks natural by the adversarial learning.
  • the discriminator D receives the output G 0 (G 1 (G 1 (G 2 (S 1 0 ))) of the generator G 0 for the semantic layout image S 1 0 and the image data representing the learning image I 1 0 as inputs, and respectively. Is an image generated by the generator G.
  • the learning unit 28 identifies to the discriminator D that the learning image I 0 is not the image generated by the generator G, and the image obtained by reducing the semantic layout image S 0 by the reducing unit 26 is the first generator G.
  • the image generated by the third generator G 0 when 1 is input is learned so as to be identified as the image generated by the generator G.
  • the discriminator D is learned so as to optimize the loss function of the discriminator D.
  • the loss function of the discriminator D is defined by the following equation (5).
  • the learning unit 28 generates the image generated by the third generator G 0 when the image generated by reducing the semantic layout image S 0 by the reduction unit 26 is input to the first generator G 2 , and the generator G generates the image.
  • the image generation model 32 is learned so that the discriminator D discriminates that the image is not the acquired image.
  • the loss function of the generator G is defined by the following equation (6).
  • the image generation model 32 that has been learned by the learning unit 28 is output from the output unit 30 to the outside of the model learning device 10.
  • the learned image generation model 32 when the image data representing the image obtained by reducing the semantic layout image S is input, the learned image generation model 32 generates a realistic image, Image data is output although it represents the image.
  • image data of an image in which the semantic layout image S 0 is reduced to 16 pixels ⁇ 16 pixels is input, image data representing an image equivalent to the learning image I 0 is output.
  • FIG. 4 is a flowchart showing an example of a model learning processing routine executed in the model learning device 10 of this embodiment.
  • the model learning processing routine shown in FIG. 4 has an arbitrary timing, such as the timing when the teacher data 1 is input to the input unit 20 or the timing when the execution instruction of the model learning processing routine is received from the outside of the model learning device 10. Run on.
  • step S100 of FIG. 4 the dividing unit 22 divides each of the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 into a plurality of regions into vertical and horizontal regions.
  • the teacher data is pseudo-increased by performing the dividing process of dividing into.
  • Image data representing the generated semantic layout image S n l and image data representing the learning image I n l are stored in the image data storage unit 24 for each division process.
  • the reduction unit 26 inputs the image data of each of the semantic layout images S 0 , S 1 1 to S 4 1 and S 1 2 to S 16 2 from the image data storage unit 24 as described above. It The reduction unit 26 reduces the size of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 .
  • the learning unit 28 learns the image generation model 32 as described above.
  • the learning unit 28 causes the generator G 2 to input the image data representing each of the reduced semantic layout images S n l input from the reduction unit 26.
  • the loss function deriving unit 34 sequentially inputs the output to the generator G 1 and the generator G 0 , and calculates the loss function Loss l when the image is generated by each of the generators G 2 , G 1 , and G 0.
  • the loss function Loss generator of the entire image generation model 32 is derived.
  • the learning unit 28 learns the image generation model 32 so as to optimize the loss function Loss generator .
  • the learned image generation model 32 is output from the output unit 30 to the outside of the model learning device 10, and the model learning processing routine ends.
  • the model learning device 10 of the present embodiment is a model learning device for learning the image generation model 32 that generates a realistic image from the semantic layout image S.
  • the model learning device 10 includes an input unit 20, a dividing unit 22, a reducing unit 26, and a learning unit 28.
  • the input unit 20, the image data representing the learning image I 0, pairs of image data representing a semantic layout images S 0 corresponding to the learning image I 0 is input as the teacher data 1.
  • the dividing unit 22 divides each of the learning image I 0 represented by the input image data of the learning image I 0 and the semantic layout image S 0 represented by the image data of the semantic layout image S 0 into a plurality of regions.
  • the dividing process is performed L times, and each time the dividing process is performed, image data corresponding to the divided regions is output.
  • the reduction unit 26 reduces the size of the semantic layout image S 0 input to the input unit 20 and the images of the plurality of regions to a predetermined size.
  • the learning unit 28 is an L+1 number of generators G that generate up-sampled images that are sequentially applied to the input image, and that the input size of the first generator G 2 is a predetermined size.
  • each of a plurality of regions obtained by dividing the semantic layout image S 0 by L-1 (0 ⁇ l ⁇ L) division processing is reduced to a predetermined size, and the image is reduced to a first generator.
  • the l+1-th generator G l+1 When input to G 2 , the l+1-th generator G l+1 generates an image corresponding to each of a plurality of regions into which the learning image I 0 is divided by the L ⁇ 1 division processing, and the semantic layout image S 0 is generated.
  • the L+1-th generator G L+1 generates an image corresponding to the learning image I 0 when the image reduced to the predetermined size is input to the first generator G 2 . Learn.
  • the model learning apparatus 10 optimizes the image generated by the generator G according to the number L of division processes for each region obtained by dividing the input semantic layout image S. Try to change. Therefore, according to the model learning device 10 of the present exemplary embodiment, it is possible to learn the image generation model 32 that can generate an image that is accurate in details.
  • the method of dividing the semantic layout image S 0 is described as a pyramid-shaped form, but the method of dividing the semantic layout image S 0 is not limited to the form.
  • the area is often set so as to extend over some objects, and in that case, the consistency of the objects is not guaranteed. Therefore, for example, in addition to the pyramid-shaped division, a division method in which an object is taken into consideration may be added. In this case, the above problem can be improved.
  • Fig. 5 is a schematic diagram of the image segmentation method that pays attention to objects.
  • candidate regions of the same size as the pyramid-shaped division are randomly set (solid line frame in the figure).
  • a candidate area of 64 pixels ⁇ 64 pixels and a candidate area of 32 pixels ⁇ 32 pixels were set.
  • the true area of each label is also set (dotted line frame in the figure).
  • IoU Intersection of Union
  • the mode in which the program is pre-installed has been described, but the program can be stored in a computer-readable recording medium and provided, or provided via a network. It is also possible to do so.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The present invention makes it possible to learn an image generation model which can generate an accurate image down to the details thereof. A partitioning unit 22 outputs image data corresponding to a plurality of partitioned regions for each instance of L instances of partitioning processing performed on a learning image I0 and a semantic layout image S0. A reduction unit 26 reduces the semantic layout image S0 and the plurality of region images to a prescribed size. A learning unit 28 learns an image generation model 32 such that: upon the input to a first generator G2 of images obtained by the reduction of each of the plurality of regions obtained by partitioning the semantic layout image S0 by L-1 instances (0 ≤ 1 < L) of partitioning processing, an (l+1)-th generator Gl+1 generates an image corresponding to each of the plurality of regions obtained by partitioning the learning image I0 by L-1 instances of partitioning processing; and upon the input to the first generator G2 of an image obtained by the reduction of the semantic layout image S0, an (L+1)-th generator GL +1 generates an image corresponding to the learning image I0.

Description

モデル学習装置、モデル学習方法、及びプログラムModel learning device, model learning method, and program
 本開示は、モデル学習装置、モデル学習方法、及びプログラムに関する。 The present disclosure relates to a model learning device, a model learning method, and a program.
 物体認識、物体検出、及びセマンティックセグメンテーション等、画像処理に関する機械学習において、学習に要する画像と教師データのペアは膨大な量となっている。特にセマンティックセグメンテーションでは、画像のピクセル毎に認識したいラベルを付与する必要があり、物体認識や物体検出に比べて、学習に十分な量の教師データを用意するのは非常にコストが高い。 In machine learning related to image processing such as object recognition, object detection, and semantic segmentation, the number of images and teacher data pairs required for learning is enormous. In particular, in semantic segmentation, it is necessary to give a label to be recognized to each pixel of an image, and it is very expensive to prepare a sufficient amount of teacher data for learning as compared with object recognition and object detection.
 そこで既存、もしくは任意に作成した簡易的なセマンティックセグメンテーションの教師データ(以下、「セマンティックレイアウト」という)から実際の画像に近いリアルな画像を生成する技術がある(非特許文献1)。当該技術によれば、単一のセマンティックレイアウトから多様な画像を生成することにより、前述の教師データに要するコストを削減することが可能である。 Therefore, there is a technique to generate a realistic image that is close to the actual image from existing or arbitrarily created teacher data for simple semantic segmentation (hereinafter referred to as "semantic layout") (Non-Patent Document 1). According to this technique, it is possible to reduce the cost required for the above-mentioned teacher data by generating various images from a single semantic layout.
 上記非特許文献1に代表される先行技術では、画像全体を見た主観的な評価では良好な画像が生成され得る一方で、生成された画像の細部においては、歪みが発生する場合やオブジェクトの輪郭が非連続的になる場合があるという課題がある。 In the prior art typified by Non-Patent Document 1 described above, a good image can be generated by subjective evaluation of the entire image, while in the details of the generated image, distortion may occur or an object There is a problem that the contour may be discontinuous.
 これにより生成された画像を学習データとして用いた場合に、学習されたモデルの精度が低下する。 When the image generated by this is used as learning data, the accuracy of the learned model decreases.
 本開示は、上記の点に鑑みてなされたものであり、細部までも正確な画像を生成する画像生成モデルを学習することができる、モデル学習装置、モデル学習方法、及びプログラムを提供することを目的とする。 The present disclosure has been made in view of the above points, and provides a model learning device, a model learning method, and a program capable of learning an image generation model that generates an image that is accurate in details. To aim.
 上記目的を達成するために、本開示の第1の態様のモデル学習装置は、セマンティックレイアウト画像から、リアルな画像を生成する画像生成モデルを学習するためのモデル学習装置であって、学習用画像を表す画像データと、前記学習用画像に応じたセマンティックレイアウト画像を表す画像データとのペアが教師データとして入力される入力部と、入力された前記学習用画像の画像データが表す前記学習用画像、及び前記セマンティックレイアウト画像の画像データが表す前記セマンティックレイアウト画像の各々に対して複数の領域に分割する分割処理をL回行い、分割処理を行う毎に、分割した複数の領域に応じた画像データを出力する分割部と、前記入力部に入力された前記セマンティックレイアウト画像、及び前記複数の領域の画像のサイズを、所定サイズに縮小する縮小部と、入力画像に対して順番に適用する、アップサンプリングした画像を生成するL+1個の生成器であって、1番目の生成器の入力サイズが前記所定サイズである生成器を備えた画像生成モデルについて、L-l回(0≦l<L)の分割処理により前記セマンティックレイアウト画像を分割した複数の領域の各々を前記所定サイズに縮小した画像を1番目の生成器に入力したときにl+1番目の生成器が、L-l回の分割処理により前記学習用画像を分割した複数の領域の各々に対応する画像を生成し、前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が、前記学習用画像に対応する画像を生成するように、前記画像生成モデルの学習を行う学習部と、を備える。 In order to achieve the above object, the model learning device according to the first aspect of the present disclosure is a model learning device for learning an image generation model that generates a realistic image from a semantic layout image, and is a learning image. And an input unit to which a pair of image data representing a semantic layout image corresponding to the learning image is input as teacher data, and the learning image represented by the image data of the input learning image. , And L is divided into a plurality of regions for each of the semantic layout images represented by the image data of the semantic layout image L times, and each time the division process is performed, image data corresponding to the divided regions A dividing unit for outputting the image, the reducing unit for reducing the sizes of the semantic layout image input to the input unit and the images of the plurality of regions to a predetermined size, and applying in order to the input image. L+1 times (0≦l<L) with respect to an image generation model including L+1 generators that generate a sampled image, the generator having the input size of the first generator being the predetermined size. When the image obtained by reducing each of the plurality of regions obtained by dividing the semantic layout image into the predetermined size by the division process of (1) is input to the first generator, the (l+1)th generator performs the L-1 division process. An image corresponding to each of a plurality of areas obtained by dividing the learning image is generated, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator, A learning unit that performs learning of the image generation model so as to generate an image corresponding to the learning image.
 本開示の第2の態様のモデル学習装置は、第1の態様のモデル学習装置において、前記学習部は、L-l回(0≦l<L)の分割処理により前記セマンティックレイアウト画像を分割した複数の領域の各々を前記所定サイズに縮小した画像を1番目の生成器に入力したときにl+1番目の生成器が生成した画像と、L-l回の分割処理により前記学習用画像を分割した複数の領域の各々との誤差、及び前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が生成した画像と、前記学習用画像との誤差を含む損失関数を導出する損失関数導出部を含み、前記損失関数の値を最適化するように、前記画像生成モデルのM+1個の生成器を学習する。 A model learning apparatus according to a second aspect of the present disclosure is the model learning apparatus according to the first aspect, in which the learning unit divides the semantic layout image by dividing processing L−1 times (0≦l<L). The learning image is divided by the image generated by the (l+1)th generator when the image obtained by reducing each of the plurality of regions to the predetermined size is input to the first generator and the L-1 division processing. Between the error from each of the plurality of regions and the image generated by the L+1th generator when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, and the learning image. A loss function derivation unit that derives a loss function including an error is included, and M+1 generators of the image generation model are learned so as to optimize the value of the loss function.
 本開示の第3の態様のモデル学習装置は、第1の態様または第2の態様のモデル学習装置において、前記学習部は、生成器が生成した画像であるか否かを識別する識別器に対して、前記学習用画像について前記生成器が生成した画像でないと識別し、前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が生成した画像について前記生成器が生成した画像であると識別するように学習すると共に、前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が生成した画像について前記生成器が生成した画像でないと前記識別器が識別するように前記画像生成モデルの学習を行う。 A model learning device according to a third aspect of the present disclosure is the model learning device according to the first aspect or the second aspect, wherein the learning unit is a discriminator for discriminating whether or not the image is generated by a generator. On the other hand, the learning image is identified as not being the image generated by the generator, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator is generated. Learning is performed so as to identify the generated image as an image generated by the generator, and when the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the L+1th generator is The image generation model is learned so that the discriminator discriminates that the generated image is not the image generated by the generator.
 本開示の第4の態様のモデル学習装置は、第1の態様から第3の態様のいずれか1態様のモデル学習装置において、前記分割部は、1回の分割処理で、縦方向に2等分し、かつ、横方向に2等分するように画像を分割する。 A model learning device according to a fourth aspect of the present disclosure is the model learning device according to any one of the first to third aspects, in which the dividing unit is configured to perform one division process in a vertical direction and the like. The image is divided so as to be divided into two parts in the horizontal direction.
 上記目的を達成するために、本開示の第5の態様のモデル学習方法は、セマンティックレイアウト画像から、リアルな画像を生成する画像生成モデルを学習するためのモデル学習方法であって、入力部に、学習用画像を表す画像データと、前記学習用画像に応じたセマンティックレイアウト画像を表す画像データとのペアが教師データとして入力されるステップと、分割部により、入力された前記学習用画像の画像データが表す前記学習用画像、及び前記セマンティックレイアウト画像の画像データが表す前記セマンティックレイアウト画像の各々に対して複数の領域に分割する分割処理をL回行い、分割処理を行う毎に、分割した複数の領域に応じた画像データを出力するステップと、縮小部により、前記入力部に入力された前記セマンティックレイアウト画像、及び前記複数の領域の画像のサイズを、所定サイズに縮小するステップと、学習部により、入力画像に対して順番に適用する、アップサンプリングした画像を生成するL+1個の生成器であって、1番目の生成器の入力サイズが前記所定サイズである生成器を備えた画像生成モデルについて、L-l回(0≦l<L)の分割処理により前記セマンティックレイアウト画像を分割した複数の領域の各々を前記所定サイズに縮小した画像を1番目の生成器に入力したときにl+1番目の生成器が、L-l回の分割処理により前記学習用画像を分割した複数の領域の各々に対応する画像を生成し、前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が、前記学習用画像に対応する画像を生成するように、前記画像生成モデルの学習を行うステップと、を含む。 In order to achieve the above object, a model learning method according to a fifth aspect of the present disclosure is a model learning method for learning an image generation model that generates a realistic image from a semantic layout image, and the model learning method includes: A step of inputting a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image as teacher data, and an image of the learning image input by the dividing unit. Each of the learning image represented by the data and the semantic layout image represented by the image data of the semantic layout image is divided into a plurality of regions by dividing processing L times. Outputting image data according to the area, reducing the size of the semantic layout image input to the input unit and the images of the plurality of areas by a reducing unit to a predetermined size, and a learning unit. According to the above, the image generation model includes L+1 number of generators for generating upsampled images, which are sequentially applied to the input image, and in which the input size of the first generator is the predetermined size. , When the image obtained by reducing each of the plurality of regions obtained by dividing the semantic layout image to the predetermined size by L-1 times (0≦l<L) is input to the first generator, Generator generates an image corresponding to each of a plurality of regions into which the learning image is divided by L-1 division processing, and generates the first image obtained by reducing the semantic layout image to the predetermined size. Learning the image generation model so that the (L+1)th generator generates an image corresponding to the learning image when input to the image generator.
 上記目的を達成するために、本開示の第6態様のプログラムは、コンピュータを、第1の態様から第4の態様のいずれか1態様に記載のモデル学習装置の各部として機能させるためのプログラムである。 In order to achieve the above object, a program according to a sixth aspect of the present disclosure is a program for causing a computer to function as each unit of the model learning device according to any one of the first to fourth aspects. is there.
 本開示によれば、細部までも正確な画像を生成する画像生成モデルを学習することができる、という効果が得られる。 According to the present disclosure, it is possible to obtain an effect that it is possible to learn an image generation model that generates an image that is accurate in details.
実施形態のモデル学習装置の一例の構成を示すブロック図である。It is a block diagram which shows the structure of an example of the model learning apparatus of embodiment. 実施形態の画像生成モデルの一例を示す図である。It is a figure which shows an example of the image generation model of embodiment. 識別器を備えた画像生成モデルの一例を示す図である。It is a figure which shows an example of the image generation model provided with the discriminator. 実施形態のモデル学習装置におけるモデル学習処理ルーチンの一例を示すフローチャートである。It is a flow chart which shows an example of a model learning processing routine in a model learning device of an embodiment. その他の画像分割方法の一例を説明する図である。It is a figure explaining an example of other image division methods.
 以下、図面を参照して本開示の実施形態を詳細に説明する。一例として、本実施形態では、学習用の画像(以下、「学習用画像」という)のサイズ、及びセマンティックレイアウトが示された画像(以下、「セマンティックレイアウト画像」という)のサイズが、縦横それぞれ128ピクセルの場合について説明する。なお、各画像のサイズ及びアスペクト比は、後述のネットワーク構成を変更することで任意の値に対応が可能である。 Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings. As an example, in the present embodiment, the size of an image for learning (hereinafter, referred to as “learning image”) and the size of an image showing a semantic layout (hereinafter, referred to as “semantic layout image”) are 128 vertically and horizontally respectively. The case of pixels will be described. The size and aspect ratio of each image can be set to arbitrary values by changing the network configuration described later.
<本実施形態のモデル学習装置の構成>
 図1は、本実施形態のモデル学習装置10の一例の構成を示すブロック図である。図1に示すように、本実施形態のモデル学習装置10は、入力部20、分割部22、画像データ記憶部24、縮小部26、学習部28、及び出力部30を備える。
<Configuration of model learning device of this embodiment>
FIG. 1 is a block diagram showing the configuration of an example of a model learning device 10 of this exemplary embodiment. As shown in FIG. 1, the model learning device 10 of this embodiment includes an input unit 20, a dividing unit 22, an image data storage unit 24, a reducing unit 26, a learning unit 28, and an output unit 30.
 一例として、本実施形態のモデル学習装置10は、CPU(Central Processing Unit)と、RAM(Random Access Memory)と、後述するモデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したROM(Read Only Memory)と、を含むコンピュータで構成することができる。具体的には、上記プログラムを実行したCPUが、図1に示したモデル学習装置10の入力部20、分割部22、縮小部26、学習部28、及び出力部30として機能する。 As an example, the model learning apparatus 10 according to the present exemplary embodiment includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read) that stores a program for executing a model learning processing routine described below and various data. Only Memory), and can be configured with a computer including. Specifically, the CPU that executes the program functions as the input unit 20, the dividing unit 22, the reducing unit 26, the learning unit 28, and the output unit 30 of the model learning device 10 illustrated in FIG. 1.
 図1に示すように、入力部20には、教師データ1が入力される。本実施形態では、教師データ1として、リアルな画像である学習用画像Iと、学習用画像Iに対してセマンティックセグメンテーションを行った結果得られたセマンティックレイアウトを表すセマンティックレイアウト画像Sとのペア(組み合わせ)を用意する。なお、本実施形態では、説明の簡略化のため、教師データ1のペアが1つ、すなわち、学習用画像Iとセマンティックレイアウト画像Sとが一つだけの場合について説明するが、実際にはペアが複数ある場合にも容易に適用可能である。なお、以下では、教師データ1であるセマンティックレイアウト画像S、及びセマンティックレイアウト画像Sより生成されたセマンティックレイアウト画像S ~S (詳細後述)等、複数種類のセマンティックレイアウト画像が挙げられるが、個々を区別せずに総称する場合は、「S」の添え字を省略し、単に「セマンティックレイアウト画像S」という。また、学習用画像I等についても同様に、個々を区別せずに総称する場合は、「I」の添え字を省略し、単に「学習用画像I」という。 As shown in FIG. 1, the teacher data 1 is input to the input unit 20. In the present embodiment, as the teaching data 1, a learning image I 0 , which is a realistic image, and a semantic layout image S 0 representing a semantic layout obtained as a result of performing semantic segmentation on the learning image I 0 . Prepare a pair (combination). In the present embodiment, for simplification of description, a case where there is only one pair of teacher data 1, that is, there is only one learning image I 0 and semantic layout image S 0 will be described. Can be easily applied when there are multiple pairs. In the following, a plurality of types of semantic layout images such as the semantic layout image S 0 which is the teacher data 1 and the semantic layout images S 1 1 to S 4 1 (details will be described later) generated from the semantic layout image S 0 will be given. However, in the case of collectively calling without distinguishing each, the suffix of "S" is omitted and simply referred to as "semantic layout image S". Similarly, when the learning images I 0 and the like are collectively referred to without distinction, the suffix “I” is omitted and simply referred to as the “learning image I”.
 具体的には、入力部20には、学習用画像Iを表す画像データと、セマンティックレイアウト画像Sを表す画像データとが教師データ1として入力される。入力部20に入力された学習用画像Iを表す画像データと、セマンティックレイアウト画像Sを表す画像データとは、分割部22に出力される。 Specifically, the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 are input to the input unit 20 as teacher data 1. The image data representing the learning image I 0 input to the input unit 20 and the image data representing the semantic layout image S 0 are output to the dividing unit 22.
 分割部22には、学習用画像Iを表す画像データと、セマンティックレイアウト画像Sを表す画像データとが入力され、学習用画像I及びセマンティックレイアウト画像Sの各々を、縦横それぞれ等分に、複数の領域に分割する分割処理をL回(Lは1以上の整数。)行うことで、教師データを疑似的に増加させる。 The image data representing the learning image I 0 and the image data representing the semantic layout image S 0 are input to the dividing unit 22, and the learning image I 0 and the semantic layout image S 0 are equally divided vertically and horizontally. In addition, the teacher data is artificially increased by performing the dividing process L times (L is an integer of 1 or more) for dividing into a plurality of areas.
 一例として、本実施形態の分割部22は、学習用画像I及びセマンティックレイアウト画像Sを参考文献1のようにピラミッド型に分割する。 As an example, the dividing unit 22 of the present embodiment divides the learning image I 0 and the semantic layout image S 0 into a pyramid shape as in Reference 1.
[参考文献1]Svetlana Lazebnik et al., "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories", インターネット検索<URL:http://mplab.ucsd.edu/~marni/Igert/Lazebnik_06.pdf> [Reference 1] Svetlana Lazebnik et al., "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories", Internet search <URL: http://mplab.ucsd.edu/~marni/niker/niker_niker >
 一例として本実施形態では、L=2とし、分割部22が2回の分割処理を行う場合について説明する。当該分割処理は、ネットワーク構成を変更することで、分割後のサイズが縦横2ピクセル(2ピクセル×2ピクセル)になるまで実施可能である。なお、分割処理を行うほど、画像生成モデル32により生成される生成画像の細部の正確さが向上する。 As an example, in the present embodiment, a case where L=2 and the dividing unit 22 performs the dividing process twice will be described. The division process can be performed by changing the network configuration until the size after division becomes 2 pixels vertically and horizontally (2 pixels×2 pixels). Note that the accuracy of the details of the generated image generated by the image generation model 32 improves as the division processing is performed.
 まず分割部22は、縦横128ピクセル(128ピクセル×128ピクセル)のセマンティックレイアウト画像Sに対して分割処理を1回行い、画像のサイズが縦横64ピクセル(64ピクセル×64ピクセル)のセマンティックレイアウト画像を4つ(セマンティックレイアウト画像S ~S 、図2参照)生成する。生成されたセマンティックレイアウト画像S ~S の各々を表す画像データは分割部22から出力され、画像データ記憶部24に記憶される。 First, the division unit 22 performs a division process once on the semantic layout image S 0 of vertical and horizontal 128 pixels (128 pixels×128 pixels), and the image size of the semantic layout image S 64 is 64 pixels (64 pixels×64 pixels). 4 (semantic layout images S 1 1 to S 4 1 , see FIG. 2) are generated. The image data representing each of the generated semantic layout images S 1 1 to S 4 1 is output from the division unit 22 and stored in the image data storage unit 24.
 なお、セマンティックレイアウト画像を示す「S」の添え字のうち、上付き文字は分割処理の回数を示す。また、下付き文字は、分割により同時に得られた一群のセマンティックレイアウト画像Sにおけるインデックスを示している。 Note that among the subscripts of "S" that indicate the semantic layout image, the superscript indicates the number of times of division processing. Further, the subscript indicates an index in the group of semantic layout images S obtained simultaneously by the division.
 さらに、本実施形態の分割部22は、セマンティックレイアウト画像S ~S の各々に対して上記と同様に分割処理を行い、画像のサイズが縦横32ピクセル(32ピクセル×32ピクセル)のセマンティックレイアウト画像を16個(セマンティックレイアウト画像S ~S16 、図2参照)生成する。生成されたセマンティックレイアウト画像S ~S16 の各々を表す画像データは分割部22から出力され、画像データ記憶部24に記憶される。 Further, the dividing unit 22 of the present embodiment performs the dividing process on each of the semantic layout images S 1 1 to S 4 1 in the same manner as described above, and the image size is 32 pixels vertically and horizontally (32 pixels×32 pixels). Sixteen semantic layout images (semantic layout images S 1 2 to S 16 2 , see FIG. 2) are generated. The image data representing each of the generated semantic layout images S 1 2 to S 16 2 is output from the division unit 22 and stored in the image data storage unit 24.
 分割部22は、上記と同様の分割処理を学習用画像Iに対しても行い、学習用画像I ~I 、及び学習用画像I ~I16 (図2参照)を生成する。学習用画像I ~I 、及び学習用画像I ~I16 の各々を表す画像データは分割部22から出力され、画像データ記憶部24に記憶される。 The division unit 22 also performs the same division processing as described above on the learning image I 0 , and the learning images I 1 1 to I 4 1 and the learning images I 1 2 to I 16 2 (see FIG. 2). To generate. Image data representing each of the learning images I 1 1 to I 4 1 and the learning images I 1 2 to I 16 2 is output from the dividing unit 22 and stored in the image data storage unit 24.
 なお、本実施形態のモデル学習装置10では、分割部22からは学習用画像Iを表す画像データ、及びセマンティックレイアウト画像Sを表す画像データも出力され、各々画像データ記憶部24に記憶される。 In the model learning device 10 of the present embodiment, the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 are also output from the dividing unit 22 and stored in the image data storage unit 24. It
 縮小部26には、画像データ記憶部24から、セマンティックレイアウト画像S、S ~S 、S ~S16 各々の画像データが入力される。縮小部26は、セマンティックレイアウト画像S、S ~S 、S ~S16 各々のサイズを縮小する。なお、縮小サイズは任意であるが、分割部22のL回の分割処理により分割したセマンティックレイアウト画像Sのサイズの縦横が半分の大きさ、換言すると1/4の大きさ(面積)を目安とする。上述したように、本実施形態では、セマンティックレイアウト画像S ~S16 のサイズ32ピクセル×32ピクセルが最小サイズであるため、縦横16ピクセル(16ピクセル×16ピクセル)のサイズに、全てのセマンティックレイアウト画像Sを縮小する。 Image data of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 is input to the reduction unit 26 from the image data storage unit 24. The reduction unit 26 reduces the size of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 . Note that the reduction size is arbitrary, but the vertical and horizontal sizes of the semantic layout image S divided by the L division processing of the division unit 22 are half the size, in other words, the size (area) of 1/4 is a standard. To do. As described above, in the present embodiment, since the size of the semantic layout images S 1 2 to S 16 2 of 32 pixels×32 pixels is the minimum size, all of the vertical and horizontal 16 pixels (16 pixels×16 pixels) are used. The semantic layout image S is reduced.
 学習部28には、縮小部26から、縮小されたセマンティックレイアウト画像S、S ~S 、S ~S16 各々の画像データが入力され、画像データ記憶部24から、学習用画像I ~I 、及び学習用画像I ~I16 の各々を表す画像データが入力される。学習部28は、画像生成モデル32及び損失関数導出部34を含む。 The learning unit 28 receives the reduced image data of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 from the reduction unit 26, and the image data storage unit 24 Image data representing each of the learning images I 1 1 to I 4 1 and the learning images I 1 2 to I 16 2 is input. The learning unit 28 includes an image generation model 32 and a loss function derivation unit 34.
 学習部28は、各セマンティックレイアウト画像S から対応する学習用画像I を生成するよう画像生成モデル32の学習を行う。なお、セマンティックレイアウト画像Sの「S」、学習用画像Iの「I」における添え字の「l」は、分割処理の回数(本実施形態では、0~2)を示し、「n」は、分割した領域のインデックス(本実施形態では、1~16)を示している。 The learning unit 28 learns the image generation model 32 so as to generate the corresponding learning image I n l from each semantic layout image S n l . The subscript “l” in “S” of the semantic layout image S and “I” of the learning image I indicates the number of division processes (0 to 2 in this embodiment), and “n” is The indexes (1 to 16 in this embodiment) of the divided areas are shown.
 画像生成モデル32は、入力画像に対して順番に適用する、アップサンプリングした画像を生成するL+1個の生成器であって、1番目の生成器の入力サイズが所定サイズである生成器を備える。例えば、L=2とし、図2に示すように、本実施形態の画像生成モデル32は、3つの生成器G(G、G、G)を含む。1番目の生成器Gには、縮小部26から、セマンティックレイアウト画像S、S ~S 、S ~S16 各々のサイズを縮小した画像データが入力される。 The image generation model 32 includes L+1 generators that sequentially apply to the input image and generate upsampled images, and the input size of the first generator is a predetermined size. For example, L=2, and as shown in FIG. 2, the image generation model 32 of the present embodiment includes three generators G (G 2 , G 1 , G 0 ). From the reducing unit 26, the first generator G 2 receives the image data in which the size of each of the semantic layout images S 0 , S 1 1 to S 4 1 and S 1 2 to S 16 2 is reduced.
 1番目の生成器Gは、入力された画像データが表すセマンティックレイアウト画像S、S ~S 、S ~S16 各々のサイズを縮小したものの縦横を二倍、換言するとサイズを4倍にアップサンプリングしながら、セマンティックレイアウト画像Sから学習用画像Iに相当する画像を生成し、生成した画像を表す画像データを出力G(S )として生成器Gに出力する。 The first generator G 2 has a reduced size of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 represented by the input image data, but the vertical and horizontal directions are doubled. Then, while upsampling the size by 4 times, an image corresponding to the learning image I is generated from the semantic layout image S, and image data representing the generated image is output to the generator G 1 as an output G 2 (S n l ). Output.
 2番目の生成器Gには、1番目の生成器Gの出力G(S )が入力される。2番目の生成器Gは、出力G(S )の各々が表す画像の縦横を二倍、換言するとサイズを4倍にアップサンプリングした画像を生成し、生成した画像を表す画像データを出力G(G(S ))として3番目の生成器Gに出力する。 The output G 2 (S n l ) of the first generator G 2 is input to the second generator G 1 . The second generator G 1 generates an image obtained by upsampling the image represented by each of the outputs G 2 (S n l ) vertically and horizontally, in other words, by quadrupling the size, and image data representing the generated image. Is output as the output G 1 (G 2 (S n l )) to the third generator G 1 .
 3番目の生成器Gには、2番目の生成器Gの出力G(G(S ))が入力される。3番目の生成器Gは、出力G(G(S ))の各々が表す画像の縦横を二倍、換言するとサイズを4倍にアップサンプリングした画像を生成し、生成した画像を表す画像データを出力G(G(G(S )))として出力する。最終的に3番目の生成器Gから出力される出力G(G(G(S )))が表す画像は、学習用画像Iと同サイズ(128ピクセル×128ピクセル)の画像となる。 The output G 1 (G 2 (S n l )) of the second generator G 1 is input to the third generator G 0 . The third generator G 0 generates an image obtained by up-sampling the image represented by each of the outputs G 1 (G 2 (S n l )) by 2 times the vertical and horizontal directions, in other words, quadruple the size, and generating the generated image. The image data representing G is output as output G 0 (G 1 (G 2 (S n l ))). The image finally represented by the output G 0 (G 1 (G 2 (S n l ))) output from the third generator G 0 has the same size as the learning image I 0 (128 pixels×128 pixels). It becomes an image of.
 なお、生成器G(図2では、G、G、G)は、任意のアップサンプリングを行うレイヤー群である。例えば非特許文献1に示されるように、3ピクセル×3ピクセルの畳込み層-正規化層-LReLU層-バイリニアアップサンプリング層としてもよいし、逆畳み込み層を用いてアップサンプリングしてもよい。 The generator G 1 (G 2 , G 1 , G 0 in FIG. 2) is a layer group that performs arbitrary upsampling. For example, as shown in Non-Patent Document 1, a 3 pixel×3 pixel convolution layer-normalization layer-LReLU layer-bilinear upsampling layer may be used, or up-sampling may be performed using a deconvolution layer.
 一方、損失関数導出部34は、画像生成モデル32における損失関数を導出する。損失関数について説明する。 On the other hand, the loss function deriving unit 34 derives the loss function in the image generation model 32. The loss function will be described.
 各生成器Gの出力について、対応する学習用画像I との再構成誤差を用いてセマンティックレイアウト画像からリアルな画像への変換を促す。再構成誤差は二つの画像のピクセル毎の一致度を計算し、一致しない場合すなわち正確に画像を生成できなかった場合に損失が高くなる損失関数を導出する。 For the output of each generator G 1 , the reconstruction error from the corresponding learning image I n l is used to prompt the conversion from the semantic layout image to a realistic image. The reconstruction error is calculated by calculating the degree of coincidence between the two images for each pixel, and deriving a loss function in which the loss is high when the images do not coincide, that is, when the images cannot be accurately generated.
 具体的には、2回の分割処理によりセマンティックレイアウト画像Sを分割した複数の領域の各々を所定サイズに縮小した画像を1番目の生成器に入力したときに3番目の生成器Gが生成した画像と、2回の分割処理により学習用画像Iを分割した複数の領域の各々との再構成誤差を含む1番目の生成器Gに関する損失関数、1回の分割処理によりセマンティックレイアウト画像Sを分割した複数の領域の各々を所定サイズに縮小した画像を1番目の生成器Gに入力したときに2番目の生成器Gが生成した画像と、1回の分割処理により学習用画像Iを分割した複数の領域の各々との再構成誤差を含む2番目の生成器Gに関する損失関数、及びセマンティックレイアウト画像Sを所定サイズに縮小した画像を1番目の生成器Gに入力したときに3番目の生成器Gが生成した画像と、学習用画像Iとの再構成誤差を含む3番目の生成器Gに関する損失関数を含む最終的な損失関数を導出する。これにより、より正確に画像を生成するようにネットワークパラメータが更新される。 Specifically, when an image obtained by reducing each of a plurality of regions obtained by dividing the semantic layout image S into a predetermined size by two division processes is input to the first generator, the third generator G 0 generates it. Loss function concerning the first generator G 2 including the reconstruction error between the image obtained by the above-described image and each of the plurality of regions obtained by dividing the learning image I by the two division processes, and the semantic layout image S by the one division process. The image generated by the second generator G 1 when the image obtained by reducing each of the plurality of divided regions to a predetermined size is input to the first generator G 2 and the learning image by one division process. The loss function regarding the second generator G 1 including the reconstruction error with respect to each of the plurality of regions obtained by dividing I, and the image obtained by reducing the semantic layout image S to a predetermined size are input to the first generator G 2 . Sometimes, a final loss function including a loss function regarding the third generator G 0 including a reconstruction error between the image generated by the third generator G 0 and the learning image I is derived. This updates the network parameters to produce the image more accurately.
 例えば、1番目の生成器Gに関する損失関数Lossl=2は下記(1)式で定義される。
Figure JPOXMLDOC01-appb-M000001
For example, loss function Loss l = 2 relates to the first generator G 2 is defined by the following equation (1).
Figure JPOXMLDOC01-appb-M000001
 上記(1)式におけるNは、二回の分割処理を行った際に分割の結果得られる画像の数、すなわちここではN=16となる。また損失関数Lossl=2は、分割処理の回数が二回のセマンティックレイアウト画像S を表す画像データ入力された時のみ算出し、分割処理の回数が異なるセマンティックレイアウト画像Sを表す画像データが入力された際は算出を行わない。 N 2 in the above formula (1) is the number of images obtained as a result of the division when the division processing is performed twice, that is, N 2 =16 here. Further, the loss function Loss l=2 is calculated only when the image data representing the semantic layout image S n 2 having the number of times of division processing is input, and the image data representing the semantic layout image S having a different number of times of division processing is calculated. When input, it does not calculate.
 また、2番目の生成器Gに関する損失関数Lossl=1は下記(2)式で定義される。
Figure JPOXMLDOC01-appb-M000002
Furthermore, loss function Loss l = 1 regarding the second generator G 1 is defined by the following equation (2).
Figure JPOXMLDOC01-appb-M000002
 上記(2)式におけるNは、一回の分割処理を行った際に分割の結果得られる画像の数、すなわちここではN=4となる。また損失関数Lossl=1は、分割処理の回数が一回のセマンティックレイアウト画像S を表す画像データが入力された時のみ算出し、分割処理の回数が異なるセマンティックレイアウト画像Sを表す画像データが入力された再は算出を行わない。 N 1 in the above equation (2) is the number of images obtained as a result of the division when one division process is performed, that is, N 1 =4 here. The loss function Loss l=1 is calculated only when image data representing the semantic layout image S n 1 for which the number of times of division processing is one is input, and image data representing the semantic layout image S for which the number of times of division processing is different. Is not input, the calculation is not performed.
 また、3番目の生成器Gに関する損失関数Lossl=0は下記(3)式で定義される。 Furthermore, loss function Loss l = 0 regarding the third generator G 0 is defined by the following equation (3).
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 従って、図2に示した画像生成モデル32の最終的な損失関数Lossgeneratorは下記(4)式で定義される。損失関数Lossgeneratorが、画像生成モデル32全体の損失関数である。
Figure JPOXMLDOC01-appb-M000004
Therefore, the final loss function Loss generator of the image generation model 32 shown in FIG. 2 is defined by the following equation (4). The loss function Loss generator is the loss function of the entire image generation model 32.
Figure JPOXMLDOC01-appb-M000004
 上記(4)式において、α、β、γの各々は、各損失関数の重みであり、任意に設定することができる。例えば、上記の参考文献1に示されるように、分割処理の回数に応じた重みとすればよく、分割処理の最大回数をLとすると、各損失関数の重みを1/2L-lとすればよい。 In the above equation (4), each of α, β and γ is a weight of each loss function and can be set arbitrarily. For example, as shown in the above-mentioned reference document 1, the weight may be set according to the number of times of division processing. If the maximum number of times of division processing is L, the weight of each loss function may be ½ L−1. Good.
 損失関数導出部34は、上記(1)式~(4)式に基づいて導出した損失関数Lossgeneratorを出力する。 The loss function deriving unit 34 outputs the loss function Loss generator , which is derived based on the equations (1) to (4).
 学習部28は、損失関数導出部34が出力した損失関数Lossgeneratorを最適化するように、画像生成モデル32の学習を行う。これにより、セマンティックレイアウト画像Sから画像生成モデルにより生成される画像が、学習用画像Iに近付く。 The learning unit 28 learns the image generation model 32 so as to optimize the loss function Loss generator output by the loss function derivation unit 34. As a result, the image generated by the image generation model from the semantic layout image S 0 approaches the learning image I 0 .
 なお、図3に示すような識別器Dを追加し、参考文献2に示されるように、画像生成モデル32と識別器Dの敵対的学習を行ってもよい。ここまで説明した画像生成モデル32は、分割した領域毎の最適化と、分割前の画像全体の最適化を同時に行っているが、領域最適化が過剰に行われた場合、領域間のつなぎ目に不自然さが現れる場合がある。そこで敵対的学習により、画像全体が自然に見えるような制約を用いて学習を行うことで前述の不自然さを改善できる。 Note that a discriminator D as shown in FIG. 3 may be added to perform adversarial learning between the image generation model 32 and the discriminator D as shown in Reference 2. The image generation model 32 described up to this point performs optimization for each divided area and optimization of the entire image before division at the same time. However, when the area optimization is excessively performed, the joint between the areas is formed. Unnaturalness may appear. Therefore, the above-mentioned unnaturalness can be improved by performing learning using the constraint that the whole image looks natural by the adversarial learning.
 識別器Dは、セマンティックレイアウト画像S に関する生成器Gの出力G(G(G(S )))、及び学習用画像I を表す画像データを入力として、各々が生成器Gが生成した画像であるか否かを識別する。 The discriminator D receives the output G 0 (G 1 (G 1 (G 2 (S 1 0 )))) of the generator G 0 for the semantic layout image S 1 0 and the image data representing the learning image I 1 0 as inputs, and respectively. Is an image generated by the generator G.
 学習部28は、識別器Dに対して、学習用画像Iについて生成器Gが生成した画像でないと識別し、セマンティックレイアウト画像Sを縮小部26により縮小した画像を1番目の生成器Gに入力したときに3番目の生成器Gが生成した画像について生成器Gが生成した画像であると識別するように学習する。具体的には、識別器Dの損失関数を最適化するように識別器Dを学習する。 The learning unit 28 identifies to the discriminator D that the learning image I 0 is not the image generated by the generator G, and the image obtained by reducing the semantic layout image S 0 by the reducing unit 26 is the first generator G. The image generated by the third generator G 0 when 1 is input is learned so as to be identified as the image generated by the generator G. Specifically, the discriminator D is learned so as to optimize the loss function of the discriminator D.
 例えば、識別器Dの損失関数は下記(5)式で定義される。
Figure JPOXMLDOC01-appb-M000005
For example, the loss function of the discriminator D is defined by the following equation (5).
Figure JPOXMLDOC01-appb-M000005
 また、学習部28は、セマンティックレイアウト画像Sを縮小部26により縮小した画像を1番目の生成器Gに入力したときに3番目の生成器Gが生成した画像について生成器Gが生成した画像でないと識別器Dが識別するように画像生成モデル32の学習を行う。 In addition, the learning unit 28 generates the image generated by the third generator G 0 when the image generated by reducing the semantic layout image S 0 by the reduction unit 26 is input to the first generator G 2 , and the generator G generates the image. The image generation model 32 is learned so that the discriminator D discriminates that the image is not the acquired image.
 例えば、生成器Gの損失関数は下記(6)式で定義される。
Figure JPOXMLDOC01-appb-M000006
For example, the loss function of the generator G is defined by the following equation (6).
Figure JPOXMLDOC01-appb-M000006
[参考文献2]Ian J. Goodfellow et al., "Generative Adversarial Nets", インターネット検索<URL:http://datascienceassn.org/sites/default/files/Generative%20Adversarial%20Nets.pdf> [Reference 2] IanJ. Good fellow et.al., "Generative Adversarial Nets", Internet search <URL: http://datascienceassn.org/sites/default/files/Generative%20Adversarial%20Nets.pdf>
 出力部30からは、学習部28により学習済みの画像生成モデル32がモデル学習装置10の外部に出力される。学習済みの画像生成モデル32を用いた画像生成装置では、セマンティックレイアウト画像Sが縮小された画像を表す画像データが入力されると、学習済みの画像生成モデル32により、リアルな画像が生成され、当該画像を表すが画像データが出力される。図2に示した具体例では、セマンティックレイアウト画像Sが16ピクセル×16ピクセルに縮小された画像の画像データが入力されると、学習用画像Iと同等の画像を表す画像データが出力される。 The image generation model 32 that has been learned by the learning unit 28 is output from the output unit 30 to the outside of the model learning device 10. In the image generation device using the learned image generation model 32, when the image data representing the image obtained by reducing the semantic layout image S is input, the learned image generation model 32 generates a realistic image, Image data is output although it represents the image. In the specific example shown in FIG. 2, when image data of an image in which the semantic layout image S 0 is reduced to 16 pixels×16 pixels is input, image data representing an image equivalent to the learning image I 0 is output. It
<本実施形態のモデル学習装置の作用>
 次に、本実施形態のモデル学習装置10の作用について図面を参照して説明する。図4は、本実施形態のモデル学習装置10において実行されるモデル学習処理ルーチンの一例を示すフローチャートである。
<Operation of the model learning device of the present embodiment>
Next, the operation of the model learning device 10 of the present exemplary embodiment will be described with reference to the drawings. FIG. 4 is a flowchart showing an example of a model learning processing routine executed in the model learning device 10 of this embodiment.
 図4に示したモデル学習処理ルーチンは、例えば、教師データ1が入力部20に入力されたタイミングや、モデル学習装置10の外部からモデル学習処理ルーチンの実行指示を受け付けたタイミング等、任意のタイミングで実行される。 The model learning processing routine shown in FIG. 4 has an arbitrary timing, such as the timing when the teacher data 1 is input to the input unit 20 or the timing when the execution instruction of the model learning processing routine is received from the outside of the model learning device 10. Run on.
 図4のステップS100で分割部22は、上述したように、学習用画像Iを表す画像データと、セマンティックレイアウト画像Sを表す画像データとの各々を、縦横それぞれ等分に、複数の領域に分割する分割処理を行うことで、教師データを疑似的に増加させる。分割処理毎に、生成されるセマンティックレイアウト画像S を表す画像データ、及び学習用画像I を表す画像データが、画像データ記憶部24に記憶される。 In step S100 of FIG. 4, as described above, the dividing unit 22 divides each of the image data representing the learning image I 0 and the image data representing the semantic layout image S 0 into a plurality of regions into vertical and horizontal regions. The teacher data is pseudo-increased by performing the dividing process of dividing into. Image data representing the generated semantic layout image S n l and image data representing the learning image I n l are stored in the image data storage unit 24 for each division process.
 次のステップS102で縮小部26は、上述したように、画像データ記憶部24から、セマンティックレイアウト画像S、S ~S 、S ~S16 各々の画像データが入力される。縮小部26は、セマンティックレイアウト画像S、S ~S 、S ~S16 各々のサイズを縮小する。 In the next step S102, the reduction unit 26 inputs the image data of each of the semantic layout images S 0 , S 1 1 to S 4 1 and S 1 2 to S 16 2 from the image data storage unit 24 as described above. It The reduction unit 26 reduces the size of each of the semantic layout images S 0 , S 1 1 to S 4 1 , and S 1 2 to S 16 2 .
 次のステップS104で学習部28は、上述したように、画像生成モデル32の学習を行う。学習部28は、縮小部26から入力された縮小されたセマンティックレイアウト画像S の各々を表す画像データを、生成器Gに入力させる。そして、損失関数導出部34が、その出力を順次、生成器G及び生成器Gに入力させて各生成器G、G、Gで画像を生成した際の損失関数Lossを統合させて、画像生成モデル32全体の損失関数Lossgeneratorを導出する。学習部28は、損失関数Lossgeneratorを最適化するように、画像生成モデル32の学習を行う。学習済みの画像生成モデル32は、出力部30からモデル学習装置10の外部に出力され、本モデル学習処理ルーチンを終了する。 In the next step S104, the learning unit 28 learns the image generation model 32 as described above. The learning unit 28 causes the generator G 2 to input the image data representing each of the reduced semantic layout images S n l input from the reduction unit 26. Then, the loss function deriving unit 34 sequentially inputs the output to the generator G 1 and the generator G 0 , and calculates the loss function Loss l when the image is generated by each of the generators G 2 , G 1 , and G 0. By integrating, the loss function Loss generator of the entire image generation model 32 is derived. The learning unit 28 learns the image generation model 32 so as to optimize the loss function Loss generator . The learned image generation model 32 is output from the output unit 30 to the outside of the model learning device 10, and the model learning processing routine ends.
 以上説明したように、本実施形態のモデル学習装置10は、セマンティックレイアウト画像Sから、リアルな画像を生成する画像生成モデル32を学習するためのモデル学習装置である。モデル学習装置10は、入力部20、分割部22、縮小部26、及び学習部28を備える。入力部20には、学習用画像Iを表す画像データと、学習用画像Iに応じたセマンティックレイアウト画像Sを表す画像データとのペアが教師データ1として入力される。分割部22は、入力された学習用画像Iの画像データが表す学習用画像I、及びセマンティックレイアウト画像Sの画像データが表すセマンティックレイアウト画像Sの各々に対して複数の領域に分割する分割処理をL回行い、分割処理を行う毎に、分割した複数の領域に応じた画像データを出力する。縮小部26は、入力部20に入力されたセマンティックレイアウト画像S、及び複数の領域の画像のサイズを、所定サイズに縮小する。学習部28は、入力画像に対して順番に適用する、アップサンプリングした画像を生成するL+1個の生成器Gであって、1番目の生成器Gの入力サイズが所定サイズである生成器Gを備えた画像生成モデル32について、L-l回(0≦l<L)の分割処理によりセマンティックレイアウト画像Sを分割した複数の領域の各々を所定サイズに縮小した画像を1番目の生成器Gに入力したときにl+1番目の生成器Gl+1が、L-l回の分割処理により学習用画像Iを分割した複数の領域の各々に対応する画像を生成し、セマンティックレイアウト画像Sを前記所定サイズに縮小した画像を1番目の生成器Gに入力したときにL+1番目の生成器GL+1が、学習用画像Iに対応する画像を生成するように、画像生成モデル32の学習を行う。 As described above, the model learning device 10 of the present embodiment is a model learning device for learning the image generation model 32 that generates a realistic image from the semantic layout image S. The model learning device 10 includes an input unit 20, a dividing unit 22, a reducing unit 26, and a learning unit 28. The input unit 20, the image data representing the learning image I 0, pairs of image data representing a semantic layout images S 0 corresponding to the learning image I 0 is input as the teacher data 1. The dividing unit 22 divides each of the learning image I 0 represented by the input image data of the learning image I 0 and the semantic layout image S 0 represented by the image data of the semantic layout image S 0 into a plurality of regions. The dividing process is performed L times, and each time the dividing process is performed, image data corresponding to the divided regions is output. The reduction unit 26 reduces the size of the semantic layout image S 0 input to the input unit 20 and the images of the plurality of regions to a predetermined size. The learning unit 28 is an L+1 number of generators G that generate up-sampled images that are sequentially applied to the input image, and that the input size of the first generator G 2 is a predetermined size. For the image generation model 32 having the following, each of a plurality of regions obtained by dividing the semantic layout image S 0 by L-1 (0≦l<L) division processing is reduced to a predetermined size, and the image is reduced to a first generator. When input to G 2 , the l+1-th generator G l+1 generates an image corresponding to each of a plurality of regions into which the learning image I 0 is divided by the L−1 division processing, and the semantic layout image S 0 is generated. Of the image generation model 32 so that the L+1-th generator G L+1 generates an image corresponding to the learning image I 0 when the image reduced to the predetermined size is input to the first generator G 2 . Learn.
 上記構成により、本実施形態のモデル学習装置10は、入力されたセマンティックレイアウト画像Sを分割して得られた領域毎に、分割処理の回数Lに応じた生成器Gにより生成される画像の最適化を図る。従って、本実施形態のモデル学習装置10によれば、細部までも正確な画像を生成することができる画像生成モデル32を学習することができる。 With the above configuration, the model learning apparatus 10 according to the present exemplary embodiment optimizes the image generated by the generator G according to the number L of division processes for each region obtained by dividing the input semantic layout image S. Try to change. Therefore, according to the model learning device 10 of the present exemplary embodiment, it is possible to learn the image generation model 32 that can generate an image that is accurate in details.
 なお、本実施形態では、セマンティックレイアウト画像Sを分割する方法について、ピラミッド型に分割する形態について説明したが、セマンティックレイアウト画像Sの分割方法は、当該形態に限定されない。ピラミッド型の分割では一部オブジェクトをまたがるように領域が設定される場合も多く、その場合にオブジェクトの一貫性が保証されない。そこで例えば、ピラミッド型の分割に加え、オブジェクトに留意した分割方法を加えてもよく、この場合、上記問題を改善することが可能となる。 It should be noted that in the present embodiment, the method of dividing the semantic layout image S 0 is described as a pyramid-shaped form, but the method of dividing the semantic layout image S 0 is not limited to the form. In the pyramid-shaped division, the area is often set so as to extend over some objects, and in that case, the consistency of the objects is not guaranteed. Therefore, for example, in addition to the pyramid-shaped division, a division method in which an object is taken into consideration may be added. In this case, the above problem can be improved.
 図5はオブジェクトに留意した画像分割法の模式図である。 Fig. 5 is a schematic diagram of the image segmentation method that pays attention to objects.
 セマンティックレイアウト画像Sに対して、ピラミッド型分割と同様のサイズの候補領域をランダムに設定する(図内実線枠)。先の例に倣って64ピクセル×64ピクセルの候補領域、32ピクセル×32ピクセルの候補領域とした。また各ラベルの真の領域についても設定する(図内点線枠)。 For the semantic layout image S 0 , candidate regions of the same size as the pyramid-shaped division are randomly set (solid line frame in the figure). Following the previous example, a candidate area of 64 pixels×64 pixels and a candidate area of 32 pixels×32 pixels were set. The true area of each label is also set (dotted line frame in the figure).
 次に、この候補領域それぞれについて、ラベル毎にIoU(Intersection of Union)を計算し、IoUが閾値を超える候補領域を学習に用いることで、オブジェクトの一貫性を持ちながら画像を生成することが可能となる。IoUは、真の領域と候補領域の積集合のピクセル数を、真の領域と候補領域の和集合のピクセル数で除した値と定義される。 Next, for each of these candidate areas, IoU (Intersection of Union) is calculated for each label, and by using the candidate area where IoU exceeds the threshold value for learning, it is possible to generate an image with object consistency. Becomes IoU is defined as a value obtained by dividing the number of pixels in the intersection of the true region and the candidate region by the number of pixels in the union of the true region and the candidate region.
 なお、本開示は、上記実施形態に限定されるものではなく、この本開示の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present disclosure is not limited to the above embodiment, and various modifications and applications are possible without departing from the gist of the present disclosure.
 また、本実施形態では、上記プログラムが予めインストールされている形態について説明したが、当該プログラムを、コンピュータが読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present embodiment, the mode in which the program is pre-installed has been described, but the program can be stored in a computer-readable recording medium and provided, or provided via a network. It is also possible to do so.
1 教師データ
10 モデル学習装置
20 入力部
22 分割部
26 縮小部
28 学習部
32 画像生成モデル
34 損失関数導出部
D 識別器
、G、G 生成器
I、I、I  学習用画像
S、S、S  セマンティックレイアウト画像
1 Teacher Data 10 Model Learning Device 20 Input Unit 22 Division Unit 26 Reduction Unit 28 Learning Unit 32 Image Generation Model 34 Loss Function Derivation Unit D Discriminator G 0 , G 1 , G 2 Generator I, I 0 , I n l Learning Image S, S 0 , S n l Semantic layout image

Claims (6)

  1.  セマンティックレイアウト画像から、リアルな画像を生成する画像生成モデルを学習するためのモデル学習装置であって、
     学習用画像を表す画像データと、前記学習用画像に応じたセマンティックレイアウト画像を表す画像データとのペアが教師データとして入力される入力部と、
     入力された前記学習用画像の画像データが表す前記学習用画像、及び前記セマンティックレイアウト画像の画像データが表す前記セマンティックレイアウト画像の各々に対して複数の領域に分割する分割処理をL回行い、分割処理を行う毎に、分割した複数の領域に応じた画像データを出力する分割部と、
     前記入力部に入力された前記セマンティックレイアウト画像、及び前記複数の領域の画像のサイズを、所定サイズに縮小する縮小部と、
     入力画像に対して順番に適用する、アップサンプリングした画像を生成するL+1個の生成器であって、1番目の生成器の入力サイズが前記所定サイズである生成器を備えた画像生成モデルについて、
     L-l回(0≦l<L)の分割処理により前記セマンティックレイアウト画像を分割した複数の領域の各々を前記所定サイズに縮小した画像を1番目の生成器に入力したときにl+1番目の生成器が、L-l回の分割処理により前記学習用画像を分割した複数の領域の各々に対応する画像を生成し、
     前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が、前記学習用画像に対応する画像を生成するように、前記画像生成モデルの学習を行う学習部と、
     を備えたモデル学習装置。
    A model learning device for learning an image generation model for generating a realistic image from a semantic layout image,
    An input unit in which a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image is input as teacher data,
    Each of the learning image represented by the input image data of the learning image and the semantic layout image represented by the image data of the semantic layout image is divided into a plurality of regions by dividing processing L times, and divided. A dividing unit that outputs image data according to a plurality of divided areas each time processing is performed,
    A reduction unit that reduces the size of the semantic layout image input to the input unit and the images of the plurality of regions to a predetermined size,
    For an image generation model including L+1 generators that generate upsampled images, which are sequentially applied to the input images, and in which the input size of the first generator is the predetermined size,
    When the image obtained by reducing each of the plurality of areas obtained by dividing the semantic layout image to the predetermined size by the L-1 (0≦l<L) division processing is input to the first generator, the (l+1)th generation is performed. A unit generates an image corresponding to each of a plurality of regions obtained by dividing the learning image by L-1 division processing,
    Learning the image generation model such that the L+1th generator generates an image corresponding to the learning image when an image obtained by reducing the semantic layout image to the predetermined size is input to the first generator. A learning section that
    Model learning device equipped with.
  2.  前記学習部は、
     L-l回(0≦l<L)の分割処理により前記セマンティックレイアウト画像を分割した複数の領域の各々を前記所定サイズに縮小した画像を1番目の生成器に入力したときにl+1番目の生成器が生成した画像と、L-l回の分割処理により前記学習用画像を分割した複数の領域の各々との誤差、及び前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が生成した画像と、前記学習用画像との誤差を含む損失関数を導出する損失関数導出部を含み、
     前記損失関数の値を最適化するように、前記画像生成モデルのM+1個の生成器を学習する、
     請求項1に記載のモデル学習装置。
    The learning unit is
    When the image obtained by reducing each of the plurality of areas obtained by dividing the semantic layout image to the predetermined size by the L-1 (0≦l<L) division processing is input to the first generator, the (l+1)th generation is performed. Between the image generated by the image generator and each of the plurality of regions obtained by dividing the learning image by the L-1 division process, and the image obtained by reducing the semantic layout image to the predetermined size as the first generator. A loss function deriving unit for deriving a loss function including an error between the image generated by the L+1-th generator and the learning image when
    Learn M+1 generators of the image generation model to optimize the value of the loss function,
    The model learning device according to claim 1.
  3.  前記学習部は、生成器が生成した画像であるか否かを識別する識別器に対して、前記学習用画像について前記生成器が生成した画像でないと識別し、前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が生成した画像について前記生成器が生成した画像であると識別するように学習すると共に、
     前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が生成した画像について前記生成器が生成した画像でないと前記識別器が識別するように前記画像生成モデルの学習を行う、
     請求項1または請求項2に記載のモデル学習装置。
    The learning unit discriminates, with respect to the discriminator for discriminating whether or not the image is generated by the generator, that the image for learning is not the image generated by the generator, and the semantic layout image is determined to have the predetermined size. Learning to identify the image generated by the L+1th generator as the image generated by the generator when the reduced image is input to the first generator, and
    When the image obtained by reducing the semantic layout image to the predetermined size is input to the first generator, the discriminator identifies that the image generated by the L+1th generator is not the image generated by the generator. Learning the image generation model,
    The model learning device according to claim 1.
  4.  前記分割部は、1回の分割処理で、縦方向に2等分し、かつ、横方向に2等分するように画像を分割する、
     請求項1から請求項3のいずれか1項に記載のモデル学習装置。
    The dividing unit divides an image into two equal parts in a vertical direction and an equal amount in a horizontal direction by one dividing process.
    The model learning device according to any one of claims 1 to 3.
  5.  セマンティックレイアウト画像から、リアルな画像を生成する画像生成モデルを学習するためのモデル学習方法であって、
     入力部に、学習用画像を表す画像データと、前記学習用画像に応じたセマンティックレイアウト画像を表す画像データとのペアが教師データとして入力されるステップと、
     分割部により、入力された前記学習用画像の画像データが表す前記学習用画像、及び前記セマンティックレイアウト画像の画像データが表す前記セマンティックレイアウト画像の各々に対して複数の領域に分割する分割処理をL回行い、分割処理を行う毎に、分割した複数の領域に応じた画像データを出力するステップと、
     縮小部により、前記入力部に入力された前記セマンティックレイアウト画像、及び前記複数の領域の画像のサイズを、所定サイズに縮小するステップと、
     学習部により、入力画像に対して順番に適用する、アップサンプリングした画像を生成するL+1個の生成器であって、1番目の生成器の入力サイズが前記所定サイズである生成器を備えた画像生成モデルについて、
     L-l回(0≦l<L)の分割処理により前記セマンティックレイアウト画像を分割した複数の領域の各々を前記所定サイズに縮小した画像を1番目の生成器に入力したときにl+1番目の生成器が、L-l回の分割処理により前記学習用画像を分割した複数の領域の各々に対応する画像を生成し、
     前記セマンティックレイアウト画像を前記所定サイズに縮小した画像を1番目の生成器に入力したときにL+1番目の生成器が、前記学習用画像に対応する画像を生成するように、前記画像生成モデルの学習を行うステップと、
     を含むモデル学習方法。
    A model learning method for learning an image generation model for generating a realistic image from a semantic layout image,
    A step of inputting, as teacher data, a pair of image data representing a learning image and image data representing a semantic layout image corresponding to the learning image to the input unit;
    The division unit divides the learning image represented by the input image data of the learning image and the semantic layout image represented by the image data of the semantic layout image into a plurality of regions into L regions. Each time performing the division process and outputting the image data corresponding to the plurality of divided regions,
    A reducing unit reducing the size of the semantic layout image input to the input unit and the images of the plurality of regions to a predetermined size;
    An image including L+1 generators that generate upsampled images, which are sequentially applied to the input image by the learning unit, and in which the input size of the first generator is the predetermined size. For the generative model,
    When the image obtained by reducing each of the plurality of areas obtained by dividing the semantic layout image to the predetermined size by the L-1 (0≦l<L) division processing is input to the first generator, the (l+1)th generation is performed. A unit generates an image corresponding to each of a plurality of regions obtained by dividing the learning image by L-1 division processing,
    Learning the image generation model such that the L+1th generator generates an image corresponding to the learning image when an image obtained by reducing the semantic layout image to the predetermined size is input to the first generator. The steps to
    Model learning method including.
  6.  コンピュータを、請求項1から請求項4のいずれか1項に記載のモデル学習装置の各部として機能させるためのプログラム。 A program for causing a computer to function as each unit of the model learning device according to any one of claims 1 to 4.
PCT/JP2019/047940 2018-12-18 2019-12-06 Model learning device, model learning method, and program WO2020129716A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018236645A JP2020098490A (en) 2018-12-18 2018-12-18 Model learning device, model learning method, and program
JP2018-236645 2018-12-18

Publications (1)

Publication Number Publication Date
WO2020129716A1 true WO2020129716A1 (en) 2020-06-25

Family

ID=71102789

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/047940 WO2020129716A1 (en) 2018-12-18 2019-12-06 Model learning device, model learning method, and program

Country Status (2)

Country Link
JP (1) JP2020098490A (en)
WO (1) WO2020129716A1 (en)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN, QIFENG ET AL.: "Photographic Image Synthesis with Cascaded Refinement Networks", 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 25 December 2017 (2017-12-25), pages 1520 - 1529, XP033283011 *
WANG, TING-CHUN ET AL.: "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs", 2018 IEEE /CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 17 December 2018 (2018-12-17), pages 8798 - 8807, XP033473804 *

Also Published As

Publication number Publication date
JP2020098490A (en) 2020-06-25

Similar Documents

Publication Publication Date Title
JP6441980B2 (en) Method, computer and program for generating teacher images
CN109905624B (en) Video frame interpolation method, device and equipment
CN109191382B (en) Image processing method, device, electronic equipment and computer readable storage medium
KR102192850B1 (en) Method and device for generating feature maps by using feature upsampling networks
CN110879959B (en) Method and device for generating data set, and testing method and testing device using same
US11514694B2 (en) Teaching GAN (generative adversarial networks) to generate per-pixel annotation
US10311547B2 (en) Image upscaling system, training method thereof, and image upscaling method
CN113240580A (en) Lightweight image super-resolution reconstruction method based on multi-dimensional knowledge distillation
CN111386536A (en) Semantically consistent image style conversion
RU2735148C1 (en) Training gan (generative adversarial networks) to create pixel-by-pixel annotation
CN109447897B (en) Real scene image synthesis method and system
CN111861886B (en) Image super-resolution reconstruction method based on multi-scale feedback network
WO2023050651A1 (en) Semantic image segmentation method and apparatus, and device and storage medium
CN112164008A (en) Training method of image data enhancement network, and training device, medium, and apparatus thereof
US20230162409A1 (en) System and method for generating images of the same style based on layout
US20230267686A1 (en) Subdividing a three-dimensional mesh utilizing a neural network
KR20210034462A (en) Method for training generative adversarial networks to generate per-pixel annotation
CN114648681B (en) Image generation method, device, equipment and medium
CN113763366B (en) Face changing method, device, equipment and storage medium
CN113129231B (en) Method and system for generating high-definition image based on countermeasure generation network
KR102228128B1 (en) Method and system for learning self-organizing generative networks
WO2020129716A1 (en) Model learning device, model learning method, and program
CN113554047A (en) Training method of image processing model, image processing method and corresponding device
CN116843901A (en) Medical image segmentation model training method and medical image segmentation method
JP6647475B2 (en) Language processing apparatus, language processing system, and language processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19898035

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19898035

Country of ref document: EP

Kind code of ref document: A1