JP2019220133A

JP2019220133A - Image generation device, image generator, image discriminator, image generation program, and image generation method

Info

Publication number: JP2019220133A
Application number: JP2018227654A
Authority: JP
Inventors: 晃一濱田; Koichi Hamada; 橘　健太郎; Kentaro Tachibana; 健太郎橘
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2019-12-26

Abstract

To generate a high-resolution generated image including an object having a complex structure.SOLUTION: An image generator 28 and an image discriminator 30 are trained stepwise through multiple stages corresponding respectively to different resolutions, from the stage corresponding to the low resolution to the stage corresponding to the high resolution. In each stage, the image discriminator 30 is trained to discriminate which of a training image 22 and a generated image 50 is a generated image on the basis of the generated image 50 and of structure information 26 and the training image 22 converted to the resolution of the stage; and the image generator generates the generated image 50 of the resolution of the stage on the basis of potential vectors 24, and is trained such that the generated image 50 exhibits characteristics of the training image 22, on the basis of feedback (a result of the discrimination) from the image discriminator 30 and of the structure information 26 and the training image 22 converted to the resolution of the stage.SELECTED DRAWING: Figure 3

Description

本発明は、画像生成装置、画像生成器、画像識別器、画像生成プログラム、及び、画像生成方法に関する。 The present invention relates to an image generation device, an image generator, an image classifier, an image generation program, and an image generation method.

近年、ディープニューラルネットワークなどを用いた機械学習技術を利用することで、画像を自動生成する技術が提案されている。 In recent years, a technology for automatically generating an image by using a machine learning technology using a deep neural network or the like has been proposed.

非特許文献１には、画像の特徴を示す潜在ベクトルに基づいて画像を生成する画像生成器と、入力画像が、画像生成器が生成した生成画像であるか否かを識別する画像識別器との組からなるＧＡＮｓ（Generative Adversarial Nets）という技術が開示されている。ＧＡＮｓにおいては、画像識別器は、入力画像が生成画像であるか否かをより好適に識別できるように学習し、一方、画像生成器は、画像識別器をだます（画像識別器が誤識別する）ように本物の画像に近い生成画像を生成するように学習する。 Non-Patent Document 1 discloses an image generator that generates an image based on a latent vector indicating a feature of an image, an image identifier that identifies whether an input image is a generated image generated by the image generator, A technology called GANs (Generative Adversarial Nets) composed of a set of GANs is disclosed. In GANs, the image classifier learns to better identify whether the input image is a generated image or not, while the image generator fools the image classifier (the image classifier is incorrectly identified). Learning) to generate a generated image close to the real image.

非特許文献２には、ＧＡＮｓの改良技術であるＰＧ（Progressive Growing of）ＧＡＮｓが開示されている。ＰＧＧＡＮｓでは、それぞれ異なる解像度に対応する複数のステージで画像生成器及び画像識別器の学習を行う。具体的には、まず、画像識別器は、画像生成器が生成した解像度４×４［ｄｐｉ］の生成画像と、実画像（画像生成器が生成した画像でない画像）とのうち、いずれが生成画像であるか否かを識別するように学習し、識別結果を画像生成器にフィードバックする。画像生成器は、当該識別結果を考慮して潜在ベクトルが示す特徴を有する生成画像を生成するように学習する。解像度４×４［ｄｐｉ］に対応するステージの学習が終了すると、より高い解像度（８×８［ｄｐｉ］）のステージに進み、同様の処理を段階的に繰り返していく。 Non-Patent Document 2 discloses PG (Progressive Growing of) GANs, which is an improved technology of GANs. In PGGANs, learning of an image generator and an image classifier is performed at a plurality of stages corresponding to different resolutions. Specifically, first, the image discriminator generates one of a generated image having a resolution of 4 × 4 [dpi] generated by the image generator and a real image (an image that is not an image generated by the image generator). Learning is performed to identify whether or not the image is an image, and the identification result is fed back to the image generator. The image generator learns to generate a generated image having the feature indicated by the latent vector in consideration of the identification result. When the learning of the stage corresponding to the resolution of 4 × 4 [dpi] is completed, the process proceeds to the stage of higher resolution (8 × 8 [dpi]), and the same processing is repeated stepwise.

非特許文献３には、人の像を含む画像と、当該人の姿勢を示すポーズ情報とに基づいて、当該ポーズ情報が示す姿勢の人の像を含む画像を生成する技術が開示されている。 Non-Patent Document 3 discloses a technique for generating an image including an image of a person in the posture indicated by the pose information based on an image including the image of the person and pose information indicating the posture of the person. .

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, "Generative Adversarial Nets" In NIPS 2014.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, "Generative Adversarial Nets" In NIPS 2014. Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen , "Progressive Growing of GANs for Improved Quality, and Stability, and Variation", In ICLR 2018.Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, "Progressive Growing of GANs for Improved Quality, and Stability, and Variation", In ICLR 2018. Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool, "Pose Guided Person Image Generation", In NIPS 2017.Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool, "Pose Guided Person Image Generation", In NIPS 2017.

従来、複雑な構造のオブジェクトを含む高解像度の生成画像を生成することは困難であった（以下、画像内のオブジェクトの構造を単に「画像の構造」と記載する場合がある）。なお、ここで高解像度とは、５１２×５１２［ｄｐｉ］程度以上の解像度を意味する。 Conventionally, it has been difficult to generate a high-resolution generated image including an object having a complicated structure (hereinafter, the structure of an object in an image may be simply referred to as “image structure”). Here, high resolution means a resolution of about 512 × 512 [dpi] or more.

上述の非特許文献２に記載のＰＧＧＡＮｓを用いれば、簡単な構造（例えば人物像の顔のみ）の高解像度の生成画像を生成することができるが、複雑な構造の画像を生成することが難しい。例えば、体は後ろ向きであるが顔だけ前を向いている人物像を含む画像や、顔が２つある動物の画像などが生成されてしまう場合があった（非特許文献２の図１４〜１７など参照）。 By using the PGGANs described in Non-Patent Document 2, it is possible to generate a high-resolution generated image having a simple structure (for example, only a face of a human image), but it is difficult to generate an image having a complicated structure. . For example, an image including a human image in which the body is facing backward but only the face is facing forward, an image of an animal having two faces, and the like may be generated (see FIGS. 14 to 17 of Non-Patent Document 2). Etc.).

一方、上述の非特許文献３に記載の技術を用いれば、構造を指定した上で生成画像を生成することが可能ではあるが、生成画像のエッジがぼやけるなどしており（非特許文献３の図４〜６参照）、まして高解像度の生成画像を生成することが困難であった。 On the other hand, if the technology described in Non-Patent Document 3 is used, it is possible to generate a generated image after designating the structure, but the edges of the generated image are blurred (see Non-Patent Document 3). 4 to 6), it was more difficult to generate a high-resolution generated image.

したがって、複雑な構造のオブジェクトを含む高解像度の生成画像を生成できる技術が望まれている。また、例えばＧＡＮｓで用いられるような画像識別器においては、複雑な構造を有する高解像度の画像が生成画像であるか否かを好適に識別できるのが望ましい。 Therefore, a technique capable of generating a high-resolution generated image including an object having a complicated structure is desired. Further, for example, in an image classifier used in GANs, it is desirable that a high-resolution image having a complicated structure can be suitably identified as a generated image.

本発明の目的は、複雑な構造のオブジェクトを含む高解像度の生成画像を生成することにある。あるいは、本発明の目的は、複雑な構造のオブジェクトを含む高解像度の画像が生成画像であるか否かを好適に識別できるようにすることにある。 An object of the present invention is to generate a high-resolution generated image including an object having a complicated structure. Alternatively, an object of the present invention is to make it possible to appropriately identify whether or not a high-resolution image including an object having a complicated structure is a generated image.

本発明は、潜在ベクトルから生成した生成画像が学習用画像の特徴を示すように学習する画像生成器と、前記学習用画像、及び、前記画像生成器が生成した前記生成画像に基づいて、前記学習用画像と前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する画像識別器と、を備え、前記画像生成器は、前記画像識別器の識別結果に基づいてさらに学習し、前記画像生成器及び前記画像識別器は、それぞれ異なる解像度に対応する複数のステージにおいて、低解像度に対応するステージから高解像度に対応するステージへ向かって段階的に学習を行い、各ステージにおいて、前記画像生成器は、さらに、前記学習用画像の構造を示す構造情報を考慮して、前記潜在ベクトルから、当該ステージに定められた解像度の前記生成画像を生成するように学習し、前記画像識別器は、さらに、前記構造情報を考慮して、当該ステージに定められた解像度に変換された前記学習用画像と当該ステージで前記画像生成器が生成した前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する、ことを特徴とする画像生成装置である。 The present invention is based on an image generator for learning such that a generated image generated from a latent vector shows the characteristics of a learning image, the learning image, and the generated image generated by the image generator. An image classifier that learns to identify which of the learning image and the generated image is the image generated by the image generator, wherein the image generator has a classification result of the image classifier. The image generator and the image classifier perform the learning stepwise from a stage corresponding to a low resolution to a stage corresponding to a high resolution in a plurality of stages corresponding to different resolutions. Performing, at each stage, the image generator further determines the stage from the latent vector in consideration of the structural information indicating the structure of the learning image Learning to generate the generated image of the image resolution, the image classifier further considers the structural information, the learning image converted to the resolution determined in the stage and the learning image in the stage An image generating apparatus, wherein learning is performed so as to identify which of the generated images generated by the image generator is an image generated by the image generator.

望ましくは、前記構造情報は、前記学習用画像内のオブジェクトの複数の特徴点の位置を示す情報である、ことを特徴とする。 Preferably, the structure information is information indicating positions of a plurality of feature points of the object in the learning image.

望ましくは、前記構造情報は、前記特徴点に対応する画素の座標情報を有する構造画像であり、各ステージにおいて、前記画像生成器及び前記画像識別器は、当該ステージに定められた解像度に変換された前記構造画像を考慮して学習する、ことを特徴とする。 Preferably, the structure information is a structure image having coordinate information of a pixel corresponding to the feature point, and in each stage, the image generator and the image classifier are converted to a resolution determined for the stage. And learning in consideration of the structured image.

望ましくは、前記潜在ベクトルは、所定の確率分布に基づいて生成される、ことを特徴とする。 Preferably, the latent vector is generated based on a predetermined probability distribution.

望ましくは、前記潜在ベクトルは、前記学習用画像に基づいて生成される、ことを特徴とする。 Preferably, the latent vector is generated based on the learning image.

また、本発明は、それぞれ異なる解像度に対応する複数のステージにおいて学習を行い、ステージ毎に、潜在ベクトルから生成した当該ステージに定められた解像度の生成画像が当該ステージに定められた解像度の学習用画像の特徴を示すように学習する処理を低解像度に対応するステージから高解像度に対応するステージへ向かって段階的に行う画像生成器であって、各ステージにおいて、さらに、前記学習用画像の構造を示す構造情報を考慮して、前記生成画像を生成するように学習する、ことを特徴とする画像生成器である。 In addition, the present invention performs learning at a plurality of stages corresponding to different resolutions, and for each stage, generates a generated image of the resolution determined for the stage generated from the latent vector for the learning of the resolution determined for the stage. An image generator for performing a process of learning to show features of an image stepwise from a stage corresponding to a low resolution to a stage corresponding to a high resolution, and further comprising, in each stage, a structure of the learning image. And learning to generate the generated image in consideration of structural information indicating the following.

また、本発明は、学習用画像、及び、潜在ベクトルから画像生成器が生成した生成画像に基づいて、それぞれ異なる解像度に対応する複数のステージにおいて学習を行い、ステージ毎に、それぞれ当該ステージに定められた解像度に変換された前記学習用画像と前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する処理を低解像度に対応するステージから高解像度に対応するステージへ向かって段階的に行う画像識別器であって、各ステージにおいて、さらに、前記学習用画像の構造を示す構造情報を考慮して、前記学習用画像と前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する、ことを特徴とする画像識別器である。 Further, the present invention performs learning at a plurality of stages corresponding to different resolutions based on the learning image and the generated image generated from the latent vector by the image generator, and determines the stage for each stage. From a stage corresponding to a low resolution to a high resolution from a stage corresponding to a process for learning so as to identify which of the learning image and the generated image converted to a given resolution is an image generated by the image generator. An image discriminator that is performed stepwise toward a stage where the learning image and the generated image are further considered in each stage in consideration of structural information indicating the structure of the learning image. An image discriminator characterized in that learning is performed so as to discriminate an image generated by the image generator.

また、本発明は、コンピュータを、潜在ベクトルから生成した生成画像が学習用画像の特徴を示すように学習する画像生成器と、前記学習用画像、及び、前記画像生成器が生成した前記生成画像に基づいて、前記学習用画像と前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する画像識別器と、して機能させ、前記画像生成器は、前記画像識別器の識別結果に基づいてさらに学習し、前記画像生成器及び前記画像識別器は、それぞれ異なる解像度に対応する複数のステージにおいて、低解像度に対応するステージから高解像度に対応するステージへ向かって段階的に学習を行い、各ステージにおいて、前記画像生成器は、さらに、前記学習用画像の構造を示す構造情報を考慮して、前記潜在ベクトルから、当該ステージに定められた解像度の前記生成画像を生成するように学習し、前記画像識別器は、さらに、前記構造情報を考慮して、当該ステージに定められた解像度に変換された前記学習用画像と当該ステージで前記画像生成器が生成した前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する、ことを特徴とする画像生成プログラムである。 Further, the present invention provides an image generator for learning a computer so that a generated image generated from a latent vector shows the characteristics of a learning image, the learning image, and the generated image generated by the image generator. Based on, the learning image and the generated image, an image classifier that learns to identify which is the image generated by the image generator, to function as, the image generator, Learning is further performed based on the identification result of the image classifier, and the image generator and the image classifier change from a stage corresponding to a low resolution to a stage corresponding to a high resolution in a plurality of stages corresponding to different resolutions. In each stage, the image generator further performs learning based on structural information indicating the structure of the learning image. Learning to generate the generated image of the resolution determined in the stage, the image classifier further considers the structural information, the learning for the converted to the resolution determined in the stage An image generation program for learning to identify which of an image and the generated image generated by the image generator at the stage is an image generated by the image generator.

また、本発明は、潜在ベクトルから生成した生成画像が学習用画像の特徴を示すように学習する画像生成器と、前記学習用画像、及び、前記画像生成器が生成した前記生成画像に基づいて、前記学習用画像と前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する画像識別器と、を用いる画像生成方法であって、前記画像生成器は、前記画像識別器の識別結果に基づいてさらに学習し、前記画像生成器及び前記画像識別器は、それぞれ異なる解像度に対応する複数のステージにおいて、低解像度に対応するステージから高解像度に対応するステージへ向かって段階的に学習を行い、各ステージにおいて、前記画像生成器は、さらに、前記学習用画像の構造を示す構造情報を考慮して、前記潜在ベクトルから、当該ステージに定められた解像度の前記生成画像を生成するように学習し、前記画像識別器は、さらに、前記構造情報を考慮して、当該ステージに定められた解像度に変換された前記学習用画像と当該ステージで前記画像生成器が生成した前記生成画像とのいずれが前記画像生成器が生成した画像であるかを識別するように学習する、ことを特徴とする画像生成方法である。 The present invention also provides an image generator for learning so that a generated image generated from a latent vector indicates the characteristics of a learning image, the learning image, and the generated image generated by the image generator. An image classifier that learns which of the learning image and the generated image is an image generated by the image generator, and an image classifier that uses the image generator. Learning is further performed based on the identification result of the image classifier, and the image generator and the image classifier each include a stage corresponding to a low resolution and a stage corresponding to a high resolution among a plurality of stages corresponding to different resolutions. Learning is performed stepwise toward each stage, and at each stage, the image generator further considers structural information indicating the structure of the learning image, Learning to generate the generated image of the resolution determined in the stage, the image classifier further considers the structural information, the learning for the converted to the resolution determined in the stage An image generation method, wherein learning is performed to identify which of an image and the generated image generated by the image generator in the stage is an image generated by the image generator.

本発明によれば、画像の複雑な構造を指定した上で、高解像度の生成画像を生成することができる。また、本発明によれば、複雑な構造のオブジェクトを含む高解像度の画像が生成画像であるか否かを好適に識別できる。 According to the present invention, it is possible to generate a high-resolution generated image after specifying a complicated structure of the image. Further, according to the present invention, it is possible to appropriately identify whether or not a high-resolution image including an object having a complicated structure is a generated image.

本実施形態に係る画像生成装置の構成概略図である。FIG. 2 is a schematic configuration diagram of an image generation device according to the present embodiment. 学習用画像及び構造情報の例を示す図である。It is a figure which shows the example for a learning image and structure information. 本実施形態に係る画像生成器及び画像識別器の構造を示す概念図である。It is a conceptual diagram showing the structure of the image generator and the image discriminator according to the present embodiment. 本実施形態に係る画像生成装置により生成されたキャラクタを示す第１の図である。FIG. 3 is a first diagram illustrating a character generated by the image generation device according to the embodiment. 本実施形態に係る画像生成装置により生成されたキャラクタを示す第２の図である。FIG. 5 is a second diagram illustrating a character generated by the image generation device according to the embodiment. 本実施形態に係る学習処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the learning process concerning this embodiment.

＜画像生成装置の構成＞
以下、本発明の実施形態について説明する。図１には、本実施形態に係る画像生成装置１０の構成概略図が示されている。画像生成装置１０は、例えばパーソナルコンピュータあるいはサーバなどであってよいが、以下に説明する機能を発揮可能な限りにおいてどのような装置であってもよい。 <Configuration of image generation device>
Hereinafter, embodiments of the present invention will be described. FIG. 1 is a schematic configuration diagram of an image generation device 10 according to the present embodiment. The image generation device 10 may be, for example, a personal computer or a server, but may be any device as long as it can perform the functions described below.

詳細は後述するが、画像生成装置１０は、学習器（画像生成器２８及び画像識別器３０）を備えており、当該学習器を学習することで、画像の特徴を示す潜在ベクトル２４（後述）から生成画像を生成する。特に、画像生成装置１０は、画像内のオブジェクトの構造を示す構造情報２６（後述）に基づいて、当該構造情報２６が示す構造の生成画像であって高解像度の生成画像を生成する。 Although the details will be described later, the image generating apparatus 10 includes a learning device (the image generating device 28 and the image discriminating device 30), and learns the learning device to generate a latent vector 24 (described later) indicating the feature of the image. From the generated image. In particular, the image generating apparatus 10 generates a high-resolution generated image having a structure indicated by the structure information 26 based on structure information 26 (described later) indicating the structure of an object in the image.

制御部１２は、例えばＣＰＵなどを含んで構成される。制御部１２は、後述の記憶部２０に記憶された画像生成プログラムに従って、画像生成装置１０の各部を制御する。特に、制御部１２は、後述の画像生成器２８及び画像識別器３０を学習させる学習処理を実行する。このように、制御部１２は、当該学習処理を実行する学習部としても機能する。 The control unit 12 includes, for example, a CPU and the like. The control unit 12 controls each unit of the image generation device 10 according to an image generation program stored in the storage unit 20 described below. In particular, the control unit 12 executes a learning process for learning an image generator 28 and an image classifier 30 described later. Thus, the control unit 12 also functions as a learning unit that executes the learning process.

表示部１４は、例えば液晶ディスプレイなどを含んで構成される。表示部１４には、種々の画面が表示される。特に、表示部１４には、画像生成装置１０が生成した生成画像が表示される。 The display unit 14 includes a liquid crystal display, for example. Various screens are displayed on the display unit 14. In particular, the display unit 14 displays a generated image generated by the image generation device 10.

入力部１６は、マウス、キーボード、あるいはタッチパネルなどを含んで構成される。入力部１６は、ユーザの指示を画像生成装置１０に入力するために用いられる。 The input unit 16 includes a mouse, a keyboard, a touch panel, and the like. The input unit 16 is used to input a user's instruction to the image generation device 10.

通信部１８は、例えばネットワークアダプタなどを含んで構成される。通信部１８は、ＬＡＮあるいはインターネットなどの通信回線を介して他の装置と通信するために用いられる。 The communication unit 18 includes, for example, a network adapter and the like. The communication unit 18 is used to communicate with another device via a communication line such as a LAN or the Internet.

記憶部２０は、例えばＲＡＭ、ＲＯＭ、あるいはハードディスクなどを含んで構成される。記憶部２０には、画像生成装置１０の各部を動作させるための画像生成プログラムが記憶される。また、図１に示される通り、記憶部２０には、学習用画像２２、潜在ベクトル２４、構造情報２６、画像生成器２８、及び、画像識別器３０が記憶される。 The storage unit 20 includes, for example, a RAM, a ROM, or a hard disk. The storage unit 20 stores an image generation program for operating each unit of the image generation device 10. As shown in FIG. 1, the storage unit 20 stores a learning image 22, a latent vector 24, structural information 26, an image generator 28, and an image classifier 30.

学習用画像２２は、画像生成器２８及び画像識別器３０を学習するために用いられる画像である。学習用画像２２は、複雑な構造を有するオブジェクトが含まれていてよい。複雑な構造を有するオブジェクトとは、例えば、姿勢が分かる程度に胴体及び手足を含む人物像、同じく姿勢が分かる程度に胴体及び手足を含むキャラクタ像などである。複雑な構造を有するオブジェクトとしては、人型のオブジェクトに限られない。また、複雑な構造を有するオブジェクトとは、当該オブジェクトが有する特徴点間の位置関係が変化し得るオブジェクトであるともいえる。例えば、人物像は、種々の姿勢を取り得るものであり、特徴点間（例えば頭と右手間）の位置関係が固定されておらず、姿勢に応じて変化する。 The learning image 22 is an image used for learning the image generator 28 and the image classifier 30. The learning image 22 may include an object having a complicated structure. The object having a complicated structure is, for example, a human image including a body and limbs such that the posture is understandable, and a character image including a body and limbs such that the posture is understandable. Objects having a complicated structure are not limited to humanoid objects. Also, an object having a complicated structure can be said to be an object whose positional relationship between feature points of the object can change. For example, a person image can take various postures, and the positional relationship between feature points (for example, between the head and the right hand) is not fixed, and changes according to the posture.

潜在ベクトル２４は、画像の特徴を示す情報である。潜在ベクトル２４としては、画像生成器２８及び画像識別器３０の学習に用いられる、所定の確率分布に基づいて生成されるもの、学習用画像２２の特徴を示すもの、及び、学習済みの画像生成器２８に生成画像を生成させるときに用いられる、ユーザが生成したい画像の特徴を示すものがある。 The latent vector 24 is information indicating characteristics of an image. The latent vector 24 is generated based on a predetermined probability distribution used for learning of the image generator 28 and the image classifier 30, the characteristic vector of the learning image 22, and a learned image generation. Some of the features that are used when the device 28 generates a generated image and that indicate the characteristics of the image that the user wants to generate.

潜在ベクトル２４は、どのように生成されてもよいが、学習用の潜在ベクトル２４は、所定の確率分布より、又は学習用画像２２に基づいて生成される。具体的には、所定の確率分布から生成する場合、該確率分布が指定した次元数をサンプリングすることで生成される。所定の確率分布としては、正規分布、一様分布などが上げられる。また、学習用画像から生成する場合は、学習用画像２２を潜在ベクトル２４に変換する処理を行うエンコーダに学習用画像２２を入力することで、当該学習用画像２２の特徴を示す潜在ベクトル２４を生成することができる。 The latent vector 24 may be generated in any manner, but the learning latent vector 24 is generated from a predetermined probability distribution or based on the learning image 22. Specifically, when the probability distribution is generated from a predetermined probability distribution, the probability distribution is generated by sampling a specified number of dimensions. Examples of the predetermined probability distribution include a normal distribution and a uniform distribution. When the learning image 22 is generated from a learning image, the learning image 22 is input to an encoder that performs a process of converting the learning image 22 into a latent vector 24, so that the latent vector 24 indicating the feature of the learning image 22 is obtained. Can be generated.

潜在ベクトル２４は、画像の特徴を表現した複数次元のベクトルである。潜在ベクトル２４によって定義される潜在空間は、学習用画像の特徴を表現した空間である。潜在空間を画像の特徴によって、更に条件付け出来るようするために、潜在ベクトルに構造情報２６を連結してもよい。 The latent vector 24 is a multidimensional vector expressing the features of the image. The latent space defined by the latent vector 24 is a space expressing the features of the learning image. The structure information 26 may be linked to the latent vector so that the latent space can be further conditioned by the features of the image.

構造情報２６は、画像の構造を示す情報である。画像の構造としては、例えば、画像に人物像が含まれていればその人物像の姿勢である。あるいは、画像の構造として、画像内のオブジェクトが存在する位置、複数のオブジェクトの位置関係などを含んでいてもよい。構造情報２６としては、画像生成器２８及び画像識別器３０の学習に用いられる、学習用画像２２の構造を示すもの、及び、学習済みの画像生成器２８に生成画像を生成させるときに用いられる、ユーザが生成したい画像の構造を示すものがある。 The structure information 26 is information indicating the structure of the image. The structure of the image is, for example, the posture of the person image if the image includes the person image. Alternatively, the structure of the image may include a position where an object exists in the image, a positional relationship between a plurality of objects, and the like. The structure information 26 indicates the structure of the learning image 22 used for learning of the image generator 28 and the image classifier 30 and is used when the learned image generator 28 generates a generated image. And the structure of an image that the user wants to generate.

学習時に用いられる構造情報２６は、学習用画像２２内のオブジェクトの複数の特徴点の位置を示す情報であってよい。本実施形態では、学習時に用いられる構造情報２６は、学習用画像２２内のオブジェクトの複数の特徴点に対応する複数の画素の座標情報を有する構造画像である。 The structure information 26 used at the time of learning may be information indicating positions of a plurality of feature points of the object in the learning image 22. In the present embodiment, the structural information 26 used at the time of learning is a structural image having coordinate information of a plurality of pixels corresponding to a plurality of feature points of an object in the learning image 22.

図２に、学習用画像２２と、当該学習用画像２２の構造を示す構造情報２６の例が示されている。図２（ａ）に示すように、学習用画像２２が全身を含む人物像である場合、図２（ｂ）に示すように、構造情報２６は、当該人物像の複数の特徴点に対応する画素の位置が示されたものとなる。具体的には、構造情報２６は、人物像の右眼、左眼、右耳、左耳、鼻、右肩、左肩、右肘、左肘、右手、左手、右膝、左膝、右足、左足などの特徴点に対応する画素が白（輝度１００％）、その他の画素が黒（輝度０％）で表された構造画像である。また、構造情報２６には、特徴点に対応する各画素が、人物像のどの特徴点に対応するものであるのか（例えば右眼なのか左眼なのか）を示す情報も有している。このように、構造情報２６は、画像内のオブジェクトの各特徴点の位置に対応する画素に特徴点を示す情報が付された画像であってよい。 FIG. 2 shows an example of the learning image 22 and structure information 26 indicating the structure of the learning image 22. When the learning image 22 is a human image including the whole body as shown in FIG. 2A, the structure information 26 corresponds to a plurality of feature points of the human image as shown in FIG. 2B. The position of the pixel is shown. Specifically, the structural information 26 includes the right eye, left eye, right ear, left ear, nose, right shoulder, left shoulder, right elbow, left elbow, right hand, left hand, right knee, left knee, right foot, This is a structural image in which pixels corresponding to feature points such as the left foot are expressed in white (luminance 100%), and other pixels are expressed in black (luminance 0%). The structure information 26 also has information indicating which feature point of the human image corresponds to each pixel corresponding to the feature point (for example, whether the pixel is a right eye or a left eye). As described above, the structure information 26 may be an image in which information indicating a feature point is added to a pixel corresponding to a position of each feature point of an object in the image.

なお、構造情報２６としては、図２（ｂ）に示されるような構造画像に限られない。例えば、画像の構造を説明した文章であってもよい。この場合、制御部１２が当該文章を解析することで、構造情報２６が表現する画像の構造を取得することができる。また、構造情報２６には、画像内のオブジェクトの位置に関する情報のみならず、オブジェクトの特徴を示す属性情報が含まれていてもよい。オブジェクトが人物像である場合、属性情報としては、例えば、眼鏡の有無、服装、髪の色、人種などである。なお、構造情報２６としては、上記のような明示的な情報に限られない。例えば、上記情報が埋め込まれた単なる数値ベクトルなどでもよい。 Note that the structure information 26 is not limited to a structure image as shown in FIG. For example, a sentence describing the structure of the image may be used. In this case, the structure of the image represented by the structure information 26 can be acquired by the control unit 12 analyzing the sentence. The structure information 26 may include not only information on the position of the object in the image but also attribute information indicating the feature of the object. When the object is a person image, the attribute information includes, for example, the presence or absence of glasses, clothes, hair color, race, and the like. Note that the structure information 26 is not limited to the explicit information as described above. For example, it may be a simple numerical vector in which the above information is embedded.

画像生成器２８は、潜在ベクトル２４に基づいて、当該潜在ベクトル２４が示す内容の画像を生成する。以下、画像生成器２８が生成した画像を「生成画像」と記載する。特に、画像生成器２８は、構造情報２６にも基づいて生成画像を生成する。画像生成器２８は学習器であり、具体的には畳み込みニューラルネットワークを含んで構成されている。畳み込みニューラルネットワークの実体は、畳み込みニューラルネットワークに関する各種パラメータ（層構造、各層のニューロン構造、各層におけるフィルタ数、フィルタサイズ、ストライド幅、ゼロパディング幅、及び各フィルタの各要素の重みなど）、並びに、入力データ（潜在ベクトル２４及び構造情報２６）に対して処理を行うための処理実行プログラムである。したがって、記憶部２０に画像生成器２８が記憶されるとは、上記各種パラメータ及び処理実行プログラムが記憶部２０に格納されることを意味する。 The image generator 28 generates an image having the content indicated by the latent vector 24 based on the latent vector 24. Hereinafter, the image generated by the image generator 28 is referred to as a “generated image”. In particular, the image generator 28 generates a generated image based on the structure information 26 as well. The image generator 28 is a learning device, and is specifically configured to include a convolutional neural network. The entity of the convolutional neural network is composed of various parameters related to the convolutional neural network (layer structure, neuron structure of each layer, number of filters in each layer, filter size, stride width, zero padding width, weight of each element of each filter, etc.), and This is a processing execution program for performing processing on input data (the latent vector 24 and the structure information 26). Therefore, the fact that the image generator 28 is stored in the storage unit 20 means that the various parameters and the processing execution program are stored in the storage unit 20.

画像生成器２８は、学習用画像２２、学習用の潜在ベクトル２４、学習用画像２２の構造を示す構造情報２６、及び、画像識別器３０の識別結果に基づいて、学習用画像２２の特徴を示す生成画像であって、構造情報２６が示す構造の生成画像を生成するように学習する。画像生成器２８の構造及び学習方法の詳細については後述する。 The image generator 28 determines the characteristics of the learning image 22 based on the learning image 22, the latent vector 24 for learning, the structure information 26 indicating the structure of the learning image 22, and the identification result of the image classifier 30. It is learned to generate a generated image having the structure indicated by the structure information 26. Details of the structure of the image generator 28 and the learning method will be described later.

画像識別器３０は、入力された画像が、実画像（生成画像でない画像）であるか生成画像であるかを識別する。本実施形態では、画像識別器３０には、実画像と生成画像の両方が入力され、画像識別器３０は、入力された２つの画像のうち、いずれが生成画像であるかを識別する。画像識別器３０は、画像生成器２８同様学習器であり、具体的には畳み込みニューラルネットワークを含んで構成されている。したがって、記憶部２０に画像識別器３０が記憶されるとは、上記各種パラメータ及び処理実行プログラムが記憶部２０に格納されることを意味する。 The image identifier 30 identifies whether the input image is a real image (an image that is not a generated image) or a generated image. In the present embodiment, both the real image and the generated image are input to the image classifier 30, and the image classifier 30 identifies which of the two input images is the generated image. The image discriminator 30 is a learning device like the image generator 28, and specifically includes a convolutional neural network. Therefore, storing the image classifier 30 in the storage unit 20 means that the various parameters and the processing execution program are stored in the storage unit 20.

画像識別器３０は、学習用画像２２及び生成画像に基づいて、学習用画像２２と生成画像のいずれが生成画像であるかを識別するように学習する。画像識別器３０の構造及び学習方法の詳細については後述する。 The image classifier 30 performs learning based on the learning image 22 and the generated image so as to identify which of the learning image 22 and the generated image is the generated image. Details of the structure of the image classifier 30 and the learning method will be described later.

画像生成装置１０の構成概要としては以上の通りである。以下、図３を参照しながら、画像生成器２８及び画像識別器３０の構造及び学習方法の詳細について説明する。図３には、画像生成器２８及び画像識別器３０の構造（ステージ８の構造参照）及び、各学習ステージにおける学習処理の概念図が示されている。図３において、上下方向略中央に引かれた一点鎖線の上側が画像生成器２８の構造を示しており、当該一点鎖線の下側が画像識別器３０の構造を示している。 The configuration outline of the image generation device 10 is as described above. Hereinafter, the structures of the image generator 28 and the image classifier 30 and the details of the learning method will be described with reference to FIG. FIG. 3 shows a structure of the image generator 28 and the image classifier 30 (see the structure of the stage 8) and a conceptual diagram of a learning process in each learning stage. In FIG. 3, the upper side of the dashed line drawn substantially at the center in the up-down direction indicates the structure of the image generator 28, and the lower side of the dashed line indicates the structure of the image discriminator 30.

＜画像生成器の構造＞
図３において、白色のボックスは畳み込み層４０を示している。畳み込み層４０が学習処理の対象である。畳み込み層４０に記載されたＮ×Ｎの文字は、当該畳み込み層４０がＮ×Ｎ［ｄｐｉ］の（空間）解像度に対応していることを示している。すなわち、Ｎ×Ｎの畳み込み層４０は、解像度がＮ×Ｎ［ｄｐｉ］の生成画像を生成するように学習される。 <Structure of image generator>
In FIG. 3, the white box indicates the convolutional layer 40. The convolutional layer 40 is a target of the learning process. N × N characters described in the convolutional layer 40 indicate that the convolutional layer 40 corresponds to a (spatial) resolution of N × N [dpi]. That is, the N × N convolutional layer 40 is learned so as to generate a generated image having a resolution of N × N [dpi].

図３に示す通り、画像生成器２８は、それぞれ対応する解像度が異なる複数の畳み込み層４０を含んで構成される。本実施形態では、画像生成器２８は、４×４、８×８、１６×１６、３２×３２、６４×６４、１２８×１２８、２５６×２５６、及び、５１２×５１２の８層の畳み込み層４０を含む構造となっている。 As shown in FIG. 3, the image generator 28 is configured to include a plurality of convolutional layers 40 having different resolutions. In the present embodiment, the image generator 28 has eight convolutional layers of 4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512. 40.

図３において、グレーのボックスはダウンサンプリング層４２を示している。ダウンサンプリング層４２は、構造情報２６をダウンサンプリング（変換）してその解像度を低下させるものである。ダウンサンプリング層４２は、学習対象でも学習対象でなくてもよい。ダウンサンプリング層４２に記載されたＮ×Ｎの文字は、当該ダウンサンプリング層４２がＮ×Ｎ［ｄｐｉ］の解像度に対応していることを示している。すなわち、Ｎ×Ｎのダウンサンプリング層４２は、構造情報２６をＮ×Ｎの解像度にダウンサンプリングして出力する。 In FIG. 3, the gray boxes indicate the downsampling layer 42. The downsampling layer 42 downsamples (converts) the structure information 26 to reduce its resolution. The downsampling layer 42 may or may not be a learning target. N × N characters described in the downsampling layer 42 indicate that the downsampling layer 42 corresponds to a resolution of N × N [dpi]. That is, the N × N downsampling layer 42 downsamples the structure information 26 to an N × N resolution and outputs it.

図３に示す通り、画像生成器２８は、それぞれ対応する解像度が異なる複数のダウンサンプリング層４２を含んで構成される。本実施形態では、画像生成器２８は、４×４、８×８、１６×１６、３２×３２、６４×６４、及び、１２８×１２８の８層のダウンサンプリング層４２を含む構造となっている。 As shown in FIG. 3, the image generator 28 is configured to include a plurality of downsampling layers 42 each having a different resolution. In the present embodiment, the image generator 28 has a structure including eight downsampling layers 42 of 4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64, and 128 × 128. I have.

＜画像識別器の構造＞
画像生成器２８同様、白色のボックスは畳み込み層４４を示しており、畳み込み層４４が学習処理の対象である。畳み込み層４４に記載されたＮ×Ｎの文字は、当該畳み込み層４４がＮ×Ｎ［ｄｐｉ］の解像度に対応していることを示している。すなわち、Ｎ×Ｎの畳み込み層４４は、いずれも解像度がＮ×Ｎ［ｄｐｉ］の学習用画像２２と生成画像のうち、いずれが生成画像であるかを識別するように学習される。 <Structure of image classifier>
Like the image generator 28, the white boxes indicate the convolutional layer 44, and the convolutional layer 44 is the target of the learning process. N × N characters described in the convolution layer 44 indicate that the convolution layer 44 corresponds to a resolution of N × N [dpi]. That is, the N × N convolutional layer 44 is learned so as to identify which of the learning image 22 and the generated image whose resolution is N × N [dpi] is the generated image.

画像識別器３０は、画像生成器２８の畳み込み層４０と同じ数の畳み込み層４４を含んで構成される。本実施形態では、画像識別器３０は、４×４、８×８、１６×１６、３２×３２、６４×６４、１２８×１２８、２５６×２５６、及び、５１２×５１２の８層の畳み込み層４４を含む構造となっている。 The image classifier 30 includes the same number of convolution layers 44 as the convolution layers 40 of the image generator 28. In the present embodiment, the image discriminator 30 includes four convolutional layers of 4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512. 44.

画像識別器３０も、複数のダウンサンプリング層４２を有している。本実施形態では、画像識別器３０は、４×４、８×８、１６×１６、３２×３２、６４×６４、及び、１２８×１２８の８層のダウンサンプリング層４２を含む構造となっている。画像識別器３０のダウンサンプリング層４２は、構造情報２６のみならず学習用画像２２もダウンサンプリングする。 The image classifier 30 also has a plurality of downsampling layers 42. In the present embodiment, the image classifier 30 has a structure including eight downsampling layers 42 of 4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64, and 128 × 128. I have. The downsampling layer 42 of the image classifier 30 downsamples not only the structure information 26 but also the learning image 22.

＜画像生成器及び画像識別器の学習方法＞
画像生成器２８及び画像識別器３０は、それぞれ異なる解像度に対応する複数のステージにおいて学習を行う。具体的には、低解像度に対応するステージから高解像度に対応するステージに向かって段階的に学習を行う。特に、本実施形態では、画像生成器２８は、学習用画像２２及び潜在ベクトル２４のみならず学習用画像２２の構造を示す構造情報２６を考慮して学習し、画像識別器３０も、学習用画像２２の構造を示す構造情報２６を考慮して学習することを特徴としている。以下、詳細に説明する。 <Learning method of image generator and image classifier>
The image generator 28 and the image classifier 30 perform learning in a plurality of stages corresponding to different resolutions. Specifically, learning is performed stepwise from a stage corresponding to a low resolution to a stage corresponding to a high resolution. In particular, in the present embodiment, the image generator 28 performs learning in consideration of not only the learning image 22 and the latent vector 24 but also the structure information 26 indicating the structure of the learning image 22, and the image classifier 30 also performs learning. The feature is that learning is performed in consideration of the structure information 26 indicating the structure of the image 22. The details will be described below.

まず、画像生成器２８及び画像識別器３０は、最も低い解像度（本実施形態では４×４［ｄｐｉ］）に対応するステージ１から学習を開始する。ステージ１においては、画像生成器２８は解像度４×４［ｄｐｉ］の畳み込み層４０を学習し、画像識別器３０は、解像度４×４［ｄｐｉ］の畳み込み層４４を学習する。 First, the image generator 28 and the image classifier 30 start learning from the stage 1 corresponding to the lowest resolution (4 × 4 [dpi] in the present embodiment). In stage 1, the image generator 28 learns the convolution layer 40 having a resolution of 4 × 4 [dpi], and the image discriminator 30 learns the convolution layer 44 having a resolution of 4 × 4 [dpi].

具体的には、画像識別器３０は、４×４のダウンサンプリング層４２で４×４［ｄｐｉ］にダウンサンプリングされた学習用画像２２ａ及び構造情報２６ａ、並びに、画像生成器の４×４の畳み込み層４０が生成した４×４［ｄｐｉ］の生成画像５０ａに基づいて、生成画像５０ａと学習用画像２２ａのいずれが生成画像であるかを識別可能なように、４×４の畳み込み層４４のパラメータを更新する（すなわち学習する）。 Specifically, the image discriminator 30 includes the learning image 22a and the structure information 26a downsampled to 4 × 4 [dpi] by the 4 × 4 downsampling layer 42, and the 4 × 4 downsampling layer 42 of the image generator. Based on the 4 × 4 [dpi] generated image 50a generated by the convolutional layer 40, the 4 × 4 convolutional layer 44 is used so that the generated image 50a or the learning image 22a can be identified as a generated image. (That is, learn).

ここで、４×４の畳み込み層４４は、構造情報２６ａを考慮して学習されるから、例えば、生成画像５０ａが、構造情報２６ａが示す構造を有している否かを考慮しながら学習することができる。すなわち、生成画像５０ａの構造が構造情報２６ａが示す構造とは類似していない構造であれば、４×４の畳み込み層４４は、生成画像５０ａは学習用画像２２ａでない、すなわち生成画像であると判定するように学習される。 Here, since the 4 × 4 convolution layer 44 is learned in consideration of the structure information 26a, for example, learning is performed in consideration of whether the generated image 50a has the structure indicated by the structure information 26a. be able to. That is, if the structure of the generated image 50a is not similar to the structure indicated by the structure information 26a, the 4 × 4 convolution layer 44 indicates that the generated image 50a is not the learning image 22a, that is, the generated image 50a. It is learned to determine.

さらに、画像識別器３０は、識別結果を出力する。例えば、生成画像５０ａが生成画像と識別された場合は「１」を出力し、学習用画像２２ａが生成画像と識別された場合は「０」を出力する。画像識別器３０の識別結果は画像生成器２８にフィードバックされる。なお、画像識別器３０の出力は、上記のような「０」と「１」に限られない。生成画像５０ａと学習用画像２２ａに対応する異なる値を出力すればよい。 Further, the image classifier 30 outputs a classification result. For example, if the generated image 50a is identified as a generated image, “1” is output, and if the learning image 22a is identified as a generated image, “0” is output. The identification result of the image identifier 30 is fed back to the image generator 28. Note that the output of the image classifier 30 is not limited to “0” and “1” as described above. Different values corresponding to the generated image 50a and the learning image 22a may be output.

画像生成器２８の４×４の畳み込み層４０は、潜在ベクトル２４に基づいて４×４［ｄｐｉ］の生成画像５０ａを生成する。そして、４×４［ｄｐｉ］にダウンサンプリングされた学習用画像２２ａ及び構造情報２６ａ、生成画像５０ａ、並びに、画像識別器３０からフィードバックされた識別結果に基づいて、生成画像５０ａが学習用画像２２ａに近づくように４×４の畳み込み層４０のパラメータを更新する（すなわち学習する）。 The 4 × 4 convolution layer 40 of the image generator 28 generates a 4 × 4 [dpi] generated image 50 a based on the latent vector 24. Then, based on the learning image 22a and the structure information 26a downsampled to 4 × 4 [dpi], the generated image 50a, and the identification result fed back from the image classifier 30, the generated image 50a is converted into the learning image 22a. Is updated (that is, learning is performed) so that the parameter of the 4 × 4 convolutional layer 40 approaches (4).

ここで、４×４の畳み込み層４０は、構造情報２６ａを考慮して学習されるから、４×４の畳み込み層４４は、構造情報２６ａが示す構造を参考にしながら生成画像５０ａを生成することができる。これにより、学習用画像２２ａにより近い生成画像５０ａが生成されやすくなるといえる。 Here, since the 4 × 4 convolution layer 40 is learned in consideration of the structure information 26a, the 4 × 4 convolution layer 44 generates the generated image 50a while referring to the structure indicated by the structure information 26a. Can be. Thus, it can be said that the generated image 50a closer to the learning image 22a is easily generated.

ステージ１において、画像生成器２８（の４×４の畳み込み層４０）及び画像識別器３０（の４×４の畳み込み層４４）は、上述の学習処理を繰り返す。その過程において、画像生成器２８は画像識別器３０を「だますように」、すなわち画像識別器３０が生成画像５０ａが生成画像でないと誤識別するように学習し、その一方において、画像識別器３０は学習用画像２２ａと生成画像５０ａとを正しく識別するように学習する。これにより、画像生成器２８はより学習用画像２２ａに近い生成画像５０ａを生成することが可能となり、画像識別器３０はより正確に学習用画像２２ａと生成画像５０ａを識別可能となる。 In stage 1, the image generator 28 (the 4 × 4 convolution layer 40) and the image discriminator 30 (the 4 × 4 convolution layer 44) repeat the above-described learning processing. In the process, the image generator 28 learns to “fool” the image classifier 30, ie, the image classifier 30 erroneously identifies that the generated image 50 a is not a generated image. 30 learns so as to correctly identify the learning image 22a and the generated image 50a. Accordingly, the image generator 28 can generate the generated image 50a closer to the learning image 22a, and the image classifier 30 can more accurately identify the learning image 22a and the generated image 50a.

そして、画像生成器２８及び画像識別器３０がナッシュ均衡となる、すなわち、画像生成器２８のパラメータを更新しても画像識別器３０をよりだませなくなり、また、画像識別器３０のパラメータを更新しても学習用画像２２ａと生成画像５０ａの識別精度が向上しなくなった場合に、ステージ１の学習処理、すなわち、４×４の畳み込み層４０，４４の学習処理を終え、ステージ１の次に高い解像度（本実施形態では８×８［ｄｐｉ］）に対応するステージであるステージ２へ進む。 Then, the image generator 28 and the image discriminator 30 are in a Nash equilibrium, that is, even if the parameters of the image generator 28 are updated, the image discriminator 30 is not messed up and the parameters of the image discriminator 30 are updated If the accuracy of discrimination between the learning image 22a and the generated image 50a does not improve even after that, the learning process of the stage 1, that is, the learning process of the 4 × 4 convolution layers 40 and 44, is completed. The process proceeds to stage 2 which is a stage corresponding to a high resolution (8 × 8 [dpi] in the present embodiment).

ステージ２においても、基本的にステージ１と同様の方法で学習処理が行われる。ステージ２においては、画像生成器２８は解像度８×８［ｄｐｉ］の畳み込み層４０を学習し、画像識別器３０は、解像度８×８［ｄｐｉ］の畳み込み層４４を学習する。 In stage 2, the learning process is basically performed in the same manner as in stage 1. In stage 2, the image generator 28 learns the convolution layer 40 having a resolution of 8 × 8 [dpi], and the image discriminator 30 learns the convolution layer 44 having a resolution of 8 × 8 [dpi].

具体的には、画像識別器３０は、８×８のダウンサンプリング層４２で８×８［ｄｐｉ］にダウンサンプリングされた学習用画像２２ｂ及び構造情報２６ｂ、並びに、画像生成器の８×８の畳み込み層４０が生成した８×８［ｄｐｉ］の生成画像５０ｂに基づいて、生成画像５０ｂと学習用画像２２ｂのいずれが生成画像であるかを識別可能なように、８×８の畳み込み層４４のパラメータを更新する。 Specifically, the image classifier 30 includes the learning image 22b and the structure information 26b downsampled to 8 × 8 [dpi] by the 8 × 8 downsampling layer 42, and the 8 × 8 downsampler of the image generator. The 8 × 8 convolution layer 44 is formed based on the 8 × 8 [dpi] generated image 50b generated by the convolution layer 40 so that the generated image 50b or the learning image 22b can be identified as the generated image. Update the parameters of.

８×８の畳み込み層４４の出力は４×４の畳み込み層４４に渡され、４×４の畳み込み層４４の処理結果を考慮して、画像識別器３０としての識別結果を出力する。ステージ２でも、画像識別器３０の識別結果は画像生成器２８にフィードバックされる。 The output of the 8 × 8 convolutional layer 44 is passed to the 4 × 4 convolutional layer 44 and outputs the identification result as the image classifier 30 in consideration of the processing result of the 4 × 4 convolutional layer 44. Also in stage 2, the identification result of the image identifier 30 is fed back to the image generator 28.

画像生成器２８の８×８の畳み込み層４０は、４×４の畳み込み層４０の処理結果を考慮して、学習用画像２２の特徴を示す潜在ベクトル２４に基づいて８×８［ｄｐｉ］の生成画像５０ｂを生成する。そして、８×８［ｄｐｉ］にダウンサンプリングされた学習用画像２２ｂ及び構造情報２６ｂ、生成画像５０ｂ、並びに、画像識別器３０からフィードバックされた識別結果に基づいて、生成画像５０ｂが学習用画像２２ｂに近づくように８×８の畳み込み層４０のパラメータを更新する。 The 8 × 8 convolutional layer 40 of the image generator 28 has an 8 × 8 [dpi] based on the latent vector 24 indicating the feature of the learning image 22 in consideration of the processing result of the 4 × 4 convolutional layer 40. The generated image 50b is generated. Then, based on the learning image 22b and the structural information 26b downsampled to 8 × 8 [dpi], the generated image 50b, and the identification result fed back from the image classifier 30, the generated image 50b is converted into the learning image 22b. Are updated so as to approach the following equation.

そして、画像生成器２８及び画像識別器３０がナッシュ均衡となった場合に、次ステージの処理に進む。 Then, when the image generator 28 and the image classifier 30 are in Nash equilibrium, the process proceeds to the next stage.

上述の学習処理を全てのステージ（本実施形態ではステージ８まで）について行うことで、画像生成器２８及び画像識別器３０が学習される。 The image generator 28 and the image discriminator 30 are learned by performing the above-described learning process for all stages (up to stage 8 in the present embodiment).

学習済みの画像生成器２８は、画像の特徴を示す潜在ベクトル２４及び構造情報２６に基づいて、当該潜在ベクトル２４が示す特徴を有し、且つ、当該構造情報２６が示す構造を有する高解像度（本実施形態では５１２×５１２［ｄｐｉ］）の生成画像を生成することができるようになる。また、学習済みの画像識別器３０は、高解像度画像と構造情報２６に基づいて、当該高解像度画像が生成画像であるか実画像であるかをより精度よく識別することができるようになる。 Based on the latent vector 24 and the structure information 26 indicating the features of the image, the learned image generator 28 has a high resolution (having the features indicated by the latent vector 24 and the structure indicated by the structure information 26). In the present embodiment, a generated image of 512 × 512 [dpi]) can be generated. Further, the learned image classifier 30 can more accurately determine whether the high-resolution image is a generated image or a real image based on the high-resolution image and the structure information 26.

上述のように、本実施形態に係る画像生成器２８は、高解像度の画像生成を可能にするＰＧＧＡＮｓをベースとしながら、さらに構造情報２６も考慮して学習されるから、高解像度であり、且つ、複雑な構造の生成画像を生成することが可能となる。同様に、本実施形態に係る画像識別器３０は、高解像度の画像識別を可能にするＰＧＧＡＮｓをベースとしながら、さらに構造情報２６も考慮して学習されるから、高解像度であり、且つ、複雑な構造の画像の識別が可能となる。 As described above, the image generator 28 according to the present embodiment is based on PGGANs capable of generating a high-resolution image, and is further learned in consideration of the structural information 26. Thus, it is possible to generate a generated image having a complicated structure. Similarly, since the image classifier 30 according to the present embodiment is based on PGGANs that enable high-resolution image classification and is further learned in consideration of the structural information 26, the image classifier 30 has high resolution and is complicated. An image having a simple structure can be identified.

なお、本実施形態では、ＧＡＮｓを用いて、画像生成器２８と画像識別器３０を並行して学習させていたが、画像生成器２８と画像識別器３０とを個別に学習するようにしてもよい。その場合であっても、画像生成器２８及び画像識別器３０は、それぞれ異なる解像度に対応する複数のステージにおいて学習を行い、構造情報２６を考慮しながら学習を行う。 In this embodiment, the image generator 28 and the image classifier 30 are learned in parallel using GANs. However, the image generator 28 and the image classifier 30 may be individually learned. Good. Even in such a case, the image generator 28 and the image classifier 30 perform learning at a plurality of stages corresponding to different resolutions, and perform learning while considering the structural information 26.

画像生成器２８を個別に学習する場合、画像生成器２８は、各ステージにおいて、潜在ベクトル２４、及び、当該ステージに対応する解像度の構造情報２６に基づいて生成画像５０を生成し、当該ステージに対応する解像度の学習用画像２２、生成画像５０、及び当該ステージに対応する解像度の構造情報２６に基づいて、生成画像５０が学習用画像２２に近づくように学習する。 When learning the image generator 28 individually, the image generator 28 generates the generated image 50 based on the latent vector 24 and the structural information 26 of the resolution corresponding to the stage at each stage, and Based on the learning image 22 of the corresponding resolution, the generated image 50, and the structure information 26 of the resolution corresponding to the stage, the learning is performed so that the generated image 50 approaches the learning image 22.

また、画像識別器３０を個別に学習する場合、画像識別器３０は、各ステージにおいて、それぞれ当該ステージに対応する解像度の学習用画像２２、生成画像５０、及び構造情報２６に基づいて、当該学習用画像２２及び当該生成画像５０のいずれが生成画像であるかを識別するように学習する。 When the image classifier 30 is individually learned, the image classifier 30 performs the learning in each stage based on the learning image 22, the generated image 50, and the structure information 26 having the resolution corresponding to the stage. Learning is performed so as to identify which of the image for use 22 and the generated image 50 is the generated image.

＜学習済みの画像生成器及び画像識別器による画像生成＞
上述の学習処理によって十分に学習された画像生成器２８は、潜在ベクトル２４及び構造情報２６に基づいて、当該潜在ベクトル２４が示す特徴を有し、当該構造情報２６が示す構造を有する生成画像を生成できるようになる。 <Image generation by trained image generator and image classifier>
The image generator 28 that has been sufficiently learned by the learning process described above generates a generated image having the features indicated by the latent vector 24 and the structure indicated by the structural information 26 based on the latent vector 24 and the structural information 26. Can be generated.

学習済みの画像生成器２８は、種々の利用方法が考えられるが、そのうちの一つとして、アニメあるいはゲームなどのキャラクタを自動生成するために用いることができる。特に、様々な姿勢の高解像度のキャラクタを簡単に自動生成することができる。特に、昨今、アニメ業界では人材不足に陥っており、アニメ制作の効率化が急がれている。したがって、様々な姿勢の高解像度のアニメのキャラクタを効率良く生成できることが望まれている。 The learned image generator 28 can be used in various ways. One of them can be used to automatically generate a character such as an animation or a game. In particular, high-resolution characters having various postures can be easily and automatically generated. In particular, in recent years, the animation industry has been suffering from a shortage of human resources, and the efficiency of animation production has been urgently required. Therefore, it is desired to be able to efficiently generate high-resolution anime characters having various postures.

図４に、自動生成された様々な姿勢の高解像度のキャラクタ５０’が示されている。まず、キャラクタの基準画像から上述のエンコーダを用いることなどによって、自動生成したいキャラクタの特徴を示す潜在ベクトル２４を生成する。なお、図４においては、便宜上、潜在ベクトル２４が実際のキャラクタの基準画像で示されている。さらに、当該キャラクタの姿勢を示す構造情報２６を用意する。当該構造情報２６はユーザが自ら作成してもよいし、Openposeなどの技術を用いて種々の画像から自動生成することもできる。 FIG. 4 shows automatically generated high-resolution characters 50 'in various postures. First, a latent vector 24 indicating a characteristic of a character to be automatically generated is generated from the reference image of the character by using the above-described encoder. In FIG. 4, the latent vector 24 is shown as a reference image of an actual character for convenience. Further, structure information 26 indicating the posture of the character is prepared. The structure information 26 may be created by the user himself, or may be automatically generated from various images using a technique such as Openpose.

そして、潜在ベクトル２４と構造情報２６を学習済みの画像生成器２８に入力することで、当該構造情報２６が示す姿勢を取る、当該潜在ベクトル２４が示す特徴を有するキャラクタ５０’を自動生成することができる。同一の潜在ベクトル２４と様々な構造情報２６との組み合わせを学習済みの画像生成器２８に入力することで、図４に示すように、同じキャラクタの様々な姿勢の画像を簡単に自動生成することができる。 Then, by inputting the latent vector 24 and the structure information 26 to the learned image generator 28, a character 50 'having a feature indicated by the latent vector 24 and taking a posture indicated by the structure information 26 is automatically generated. Can be. By inputting a combination of the same latent vector 24 and various structural information 26 to the learned image generator 28, it is possible to easily automatically generate images of the same character in various postures as shown in FIG. Can be.

また、学習用画像２２から構築された、画像の特徴を表現する潜在空間内から潜在ベクトル２４をサンプリングすることで、キャラクタ（潜在ベクトル２４）を自動生成することも可能である。例えば、潜在空間において、図５に示す一番左側のキャラクタの特徴を示す潜在ベクトル２４ａに対応する座標（ａ座標）と、図５に示す一番右側のキャラクタの特徴を示す潜在ベクトル２４ｂに対応する座標（ｂ座標）との間に線分を定義する。そうすると、当該線分上の座標（例えばｃ座標、ｄ座標、ｅ座標、ｆ座標）に対応する潜在ベクトル２４ｃ、２４ｄ、２４ｅ、２４ｆは、図５に示す通り、潜在ベクトル２４ａが示すキャラクタと、潜在ベクトル２４ｂが示すキャラクタの間の特徴を有するキャラクタを示すものとなる。図５の例では、潜在ベクトル２４ａが示すキャラクタの服が濃い色となっており、潜在ベクトル２４ｂが示すキャラクタの服が薄い色となっているから、潜在ベクトル２４ｃ、２４ｄ、２４ｅ、２４ｆへと向かうに従って、服の色が徐々に薄くなっていっている。さらに、構造情報２６が属性情報を含んでおり、且つ、構造情報２６も考慮した潜在空間を構築した場合、属性情報を指定することで、指定した特徴を持った画像を自動生成することも可能である。例えば、男性・黒髪・目は青色といった属性情報を与えると、それに該当した画像が生成される。 Further, it is also possible to automatically generate a character (latent vector 24) by sampling the latent vector 24 from within a latent space expressing the features of the image constructed from the learning image 22. For example, in the latent space, the coordinates (a coordinates) corresponding to the latent vector 24a indicating the characteristics of the leftmost character shown in FIG. 5 correspond to the latent vectors 24b indicating the characteristics of the rightmost character shown in FIG. A line segment is defined between the coordinates (b coordinates). Then, the latent vectors 24c, 24d, 24e, and 24f corresponding to the coordinates (for example, the c coordinate, the d coordinate, the e coordinate, and the f coordinate) on the line segment are, as shown in FIG. This indicates a character having characteristics between the characters indicated by the latent vector 24b. In the example of FIG. 5, the clothes of the character indicated by the latent vector 24a are dark and the clothes of the character indicated by the latent vector 24b are light, so that the latent vectors 24c, 24d, 24e, and 24f are changed. As you go, the color of your clothes is gradually fading. Further, when the structure information 26 includes the attribute information and the latent space is constructed in consideration of the structure information 26, it is also possible to automatically generate an image having the specified feature by specifying the attribute information. It is. For example, when attribute information such as blue for men, black hair, and eyes is given, an image corresponding to the attribute information is generated.

なお、本実施形態では、画像生成器２８と画像識別器３０が組み合わされて用いられるが、画像生成器２８は単体で用いることができる。 In the present embodiment, the image generator 28 and the image discriminator 30 are used in combination, but the image generator 28 can be used alone.

また、学習済みの画像識別器３０も単体で用いることができる。例えば、入力画像と構造情報２６を学習済みの画像識別器３０に入力することで、画像識別器３０は、当該入力画像が画像生成器２８によって生成された生成画像であるか実画像であるかを識別することができる。それと共に、入力画像の構造が、構造情報２６が示す構造であるか否かを識別することが可能になる。これにより、画像識別器３０は、入力画像は生成画像であるが構造情報２６が示す構造ではない、など識別することができる。 The learned image classifier 30 can also be used alone. For example, by inputting the input image and the structure information 26 to the learned image classifier 30, the image classifier 30 determines whether the input image is a generated image generated by the image generator 28 or an actual image. Can be identified. At the same time, it is possible to identify whether or not the structure of the input image is the structure indicated by the structure information 26. Thus, the image classifier 30 can identify the input image as a generated image but not the structure indicated by the structure information 26.

＜学習処理の流れ＞
以下、図６に示すフローチャートに従って、画像生成器２８及び画像識別器３０の学習処理の流れを説明する。 <Flow of learning process>
Hereinafter, the flow of the learning process of the image generator 28 and the image classifier 30 will be described with reference to the flowchart shown in FIG.

ステップＳ１０において、制御部１２は、変数ｎを１に設定する。変数ｎは画像生成器２８及び画像識別器３０の学習ステージ（図３参照）を示す変数である。変数ｎが１の場合、ステージ１の学習処理が行われることを意味する。 In step S10, the control unit 12 sets the variable n to 1. The variable n is a variable indicating the learning stage (see FIG. 3) of the image generator 28 and the image classifier 30. When the variable n is 1, it means that the learning process of stage 1 is performed.

ステップＳ１２において、制御部１２は、学習用画像２２及び構造情報２６（構造画像）をダウンサンプリング層４２に入力して、ステージｎに対応する解像度に変換する。 In step S12, the control unit 12 inputs the learning image 22 and the structure information 26 (structure image) to the downsampling layer 42, and converts the image into a resolution corresponding to the stage n.

ステップＳ１４において、画像識別器３０は、ステップＳ１２で解像度が変換された学習用画像２２及び構造情報２６、並びに、ステージｎに対応する解像度の生成画像に基づいて、ステージｎに対応する畳み込み層４４のパラメータを更新する（すなわち学習する）。 In step S14, based on the learning image 22 and the structure information 26 whose resolution has been converted in step S12, and the generated image having the resolution corresponding to stage n, the image classifier 30 determines the convolutional layer 44 corresponding to stage n. (That is, learn).

ステップＳ１６において、制御部１２は、ステップＳ１２で解像度が変換された学習用画像２２、潜在ベクトル２４、及び、ステップＳ１２で解像度が変換された構造情報２６を画像生成器２８に入力する。 In step S16, the control unit 12 inputs to the image generator 28 the learning image 22, the latent vector 24 whose resolution has been converted in step S12, and the structure information 26 whose resolution has been converted in step S12.

ステップＳ１８において、画像生成器２８は、潜在ベクトル及び構造情報２６に基づいて、ステージｎに対応する解像度の生成画像５０を生成する。 In step S18, the image generator 28 generates a generated image 50 having a resolution corresponding to the stage n based on the latent vector and the structure information 26.

ステップＳ２０において、画像識別器３０は識別結果を画像生成器２８にフィードバックする。 In step S20, the image classifier 30 feeds back the classification result to the image generator 28.

ステップＳ２２において、画像生成器２８は、ステップＳ１２で解像度が変換された学習用画像２２及び構造情報２６、ステップＳ１８で生成した生成画像５０、並びに、ステップＳ２０でフィードバックされた画像識別器３０の識別結果に基づいて、ステージｎに対応する畳み込み層４０のパラメータを更新する（すなわち学習する）。 In step S22, the image generator 28 identifies the learning image 22 and the structure information 26 whose resolution has been converted in step S12, the generated image 50 generated in step S18, and the image identifier 30 fed back in step S20. Based on the result, the parameters of the convolutional layer 40 corresponding to the stage n are updated (that is, learned).

ステップＳ２４において、制御部１２は、画像生成器２８と画像識別器３０がナッシュ均衡となったか否かを判定する。ナッシュ均衡となっていない場合は、再度ステップＳ１４に戻り、当該ステージにおける学習処理を繰り返す。ナッシュ均衡となった場合はステップＳ２６に進む。 In step S24, the control unit 12 determines whether or not the image generator 28 and the image classifier 30 are in Nash equilibrium. If the Nash equilibrium has not been reached, the process returns to step S14, and the learning process in the stage is repeated. If Nash equilibrium has been reached, the process proceeds to step S26.

ステップＳ２６において、制御部１２は、予め定められた全てのステージで学習処理が終了したか否かを判定する。終了していない場合はステップＳ２８に進み、変数ｎが１インクリメントされ、ステップＳ１２に戻る。すなわち、次のステージにおける学習処理を実行する。全てのステージでの学習処理が終了した場合は学習処理を終了する。 In step S26, the control unit 12 determines whether the learning process has been completed in all predetermined stages. If not, the process proceeds to step S28, the variable n is incremented by 1, and the process returns to step S12. That is, the learning process in the next stage is executed. When the learning process in all the stages is completed, the learning process ends.

以上、本発明に係る実施形態を説明したが、本発明は上記実施形態に限られるものではなく、本発明の趣旨を逸脱しない限りにおいて種々の変更が可能である。 The embodiment according to the present invention has been described above, but the present invention is not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention.

１０画像生成装置、１２制御部、１４表示部、１６入力部、１８通信部、２０記憶部、２２学習用画像、２４潜在ベクトル、２６構造情報、２８画像生成器、３０画像識別器、４０，４４畳み込み層、４２ダウンサンプリング層、５０生成画像。 Reference Signs List 10 image generation device, 12 control unit, 14 display unit, 16 input unit, 18 communication unit, 20 storage unit, 22 learning image, 24 latent vector, 26 structure information, 28 image generator, 30 image classifier, 40, 44 convolutional layer, 42 downsampling layer, 50 generated image.

Claims

An image generator that learns such that a generated image generated from the latent vector shows the characteristics of the learning image,
Learning so as to identify which of the learning image and the generated image is an image generated by the image generator, based on the learning image and the generated image generated by the image generator. An image classifier that performs
With
The image generator further learns based on the identification result of the image classifier,
The image generator and the image discriminator perform learning stepwise from a stage corresponding to a low resolution to a stage corresponding to a high resolution in a plurality of stages corresponding to different resolutions,
At each stage,
The image generator further learns to generate the generated image having a resolution determined in the stage from the latent vector in consideration of structure information indicating a structure of the learning image,
The image classifier further considers the structure information, which of the learning image converted to the resolution determined in the stage and the generated image generated by the image generator in the stage. Learn to identify whether the image is generated by the image generator,
An image generating apparatus, characterized in that:

The structure information is information indicating positions of a plurality of feature points of an object in the learning image.
The image generating apparatus according to claim 1, wherein:

The structure information is a structure image having coordinate information of a pixel corresponding to the feature point,
At each stage,
The image generator and the image classifier learn in consideration of the structural image converted to a resolution determined for the stage,
The image generating apparatus according to claim 2, wherein:

The latent vector is generated based on a predetermined probability distribution,
The image generation device according to claim 1, wherein:

The latent vector is generated based on the learning image,
The image generation device according to claim 1, wherein:

Learning is performed at a plurality of stages corresponding to different resolutions, and for each stage, a generated image of the resolution determined at the stage generated from the latent vector indicates the characteristic of the learning image of the resolution determined at the stage. An image generator that performs a learning process stepwise from a stage corresponding to low resolution to a stage corresponding to high resolution,
At each stage, learning is further performed to generate the generated image in consideration of structure information indicating the structure of the learning image.
An image generator characterized in that:

Based on the learning image and the generated image generated by the image generator from the latent vector, learning is performed at a plurality of stages corresponding to different resolutions, and each stage is converted to a resolution determined for the stage. A learning process to identify which of the learning image and the generated image is the image generated by the image generator, from a stage corresponding to a low resolution to a stage corresponding to a high resolution. Image discriminator to be performed,
At each stage, learning is performed so as to identify which of the learning image and the generated image is an image generated by the image generator, in consideration of structure information indicating a structure of the learning image. Do
An image classifier characterized in that:

Computer
An image generator that learns such that a generated image generated from the latent vector shows the characteristics of the learning image,
Learning so as to identify which of the learning image and the generated image is an image generated by the image generator, based on the learning image and the generated image generated by the image generator. An image classifier that performs
Function
The image generator further learns based on the identification result of the image classifier,
The image generator and the image discriminator perform learning stepwise from a stage corresponding to a low resolution to a stage corresponding to a high resolution in a plurality of stages corresponding to different resolutions,
At each stage,
The image generator further learns to generate the generated image having a resolution determined in the stage from the latent vector in consideration of structure information indicating a structure of the learning image,
The image classifier further considers the structure information, which of the learning image converted to the resolution determined in the stage and the generated image generated by the image generator in the stage. Learn to identify whether the image is generated by the image generator,
An image generation program, characterized in that:

An image generator that learns such that a generated image generated from the latent vector shows the characteristics of the learning image,
Learning so as to identify which of the learning image and the generated image is an image generated by the image generator, based on the learning image and the generated image generated by the image generator. An image classifier that performs
An image generation method using
The image generator further learns based on the identification result of the image classifier,
The image generator and the image discriminator perform learning stepwise from a stage corresponding to a low resolution to a stage corresponding to a high resolution in a plurality of stages corresponding to different resolutions,
At each stage,
The image generator further learns to generate the generated image having a resolution determined in the stage from the latent vector in consideration of structure information indicating a structure of the learning image,
The image classifier further considers the structure information, which of the learning image converted to the resolution determined in the stage and the generated image generated by the image generator in the stage. Learn to identify whether the image is generated by the image generator,
An image generation method characterized by the above-mentioned.