JP2019045984A

JP2019045984A - Data composing apparatus and method

Info

Publication number: JP2019045984A
Application number: JP2017166062A
Authority: JP
Inventors: マルティンクリンキグト; Klinkigt Martin; 彬童; Bin Tong; 村上　智一; Tomokazu Murakami; 智一村上
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2019-03-22
Anticipated expiration: 2037-08-30
Also published as: JP6962747B2; WO2019044064A1

Abstract

To provide a technique which enables verbal data composition.SOLUTION: A data composing apparatus includes: a database in which element objects being compositable data relating to objects describable in a natural language are preliminarily stored; an extraction unit which extracts at least one of a concept or a context from a sentence described in the natural language; a conversion unit which converts the concept or the context to a feature vector represented as a vector in a prescribed feature space; and a composing unit which preliminarily holds a neural network model for composing element objects in accordance with the input feature vector, selects an element object from the database on the basis of the neural network model and the feature vector, and generates composite data by using the element object.SELECTED DRAWING: Figure 2

Description

本発明は、言葉によるデータ合成を可能にする技術に関する。 The present invention relates to techniques that allow verbal data synthesis.

画像と自然言語の言葉を関連付けることにより画像の様々な応用が可能となる。例えば、画像と言葉を関連づけると言葉による画像検索が可能となる。画像と言葉を関連付ける技術として、例えば、画像アノテーション技術がある。画像アノテーション技術は、対象画像の画像領域から特徴量を抽出し、予め特徴を学習しメタデータを付与してある画像の中から対象画像に最も近い特徴量を有する画像を選択し、その画像のメタデータを対象画像に付与する技術である。また、特許文献１には、学習用画像から複数の特徴量を抽出し、バイナリ識別器を用いて複数の特徴量を分類し、識別情報と特徴量とを対応付けるための学習モデルを識別情報及び特徴量の種類毎に作成し、識別情報の条件付確率を求める計算式をシグモイド関数で近似し、識別情報の条件付確率が最大となるようにシグモイド関数のパラメータを最適化することで識別情報毎に学習モデルを最適化する技術が開示されている。それにより、画像に信頼性の高い識別情報を付与することができる。 By associating images with natural language words, various applications of images become possible. For example, associating an image with a word makes it possible to search an image by words. As a technique for associating an image with a word, for example, there is an image annotation technique. The image annotation technology extracts a feature amount from an image area of a target image, selects an image having a feature amount closest to the target image from images in which features have been learned in advance and metadata have been added, and It is a technology that gives metadata to a target image. Further, in Patent Document 1, a plurality of feature quantities are extracted from a learning image, a plurality of feature quantities are classified using a binary discriminator, and a learning model for correlating identification information with the feature quantity is identified as identification information and Discriminant information by creating a formula for each type of feature and approximating the equation for finding the conditional probability of identification information with a sigmoid function and optimizing the parameters of the sigmoid function so that the conditional probability of identification information is maximized. A technique for optimizing a learning model is disclosed for each. Thereby, reliable identification information can be given to the image.

また画像合成技術として、四角いキャンバスの中に「山」「海」のような言葉を配置して、画像を合成する技術もある。 There is also a technology for combining images by arranging words such as "mountain" and "sea" in a square canvas as an image combining technology.

特開２０１２−０３８２４４号公報JP 2012-038244 A

近年では画像の新たな活用が求められている。例えば、言葉で表現した所望の画像を作り出すことも求められる。しかしながら、上述した従来の画像アノテーション技術は、データベースの中から所望の画像を選択するものであり、新たな画像を作り出すことはできない。言葉を配置して画像を合成する技術では、単純な位置関係を表現することはできるが、「走って電車に乗る」といった時間的空間的関係や動作状況を表現することができず、こうした画像を合成することができない。また、従来の画像合成技術では、大量の画像とその画像に対する説明文を学習させることで、テキストから画像を生成するモデルを構築するが、学習データに含まれない未知の事象に対して適切な画像を生成することは困難である。
また、上述した画像と同様に音声やセンサデータなど他の様々なデータも自然言語の言葉と関連付けることができれば、そのデータの活用も大きく広がることが考えられる。 In recent years, new utilization of images is required. For example, it is also required to create a desired image expressed in words. However, the conventional image annotation technology described above selects a desired image from a database and can not create a new image. Although the technique of arranging words and combining images can represent simple positional relationships, it can not represent temporal and spatial relationships such as "run and get on a train" and operating conditions. Can not be synthesized. Also, in the conventional image synthesis technology, a model for generating an image from text is constructed by learning a large number of images and their descriptive sentences, but it is suitable for unknown events not included in learning data. It is difficult to generate an image.
In addition, if various other data such as voice and sensor data can be associated with the language of the natural language as in the case of the image described above, it is conceivable that the utilization of the data will be greatly expanded.

本発明の目的は、言葉によるデータ合成を可能にする技術を提供することである。 An object of the present invention is to provide a technology that enables verbal data synthesis.

本発明の一つの態様に従うデータ合成装置は、自然言語で記述可能なオブジェクトに関する合成可能なデータである要素オブジェクトを予め蓄積したデータベースと、自然言語の文からコンセプトまたはコンテキストの少なくとも一方を抽出する抽出部と、前記コンセプトまたは前記コンテキストを所定の特徴空間におけるベクトルで表現した特徴ベクトルに変換する変換部と、入力された特徴ベクトルに従って要素オブジェクトを合成するニューラルネットワークモデルを予め保持しており、前記ニューラルネットワークモデルおよび前記特徴ベクトルに基づき、前記データベースから要素オブジェクトを選択し該要素オブジェクトを用いて合成データを生成する合成部と、を有する。 A data synthesizing apparatus according to one aspect of the present invention is a database in which element objects which are synthesizable data relating to objects that can be described in natural language are stored in advance, and an extraction that extracts at least one of concept or context from natural language sentences. A neural network model for combining element objects according to an input feature vector, and a conversion unit for converting the concept or the context into a feature vector represented by a vector in a predetermined feature space; And a composition unit that selects an element object from the database and generates composition data using the element object based on the network model and the feature vector.

本発明によれば、特徴ベクトルに従って要素オブジェクトを合成するニューラルネットワークモデルを予め保持し、自然言語の文からコンセプトとコンテキストの少なくとも一方を抽出し、それらを特徴ベクトルに変換し、ニューラルネットワークモデルに基づき要素オブジェクトを合成して合成データを生成するので、自然言語の文で表現されたユーザ所望の合成データを、データベースに蓄積された要素オブジェクトから生成することができる。また、学習に用いていない未知のシーンの画像を生成することができる。 According to the present invention, a neural network model that combines element objects in accordance with feature vectors is held in advance, at least one of concept and context is extracted from natural language sentences, and converted into feature vectors. Since element objects are synthesized to generate synthesized data, user-desired synthesized data expressed in natural language sentences can be generated from the element objects stored in the database. Also, an image of an unknown scene not used for learning can be generated.

実施例１による画像合成システムの物理構成を示すブロック図である。FIG. 1 is a block diagram showing a physical configuration of an image combining system according to a first embodiment. 実施例１による画像合成装置のブロック図である。FIG. 1 is a block diagram of an image combining device according to a first embodiment. 実施例１による画像合成処理のフローチャートである。5 is a flowchart of an image combining process according to the first embodiment. 実施例１による画像合成処理のシーケンス図である。FIG. 6 is a sequence diagram of an image combining process according to the first embodiment. 実施例１による画像合成装置の動作例を示すシーケンス図である。FIG. 7 is a sequence diagram showing an operation example of the image combining device according to the first embodiment. 実施例２による画像合成装置のブロック図である。FIG. 7 is a block diagram of an image combining device according to a second embodiment. 実施例２による機械学習処理のフローチャートである。7 is a flowchart of a machine learning process according to a second embodiment. 実施例３による画像合成システムの物理構成を示すブロック図である。FIG. 13 is a block diagram showing a physical configuration of an image combining system according to a third embodiment. 実施例３による端末のブロック図である。FIG. 10 is a block diagram of a terminal according to a third embodiment. 実施例４によるデータベースが蓄積するデータの一例を示す図である。FIG. 18 is a diagram showing an example of data accumulated by the database according to the fourth embodiment. 実施例５による画像合成装置のブロック図である。FIG. 16 is a block diagram of an image combining device according to a fifth embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、実施例１による画像合成システムの物理構成を示すブロック図である。図１を参照すると、画像合成システムは、端末１００、画像合成装置２００、およびデータベース３００を有している。端末１００と画像合成装置２００は通信ネットワーク１０１で接続され、画像合成装置２００は通信ネットワーク２０１で接続される。通信ネットワーク１０１は例えばインターネットである。通信ネットワーク２０１は例えばＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）である。 FIG. 1 is a block diagram showing the physical configuration of the image combining system according to the first embodiment. Referring to FIG. 1, the image combining system includes a terminal 100, an image combining device 200, and a database 300. The terminal 100 and the image combining device 200 are connected by a communication network 101, and the image combining device 200 is connected by a communication network 201. The communication network 101 is, for example, the Internet. The communication network 201 is, for example, a LAN (Local Area Network).

端末１００は、ユーザ１０が直接利用する情報端末であり、例えば、パーソナルコンピュータ、タブレット端末、スマートフォンなどである。ユーザ１０の指示により、ユーザが所望する合成画像を自然言語で表現した文の情報を画像合成装置２００に送り、文に合った画像の合成を要求する。また、端末１００は、画像合成装置２００が作成した合成画像のデータを画像合成装置２００から受信し、内部の記憶措置（不図示）にデータを記録するとともに合成画像を画面に表示する。 The terminal 100 is an information terminal that the user 10 directly uses, and is, for example, a personal computer, a tablet terminal, a smartphone or the like. In accordance with the instruction of the user 10, information of a sentence in which a synthesized image desired by the user is expressed in natural language is sent to the image synthesizing device 200, and the synthesis of the image matching the sentence is requested. The terminal 100 also receives, from the image combining device 200, the data of the combined image created by the image combining device 200, records the data in an internal storage device (not shown), and displays the combined image on the screen.

データベース３００は、画像合成に利用する様々な画像のデータを、画像合成装置２００から取得可能に蓄積するデータベースである。データベース３００には、様々な物をそれぞれ表示する画像（オブジェクト画像）のデータが蓄積されている。オブジェクト画像にはその画像に表示されている物を示す情報（ラベル）が付加されている。例えば、オブジェクト画像には、その画像に表示されている物の名称の情報がラベルとして付加されている。 The database 300 is a database that accumulates various image data used for image composition so as to be obtainable from the image composition device 200. The database 300 stores data of images (object images) for displaying various objects. Information (labels) indicating objects displayed in the image is added to the object image. For example, in the object image, information of the name of the object displayed in the image is added as a label.

画像合成装置２００は、ユーザ１０が操作する端末１００からの要求に従い、データベース３００に蓄積されたデータを利用して、ユーザ１０の所望する画像（合成画像）を合成する計算機である。 The image synthesizing device 200 is a computer that synthesizes an image (synthesized image) desired by the user 10 using data stored in the database 300 according to a request from the terminal 100 operated by the user 10.

図２は、実施例１による画像合成装置のブロック図である。図２を参照すると、画像合成装置２００は、文処理部２１０、コンセプト抽出部２２０、コンテキスト抽出部２３０、エンベディング変換部２４０、画像合成部２５０、および画像出力部２６０を有している。 FIG. 2 is a block diagram of the image combining device according to the first embodiment. Referring to FIG. 2, the image combining device 200 includes a sentence processing unit 210, a concept extraction unit 220, a context extraction unit 230, an embedding conversion unit 240, an image combining unit 250, and an image output unit 260.

文処理部２１０は、端末１００から受信した文を解析し、意味の解釈が可能な最小単位のエンティティ（以下「最小エンティティ」という）変換するとともに各エンティティの品詞を判断する。これにより文はコンセプト抽出部２２０およびコンテキスト抽出部２３０にて処理可能な形式となる。 The sentence processing unit 210 analyzes the sentence received from the terminal 100, converts the smallest unit entity (hereinafter referred to as "minimum entity") capable of interpreting the meaning, and determines the part of speech of each entity. Thus, the sentence is in a form that can be processed by the concept extraction unit 220 and the context extraction unit 230.

コンセプト抽出部２２０は、文処理部２１０にて解析された文からコンセプトを抽出する。具体的には例えば、コンセプト抽出部２２０は、文処理部２１０による解析により得られたエンティティのうち、品詞が名詞であるものを入力としコンセプトとして抽出すればよい。 The concept extraction unit 220 extracts a concept from the sentence analyzed by the sentence processing unit 210. Specifically, for example, among the entities obtained by the analysis by the sentence processing unit 210, the concept extraction unit 220 may extract an entity whose part of speech is a noun as an input.

コンセプトは例えば、名詞に対する属性や特徴の関係性を記述したものである。例えば「犬」に対して「動物」「毛」といった関連語句が記述される。 The concept describes, for example, the relationship between attributes and features to nouns. For example, related words such as "animal" and "hair" are described for "dog".

コンテキスト抽出部２３０は、文処理部２１０にて解析された文からコンテキストを抽出する。具体的には例えば、コンテキスト抽出部２３０は、コンセプト抽出部２２０で抽出されたコンセプトをそのままコンテキストとしたもの、文処理部２１０による解析により得られたエンティティのうち動詞など名詞以外の品詞の最小エンティティとコンセプトとを接続したもの、動詞でコンセプトとコンセプトを接続したものをコンテキストとして抽出すればよい。 The context extraction unit 230 extracts a context from the sentence analyzed by the sentence processing unit 210. Specifically, for example, the context extraction unit 230 uses the concept extracted by the concept extraction unit 220 as it is as a context, and the smallest entity of a part of speech other than a noun such as a verb among the entities obtained by analysis by the sentence processing unit 210 What connected the concept and the concept, and what connected the concept and the concept with the verb may be extracted as the context.

コンテキストは語句の文章表現の関係性を記述したものである。例えば「犬」と「走る」といった関係性が記述される。 The context describes the relationship of the sentence expression of the phrase. For example, the relationship between "dog" and "run" is described.

エンベディング変換部２４０は、コンセプトおよびコンテキストを、特徴空間におけるベクトルである特徴ベクトルに変換する。特徴ベクトルにより、コンセプトあるいはコンテキストの意味が表現される。これにより、コンセプトおよびコンテキストの意味を特徴ベクトルにより処理することが可能となる。本実施例では、コンセプトおよびコンテキストの特徴ベクトルがニューラルネットワークモデルの入力となる。 The embedding transform unit 240 transforms the concept and the context into a feature vector which is a vector in feature space. Feature vectors represent the meaning of a concept or context. This makes it possible to process the meaning of concepts and contexts with feature vectors. In this embodiment, feature vectors of concept and context become inputs of the neural network model.

画像合成部２５０は、機械学習により得たニューラルネットワーク（ＮＮ）モデルを予め保持しており、エンベディング変換部２４０で生成されたコンセプトの特徴ベクトルに応じたオブジェクト画像をデータベース３００から取得し、コンセプトおよびコンテキストの特徴ベクトルとＮＮモデルとに基づいてオブジェクト画像を配置することにより合成画像を生成する。例えば、ＮＮモデルは、コンセプトを含むコンテキストの特徴ベクトルを入力とし、合成画像またはその生成方法を出力する。ＮＮモデルでは、入力された特徴ベクトルからレイヤ毎に高次元平面が生成され、最終平面ではコンテキストにおけるコンセプトを解釈した合成画像が出力される。合成画像の生成方法には、例えば、どのオブジェクト画像を用いるか、オブジェクト画像をどのような位置関係に配置するか、などが含まれる。例えば、画像合成部２５０は、ＮＮモデルの出力に従って、コンセプト抽出部２２０で抽出されたコンセプトに対応するオブジェクト画像をデータベース３００から取得し、そのオブジェクト画像を、コンテキスト抽出部２３０で抽出されたコンテキストに基づいて配置することにより合成画像を生成すればよい。 The image synthesis unit 250 holds in advance a neural network (NN) model obtained by machine learning, acquires an object image according to the feature vector of the concept generated by the embedding conversion unit 240 from the database 300, and A composite image is generated by arranging an object image based on the feature vector of the context and the NN model. For example, the NN model takes as input a feature vector of context including a concept, and outputs a composite image or a method of generating the composite image. In the NN model, a high dimensional plane is generated for each layer from the input feature vector, and a final image is output as a composite image in which the concept in the context is interpreted. The synthetic image generation method includes, for example, which object image is to be used, and the positional relationship between the object images and the like. For example, the image combining unit 250 acquires an object image corresponding to the concept extracted by the concept extraction unit 220 from the database 300 according to the output of the NN model, and sets the object image to the context extracted by the context extraction unit 230. The composite image may be generated by arranging based on this.

本方式では、コンセプトとコンテキストの両方の特徴と出力画像が紐づけられるようにＮＮモデルが構成されていることを特徴とする。これによりコンセプトのみで学習したＮＮモデルでは合成することができない、事前に関係性が学習されていない対象についても画像を合成することができる。例えば、「雲の中を泳ぐ人」のような、実例のない画像を合成することができる。 The present method is characterized in that the NN model is configured such that features of both concept and context can be linked to an output image. As a result, images can be synthesized even for objects for which relationships have not been learned in advance, which can not be synthesized by the NN model learned only by the concept. For example, it is possible to synthesize an example-free image such as “person swimming in the cloud”.

画像出力部２６０は、画像合成部２５０で生成された合成画像のデータを端末１００に送信する。 The image output unit 260 transmits the data of the combined image generated by the image combining unit 250 to the terminal 100.

図３は、実施例１による画像合成処理のフローチャートである。図３を参照すると、まずステップＳ２１０にて、文処理部２１０が、端末１００から受信した文を解析する。次にステップＳ２２０にて、コンセプト抽出部２２０が、文処理部２１０にて解析された文からコンセプトを抽出する。続いてステップＳ２３０にて、コンテキスト抽出部２３０が、文処理部２１０にて解析された文からコンテキストを抽出する。 FIG. 3 is a flowchart of the image combining process according to the first embodiment. Referring to FIG. 3, first, in step S <b> 210, the sentence processing unit 210 analyzes a sentence received from the terminal 100. Next, in step S220, the concept extraction unit 220 extracts a concept from the sentence analyzed by the sentence processing unit 210. Subsequently, in step S230, the context extraction unit 230 extracts a context from the sentence analyzed by the sentence processing unit 210.

なお、文の複雑さによっては、コンセプトのみを抽出することで画像の合成は可能である。また、ここでは、コンセプトの抽出の次にコンテキストの抽出を行う例を示しているが、この順序に限定されることはない。他の例として、コンセプトとコンテキストを同時に抽出してもよい。 Depending on the complexity of the sentence, it is possible to combine images by extracting only the concept. Moreover, although the example which extracts context after extraction of a concept is shown here, it is not limited to this order. As another example, concepts and contexts may be extracted simultaneously.

次にステップＳ２４０では、エンベディング変換部２４０が、コンセプトおよびコンテキストを、特徴空間におけるベクトルである特徴ベクトルに変換する。例えば、Ｗｏｒｄ２Ｖｅｃにより個々のコンセプトおよびコンテキストを高次元のベクトル表現されたフィーチャにマッピングすることができる。コンセプトあるいはコンテキストの意味が特徴ベクトルで表現される。これにより、コンセプトおよびコンテキストの意味を演算処理することが可能となる。 Next, in step S240, the embedding transform unit 240 transforms the concept and the context into a feature vector which is a vector in the feature space. For example, Word2Vec can map individual concepts and contexts to high dimensional vector represented features. The meaning of a concept or context is represented by a feature vector. This makes it possible to process the meaning of the concept and context.

更にステップＳ２５０にて、画像合成部２５０が、コンセプトの特徴ベクトルに応じたオブジェクト画像をデータベース３００から取得し、ＮＮモデルを利用して、コンセプトおよびコンテキストの特徴ベクトルに基づき、オブジェクト画像を配置することにより合成画像を生成する。そしてステップＳ２６０にて、画像出力部２６０が、画像合成部２５０で生成された合成画像のデータを端末１００に送信する。 Furthermore, in step S250, the image combining unit 250 acquires an object image corresponding to the feature vector of the concept from the database 300, and arranges the object image based on the feature vector of the concept and context using the NN model. To generate a composite image. Then, in step S 260, image output unit 260 transmits the data of the composite image generated by image combining unit 250 to terminal 100.

図４は、実施例１による画像合成処理のシーケンス図である。図４を参照すると、ステップＳ５０１にて、端末１００から画像合成装置２００に文が送信される。ステップＳ５０２にて、画像合成装置２００でコンセプトが抽出される。ステップＳ５０３にて、画像合成装置２００からデータベース３００に、抽出されたコンセプトに合う画像が要求される。ステップＳ５０４にて、要求された、コンセプトに合った画像がデータベース３００から画像合成装置２００に返信される。ステップＳ５０５にて、画像合成装置２００でコンテキストが抽出される。ステップＳ５０６にて、コンセプトに合った画像をコンテキストに従って合成した合成画像が生成される。ステップＳ５０７にて、合成画像が画像合成装置２００から端末１００に返送される。 FIG. 4 is a sequence diagram of the image combining process according to the first embodiment. Referring to FIG. 4, in step S501, a sentence is transmitted from the terminal 100 to the image combining device 200. In step S502, the concept is extracted by the image combining device 200. In step S503, the image combining device 200 requests the database 300 to match the extracted concept. At step S504, the requested image matching the concept is returned from the database 300 to the image combining device 200. In step S505, the context is extracted by the image combining device 200. In step S506, a composite image is generated by combining the images matching the concept according to the context. In step S507, the composite image is returned from the image combining device 200 to the terminal 100.

図５は、実施例１による画像合成装置の動作例を示すシーケンス図である。 FIG. 5 is a sequence diagram showing an operation example of the image combining device according to the first embodiment.

端末１１０は、ステップＳ６１０にて、ユーザ１０入力した“Man running to train at station.”という文を画像合成装置２００に送信する。 In step S610, the terminal 110 transmits the sentence “Man running to train at station.” Input by the user 10 to the image combining device 200.

画像合成装置２００では、文を最小エンティティに分割する。そして、画像合成装置２００は、最小エンティティの例に変換された文からコンセプトを抽出する。最小エンティティは、例えば、"Man：A"、"Running：P"、"Train：B"、"at station：C"などである。更に、画像合成装置２００は、それらのコンセプトが存在するコンテキストを文から抽出する。図５の例では以下のようなコンテキストを抽出することができる。これらのコンセプトおよびコンテキストがＮＮモデルの入力となる。ここで、Ａ、Ｂ、Ｃ、Ｐ等のアルファベットはコンセプトあるいはコンテキストに付与される識別符号である。
Man:A
Train:B
At Station:C
Man:A running:P
Man:A running:P Train:B
Train:B at station:C
Man:A running:P at Station:C The image synthesizing device 200 divides a sentence into minimum entities. Then, the image synthesizing device 200 extracts the concept from the sentence converted to the example of the minimum entity. The minimum entities are, for example, "Man: A", "Running: P", "Train: B", "at station: C" and the like. Furthermore, the image synthesizing device 200 extracts the context in which those concepts exist from the sentence. In the example of FIG. 5, the following contexts can be extracted. These concepts and contexts are input to the NN model. Here, alphabets such as A, B, C and P are identification codes given to a concept or context.
Man: A
Train: B
At Station: C
Man: A running: P
Man: A running: P Train: B
Train: B at station: C
Man: A running: P at Station: C

画像合成装置２００は、ステップＳ６０２にて、コンセプトを取得し、ステップＳ６０３にて、コンセプトに対応する画像をデータベース３００に要求する。この例では、manとtrainの画像が要求されている。ステップＳ６０４にて、データベース３００から画像合成装置２００に、コンセプトに対応する画像が返信される。この例では、男の画像と駅に停車している電車の画像が返信されている。画像合成装置２００は、ステップＳ６０５にて、コンテキストを取得し、ステップＳ６０６にて、画像を合成する。ここでは、manの画像とtrainの画像を重ね合わせて配置することにより、二次元の合成画像を生成している。画像合成装置２００は、ステップＳ６０７にて、作成した合成画像のデータを端末１００に送信する。画像合成は、コンセプトおよびコンテキスト、コンテキストのみ、あるいはコンセプトのみから合成が可能である。 The image combining device 200 acquires the concept at step S602, and requests the database 300 for an image corresponding to the concept at step S603. In this example, images of man and train are required. In step S604, an image corresponding to the concept is sent back from the database 300 to the image combining device 200. In this example, an image of a man and an image of a train stopping at a station are sent back. The image combining device 200 acquires the context in step S605, and combines the image in step S606. Here, a two-dimensional composite image is generated by superimposing and arranging the image of man and the image of train. The image combining device 200 transmits the data of the created combined image to the terminal 100 in step S 607. Image composition can be composed from concept and context, context only, or concept alone.

なお、本実施例では、画像を合成する例を示したが、画像と同様にオーディオデータやセンサデータなど他の様々なデータについても自然言語の言葉と関連付けることができれば、そのデータの活用も大きく広がることが考えられる。様々なデータについて、合成可能なデータである要素オブジェクトを用いて合成データを生成することができる。 In the present embodiment, an example of combining images is shown, but if various other data such as audio data and sensor data can be associated with the language of natural language as in the case of images, the utilization of the data is also significant. It is thought that it spreads. For various data, composite data can be generated using element objects that can be composited data.

以上、説明したように、本実施例による画像合成装置２００は、自然言語で記述可能なオブジェクトに関する合成可能なデータ（画像データ）である要素オブジェクト（オブジェクト画像）を予め蓄積したデータベース３００と、自然言語の文からコンセプトまたはコンテキストの少なくとも一方を抽出する抽出部（コンセプト抽出部２２０、コンテキスト抽出部２３０）と、コンセプトまたはコンテキストを所定の特徴空間におけるベクトルで表現した特徴ベクトルに変換する変換部（エンベディング変換部２４０）と、入力された特徴ベクトルに従って要素オブジェクトを合成するニューラルネットワークモデルを予め保持しており、ニューラルネットワークモデルおよび特徴ベクトルに基づき、データベースから要素オブジェクトを選択しその要素オブジェクトを用いて合成データ（合成画像）を生成する合成部（画像合成部２５０）と、を有している。このように、特徴ベクトルに従って要素オブジェクトを合成するニューラルネットワークモデルを予め保持し、自然言語の文からコンセプトとコンテキストの少なくとも一方を抽出し、それらを特徴ベクトルに変換し、ニューラルネットワークモデルに基づき要素オブジェクトを合成して合成データを生成するので、自然言語の文で表現されたユーザ所望の合成データを、データベースに蓄積された要素オブジェクトから生成することができる。画像についてみると、要素オブジェクトとしてオブジェクト画像をデータベースに予め保持し、自然言語の文からコンセプトとコンテキストの少なくとも一方を抽出し、それらを特徴ベクトルに変換し、ニューラルネットワークモデルに基づきオブジェクト画像を合成して合成画像を生成するので、自然言語の文で表現されたユーザ所望の合成画像タを生成することができる。 As described above, the image combining apparatus 200 according to the present embodiment includes the database 300 in which element objects (object images), which are data (image data) that can be combined with objects that can be described in natural language, are stored in advance An extraction unit (concept extraction unit 220, context extraction unit 230) for extracting at least one of a concept or context from a language sentence, and a conversion unit (embedding for converting the concept or context into a feature vector represented by a vector in a predetermined feature space A transformation unit (240) and a neural network model for synthesizing element objects according to the input feature vector are held in advance, and based on the neural network model and the feature vector, the element object is selected from the database. Synthesis unit that generates synthesized data (composite image) with the element object of perilla (image combining unit 250), a has. In this way, a neural network model that synthesizes an element object according to a feature vector is held in advance, at least one of a concept and a context is extracted from natural language sentences, and converted into a feature vector. Can be generated to generate synthetic data, so that user-desired synthetic data expressed in natural language sentences can be generated from element objects stored in the database. As for images, object images are held in the database as element objects in advance, at least one of concepts and contexts is extracted from natural language sentences, they are converted to feature vectors, and object images are synthesized based on neural network models. Since the composite image is generated, it is possible to generate a user-desired composite image represented by natural language sentences.

また、抽出部は、文を意味解釈が可能な最小単位である最小エンティティのうち、名詞である最小エンティティをコンセプトとし、他の最小エンティティとコンセプトを接続してコンテキストを生成する。名詞をコンセプトとし、コンセプトと他の最小エンティティを接続してコンテキストを生成するので、文で表現された物を解釈して合成データに反映させることができる。 Also, the extraction unit generates a context by connecting the concept with the smallest entity which is a noun among the smallest entities which are the smallest units capable of semantic interpretation of a sentence, and connecting the concept with other minimal entities. A noun is a concept, and a concept is connected to other minimal entities to generate a context, so that objects represented by sentences can be interpreted and reflected in synthetic data.

また、画像合成部は複数のオブジェクト画像を重ね合わせてまたは並べて配置することにより前記合成画像を生成する。比較的少ない処理量で合成画像を生成することができる。 Further, the image combining unit generates the combined image by arranging a plurality of object images so as to overlap or align. A composite image can be generated with a relatively small amount of processing.

また、本実施例では、どのような画像合成方法を用いてもよい。例えば、ニューラルネットワークモデルは、ジェネレータとディスクリミネータという互いに敵対する２つのモデルで学習を行うＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋｓによるニューラルネットワークモデルであり、ジェネレータにより合成画像を生成する。ユーザの所望するものに比較的近い合成画像を生成することができる。また、より単純な方法として、複数の画像を組み合わせて配置するコラージュにより合成画像を生成してもよい。 Further, in the present embodiment, any image combining method may be used. For example, a neural network model is a neural network model according to General Adversalial Networks which performs learning with two opposing models of a generator and a discriminator, and a generator generates a composite image. A composite image can be generated that is relatively close to what the user desires. As a simpler method, a composite image may be generated by collage in which a plurality of images are combined and arranged.

実施例１は、画像合成システムが学習済みのニューラルネットワークモデルを予め備え、そのＮＮモデルを用いて画像を合成する例を示した。実施例２では、画像合成システムがＮＮモデルを学習し、学習したＮＮモデルを用いて画像を合成する例を示す。 The first embodiment shows an example in which the image synthesis system is provided in advance with a trained neural network model, and images are synthesized using the NN model. The second embodiment shows an example in which the image combining system learns an NN model and combines images using the learned NN model.

実施例２による画像合成システムは、図１に示した実施例１のものと同様の物理構成を有する。 The image synthesizing system according to the second embodiment has the same physical configuration as that of the first embodiment shown in FIG.

図６は、実施例２による画像合成装置のブロック図である。実施例２による画像合成装置は、図２に示した実施例１のものとは、コンセプト選択部３１０、コンテキスト生成部３２０、画像判定部３５０、および終了条件判定部３６０を有する点で異なっている。 FIG. 6 is a block diagram of an image combining device according to a second embodiment. The image combining apparatus according to the second embodiment is different from that of the first embodiment shown in FIG. 2 in that it has a concept selection unit 310, a context generation unit 320, an image determination unit 350, and an end condition determination unit 360. .

実施例２による画像合成処理は、図３および図４に示した実施例１のものと同様の処理である。 The image combining process according to the second embodiment is the same process as that of the first embodiment shown in FIGS. 3 and 4.

図７は、実施例２による機械学習処理のフローチャートである。 FIG. 7 is a flowchart of machine learning processing according to the second embodiment.

まず、ステップＳ３１０にて、コンセプト選択部３１０が、様々ある既知のコンセプトから、例えばランダムにあるいは所定の選択方法によりコンセプトを選択する。次に、ステップＳ３２０にて、コンテキスト生成部３２０が、そのコンセプトを含む自然なコンテキストを生成する。例えば、自然言語の様々な文を蓄積しておき、その中から、選択されたコンセプトを含む文をコンテキストとして抽出することにしてもよい。 First, in step S310, the concept selection unit 310 selects a concept from among various known concepts, for example, randomly or according to a predetermined selection method. Next, in step S320, the context generation unit 320 generates a natural context including the concept. For example, various sentences of natural language may be stored, and a sentence including the selected concept may be extracted as a context.

次に、ステップＳ３３０にて、エンベディング変換部２４０が、コンセプトおよびコンテキストを特徴ベクトルに変換する。このときエンベディング変換部２４０は、実施例１にて説明した画像合成時と同様の方法で、コンセプトおよびコンテキストを特徴ベクトルに変換すればよい。エンベディング変換部２４０で生成された特徴ベクトルは、実施例１にて説明した画像合成時と同様に、画像合成部２５０に提供される。 Next, in step S330, the embedding transform unit 240 transforms the concept and the context into a feature vector. At this time, the embedding conversion unit 240 may convert the concept and the context into a feature vector by the same method as the image combining described in the first embodiment. The feature vector generated by the embedding conversion unit 240 is provided to the image combining unit 250 as in the image combining described in the first embodiment.

次に、ステップＳ３４０にて、画像合成部２５０は、コンセプトおよびコンテキストの特徴ベクトルをＮＮモデルに入力することにより合成画像を生成する。このとき画像合成部２５０は、実施例１にて説明した画像合成時と同様の方法で画像合成を行う。 Next, in step S340, the image combining unit 250 generates a combined image by inputting feature vectors of concept and context into the NN model. At this time, the image combining unit 250 performs image combining in the same manner as the image combining described in the first embodiment.

次に、ステップＳ３５０にて、画像判定部３５０が、作成された合成画像が、コンセプトとコンテキストを正しく表現したものとなっているか否か判定する。合成画像がコンセプトおよびコンテキストを正しく表現したものでなければ、ステップＳ３４０に戻って画像合成をやり直す。合成画像がコンセプトおよびコンテキストを正しく表現したものとなっていれば、次に、ステップＳ３６０にて、終了条件判定部３６０が、所定の終了条件が成立しているか否か判定する。終了条件が成立していれば、機械学習処理は終了する。終了条件が成立していなければ、ステップＳ３１０に戻る。終了条件は例えば所定回数だけ機械学習処理がループしたら終了としてもよい。あるいは、所定時間だけ機械学習処理を行ったら終了としてもよい。 Next, in step S350, the image determination unit 350 determines whether the created composite image correctly represents the concept and the context. If the composite image is not one that correctly expresses the concept and context, the process returns to step S340 to redo the image composition. If the composite image correctly represents the concept and the context, then in step S360, the termination condition determination unit 360 determines whether a predetermined termination condition is satisfied. If the end condition is satisfied, the machine learning process ends. If the termination condition is not satisfied, the process returns to step S310. The termination condition may be terminated when, for example, the machine learning process loops a predetermined number of times. Alternatively, the process may be ended when the machine learning process is performed for a predetermined time.

なお、本実施例も実施例１と同様に、オーディオデータやセンサデータなど画像データ以外のデータにも適用可能である。 As in the first embodiment, the present embodiment is also applicable to data other than image data such as audio data and sensor data.

以上説明したように、本実施例では、画像合成装置２００は、ニューラルネットワークモデルの学習に用いるコンセプトを選択するコンセプト選択部３１０と、ニューラルネットワークモデルの学習に用いるコンテキストを生成するコンテキスト生成部３２０と、ニューラルネットワークモデルにより生成された合成画像を評価する判定部（画像判定部３５０）と、を更に有している。そして、変換部（エンベディング変換部２４０）が、コンセプト選択部３１０により選択されたコンセプトと、コンテキスト生成部３２０により生成されたコンテキストとを特徴ベクトルに変換する。合成部（画像合成部２５０）が、ニューラルネットワークモデルおよび特徴ベクトルに基づき、データベース３００から要素オブジェクト（オブジェクト画像）を選択しその要素オブジェクトを用いて合成データを生成する。判定部が、合成データを評価し、合成データが所定の評価を得られなければ合成部および判定部の処理を繰り返す。データ合成装置がデータ合成の機械学習を行うことができるので、ニューラルネットワークモデルを自身で学習して自身で利用するということが可能となる。 As described above, in the present embodiment, the image combining device 200 includes the concept selection unit 310 that selects the concept used for learning the neural network model, and the context generation unit 320 that generates the context used for learning the neural network model. And a determination unit (image determination unit 350) that evaluates a composite image generated by the neural network model. Then, the conversion unit (embedding conversion unit 240) converts the concept selected by the concept selection unit 310 and the context generated by the context generation unit 320 into a feature vector. The synthesizing unit (image synthesizing unit 250) selects an element object (object image) from the database 300 based on the neural network model and the feature vector, and generates synthesized data using the element object. The determination unit evaluates the composite data, and if the composite data can not obtain a predetermined evaluation, the processing of the combination unit and the determination unit is repeated. Since the data synthesis device can perform machine learning of data synthesis, it becomes possible to learn the neural network model by itself and use it by itself.

実施例１は、ユーザ１０が端末１００から入力した自然言語の文書に基づいて画像合成装置２００が画像を合成し、端末１００に送信する例を示した。実施例３では、端末が単独で画像合成を行う例を示す。 The first embodiment shows an example in which the image synthesizing device 200 synthesizes an image based on a natural language document input by the user 10 from the terminal 100 and transmits the image to the terminal 100. The third embodiment shows an example in which the terminal alone performs image synthesis.

図８は、実施例３による画像合成システムの物理構成を示すブロック図である。図８を参照すると、画像合成システムは端末１１０のみで構成されている。端末１１０は、例えば、パーソナルコンピュータ、タブレット端末、スマートフォンなどの情報機器である。図９は、実施例３による端末のブロック図である。図９を参照すると、実施例３による端末１１０は、入力部４１０、表示部４２０、データベース記憶部４３０、文処理部２１０、コンセプト抽出部２２０、コンテキスト抽出部２３０、エンベディング変換部２４０、および画像合成部２５０を有している。 FIG. 8 is a block diagram showing the physical configuration of the image combining system according to the third embodiment. Referring to FIG. 8, the image combining system is configured of only the terminal 110. The terminal 110 is, for example, an information device such as a personal computer, a tablet terminal, or a smartphone. FIG. 9 is a block diagram of a terminal according to the third embodiment. 9, the terminal 110 according to the third embodiment includes an input unit 410, a display unit 420, a database storage unit 430, a sentence processing unit 210, a concept extraction unit 220, a context extraction unit 230, an embedding conversion unit 240, and an image synthesis. It has a part 250.

文処理部２１０、コンセプト抽出部２２０、コンテキスト抽出部２３０、エンベディング変換部２４０、および画像合成部２５０は、図２に示した実施例１において画像合成装置２００が備えていたものと同様である。 The sentence processing unit 210, the concept extraction unit 220, the context extraction unit 230, the embedding conversion unit 240, and the image combining unit 250 are the same as those included in the image combining apparatus 200 in the first embodiment shown in FIG.

入力部４１０は、ユーザ１０が生成したい画像に関する文を入力する入力操作部である。 The input unit 410 is an input operation unit that inputs a sentence related to an image that the user 10 wants to generate.

表示部４２０は、画像合成部２５０で生成された合成画像を表示する表示部である。 The display unit 420 is a display unit that displays the composite image generated by the image combining unit 250.

データベース記憶部４３０は、図２のデータベース３００に相当するオブジェクト画像のデータを蓄積する記憶部である。 The database storage unit 430 is a storage unit for accumulating data of object images corresponding to the database 300 of FIG.

実施例１では、データベース３００に多数のオブジェクト画像を蓄積しておき、画像合成部２５０は、データベース３００から特定のオブジェクト画像を取得し、それらのオブジェクト画像を合成することにより合成画像を作成した。オブジェクト画像は、人間、電車、うさぎ、など所定の対象物の画像である。しかし、本発明が実施例１の構成および処理に限定されることはない。視覚表現オブジェクトを組み合わせることにより合成画像を生成するものであればよい。視覚表現オブジェクトには、対象物の画像だけでなく、対象物の外観上の特徴を示すパッチも含まれる。パッチにより画像上で対象物の外観上の特徴を修正することが可能である。特徴は、うさぎの例をとれば、「白い毛皮」「短い脚」「長い耳」などであり、それを、うさぎのオブジェクト画像に、白い毛皮のパッチを組み合わせて「白いウサギ」の画像を生成することができる。 In the first embodiment, a large number of object images are accumulated in the database 300, and the image combining unit 250 acquires specific object images from the database 300, and creates a combined image by combining the object images. The object image is an image of a predetermined object such as a human, a train, or a rabbit. However, the present invention is not limited to the configuration and process of the first embodiment. What is necessary is just to produce | generate a synthetic image by combining a visual expression object. Visual representation objects include not only images of objects, but also patches that indicate the appearance features of the objects. The patch makes it possible to correct the appearance features of the object on the image. The characteristic is, in the rabbit example, "white fur", "short leg", "long ear", etc. and combining it with the white fur patch to the rabbit object image to generate the "white rabbit" image can do.

実施例４では、データベース３００に、視覚表現オブジェクトとして、オブジェクト画像の他に、対象物の外観上の詳細な特徴を示すパッチを蓄積しておき、オブジェクト画像とパッチを用いて所望のオブジェクト画像を作成する例を示す。実施例４の画像合成装置２００の基本的な構成は、図２に示した実施例１のものと同様である。実施例４では、画像合成部２５０は、オブジェクト画像とパッチを用いて所望のオブジェクト画像を作成し、その作成したオブジェクト画像を用いて合成画像を作成する。 In the fourth embodiment, as a visual representation object, in addition to the object image, a patch indicating a detailed feature of the appearance of the object is accumulated in the database 300, and a desired object image is stored using the object image and the patch. An example of creating is shown. The basic configuration of the image combining device 200 of the fourth embodiment is the same as that of the first embodiment shown in FIG. In the fourth embodiment, the image combining unit 250 creates a desired object image using an object image and a patch, and creates a combined image using the created object image.

図１０は、実施例４によるデータベースが蓄積するデータの一例を示す図である。図１０を参照すると、データベース３００には、オブジェクト画像５１０と、それに付随するパッチ５２０が蓄積されている。画像合成部２５０は、ユーザ１０が入力した文から抽出されたコンセプトおよびコンテキストに基づき、ＮＮモデルを用いて、オブジェクト画像をパッチで修正し、修正した画像を組み合わせて合成画像を生成する。 FIG. 10 is a diagram of an example of data accumulated by the database according to the fourth embodiment. Referring to FIG. 10, in the database 300, an object image 510 and a patch 520 associated therewith are accumulated. The image combining unit 250 corrects the object image with a patch using the NN model based on the concept and context extracted from the sentence input by the user 10, and combines the corrected images to generate a combined image.

なお、本実施例も、画像データだけでなく、オーディオデータやセンサデータなど他のデータにも適用可能である。 The present embodiment is also applicable to not only image data but also other data such as audio data and sensor data.

以上説明したように、本実施例によれば、要素オブジェクト（オブジェクト画像）は、画像の要素となる視覚的な表現を示す視覚表現オブジェクトであり、合成部（画像合成部２５０）は、特徴ベクトルに基づいて視覚表現オブジェクトを組み合わせることにより合成画像を生成する。自然言語の文で表現されたユーザ所望の画像を、視覚表現オブジェクトを組み合わせて合成することができる。 As described above, according to the present embodiment, the element object (object image) is a visual expression object indicating a visual expression that is an element of the image, and the combining unit (image combining unit 250) To generate a composite image by combining the visual representation objects on the basis of. A user-desired image represented by natural language sentences can be synthesized by combining visual presentation objects.

また、視覚表現オブジェクトには、合成画像に表示する物体の特徴を表すパッチが含まれ、合成部は、オブジェクト画像をパッチにより修正し、修正したオブジェクト画像により合成画像を作成する。パッチによりユーザが所望する合成画像に適するようにオブジェクト画像を修正するので、よりユーザの所望に近い合成画像の生成が可能となる。 In addition, the visual representation object includes a patch representing the feature of the object to be displayed in the composite image, and the combining unit corrects the object image with the patch, and creates a composite image from the corrected object image. The patch corrects the object image so as to be suitable for the composite image desired by the user, and thus enables generation of a composite image closer to the user's desired.

実施例１に示した例は、予め学習により取得したＮＮモデルを使用して画像合成を行う例であった。しかし、本発明がこれに限定されることはない。他の例として、実施例５では、ＮＮモデルを用いた画像合成により得られる情報を利用してゼロショット学習を行い、ＮＮモデルを更新していく例を示す。合成画像を生成する過程で、合成画像を生成するのに利用したオブジェクト画像が何を示しているかに関する情報（以下「オブジェクト画像情報」という）が得られる。本実施例では、合成データを生成する過程で得られたオブジェクト画像情報を利用してゼロショット学習を行い、ＮＮモデルを更新する。合成データの生成で得られた情報を用いたゼロショット学習によりＮＮモデルを更新するので、言葉によるデータ合成の性能を継続的に向上させることができる。 The example shown in the first embodiment is an example in which image synthesis is performed using an NN model acquired by learning in advance. However, the present invention is not limited to this. As another example, Example 5 shows an example of performing zero-shot learning using information obtained by image combination using an NN model, and updating the NN model. In the process of generating the composite image, information (hereinafter referred to as “object image information”) is obtained regarding what the object image used to generate the composite image indicates. In this embodiment, zero-shot learning is performed using object image information obtained in the process of generating composite data, and the NN model is updated. Since the NN model is updated by zero-shot learning using information obtained in the generation of synthetic data, the performance of verbal data synthesis can be continuously improved.

実施例５による画像合成システムの物理構成は図１に示した実施例１のものと同様である。実施例５の画像合成装置２００は実施例１のものと一部が異なる。図１１は、実施例５による画像合成装置のブロック図である。 The physical configuration of the image combining system according to the fifth embodiment is the same as that of the first embodiment shown in FIG. The image combining device 200 of the fifth embodiment is partially different from that of the first embodiment. FIG. 11 is a block diagram of an image combining device according to a fifth embodiment.

図１１を参照すると、画像合成装置２００は、文処理部２１０、コンセプト抽出部２２０、コンテキスト抽出部２３０、エンベディング変換部２４０、画像合成部２５０、画像出力部２６０、およびモデル更新部６１０を有している。実施例５の画像合成装置２００は、実施例１のものと同様に画像合成を行うことができ、文処理部２１０、コンセプト抽出部２２０、コンテキスト抽出部２３０、エンベディング変換部２４０、画像合成部２５０、および画像出力部２６０は、図２に示した実施例１のものと同様である。 Referring to FIG. 11, the image combining device 200 includes a sentence processing unit 210, a concept extraction unit 220, a context extraction unit 230, an embedding conversion unit 240, an image combining unit 250, an image output unit 260, and a model updating unit 610. ing. The image combining apparatus 200 of the fifth embodiment can perform image combining in the same manner as that of the first embodiment, and the sentence processing unit 210, the concept extracting unit 220, the context extracting unit 230, the embedding conversion unit 240, and the image combining unit 250. The image output unit 260 is the same as that of the first embodiment shown in FIG.

データベース３００に蓄積されたオブジェクト画像には、オブジェクト画像に表示された対象物に関する情報がメタデータ（ラベル）として付加されている。モデル更新部６１０は、画像合成部２５０で合成画像を生成する過程に得られた、合成画像の生成に利用したオブジェクト画像に関するオブジェクト画像情報が、データベース３００に蓄積されたオブジェクト画像に与えられていない情報（以下「サンプルなしオブジェクト画像情報」という）であることを認識する。例えば、データベース３００に蓄積されていない物に関する情報、あるいは、データベース３００に蓄積されたオブジェクト画像に付加されていない情報が考えられる。モデル更新部６１０は、サンプルなしオブジェクト画像情報を認識すると、そのサンプルなしオブジェクト画像情報を利用してＮＮモデルのゼロショット学習を行うことを決定する。モデル更新部６１０は、モデル更新部６１０は、画像合成部２５０で合成画像を生成する過程に得られた、合成画像の生成に利用したオブジェクト画像に関するオブジェクト画像情報を利用してゼロショット学習を行うことによりＮＮモデルを更新する。その後、画像合成部２５０は、更新されたＮＮモデルを用いて画像合成を行う。サンプルが与えられていない情報を認識すると、その情報をゼロショット学習に利用するので、サンプルが与えられていない情報を効率よくゼロショット学習することができる。 Information on the object displayed in the object image is added as metadata (label) to the object image stored in the database 300. The model updating unit 610 does not give the object image information about the object image used for generating the composite image obtained in the process of generating the composite image by the image combining unit 250 to the object image stored in the database 300. It recognizes that it is information (hereinafter referred to as "sample image information without sample"). For example, information on objects not stored in the database 300 or information not added to an object image stored in the database 300 can be considered. When recognizing the non-sampled object image information, the model updating unit 610 determines to perform zero shot learning of the NN model using the non-sampled object image information. The model update unit 610 performs zero shot learning using object image information on an object image used for generating a composite image, which is obtained in the process of generating a composite image by the image composition unit 250. Update the NN model accordingly. Thereafter, the image combining unit 250 performs image combining using the updated NN model. When information not given a sample is recognized, that information is used for zero shot learning, so information without a sample can be efficiently zero shot learned.

なお、モデル更新部６１０はどのようなタイミングでゼロショット学習を実行してもよいが、例えば、一定期間間隔で行ってもよいし、管理者が指示したタイミングで行ってもよい。 The model updating unit 610 may perform zero shot learning at any timing, but may perform it at fixed intervals, for example, or at a timing instructed by the administrator.

実施例５では、画像合成装置２００は、ユーザの要求した合成画像を生成する過程で得られる情報を用いてゼロショット学習を行う例を示したが、本発明がこれに限定されることはない。他の例として実施例６では、ＮＮモデルのゼロショット学習に利用するための文（以下「サンプル文」という）を画像合成装置に与え、画像合成装置は、そのサンプル文を用いて、コンセプト抽出、コンテキスト抽出、およびエンベディング変換を行い、その過程で得られた、オブジェクト画像が何を示しているかに関するオブジェクト画像情報をゼロショット学習に利用する。ただし、ＮＮモデルのゼロショット学習に利用するために与えられたサンプル文に対する処理では、画像合成部２５０は画像合成を行わない。合成画像を生成する処理は負荷の高い処理であるが、本実施例では、合成画像を実際に生成するまでの処理を完結しなくてもオブジェクト画像情報が得られるので、それをゼロショット学習に利用するというものである。 The fifth embodiment shows an example in which the image synthesizing device 200 performs zero-shot learning using information obtained in the process of generating a synthesized image requested by the user, but the present invention is not limited to this. . As another example, in the sixth embodiment, a sentence (hereinafter referred to as "sample sentence") to be used for zero shot learning of the NN model is given to the image synthesizer, and the image synthesizer uses the sample sentences to extract a concept. , Context extraction, and embedding transformation, and object image information on what the object image indicates in the process is used for zero-shot learning. However, the image combining unit 250 does not perform image combining in the processing on a sample sentence given for use in zero-shot learning of the NN model. The process of generating a composite image is a high-load process, but in the present embodiment, object image information can be obtained without completing the process until the composite image is actually generated. It is something to use.

コンセプト抽出、コンテキスト抽出、およびエンベディング変換を行うことによりコンセプトおよびコンテキストが特徴ベクトルで表現した情報（以下「フィーチャ情報」という）される。実施例６では、合成画像の生成まで行わずに得られたフィーチャ情報がゼロショット学習に利用される。そのために実施例６では、ＮＮモデルのゼロショット学習に利用するフィーチャ情報を取得するために、コンセプト抽出、コンテキスト抽出、およびエンベディング変換の処理が実行される。 Information extracted from concepts and contexts as feature vectors (hereinafter referred to as "feature information") is obtained by performing concept extraction, context extraction, and embedding transformation. In the sixth embodiment, feature information obtained without generating a composite image is used for zero-shot learning. Therefore, in the sixth embodiment, in order to obtain feature information used for zero shot learning of the NN model, processes of concept extraction, context extraction, and embedding transformation are performed.

実施例６による画像合成システムの物理構成は図１に示した実施例１（あるいは実施例５）のものと同様である。実施例６の画像合成装置２００の構成は図１１に示した実施例５による画像合成装置と同様である。 The physical configuration of the image combining system according to the sixth embodiment is the same as that of the first embodiment (or the fifth embodiment) shown in FIG. The configuration of the image combining device 200 of the sixth embodiment is the same as that of the image combining device according to the fifth embodiment shown in FIG.

画像合成装置２００に対してゼロショット学習のためにサンプル文が与えられる。サンプル文は、仮想的な合成画像に関する文であり、ＮＮモデルの学習のための文である。コンセプト抽出部２２０は、実施例１のものと同様に、サンプル文からコンセプトを抽出する。コンテキスト抽出部２３０は、実施例１のものと同様に、サンプル文からコンテキストを抽出する。更に、エンベディング変換部２４０は、コンセプトおよびコンテキストを特徴ベクトルに変換する。これらによりオブジェクト画像情報（フィーチャ情報）が得られる。ただし、画像合成部２５０は、サンプル文に対する処理では画像合成を行わない。モデル更新部６１０は、オブジェクト画像情報（フィーチャ情報）を利用してゼロショット学習を行うことによりＮＮモデルを更新する。合成データを生成しなくてもゼロショット学習によりＮＮモデルを更新することができ処理量が少ないので、高速な学習が可能であり、言葉によるデータ合成の性能を迅速に向上させることができる。 Sample sentences are given to the image synthesizing device 200 for zero shot learning. The sample sentence is a sentence related to a virtual composite image, and is a sentence for learning of the NN model. The concept extraction unit 220 extracts a concept from the sample sentence as in the first embodiment. The context extraction unit 230 extracts a context from the sample sentence as in the first embodiment. Furthermore, the embedding transform unit 240 transforms the concept and the context into feature vectors. Object image information (feature information) is obtained by these. However, the image combining unit 250 does not perform image combining in the process on the sample sentence. The model update unit 610 updates the NN model by performing zero-shot learning using object image information (feature information). Since the NN model can be updated by zero-shot learning without generating synthetic data and the amount of processing is small, high-speed learning is possible, and the performance of verbal data synthesis can be rapidly improved.

実施例５では、一般的なニューラルネットワークのゼロショット学習手法を用いてＮＮモデルを更新する例を示したが、実施例７では、ＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋｓ）を用いた敵対的訓練によるゼロショット学習によりＮＮモデルを更新する例を示す。実施例７では、全てのコンセプトおよびコンテキストの特徴が知られているわけではなく、ゼロショット学習が行われる。コンセプトおよびコンテキストの特徴の限定されたサブセットが与えられると、画像合成装置は、２つのコンセプトの特徴を選択し、それら２つのコンセプトの特徴の間に新しい特徴を合成することができる。その結果として得られる特徴は、いずれのコンセプトではなく、両方を混合したものとなる。そのような特徴を計算し、ＮＮモデルを訓練するために可能なアプローチの一例を以下に示す。 In the fifth embodiment, an example of updating the NN model using a general neural network zero-shot learning method has been described. In the seventh embodiment, zero-shot learning by hostile training using generic adaptive networks (GAN) is performed. Shows an example of updating the NN model. In Example 7, not all features of the concept and context are known, but zero shot learning is performed. Given a limited subset of concepts and context features, the image synthesizer can select features of the two concepts and combine new features between the features of the two concepts. The resulting feature is not a concept, but a mixture of both. An example of a possible approach to computing such features and training an NN model is given below.

実施例７による画像合成システムの物理構成は図１に示した実施例１（あるいは実施例５）のものと同様である。また、実施例７による画像合成装置の構成は図１１に示した実施例５のものと同様である。本実施例のモデル更新部６１０は、条件生成モデルを用いて多様な学習用のサンプルを生成し、ＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋｓ）を用いた敵対的訓練によるゼロショット学習によりＮＮモデルを更新する。条件生成モデルで学習用のサンプルを生成し、ＧＡＮによるゼロショット学習でニューラルネットワークモデルを更新するので、ＮＮモデルによるデータ合成を継続的に改善し、ＮＮモデルを堅牢にすることができる。 The physical configuration of the image combining system according to the seventh embodiment is the same as that of the first embodiment (or the fifth embodiment) shown in FIG. The configuration of the image combining apparatus according to the seventh embodiment is the same as that of the fifth embodiment shown in FIG. The model update unit 610 of this embodiment generates various learning samples using the condition generation model, and updates the NN model by zero shot learning by hostile training using generative adaptive networks (GAN). Since the sample for learning is generated by the condition generation model and the neural network model is updated by zero shot learning by GAN, data synthesis by the NN model can be continuously improved and the NN model can be made robust.

本実施例のゼロショット学習について説明する。ここでは以下の表記法に従う。学習データＤを式（１）のように表す。

Zero shot learning according to this embodiment will be described. Here, the following notation is used. The learning data D is expressed as equation (1).

ここで画像ｘ_ｎは式（２）に示すように画像集合の要素である。

Here, the image x _n is an element of the image set as shown in equation (2).

ｙ_ｉは画像ｘ_ｉのラベルであり、それらのラベルは、式（３）に示す既知のクラスのラベル空間から得られる。

y _i is the label of the image x _i , which labels are obtained from the label space of the known class shown in equation (3).

未知のクラスのラベル空間を式（４）で示すものとする。

The label space of the unknown class is represented by equation (4).

式（５）に示す既知の各クラスはワンショット表現または単語エンベディング（単語の特徴ベクトル）で表現することができるものとする。

Each known class shown in the equation (5) can be expressed by one-shot expression or word embedding (feature vector of a word).

本実施例におけるゼロショット学習の目標は、未知の画像を正しいクラスラベル（クラスを示すラベル）のマッピングに近づけるようにマッピングすることができる、最適な意味的エンベディング（意味を表す特徴ベクトル）を見つけることである。つまり、画像とクラスラベルを、ある画像のエンベディングはその画像と同じクラスラベルのエンベディングに近く、ある画像のエンベディングはその画像と異なるクラスのラベルのエンベディングと異なるという意味空間に投影することを目的とする。最小化の目的となる損失を式（６）のように示すことができる。

The goal of zero-shot learning in this example is to find an optimal semantic embedding (feature vector representing meaning) that can map unknown images close to the mapping of the correct class label (label indicating class). It is. In other words, the purpose is to project an image and a class label in a semantic space that the embedding of an image is close to the embedding of the same class label as the image and the embedding of an image is different from the embedding of a label of a different class from the image. Do. The loss targeted for the minimization can be expressed as equation (6).

ここで、ｄ（ｘ_ｉ，ｙ_ｉ）は、画像の意味的エンベディングと同じクラスのラベルの意味的エンベディングとの間の類似性を示す指標である。類似性の指標として、セマンティックマッチングで広く用いられているドット積を用いている。 Here, d (x _i , y _i ) is an index indicating the similarity between the semantic embedding of an image and the semantic embedding of a label of the same class. The dot product widely used in semantic matching is used as an index of similarity.

ここで、条件生成モデルをＧｅｎ（ｙ_ｉ，ｚ）と表す。ラベルｙ_ｉは、画像ｘ_ｉの単語エンベディング（単語の特徴ベクトル）である。生成モデルＧｅｎは、画像ｘ_ｉと同じ分布から、画像合成において条件として用いる視覚的特徴を出力するＮＮモデルである。以下、この視覚的特徴を式（７）に示す記号で示すものとする。

Here, the condition generation model is represented as Gen (y _i , z). The label y _i is the word embedding (feature vector of a word) of the image x _i . Generating model Gen from the same distribution as the image x _i, a NN model for outputting the visual characteristics is used as a condition in the image synthesis. Hereinafter, this visual feature is shown by the symbol shown in equation (7).

また、単語エンベディングｙ_ｉの補間は式（８）のように表すことができる。

Also, the interpolation of the word embedding y _i can be expressed as equation (8).

式（８）の補間は生成モデルＧｅｎの入力とみなされ、視覚的特徴空間における補間ｕ_ｉを得ることができる。生成された視覚的特徴に対する２つの損失関数（式（９）および式（１０））を得ることができる。

The interpolation of equation (8) is considered as an input of the generation model Gen, and an interpolation u _i in the visual feature space can be obtained. Two loss functions (Eqs. (9) and (10)) can be obtained for the generated visual features.

式（９）および式（１０）は、視覚的特徴空間における補間ｕｉで補間された視覚的特徴のマッピングが単語エンベディングｙｉと単語エンベディングｙｊのマッピングの間にあることを意味する。ＮＮモデルで生成された視覚的特徴（式（７））に対して以下の式（１１）および式（１２）の損失関数を用いることができる。

Equations (9) and (10) mean that the mapping of visual features interpolated with interpolation ui in visual feature space is between the mapping of word embedding yi and word embedding yj. The loss functions of the following Equations (11) and (12) can be used for the visual features (Equation (7)) generated by the NN model.

式（１１）および式（１２）のＬを最小化することは、ＮＮモデルで生成される視覚的特徴のマッピングを画像ｘｉのものに近づけようとすることを意味する。式（９）および式（１０）は、ｘｉ、ｙｉ、およびｙｊを識別する関係を崩さないように補間ｕｉを決めようとするものである。この式（１１）および式（１２）を用いることにより、より多様なサンプルの視覚的特徴を用いて式（６）を訓練することができる。 Minimizing L in Equations (11) and (12) means that the mapping of the visual features generated in the NN model is to be closer to that of the image xi. Equations (9) and (10) try to determine the interpolation ui so as not to break the relationship identifying xi, yi, and yj. By using Equations (11) and (12), it is possible to train Equation (6) using more diverse sample visual features.

上述した各実施例は、本発明の説明のための例示であり、本発明の範囲をそれらに限定する趣旨ではない。当業者は、本発明の範囲を逸脱することなしに、他の様々な態様で本発明を実施することができる。 Each example mentioned above is an illustration for explanation of the present invention, and is not the meaning which limits the range of the present invention to them. Those skilled in the art can implement the present invention in various other aspects without departing from the scope of the present invention.

１０…ユーザ、１００…端末、１０１…通信ネットワーク、１１０…端末、２００…画像合成装置、２０１…通信ネットワーク、２１０…文処理部、２２０…コンセプト抽出部、２３０…コンテキスト抽出部、２４０…エンベディング変換部、２５０…画像合成部、２６０…画像出力部、３００…データベース、３１０…コンセプト選択部、３２０…コンテキスト生成部、３５０…画像判定部、３６０…終了条件判定部、４１０…入力部、４２０…表示部、４３０…データベース記憶部、５１０…オブジェクト画像、５２０…パッチ、６１０…モデル更新部 Reference Signs List 10 user 100 terminal 101 communication network 110 terminal 200 image combining device 201 communication network 210 statement processing unit 220 concept extraction unit 230 context extraction unit 240 embedding conversion The unit 250, an image combining unit 260, an image output unit 300, a database 310, a concept selection unit 320, a context generation unit 350, an image determination unit 360, an end condition determination unit 410, an input unit 420, and so on. Display unit, 430 ... database storage unit, 510 ... object image, 520 ... patch, 610 ... model update unit

Claims

A database in which element objects, which are synthesizable data related to objects that can be described in natural language, are stored in advance;
An extraction unit that extracts at least one of a concept or context from natural language sentences;
A transformation unit for transforming the concept or the context into a feature vector represented by a vector in a predetermined feature space;
A neural network model for combining element objects according to the input feature vector is held in advance, and based on the neural network model and the feature vector, an element object is selected from the database and combined data is generated using the element object The synthesis unit to
A data synthesizer comprising:

The element object includes an object image which is data of an image representing a predetermined object,
The composite data includes a composite image to be composited by the object image,
The neural network model receives a feature vector of at least one of the concept and the context, acquires an object image from the database, and generates a composite image using the acquired object image.
The data synthesizing apparatus according to claim 1.

The extraction unit extracts a concept related to a minimal entity that is a noun among minimal entities that are semantically interpretable units of the sentence, and connects the minimal entity that is a noun to another minimal entity. The data synthesizer according to claim 1, wherein the context is generated.

The image combining unit generates the combined image by overlapping or arranging a plurality of object images.
The data synthesizer according to claim 2.

The neural network model is a neural network model based on generative adaptive networks which performs learning with two mutually opposing models of a generator and a discriminator, and the generator generates the composite image by the generator.
The data synthesizer according to claim 2.

A concept selection unit for selecting a concept to be used for learning the neural network model;
A context generation unit that generates a context used to learn the neural network model;
A determination unit that evaluates a composite image generated by the neural network model;
And have
The conversion unit converts the concept selected by the concept selection unit and the context generated by the context generation unit into a feature vector;
The synthesizing unit selects an element object from the database based on the neural network model and the feature vector, and generates synthesized data using the element object.
The determination unit evaluates the combined data;
If the synthesized data can not obtain a predetermined evaluation, the processing of the synthesizing unit and the judging unit is repeated.
The data synthesizing apparatus according to claim 1.

The element object is a visual expression object indicating a visual expression that is an element of an image,
The combining unit generates a combined image by combining the visual representation objects based on the feature vector.
The data synthesizing apparatus according to claim 1.

The visual representation object includes a patch representing a feature of an object to be displayed in the composite image.
The combining unit corrects the object image with the patch, and creates a combined image from the corrected object image.
The data synthesizer according to claim 7.

The information processing method according to claim 1, further comprising: an update unit that performs zero-shot learning by using information on element objects used to generate the synthetic data obtained in the process of generating the synthetic data, and updates the neural network model. Data synthesizer as described in.

The update unit recognizes that the information on the element object used for generating the composite data, which is obtained in the process of generating the composite data, is information that is not given to the object image stored in the database. Then, it is decided to perform the zero shot learning.
The data synthesizer according to claim 9.

It further comprises an updating unit that performs zero-shot learning using information on element objects and updates the neural network model,
The extraction unit extracts at least one of a concept or a context from sample sentences given for learning of the neural network model;
The conversion unit converts the concept or the context into a feature vector represented by a vector in a predetermined feature space;
The updating unit performs zero-shot learning using information on the feature vector, and updates the neural network model.
The data synthesizing apparatus according to claim 1.

The data synthesizing device according to claim 9 or 11, wherein the updating section updates the neural network model by zero shot learning by hostile training using generative adversary networks.

Pre-hold a neural network model that synthesizes element objects according to the input feature vector,
It has a database in which element objects that are synthesizable data related to objects that can be described in natural language are stored in advance
The extraction means extracts at least one of a concept or context from natural language sentences,
Transformation means transforms the concept or the context into a feature vector represented by a vector in a predetermined feature space;
And combining means, based on the neural network model and the feature vector, selects an element object from the database and generates synthesized data using the element object.
Data synthesis method.