JP6877374B2

JP6877374B2 - How to train a model that outputs a vector that represents the tag set that corresponds to the image

Info

Publication number: JP6877374B2
Application number: JP2018025847A
Authority: JP
Inventors: 彬童; マルティンクリンキグト; ヨコジョバンニクリスティアント; 真岩山; 義行小林
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-02-16
Filing date: 2018-02-16
Publication date: 2021-05-26
Anticipated expiration: 2038-02-16
Also published as: JP2019144639A

Description

本発明は、画像に対応するタグセットを表すベクトルを出力するモデルを訓練する方法に関する。 The present invention relates to a method of training a model that outputs a vector representing a tag set corresponding to an image.

画像の解析や検索のために、画像に対して画像を表すタグを付与する技術が知られている。例えば、特許文献１又は特許文献２は、画像に関連するタグを推定するシステムを開示している。 A technique for adding a tag representing an image to an image is known for image analysis and retrieval. For example, Patent Document 1 or Patent Document 2 discloses a system for estimating tags related to images.

特許文献１において、「最初に、訓練中に、特徴データをトレーニングするために畳み込みニューラルネットワーク（ＣＮＮ）に入力されるデータのクラスタ不均衡を低減するためにクラスタリング技術が利用される。実施形態では、クラスタリング技術を利用して、（タグ付けされていない画像をタグ付けするための）タグ伝搬に利用できるデータポイント類似性を計算することもできる。テスト中に、多様性ベースの投票フレームワークがユーザのタグ付けバイアスを克服するために利用される。いくつかの実施形態では、バイグラム再重み付けは、予測タグセットに基づいてバイグラムの一部である可能性が高いキーワードの重みを小さくすることができる。」（要約）と開示されている。 In Patent Document 1, "First, during training, a clustering technique is used to reduce the cluster imbalance of data input to a convolutional neural network (CNN) to train feature data. , Clustering techniques can also be used to calculate data point similarity that can be used for tag propagation (for tagging untagged images). During testing, a diversity-based voting framework It is used to overcome user tagging bias. In some embodiments, bigram reweighting can reduce the weighting of keywords that are likely to be part of the bigram based on a predictive tag set. It can be done. ”(Summary).

特許文献２において、「システム、方法、および非一時的なコンピュータ可読媒体は、訓練段階において、第１のコンテンツアイテム変換に基づいて第１のコンテンツアイテムの第１のコンテンツアイテム表現を生成することができる。第１のコンテンツアイテムは、１または複数の画像及びビデオを含むことができる。第１のユーザメタデータの第１のユーザメタデータ表現は、第１のユーザメタデータ変換に基づいて作成することができる。第１のコンテンツアイテム表現及び第１のユーザメタデータ表現を組み合わせて、第１の結合表現を生成することができる。第１の組み合わせ表現及び第１のタグの第１のタグ表現は、互いに第１の閾値距離内の埋め込み空間に埋め込むことができる。」（要約）と開示されている。 In Patent Document 2, "systems, methods, and non-transitory computer-readable media may generate a first content item representation of a first content item based on a first content item transformation in the training stage. The first content item can include one or more images and videos. The first user metadata representation of the first user metadata is created based on the first user metadata transformation. The first content item representation and the first user metadata representation can be combined to generate the first combined representation. The first combination representation and the first tag representation of the first tag. Can be embedded in an embedded space within a first threshold distance from each other. "(Summary).

米国特許出願公開２０１７／０２３６０５５号U.S. Patent Application Publication No. 2017/0236055 米国特許出願公開２０１６／０１８８５９２号U.S. Patent Application Publication No. 2016/01885992

機械学習の技術により、入力された画像に対して、関連する１又は複数タグ（タグセット）を出力するモデルを形成することができる。機械学習によりモデルを適切に訓練するためには、大量の訓練用データが必要とされ、一般に、画像データに関連付けられるタグの出現頻度分布は、ロングテイル属性を有する。つまり、多くのタグの出現頻度（関連づけられる画像の数）は低く、一部のタグのみその出現頻度が高い。 By the technique of machine learning, it is possible to form a model that outputs one or a plurality of related tags (tag sets) for an input image. In order to properly train a model by machine learning, a large amount of training data is required, and in general, the frequency distribution of tags associated with image data has a long tail attribute. That is, the frequency of appearance of many tags (the number of images associated with them) is low, and the frequency of appearance of only some tags is high.

出現頻度が低いタグセットの多くの訓練用データを収集することは困難である。一方、画像に関連付けるべきタグセットを推定するシステムに対して、出現頻度が低いタグセットであっても高精度に推定することができる能力が要求される。 It is difficult to collect many training data for infrequently occurring tag sets. On the other hand, a system that estimates a tag set to be associated with an image is required to have the ability to estimate with high accuracy even a tag set that appears infrequently.

本開示の一態様は、入力画像から前記入力画像と関連付けるべきタグセットを表すベクトルを出力する第１モデルを訓練する、ことを含む、計算機に実行される方法であって、前記計算機は、記憶装置と、前記記憶装置に格納されているプログラムに従って動作するプロセッサと、を含み、前記第１モデルは、入力画像から前記入力画像に関連付けるべきタグセットを表すベクトルを出力するエンコーダと、前記エンコーダからの出力を入力されるデコーダとを含み、前記方法は、前記プロセッサが、前記エンコーダに第１訓練画像を入力し、前記デコーダからの第１出力画像を取得し、前記エンコーダからの第１出力ベクトルを取得し、前記第１訓練画像と前記第１出力画像との間の誤差と、前記第１訓練画像に予め関連付けられている第１タグセットに基づき予め設定された、前記エンコーダからの前記第１出力ベクトルに対する制約と、に基づいて、前記第１モデルのパラメータを更新する。 One aspect of the disclosure is a method performed on a computer, comprising training a first model that outputs a vector representing a set of tags to be associated with the input image from the input image, wherein the computer stores. The first model comprises from the input image an encoder that outputs a vector representing a tag set to be associated with the input image, and from the encoder, including a device and a processor that operates according to a program stored in the storage device. In the method, the processor inputs a first training image to the encoder, acquires a first output image from the decoder, and includes a first output vector from the encoder. The first from the encoder, which is preset based on the error between the first training image and the first output image and the first tag set preliminarily associated with the first training image. The parameters of the first model are updated based on the constraints on one output vector.

本開示の一態様によれば、画像と関連付けるべきタグセットを高精度に推定できるモデルを得ることができる。 According to one aspect of the present disclosure, it is possible to obtain a model capable of estimating the tag set to be associated with the image with high accuracy.

タグ推定装置を含む計算機システムの構成例を示す。A configuration example of a computer system including a tag estimation device is shown. 画像−テキスト関連データベースの構成例を示すShows an example of an image-text relational database configuration 画像−タグ関連データベースの構成例を示す。An example of the configuration of an image-tag relational database is shown. ユーザが、入力画像に対応するタグセットを取得するための、ＧＵＩ画像の例を示す。An example of a GUI image for a user to acquire a tag set corresponding to an input image is shown. ユーザが、入力タグセットに対応する画像を取得するための、ＧＵＩ画像の例を示す。An example of a GUI image for a user to acquire an image corresponding to an input tag set is shown. 運用プログラムによる、入力画像に対する推奨タグセットの決定の処理のフローチャートを示す。The flowchart of the process of determining the recommended tag set for the input image by the operation program is shown. 運用プログラムによる、入力タグセットに関連する画像をする処理のフローチャートを示す。The flow chart of the process of making an image related to the input tag set by the operation program is shown. 画像−タグ関連データ生成プログラムによる処理のフローチャートを示す。The flowchart of the process by the image-tag-related data generation program is shown. タグ分布テーブル構成例を示す。An example of tag distribution table configuration is shown. 訓練プログラムによる、マッチングモデルの訓練（学習）の方法を示すフローチャートである。It is a flowchart which shows the method of training (learning) of a matching model by a training program. 訓練におけるマッチングモデルの処理のフローチャートである。It is a flowchart of the processing of the matching model in training. 意味表現モデルに構成例を模式的に示す。A configuration example is schematically shown in the semantic expression model. 意味表現モデルの訓練における入力データ及び出力データを示す。The input data and output data in the training of the semantic expression model are shown. 訓練プログラムによる、意味表現モデルの訓練のフローチャートである。It is a flowchart of the training of the semantic expression model by the training program.

以下、添付図面を参照して本発明の実施形態を説明する。本実施形態は本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。各図において共通の構成については同一の参照符号が付されている。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that the present embodiment is merely an example for realizing the present invention and does not limit the technical scope of the present invention. The same reference numerals are given to common configurations in each figure.

図１は、タグ推定装置１００を含む計算機システムの構成例を示す。タグ推定装置１００は、プロセッサ１１０、メモリ１２０、補助記憶装置１３０、入出力インタフェース１４０、及びネットワーク（ＮＷ）インタフェース１４５を含む。上記構成要素は、バスによって互いに接続されている。メモリ１２０、補助記憶装置１３０又はこれらの組み合わせは記憶装置の例である。 FIG. 1 shows a configuration example of a computer system including a tag estimation device 100. The tag estimation device 100 includes a processor 110, a memory 120, an auxiliary storage device 130, an input / output interface 140, and a network (NW) interface 145. The components are connected to each other by a bus. The memory 120, the auxiliary storage device 130, or a combination thereof is an example of a storage device.

メモリ１２０は、例えば半導体メモリから構成され、主にプログラムやデータを一時的に保持するために利用される。メモリ１２０が格納しているプログラムは、マッチングモデル１２１、意味表現モデル１２２、訓練プログラム１２３、運用プログラム１２４、データクローラ１２５、及び、画像−タグ関連データ生成プログラム１２６を含む。 The memory 120 is composed of, for example, a semiconductor memory, and is mainly used for temporarily holding a program or data. The program stored in the memory 120 includes a matching model 121, a semantic expression model 122, a training program 123, an operation program 124, a data crawler 125, and an image-tag related data generation program 126.

メモリ１２０は、さらに、画像−タグ関連データ生成プログラム１２６に作成されるタグ分布テーブル５４７を格納している。タグ分布テーブル５４７は、補助記憶装置１３０にも格納されてもよい。 The memory 120 further stores a tag distribution table 547 created in the image-tag related data generation program 126. The tag distribution table 547 may also be stored in the auxiliary storage device 130.

プロセッサ１１０は、メモリ１２０に格納されているプログラムに従って、様々な処理を実行する。プロセッサ１１０がプログラムに従って動作することで、様々な機能部が実現される。例えば、プロセッサ１１０は、上記プログラムそれぞれに従って、マッチングモデル部、意味表現モデル部、学習部、運用部、データクローラ部、及び、訓練データ生成部として動作する。 The processor 110 executes various processes according to the program stored in the memory 120. When the processor 110 operates according to the program, various functional units are realized. For example, the processor 110 operates as a matching model unit, a semantic expression model unit, a learning unit, an operation unit, a data crawler unit, and a training data generation unit according to each of the above programs.

補助記憶装置１３０は、画像−テキスト関連データベース（ＤＢ）１３２、及び、画像−タグ関連データベース１３４を格納している。補助記憶装置１３０は、例えばハードディスクドライブやソリッドステートドライブなどの大容量の記憶装置から構成され、プログラムやデータを長期間保持するために利用される。 The auxiliary storage device 130 stores an image-text-related database (DB) 132 and an image-tag-related database 134. The auxiliary storage device 130 is composed of a large-capacity storage device such as a hard disk drive or a solid state drive, and is used for holding programs and data for a long period of time.

補助記憶装置１３０に格納されたプログラムが起動時又は必要時にメモリ１２０にロードされ、このプログラムをプロセッサ１１０が実行することにより、タグ推定装置１００の各種処理が実行される。したがって、以下においてプログラムにより実行される処理は、プロセッサ１１０又はタグ推定装置１００による処理である。 The program stored in the auxiliary storage device 130 is loaded into the memory 120 at startup or when necessary, and the processor 110 executes this program to execute various processes of the tag estimation device 100. Therefore, the processing executed by the program in the following is the processing performed by the processor 110 or the tag estimation device 100.

ネットワークインタフェース１４５は、ネットワークとの接続のためのインタフェースである。図１の例において、タグ推定装置１００は、ネットワークインタフェース１４５を介して、インターネットに接続する。 The network interface 145 is an interface for connecting to a network. In the example of FIG. 1, the tag estimation device 100 connects to the Internet via the network interface 145.

クライアント装置１４４は、ユーザが使用する装置であって、ネットワーク、図１の例においてインターネットを介して、タグ推定装置１００にアクセスする。クライアント装置１４４は、例えば、一般的な計算機構成を有し、入力装置及び表示装置を含む。入力装置は、ユーザがタグ推定装置１００に指示や情報などを入力するためのハードウェアデバイスである。表示装置は、入出力用の各種画像を表示するハードウェアデバイスである。入力デバイス及び表示デバイスは、ネットワークを介することなく、タグ推定装置１００に接続されていてもよい。 The client device 144 is a device used by the user, and accesses the tag estimation device 100 via the network and the Internet in the example of FIG. The client device 144 has, for example, a general computer configuration and includes an input device and a display device. The input device is a hardware device for the user to input instructions, information, and the like to the tag estimation device 100. The display device is a hardware device that displays various images for input and output. The input device and the display device may be connected to the tag estimation device 100 without going through a network.

マッチングモデル１２１及び意味表現モデル１２２は、機械学習により訓練される（更新される）モデルである。タグ推定装置１００は、マッチングモデル１２１のための、訓練モード（学習モード）とオペレーションモードを有する。また、意味表現モデル１２２のための、訓練モードとオペレーションモードを有する。 The matching model 121 and the semantic expression model 122 are models trained (updated) by machine learning. The tag estimation device 100 has a training mode (learning mode) and an operation mode for the matching model 121. It also has a training mode and an operation mode for the semantic representation model 122.

マッチングモデル１２１及び意味表現モデル１２２は、それぞれ、訓練モードにおいて、訓練プログラム１２３により訓練される。マッチングモデル１２１及び意味表現モデル１２２の訓練のため、画像−タグ関連データベース１３４が使用される。 The matching model 121 and the semantic expression model 122 are each trained by the training program 123 in the training mode. The image-tag relational database 134 is used for training the matching model 121 and the semantic representation model 122.

マッチングモデル１２１は、オペレーションモードにおいて、画像−タグ関連データ生成プログラム１２６によって使用される。マッチングモデル１２１は、オペレーションモードにおいて、画像−テキスト関連データベース１３２に格納されているデータから画像−タグ関連データベース１３４に格納するデータを生成するために使用される。 The matching model 121 is used by the image-tag related data generation program 126 in the operation mode. The matching model 121 is used in the operation mode to generate data to be stored in the image-tag relational database 134 from the data stored in the image-text relational database 132.

意味表現モデル１２２は、オペレーションモードにおいて、運用プログラム１２４によって使用される。意味表現モデル１２２は、オペレーションモードにおいて、ユーザに入力された画像に対応する１又は複数タグからなるタグセットを推定するために使用される。タグセットの推定のため、画像−タグ関連データベース１３４のデータが使用される。 The semantic representation model 122 is used by the operation program 124 in the operation mode. The semantic representation model 122 is used in the operation mode to estimate a tag set consisting of one or more tags corresponding to an image input to the user. Data from the image-tag relational database 134 is used to estimate the tag set.

また、意味表現モデル１２２は、ユーザに入力されたタグセットに対応する１又は複数画像を推定するために使用される。画像推定のため、画像−テキスト関連データベース１３２のデータが使用される。画像推定は、意味表現モデル１２２及び画像−テキスト関連データベース１３２を使用することなく、画像−タグ関連データベース１３４のデータを使用してもよい。 In addition, the semantic representation model 122 is used to estimate one or more images corresponding to the tag set input to the user. Data from the image-text relational database 132 is used for image estimation. The image estimation may use the data of the image-tag relational database 134 without using the semantic representation model 122 and the image-text relational database 132.

データクローラ１２５は、インターネット上のＷｅｂページを定期的に巡回し、互いに関連付けられている画像とテキストからなる組を収集する。データクローラ１２５は、画像−テキスト関連データベース１３２に、収集した画像とテキストの組を格納する。 The data crawler 125 periodically crawls Web pages on the Internet and collects pairs of images and text associated with each other. The data crawler 125 stores the collected image and text pairs in the image-text relational database 132.

図２は、画像−テキスト関連データベース１３２の構成例を示す。画像−テキスト関連データベース１３２は、画像と、対応するテキストと関連付ける。画像カラム３２１及びテキストカラム３２２を有する。画像カラム３２１は、収集された画像を格納している。テキストカラム３２２は、画像それぞれに関連付けられているテキストを格納している。各テキストは複数の単語で構成されている。 FIG. 2 shows a configuration example of the image-text relational database 132. The image-text related database 132 associates an image with the corresponding text. It has an image column 321 and a text column 322. The image column 321 stores the collected images. The text column 322 stores the text associated with each image. Each text is composed of multiple words.

上述のように、画像−テキスト関連データベース１３２は、画像−タグ関連データベース１３３の生成、及び、ユーザ入力タグセットに対応する画像の推定において、使用される。 As mentioned above, the image-text relational database 132 is used in the generation of the image-tag relational database 133 and in the estimation of the image corresponding to the user input tag set.

図３は、画像−タグ関連データベース１３３の構成例を示す。画像−タグ関連データベース１３３は、画像と、対応するタグセットとを関連付ける。画像−タグ関連データベース１３３は、画像カラム３３１及びタグカラム３３２を有する。画像カラム３２１は、画像を格納している。タグカラム３３２は、画像それぞれに関連付けられているタグセットを格納している。各タグセットは１又は複数のタグ（単語）で構成されている。 FIG. 3 shows a configuration example of the image-tag relational database 133. The image-tag relational database 133 associates an image with a corresponding tag set. The image-tag relational database 133 has an image column 331 and a tag column 332. The image column 321 stores an image. The tag column 332 stores a tag set associated with each image. Each tag set is composed of one or more tags (words).

上述のように、画像−タグ関連データベース１３３は、マッチングモデル１２１及び意味表現モデル１２２のために使用され、また、ユーザ入力された画像を表す適切なタグセットを推定するために使用される。 As mentioned above, the image-tag relational database 133 is used for the matching model 121 and the semantic representation model 122, and is also used to estimate the appropriate tag set representing the user-entered image.

以下において、ユーザに入力されたタグセット対応する１又は複数の画像の推定、及び、ユーザに入力された画像に対応するタグセットの推定、を説明する。運用プログラム１２４は、意味表現モデル１２２を使用して、入力タグセットに対応する１又は複数の画像を推定し、また、入力画像に対応するタグセットを推定する。 In the following, the estimation of one or a plurality of images corresponding to the tag set input to the user and the estimation of the tag set corresponding to the image input to the user will be described. The operation program 124 uses the semantic representation model 122 to estimate one or more images corresponding to the input tag set, and also estimate the tag set corresponding to the input image.

図４Ａは、ユーザが、入力画像に対応するタグセットを取得するための、ＧＵＩ画像の例を示す。運用プログラム１２４は、クライアント装置１４４に、図４Ａに示す画像データを送信する。ユーザは、クライアント装置１４４において、フィールド４０１に対象画像のパスを入力し、ＵＰＬＯＡＤボタン４０２を押す。 FIG. 4A shows an example of a GUI image for the user to acquire a tag set corresponding to the input image. The operation program 124 transmits the image data shown in FIG. 4A to the client device 144. In the client device 144, the user inputs the path of the target image in the field 401 and presses the UPLOAD button 402.

クライアント装置１４４は、フィールド４０１が示すパスから入力画像（データ）を取得し、タグ推定装置１００に送信する。運用プログラム１２４は、クライアント装置１４４から対象画像を受信する。 The client device 144 acquires the input image (data) from the path indicated by the field 401 and transmits it to the tag estimation device 100. The operation program 124 receives the target image from the client device 144.

クライアント装置１４４は、ユーザによる入力装置を介した「推定」ボタン４０３の選択応じて、送信した画像を表すタグセットの要求をタグ推定装置１００に送信する。運用プログラム１２４は、受信した要求に応じて、対象画像に関連するタグセットを推定し、クライアント装置１４４に送信する。クライアント装置１４４は、受信したタグセットを、「推奨タグ」セクション４０４に表示する。 The client device 144 transmits a request for a tag set representing the transmitted image to the tag estimation device 100 in response to the user's selection of the "estimate" button 403 via the input device. The operation program 124 estimates the tag set related to the target image and transmits it to the client device 144 in response to the received request. The client device 144 displays the received tag set in the "Recommended Tags" section 404.

図４Ｂは、ユーザが、入力タグセットに対応する画像を取得するための、ＧＵＩ画像の例を示す。運用プログラム１２４は、クライアント装置１４４に、図４Ｂに示す画像データを送信する。ユーザは、クライアント装置１４４において、フィールド４５１に対象タグセットを入力する。クライアント装置１４４は、ユーザによる入力装置を介した「検索」ボタン４５２の選択応じて、対象タグセットと共に、対象タグセットに関連する画像の検索の要求をタグ推定装置１００に送信する。 FIG. 4B shows an example of a GUI image for the user to acquire an image corresponding to the input tag set. The operation program 124 transmits the image data shown in FIG. 4B to the client device 144. The user inputs the target tag set in the field 451 in the client device 144. The client device 144 transmits a request for searching an image related to the target tag set to the tag estimation device 100 together with the target tag set in response to the selection of the "search" button 452 by the user via the input device.

運用プログラム１２４は、受信した要求に応じて、対象タグセットに関連すると推定される１又は複数の画像を選択し、クライアント装置１４４に送信する。クライアント装置１４４は、受信した１又は複数の画像を、「関連画像」セクション４５４に表示する。 The operation program 124 selects one or a plurality of images presumed to be related to the target tag set according to the received request and transmits the images to the client device 144. The client device 144 displays the received one or more images in the "related images" section 454.

図５は、運用プログラム１２４による、入力画像に対する推奨タグセットの決定の処理のフローチャートを示す。運用プログラム１２４は、ユーザに入力された画像を取得する（Ｓ１０１）。運用プログラム１２４は、取得した画像から、意味表現モデル１２２によって、意味表現ベクトルを生成する（Ｓ１０２）。意味表現モデル１２２は、入力された画像から、当該画像の意味表現ベクトルを生成する。意味表現モデル１２２の構成及び動作の詳細は後述する。 FIG. 5 shows a flowchart of the process of determining the recommended tag set for the input image by the operation program 124. The operation program 124 acquires an image input to the user (S101). The operation program 124 generates a semantic expression vector from the acquired image by the semantic expression model 122 (S102). The semantic expression model 122 generates a semantic expression vector of the image from the input image. Details of the configuration and operation of the semantic expression model 122 will be described later.

後述するように、意味表現ベクトルは、入力画像と関連づけるべきタグセットを表すベクトル、つまり、画像を表すタグセットを表すベクトルと見做すことができる。後述する例において、意味表現ベクトルは、ｗｏｒｄｅｍｂｅｄｄｉｎｇ技術を利用して生成されるベクトルに対応する。 As will be described later, the semantic expression vector can be regarded as a vector representing a tag set to be associated with the input image, that is, a vector representing a tag set representing an image. In the examples described below, the semantic representation vector corresponds to the vector generated using the word embedding technique.

運用プログラム１２４は、画像−タグ関連データベース１３４から、タグを順次選択し、それらのベクトル（タグベクトル）を生成する（Ｓ１０３）。本例において、運用プログラム１２４は、ｗｏｒｄｅｍｂｅｄｄｉｎｇ技術により、各タグをタグベクトルに変換する。運用プログラム１２４は、事前に訓練されているｗｏｒｄｅｍｂｅｄｄｉｎｇのモデルを含む。他の実装において、ｗｏｒｄｅｍｂｅｄｄｉｎｇ技術と異なる技術により、タグを含む語をベクトルに変換してもよい。 The operation program 124 sequentially selects tags from the image-tag relational database 134 and generates a vector (tag vector) thereof (S103). In this example, the operation program 124 converts each tag into a tag vector by the word embedding technology. Operational program 124 includes a pre-trained model of word embedding. In other implementations, a word containing a tag may be converted into a vector by a technique different from the word embedding technique.

運用プログラム１２４は、生成したタグベクトルそれぞれと、意味表現ベクトルとを比較して、意味表現ベクトルとの類似度を計算する（Ｓ１０４）。類似度の計算は、例えば、ドット積又はコサイン類似度を使用することができる。 The operation program 124 compares each of the generated tag vectors with the semantic expression vector, and calculates the degree of similarity with the semantic expression vector (S104). For the calculation of similarity, for example, dot product or cosine similarity can be used.

運用プログラム１２４は、類似度に基づいて、対象画像に対応する推奨タグセットに含めるタグを決定する（Ｓ１０５）。運用プログラム１２４は、例えば、類似度が閾値より高いタグにより、推奨タグセットを構成する。推奨タグセットを構成するタグの最大数が、予め設定されていてもよい。類似度が閾値を超えるタグの数が規定数を超える場合、運用プログラム１２４は、類似度の順において、最も類似度が高いタグから規定数のタグを選択してもよい。運用プログラム１２４は、類似度閾値を参照することなく、類似度の順において、最も類似度が高いタグから規定数のタグを選択してもよい。 The operation program 124 determines the tags to be included in the recommended tag set corresponding to the target image based on the similarity (S105). The operation program 124 constitutes a recommended tag set, for example, by tags having a similarity higher than the threshold value. The maximum number of tags that make up the recommended tag set may be preset. When the number of tags whose similarity exceeds the threshold value exceeds the specified number, the operation program 124 may select the specified number of tags from the tags having the highest similarity in the order of similarity. The operation program 124 may select a specified number of tags from the tags having the highest similarity in the order of similarity without referring to the similarity threshold.

上述のように、意味表現モデル１２２を使用して高精度に入力画像に対応するタグセットを推定することができる。タグそれぞれのベクトルと意味表現ベクトルと比較することで、画像に関連付けるべきタグをより正確に推定することができる。 As described above, the semantic representation model 122 can be used to estimate the tag set corresponding to the input image with high accuracy. By comparing the vector of each tag with the semantic expression vector, the tag to be associated with the image can be estimated more accurately.

運用プログラム１２４は、複数タグからなるタグセットのベクトル、例えば、画像−タグ関連データベース１３４の各レコードのタグセットと、意味表現ベクトルとを比較してもよい。単一のタグは、単一タグからなるタグセットでもある。画像−タグ関連データベース１３４に代えて、タグのみを格納するデータベースが用意され、運用プログラム１２４は、そのデータベース内のタグのベクトルと意味表現ベクトルとを比較してもよい。 The operation program 124 may compare a vector of a tag set composed of a plurality of tags, for example, a tag set of each record of the image-tag relational database 134 with a semantic expression vector. A single tag is also a set of tags consisting of a single tag. Instead of the image-tag relational database 134, a database that stores only tags is prepared, and the operation program 124 may compare the tag vector in the database with the semantic expression vector.

図６は、運用プログラム１２４による、入力タグセットに関連する画像をする処理のフローチャートを示す。運用プログラム１２４は、ユーザに入力されたタグセットを取得する（Ｓ１５１）。 FIG. 6 shows a flowchart of a process of creating an image related to an input tag set by the operation program 124. The operation program 124 acquires the tag set input to the user (S151).

運用プログラム１２４は、取得したタグセットから、ｗｏｒｄｅｍｂｅｄｄｉｎｇモデルを使用して、ベクトル（タグセットベクトル）を生成する（Ｓ１５２）。運用プログラム１２４は、タグセットを構成するタグそれぞれのタグベクトルをｗｏｒｄｅｍｂｅｄｄｉｎｇモデルにより生成し、それらを組み合わせることで、当該タグセットのタグセットベクトルを生成する。 The operation program 124 generates a vector (tag set vector) from the acquired tag set by using the word embedding model (S152). The operation program 124 generates the tag vector of each tag constituting the tag set by the word embedding model, and generates the tag set vector of the tag set by combining them.

例えば、運用プログラム１２４は、タグベクトルの単純平均値をタグセットベクトルと決定してもよく、タグベクトルの重みづけ平均をタグセットベクトルと決定してもよい。タグの重みは、例えば、画像−タグ関連データベース１３４におけるタグの出現頻度に応じて決定される。画像−タグ関連データベース１３４において関連付けられている画像の数が多いタグほど、大きい重みが与えられる。 For example, the operation program 124 may determine the simple average value of the tag vector as the tag set vector, or may determine the weighted average of the tag vector as the tag set vector. The tag weight is determined, for example, according to the frequency of occurrence of the tag in the image-tag relational database 134. The larger the number of images associated in the image-tag relational database 134, the greater the weight given to the tag.

次に、運用プログラム１２４は、画像−テキスト関連データベース１３２から画像を順次読み出し、意味表現モデル１２２によって、それらの意味表現ベクトルを生成する（Ｓ１５３）。運用プログラム１２４は、生成した意味表現ベクトルそれぞれと、ステップＳ１５２で生成したタグセットベクトルとの間の類似度を計算する（Ｓ１５４）。類似度の計算は、例えば、ドット積又はコサイン類似度を使用することができる。 Next, the operation program 124 sequentially reads images from the image-text relational database 132, and generates those semantic expression vectors by the semantic expression model 122 (S153). The operation program 124 calculates the similarity between each of the generated semantic expression vectors and the tag set vector generated in step S152 (S154). For the calculation of similarity, for example, dot product or cosine similarity can be used.

運用プログラム１２４は、類似度に基づいて、対象タグセットに関連すると推定される画像を決定する（Ｓ１５５）。運用プログラム１２４は、例えば、類似度が閾値より高い画像を、タグセットに関連する画像として提示する画像と決定する。提示する画像の最大数が、予め設定されていてもよい。類似度が閾値を超える画像の数が規定数を超える場合、運用プログラム１２４は、類似度の順において、最も類似度が高い画像から規定数の画像を選択してもよい。運用プログラム１２４は、類似度閾値を参照することなく、類似度の順において、最も類似度が高い画像から規定数の画像を選択してもよい。 The operation program 124 determines an image presumed to be related to the target tag set based on the similarity (S155). The operation program 124 determines, for example, an image having a similarity higher than the threshold value as an image to be presented as an image related to the tag set. The maximum number of images to be presented may be preset. When the number of images whose similarity exceeds the threshold value exceeds the specified number, the operation program 124 may select a specified number of images from the images having the highest similarity in the order of similarity. The operation program 124 may select a specified number of images from the images having the highest similarity in the order of similarity without referring to the similarity threshold.

上述のように、意味表現モデル１２２を使用して、入力されたタグセットに関連する画像を高精度に推定することができる。ステップＳ１５３において、運用プログラム１２４は、画像−テキスト関連データベース１３２に代えて、画像−タグ関連データベース１３４を参照してもよい。運用プログラム１２４は、画像−タグ関連データベース１３４におけるレコードそれぞれのタグセットのタグセットベクトル生成し、ステップＳ１５２で生成したタグセットベクトルとの間の類似度を計算する。 As described above, the semantic representation model 122 can be used to estimate the image associated with the input tag set with high accuracy. In step S153, the operation program 124 may refer to the image-tag relational database 134 instead of the image-text relational database 132. The operation program 124 generates a tag set vector for each tag set of the records in the image-tag relational database 134, and calculates the degree of similarity with the tag set vector generated in step S152.

次に、画像−タグ関連データ生成プログラム１２６により、画像−テキスト関連データベース１３２のデータから、画像−タグ関連データベース１３４に格納するデータを生成する処理を説明する。図７は、画像−タグ関連データ生成プログラム１２６による処理のフローチャートを示す。画像−タグ関連データ生成プログラム１２６は、図７に示す処理を定期的に、例えば、１週間に１回実行する。 Next, a process of generating data to be stored in the image-tag-related database 134 from the data of the image-text-related database 132 by the image-tag-related data generation program 126 will be described. FIG. 7 shows a flowchart of processing by the image-tag related data generation program 126. The image-tag-related data generation program 126 executes the process shown in FIG. 7 periodically, for example, once a week.

画像−タグ関連データ生成プログラム１２６は、画像−タグ関連データベース１３４におけるタグの出現頻度分布を分析し、タグ分布テーブル５４７を生成する（Ｓ２０１）。タグの出現頻度は、画像−タグ関連データベース１３４において、当該タグに関連付けられている画像の数に等しい。 The image-tag-related data generation program 126 analyzes the tag appearance frequency distribution in the image-tag-related database 134 and generates a tag distribution table 547 (S201). The frequency of occurrence of a tag is equal to the number of images associated with that tag in the image-tag relational database 134.

図８は、タグ分布テーブル５４７の構成例を示す。タグ分布テーブル５４７は、タグカラム４７１と画像の数カラム４７２とを含む。タグカラム４７１は、画像−タグ関連データベース１３４に格納されているタグを示す。画像の数カラム４７２は、タグそれぞれが画像−タグ関連データベース１３４において関連付けられている画像の数を示す。 FIG. 8 shows a configuration example of the tag distribution table 547. The tag distribution table 547 includes a tag column 471 and an image number column 472. The tag column 471 indicates a tag stored in the image-tag relational database 134. The number of images column 472 indicates the number of images each tag is associated with in the image-tag relational database 134.

図７に戻って、画像−タグ関連データ生成プログラム１２６は、タグ分布テーブル５４７を参照し、出現頻度が低いタグを選択する（Ｓ２０２）。例えば、画像−タグ関連データ生成プログラム１２６は、出現頻度の順において、出現頻度が最も低いタグから規定数のタグを選択する。画像−タグ関連データ生成プログラム１２６は、出現頻度が規定値より小さい全てのタグを選択してもよい。 Returning to FIG. 7, the image-tag-related data generation program 126 refers to the tag distribution table 547 and selects a tag having a low appearance frequency (S202). For example, the image-tag-related data generation program 126 selects a specified number of tags from the tag having the lowest appearance frequency in the order of appearance frequency. The image-tag-related data generation program 126 may select all tags whose appearance frequency is less than the specified value.

画像−タグ関連データ生成プログラム１２６は、選択したいずれかのタグをテキスト内に含む、画像とテキストとの組（レコード）を、画像−テキスト関連データベース１３２から選択する（Ｓ２０３）。 The image-tag relational data generation program 126 selects a pair (record) of an image and a text including any of the selected tags in the text from the image-text relational database 132 (S203).

画像−タグ関連データ生成プログラム１２６は、画像とテキストとの選択した各組において、画像と関連づけられているテキスト内の各単語との間の類似度を、マッチングモデル１２１を使用して、計算する（Ｓ２０４）。 The image-tag association data generation program 126 uses the matching model 121 to calculate the similarity between each word in the text associated with the image in each selected set of image and text. (S204).

具体的には、画像−タグ関連データ生成プログラム１２６は、画像の特徴ベクトルと各単語のベクトルを生成する。画像の特徴ベクトルは、例えば、予め訓練されている深層学習モデル（深層ニューラルネットワーク）によって、生成することができる。単語ベクトルは、上述のように、予め訓練されているｗｏｒｄｅｍｂｅｄｄｉｎｇモデルにより生成される。 Specifically, the image-tag-related data generation program 126 generates a feature vector of the image and a vector of each word. The feature vector of the image can be generated, for example, by a pre-trained deep learning model (deep neural network). The word vector is generated by a pre-trained word embedding model, as described above.

画像−タグ関連データ生成プログラム１２６は、生成した画像特徴ベクトルと一つの単語ベクトルをマッチングモデル１２１に入力する。マッチングモデル１２１は、入力された画像特徴ベクトルと単語ベクトルとの間の類似度を出力する。マッチングモデル１２１の処理の詳細は後述する。 The image-tag-related data generation program 126 inputs the generated image feature vector and one word vector into the matching model 121. The matching model 121 outputs the degree of similarity between the input image feature vector and the word vector. Details of the processing of the matching model 121 will be described later.

画像−タグ関連データ生成プログラム１２６は、画像とテキストの各組において、画像との類似度に基づいて、テキストの単語のうちタグとして登録する単語を選択する（Ｓ２０５）。画像−タグ関連データ生成プログラム１２６は、例えば、類似度が閾値より高い単語を、画像のタグとして選択する。このように、単語は、画像との類似度によりランク付けされる。 The image-tag-related data generation program 126 selects a word to be registered as a tag among the words in the text based on the similarity with the image in each set of the image and the text (S205). The image-tag-related data generation program 126 selects, for example, words having a similarity higher than the threshold value as image tags. In this way, words are ranked according to their similarity to the image.

選択できるタグの最大数は予め設定されていてもよい。画像−タグ関連データ生成プログラム１２６は、類似度が閾値より高い単語から、類似度の順で、最も類似度が高い単語から１又は複数単語を、規定最大数を限度に選択する。選択できるタグの数は予め設定されていてもよい。画像−タグ関連データ生成プログラム１２６は、類似度の順において、類似度が最も高い単語から規定数の単語を選択する。 The maximum number of tags that can be selected may be preset. The image-tag-related data generation program 126 selects one or more words from the word having the highest similarity in the order of similarity from the word having the similarity higher than the threshold value, up to the specified maximum number. The number of tags that can be selected may be preset. The image-tag-related data generation program 126 selects a specified number of words from the words having the highest similarity in the order of similarity.

タグとして選択した１又は複数の単語が、画像を表す（画像に関連付けられる）タグセットを構成する。画像−タグ関連データ生成プログラム１２６は、画像と選択したタグセットとの組を、画像−タグ関連データベース１３４に追加する（Ｓ２０６）。 One or more words selected as tags constitute a set of tags that represent the image (associate with the image). The image-tag-related data generation program 126 adds the pair of the image and the selected tag set to the image-tag-related database 134 (S206).

上述のように、画像に実際に関連付けられているテキストからタグを選択することで、画像に関連するタグをより高精度に決定することができる。後述するように、画像−タグ関連データベース１３４は、意味表現モデル１２２の訓練に使用され、意味表現モデル１２２を適切に訓練することができる。また、訓練により改善されるマッチングモデル１２１を使用することで、画像−タグ関連データベース１３４に追加するより適切なデータを生成できる。 As described above, by selecting a tag from the text actually associated with the image, the tag associated with the image can be determined with higher accuracy. As will be described later, the image-tag relational database 134 is used for training the semantic representation model 122, and the semantic representation model 122 can be appropriately trained. Also, by using the matching model 121, which is improved by training, it is possible to generate more appropriate data to be added to the image-tag relational database 134.

タグ分布テーブル５４７に基づき画像−タグ関連データベース１３４に追加する画像を決定することで、画像−タグ関連データベース１３４内のタグの出現頻度分布がより均一に近づけることができる。関連付けられているテキストに低頻度のタグを含む画像を選択することで、低頻度タグが関連づけられる画像を選択する可能性を高めることができる。 By determining the image to be added to the image-tag relational database 134 based on the tag distribution table 547, the appearance frequency distribution of the tags in the image-tag relational database 134 can be made closer to more uniform. By selecting an image that contains a low frequency tag in the associated text, it is possible to increase the possibility of selecting an image to which the low frequency tag is associated.

これにより、関連タグセットが一般にレアな画像（レア画像）のデータを、画像−タグ関連データベース１３４に含め、レア画像についても正確にタグセットの推定をできるように意味表現モデル１２２を訓練できる。レア画像は、関連付けられるタグセットが出現頻度の低いタグを含む、又は、そのタグセットを構成するタグの組み合わせの出現頻度が低い、画像である。 Thereby, the data of the image (rare image) whose related tag set is generally rare can be included in the image-tag relational database 134, and the semantic expression model 122 can be trained so that the tag set can be estimated accurately even for the rare image. A rare image is an image in which the associated tag set includes tags with a low frequency of appearance, or the combination of tags constituting the tag set has a low frequency of appearance.

図９は、訓練プログラム１２３による、マッチングモデル１２１の訓練（学習）の方法を示すフローチャートである。図１０は、訓練におけるマッチングモデル１２１の処理のフローチャートである。訓練プログラム１２３は、図９を参照して説明する処理を、画像−タグ関連データベース１３４内の異なる画像について繰り返す。 FIG. 9 is a flowchart showing a training (learning) method of the matching model 121 by the training program 123. FIG. 10 is a flowchart of processing of the matching model 121 in training. The training program 123 repeats the process described with reference to FIG. 9 for different images in the image-tag relational database 134.

図９を参照して、訓練プログラム１２３は、画像−タグ関連データベース１３４から、画像を選択し、その画像の特徴ベクトルを生成する（Ｓ２２１）。上述のように、画像の特徴ベクトルは、例えば、予め訓練されている深層学習モデル（深層ニューラルネットワーク）によって、生成することができる。 With reference to FIG. 9, the training program 123 selects an image from the image-tag relational database 134 and generates a feature vector of the image (S221). As described above, the feature vector of the image can be generated, for example, by a pre-trained deep learning model (deep neural network).

次に、訓練プログラム１２３は、選択した画像に関連づけられているタグセットを画像−タグ関連データベース１３４から取得し、当該タグセットのベクトル（関連タグセットベクトル）を生成する（Ｓ２２２）。関連タグセットベクトルの生成は、図６を参照して説明したタグセットベクトルの生成（Ｓ１５２）と同様である。 Next, the training program 123 acquires the tag set associated with the selected image from the image-tag relational database 134 and generates a vector (related tag set vector) of the tag set (S222). The generation of the related tag set vector is the same as the generation of the tag set vector (S152) described with reference to FIG.

次に、訓練プログラム１２３は、画像−タグ関連データベース１３４から、選択した画像に関連付けられていない複数のタグをサンプリングする（Ｓ２２３）。訓練プログラム１２３は、さらに、サンプリングしたタグそれぞれのベクトル（無関連タグベクトル）を生成する（Ｓ２２４）。無関連タグベクトルは、例えば、上述のように、ｗｏｒｄｅｍｂｅｄｄｉｎｇモデルを使用して生成される。 Next, the training program 123 samples a plurality of tags not associated with the selected image from the image-tag relational database 134 (S223). The training program 123 further generates a vector (unrelated tag vector) for each sampled tag (S224). The irrelevant tag vector is generated using the word embedding model, for example, as described above.

各タグのベクトルに代えて又は加えて、複数のタグからなるタグセットのベクトルを生成してもよい。例えば、画像−タグ関連データベース１３４の各レコードのタグセットのであって、選択した画像と関連付けられていない１以上のタグを含む、又は、選択した画像と関連付けられていないタグからなる、タグセットのベクトルを生成してもよい。無関連タグのベクトルは、一つのタグからなるタグセットのベクトルでもある。 Instead of or in addition to the vector of each tag, a vector of a tag set consisting of a plurality of tags may be generated. For example, a tag set of tags for each record in the image-tag relational database 134, which includes one or more tags that are not associated with the selected image, or consists of tags that are not associated with the selected image. You may generate a vector. The vector of unrelated tags is also the vector of a tag set consisting of one tag.

訓練プログラム１２３は、マッチングモデル１２１に、上記生成した画像特徴ベクトル、関連タグセットベクトル、及び無関連タグベクトルを入力する（Ｓ２２５）。具体的には、訓練プログラム１２３は、画像特徴ベクトル及び関連タグセットベクトルのペア、並びに、画像特徴ベクトル及び無関連タグベクトルそれぞれのペアを、マッチングモデル１２１に順次入力する。 The training program 123 inputs the generated image feature vector, the related tag set vector, and the unrelated tag vector into the matching model 121 (S225). Specifically, the training program 123 sequentially inputs a pair of the image feature vector and the related tag set vector, and a pair of each of the image feature vector and the unrelated tag vector into the matching model 121.

マッチングモデル１２１は、入力された二つのベクトルの間の類似度を出力する。したがって、訓練プログラム１２３は、画像特徴ベクトルと関連タグセットベクトルの間の類似度、及び、画像特徴ベクトルと無関連タグベクトルそれぞれとの間の類似度、を取得する（Ｓ２２６）。 The matching model 121 outputs the similarity between the two input vectors. Therefore, the training program 123 acquires the similarity between the image feature vector and the related tag set vector, and the similarity between the image feature vector and the unrelated tag vector, respectively (S226).

マッチングモデル１２１は、取得した類似度に基づいて、マッチングモデル１２１を更新する（Ｓ２２７）。マッチングモデル１２１は、マッチングモデル１２１の出力の損失が小さくなるように、マッチングモデル１２１を更新（訓練）する。例えば、ヒンジ損失関数が使用される。 The matching model 121 updates the matching model 121 based on the acquired similarity (S227). The matching model 121 updates (trains) the matching model 121 so that the output loss of the matching model 121 is reduced. For example, the hinge loss function is used.

例えば、マッチングモデル１２１は、ｍａｘ（０、ｍ＋ｓｉｍ（ｘ、ｙ´）−ｓｉｍ（ｘ、ｙ））を最小化するように、マッチングモデル１２１を訓練する。ここで、ｓｉｍは、類似度を示し、ｍは、予め設定されているマージンである。ｘは、画像特徴ベクトル、ｙは関連タグセットベクトル、ｙ´は、無関連タグベクトルである。 For example, the matching model 121 trains the matching model 121 to minimize max (0, m + sim (x, y')-sim (x, y)). Here, sim indicates the degree of similarity, and m is a preset margin. x is an image feature vector, y is a related tag set vector, and y'is an unrelated tag vector.

図１０は、訓練におけるマッチングモデル１２１の処理のフローチャートである。マッチングモデル１２１は、入力データとして、画像特徴ベクトルを取得する（Ｓ２５１）。マッチングモデル１２１は、入力データとして、無関連タグベクトル又は関連タグセットベクトルを取得する（Ｓ２５２）。 FIG. 10 is a flowchart of processing of the matching model 121 in training. The matching model 121 acquires an image feature vector as input data (S251). The matching model 121 acquires an unrelated tag vector or a related tag set vector as input data (S252).

マッチングモデル１２１は、入力された画像特徴ベクトルを共通空間に写像する（Ｓ２５３）。マッチングモデル１２１は、さらに、入力された無関連タグベクトル又は関連タグセットベクトルを、その共通空間に写像する（Ｓ２５４）。画像特徴ベクトルとタグ又はタグセットベクトルの次元は異なりえる。共通空間への写像において、これらが一致される。入力されたベクトルの共通空間への写像は、深層学習モデル（深層ニューラルネットワーク）により実行され得る。深層学習モデルのパラメータは、訓練プログラム１２３による訓練により更新される。 The matching model 121 maps the input image feature vector into a common space (S253). The matching model 121 further maps the input unrelated tag vector or related tag set vector to its common space (S254). The dimensions of the image feature vector and the tag or tag set vector can be different. These are matched in the mapping to the common space. The mapping of the input vector to the common space can be performed by a deep learning model (deep neural network). The parameters of the deep learning model are updated by training by the training program 123.

マッチングモデル１２１は、共通空間内において、画像特徴ベクトルと無関連タグベクトル又は関連タグセットベクトルとの間の類似度を計算する（Ｓ２５５）。類似度は、例えば、ドット積又はコサイン類似度で定義される。なお、正準相関分析（ＣＣＡ：ＣａｎｏｎｉｃａｌＣｏｒｒｅｌａｔｉｏｎＡｎａｌｙｓｉｓ）を、マッチングモデル１２１に適用することができる。 The matching model 121 calculates the degree of similarity between the image feature vector and the unrelated tag vector or the related tag set vector in the common space (S255). Similarity is defined, for example, by the dot product or cosine similarity. In addition, canonical correlation analysis (CCA: Canonical Correlation Analysis) can be applied to the matching model 121.

図１１は、意味表現モデル１２２に構成例を模式的に示す。意味表現モデル１２２は、オートエンコーダの構成を有する。意味表現モデル１２２は、深層学習モデル（深層ニューラルネットワーク）で構成できる。意味表現モデル１２２は、エンコーダ２２１及びデコーダ２２２を含む。 FIG. 11 schematically shows a configuration example in the semantic expression model 122. The semantic expression model 122 has an autoencoder configuration. The semantic expression model 122 can be configured by a deep learning model (deep neural network). The semantic representation model 122 includes an encoder 221 and a decoder 222.

エンコーダ２２１とデコーダ２２２とは、例えば、それらの出力及び入力（意味表現ベクトル２２５）を中心として、対称の構造を有する。これにより、意味表現モデル１２２を効率的に訓練することができる。エンコーダ２２１とデコーダ２２２とは、例えば、複数層のＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）及び１又は複数層の全結合層を含む。 The encoder 221 and the decoder 222 have, for example, a symmetrical structure centered on their outputs and inputs (semantic expression vector 225). This makes it possible to efficiently train the semantic expression model 122. The encoder 221 and the decoder 222 include, for example, a plurality of layers of CNN (Convolutional Neural Network) and one or more layers of fully connected layers.

エンコーダ２２１は、入力された画像２２７を、意味表現ベクトル２２５にエンコードして、出力する。デコーダ２２２は、入力された意味表現ベクトル２２５を再構築画像２２８にデコードして、出力する。図５及び６を参照して説明したように、エンコーダ２２１から出力された意味表現ベクトル２２５は、運用プログラム１２４に利用される。後述するように、再構築画像２２８と入力画像２２７との差異が小さくなるように、意味表現モデル１２２は訓練される。 The encoder 221 encodes the input image 227 into the semantic expression vector 225 and outputs it. The decoder 222 decodes the input semantic expression vector 225 into the reconstructed image 228 and outputs it. As described with reference to FIGS. 5 and 6, the semantic expression vector 225 output from the encoder 221 is used in the operation program 124. As will be described later, the semantic representation model 122 is trained so that the difference between the reconstructed image 228 and the input image 227 is small.

図１２Ａ及び１２Ｂを参照して、意味表現モデル１２２の訓練を説明する。図１２Ａは、意味表現モデル１２２の訓練における入力データ及び出力データを示す。画像−タグ関連データベース１３４から選択された画像２２７がエンコーダ２２１に入力される。デコーダ２２２は、エンコーダ２２１から出力された意味表現ベクトル２２５を入力として受け付けて、再構築画像２２８を出力する。 The training of the semantic expression model 122 will be described with reference to FIGS. 12A and 12B. FIG. 12A shows the input data and the output data in the training of the semantic expression model 122. The image 227 selected from the image-tag relational database 134 is input to the encoder 221. The decoder 222 receives the semantic expression vector 225 output from the encoder 221 as an input, and outputs the reconstructed image 228.

意味表現モデル１２２の訓練は、エンコーダ２２１の出力である、意味表現ベクトル２２５に対して、制約を与える。制約の例は後述する。意味表現ベクトル２２５に対して制約を与えるために、画像−タグ関連データベース１３４において入力画像２２７に関連付けられているタグセットが参照される。 The training of the semantic representation model 122 constrains the semantic representation vector 225, which is the output of the encoder 221. An example of the constraint will be described later. To constrain the semantic representation vector 225, the tag set associated with the input image 227 is referenced in the image-tag relational database 134.

エンコーダ２２１から出力された意味表現ベクトル２２５と、入力画像２２７に関連付けられている関連タグセットベクトルとの間の誤差が、意味表現モデル１２２の訓練において使用される。さらに、再構築画像２２８と入力画像２２７との間の誤差が、意味表現モデル１２２の訓練において使用される。 The error between the semantic representation vector 225 output from the encoder 221 and the associated tag set vector associated with the input image 227 is used in training the semantic representation model 122. In addition, the error between the reconstructed image 228 and the input image 227 is used in training the semantic representation model 122.

意味表現ベクトル２２５の訓練において、自己復元性（再構築画像２２８と入力画像２２７との間の誤差）を考慮することで、入力画像２２７を再現するのに十分な情報量を確保しようとする方向に力が働く。これにより、意味表現モデル１２２の訓練における過度の縮退（汎化）が抑制され、出現頻度が小さいレアなタグ又はタグセットであっても、画像の特徴として重要なタグ又はタグセットが、ノイズとして切り捨てられることを防ぐことができる。 In the training of the semantic expression vector 225, the direction of trying to secure a sufficient amount of information to reproduce the input image 227 by considering the self-restorability (error between the reconstructed image 228 and the input image 227). Power works on. As a result, excessive degeneracy (generalization) in the training of the semantic expression model 122 is suppressed, and even a rare tag or tag set that appears infrequently, a tag or tag set that is important as an image feature is generated as noise. It can be prevented from being truncated.

図１２Ｂは、訓練プログラム１２３による、意味表現モデル１２２の訓練のフローチャートである。訓練プログラム１２３は、図１２を参照して説明する処理を、画像−タグ関連データベース１３４内の異なる画像について繰り返す。 FIG. 12B is a flowchart of training of the semantic expression model 122 by the training program 123. The training program 123 repeats the process described with reference to FIG. 12 for different images in the image-tag relational database 134.

訓練プログラム１２３は、画像−タグ関連データベース１３４から、画像を選択する（Ｓ３０１）。訓練プログラム１２３は、画像−タグ関連データベース１３４から、選択した画像に関連づけられているタグセットを取得し、そのベクトル（関連タグセットベクトル）を生成する（Ｓ３０２）。関連タグセットベクトルの生成は、図６を参照して説明したタグセットベクトルの生成（Ｓ１５２）と同様である。 The training program 123 selects an image from the image-tag relational database 134 (S301). The training program 123 acquires a tag set associated with the selected image from the image-tag relational database 134 and generates a vector (related tag set vector) thereof (S302). The generation of the related tag set vector is the same as the generation of the tag set vector (S152) described with reference to FIG.

訓練プログラム１２３は、選択した画像（入力画像２２７）を意味表現モデル１２２のエンコーダ２２１に入力する（Ｓ３０３）。訓練プログラム１２３は、意味表現モデル１２２から、出力画像（再構築画像）２２８及び意味表現ベクトル２２５を取得する（Ｓ３０４）。 The training program 123 inputs the selected image (input image 227) to the encoder 221 of the semantic expression model 122 (S303). The training program 123 acquires the output image (reconstructed image) 228 and the semantic expression vector 225 from the semantic expression model 122 (S304).

訓練プログラム１２３は、入力画像と出力画像の誤差、及び、意味表現ベクトルと関連タグベクトルとの誤差に基づいて、意味表現モデル１２２を更新する。訓練プログラム１２３は、例えば、｜｜ｘ−ｘ´｜｜＋￥ｌａｍｂｄａ｜｜ｓ−ｗ｜｜が小さくなるように、意味表現モデル１２２のパラメータを更新する。意味表現ベクトル２５５に対する制約は、意味表現ベクトルと関連タグベクトルとの誤差を小さくすることである。 The training program 123 updates the semantic expression model 122 based on the error between the input image and the output image and the error between the semantic expression vector and the related tag vector. The training program 123 updates the parameters of the semantic expression model 122 so that, for example, || x-x'|| + \ lambda || sw || becomes smaller. The constraint on the semantic expression vector 255 is to reduce the error between the semantic expression vector and the related tag vector.

ｘ及びｘ´は、それぞれ、入力画像及び出力画像（再構築画像）である。￥ｌａｍｂｄａはバランシングパラメータであり、ｓは意味表現ベクトル、ｗは関連タグセットベクトルである。エンコーダ２２１をｆ、デコーダをｆ´と表わす。ｓ＝ｆ（ｘ）、ｘ´＝ｆ´（ｆ（ｘ））である。 x and x'are an input image and an output image (reconstructed image), respectively. \ Lambda is a balancing parameter, s is a semantic expression vector, and w is a related tag set vector. The encoder 221 is represented by f, and the decoder is represented by f'. s = f (x), x'= f'(f (x)).

意味表現モデル１２２の訓練における、意味表現ベクトル２２５に対する制約の他の例を説明する。訓練プログラム１２３は、画像−タグ関連データベース１３４から、選択画像と関連付けられているタグセット、及び、選択画像と関連付けられていないタグを含む又は関連付けられていないタグからなるタグセットを選択し、そのタグセットを表すベクトル（関連タグセットベクトル）を生成する。 Another example of the constraint on the semantic representation vector 225 in the training of the semantic representation model 122 will be described. The training program 123 selects from the image-tag relational database 134 a tag set consisting of a tag set associated with the selected image and a tag set including or not associated with the selected image. Generate a vector representing a tag set (related tag set vector).

入力画像の意味表現ベクトルをｚ、入力画像の関連タグセットベクトルをｙ、入力画像に関連しない無関連タグセットベクトルをｙ´と表わす。関連タグセットベクトルｙと無関連タグセットベクトルをｙ´との類似度は、関連タグセットベクトルｙと関連タグセットベクトルｙ（同一ベクトル）との類似度より小さい。 The semantic expression vector of the input image is represented by z, the related tag set vector of the input image is represented by y, and the unrelated tag set vector not related to the input image is represented by y'. The similarity between the related tag set vector y and the unrelated tag set vector y'is smaller than the similarity between the related tag set vector y and the related tag set vector y (same vector).

意味表現モデル１２２の訓練における意味表現ベクトル２２５に対する制約の例は、Ｌ２（ｙ、ｚ）＜Ｌ２（ｙ´、ｚ）である。Ｌ２は、ベクトル空間におけるユークリッド距離である。なお、複数の無関連タグセットベクトルを形成して、複数の不等号式を制約に含めてもよい。 An example of the constraint on the semantic expression vector 225 in the training of the semantic expression model 122 is L2 (y, z) <L2 (y', z). L2 is the Euclidean distance in the vector space. It should be noted that a plurality of unrelated tag set vectors may be formed to include a plurality of inequalities in the constraint.

意味表現モデル１２２の訓練における、意味表現ベクトル２２５に対する制約の他の例を説明する。訓練プログラム１２３は、画像−タグ関連データベース１３４から、入力画像のタグセットとの類似度が異なるタグセットの画像を選択する。例えば、訓練プログラム１２３は、入力画像とタグセットが近似する画像（近似画像）として、例えば、一つのタグ以外のタグが入力画像に共通である画像を選択する。 Another example of the constraint on the semantic representation vector 225 in the training of the semantic representation model 122 will be described. The training program 123 selects an image of a tag set having a different degree of similarity to the tag set of the input image from the image-tag relational database 134. For example, the training program 123 selects, for example, an image in which tags other than one tag are common to the input image as an image (approximate image) in which the input image and the tag set are similar.

訓練プログラム１２３は、さらに、画像−タグ関連データベース１３４から、入力画像と、タグが完全に異なる画像（非類似画像）を選択する。入力画像に関連付けられているいずれのタグも、非類似画像に関連付けられていない。 The training program 123 further selects an image (dissimilar image) whose tag is completely different from the input image from the image-tag relational database 134. None of the tags associated with the input image are associated with the dissimilar image.

入力画像の意味表現ベクトルをｚ１、近似画像の意味表現ベクトルをｚ２、非類似画像の意味表現ベクトルをｚ３とする。意味表現モデル１２２の訓練における意味表現ベクトル２２５に対する制約の例は、Ｌ２（ｚ１、ｚ２）＜Ｌ２（ｚ１、ｚ３）である。なお、複数の近似画像及び／又は複数の非類似画像を選択して、複数の不等号式を制約に含めてもよい。 Let z1 be the semantic expression vector of the input image, z2 be the semantic expression vector of the approximate image, and z3 be the semantic expression vector of the dissimilar image. An example of the constraint on the semantic expression vector 225 in the training of the semantic expression model 122 is L2 (z1, z2) <L2 (z1, z3). It should be noted that a plurality of approximate images and / or a plurality of dissimilar images may be selected to include a plurality of inequalities in the constraint.

上述のような意味表現ベクトル２２５に対する制約により、意味表現モデル１２２が入力画像に関連すべきタグセットを表す意味表現ベクトルを生成するように適切に訓練することができる。 The constraints on the semantic representation vector 225 as described above allow the semantic representation model 122 to be properly trained to generate a semantic representation vector representing the tag set to be associated with the input image.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明したすべての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 The present invention is not limited to the above-described examples, and includes various modifications. For example, the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the configurations described. Further, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Further, it is possible to add / delete / replace a part of the configuration of each embodiment with another configuration.

また、上記の各構成・機能・処理部等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記録装置、または、ＩＣカード、ＳＤカード等の記録媒体に置くことができる。 Further, each of the above-mentioned configurations, functions, processing units and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files that realize each function can be placed in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card or an SD card.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしもすべての制御線や情報線を示しているとは限らない。実際には殆どすべての構成が相互に接続されていると考えてもよい。 In addition, control lines and information lines are shown as necessary for explanation, and not all control lines and information lines are shown in the product. In practice, it can be considered that almost all configurations are interconnected.

１００タグ推定装置、１１０プロセッサ、１２０メモリ、１２１マッチングモデル、１２２意味表現モデル、１２３訓練プログラム、１２４運用プログラム、１２５データクローラプログラム、１２６、１３０補助記憶装置、１３２画像−テキスト関連データベース、１３４画像−タグ関連データベース、１４４クライアント装置、１４５ネットワークインタフェース、２２１エンコーダ、２２２デコーダ、２２５意味表現ベクトル、２２６制約、２２７入力画像、２２８再構築画像、５４７タグ分布テーブル 100 tag estimator, 110 processor, 120 memory, 121 matching model, 122 semantic representation model, 123 training program, 124 operation program, 125 data crawler program, 126, 130 auxiliary storage, 132 image-text related database, 134 image- Tag-related database, 144 client device, 145 network interface, 221 encoder, 222 decoder, 225 semantic representation vector, 226 constraint, 227 input image, 228 reconstructed image, 547 tag distribution table

Claims

A method performed on a computer that includes training a first model that outputs a vector representing a set of tags to be associated with the input image from the input image.
The calculator
Storage device and
Including a processor that operates according to a program stored in the storage device.
The first model includes an encoder that outputs a vector representing a tag set to be associated with the input image from the input image, and a decoder that inputs the output from the encoder.
In the method, the processor
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set preliminarily associated with the first training image. constraints and, based on, update the parameters of the first model, look including that,
The constraint is to reduce the error between the first output vector from the encoder and the vector of the first tag set .

A method performed on a computer that includes training a first model that outputs a vector representing a set of tags to be associated with the input image from the input image.
The calculator
Storage device and
Including a processor that operates according to a program stored in the storage device.
The first model includes an encoder that outputs a vector representing a tag set to be associated with the input image from the input image, and a decoder that inputs the output from the encoder.
In the method, the processor
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set previously associated with the first training image. Based on the constraints, the parameters of the first model are updated.
The frequency distribution of tags in the database storing the images used for training the first model and the tag sets associated with each of the images was analyzed.
A method comprising adding new images and new tag sets associated with each other to the database so that the frequency distribution approaches a uniform state.

The method according to claim 2.
The processor
Get the new images and texts that are associated with each other
The similarity between the vector of each word in the text and the feature vector of the new image is determined.
A method further comprising selecting the new tag set associated with the new image from the text based on the similarity.

The method according to claim 3.
The processor
The similarity between the vector of each word in the text and the feature vector of the image is determined by the second model.
In the training of the second model,
The first similarity between the feature vector of the second training image and the vector of the related tag set previously associated with the second training image is determined by the second model.
The second model determines the second similarity between the feature vector of the second training image and the vector of an unrelated tag set containing words not included in the related tag set.
A method further comprising updating the second model based on the first similarity and the second similarity.

The method according to claim 3.
The processor further comprises selecting a first tag in the database whose frequency of occurrence should be increased based on the frequency of occurrence.
The method, wherein the text comprises the first tag.

A method performed on a computer that includes training a first model that outputs a vector representing a set of tags to be associated with the input image from the input image.
The calculator
Storage device and
Including a processor that operates according to a program stored in the storage device.
The first model includes an encoder that outputs a vector representing a tag set to be associated with the input image from the input image, and a decoder that inputs the output from the encoder.
In the method, the processor
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set previously associated with the first training image. Based on the constraints, the parameters of the first model are updated.
Input the first user image for the first model
The first user output vector for the first user image from the encoder is acquired and
The degree of similarity between the first user output vector and each vector of the tag set prepared in advance is determined.
A method comprising determining a tag set associated with the first user image based on the similarity.

The method according to claim 6.
A method in which each of the tag sets prepared in advance is composed of one tag.

A method performed on a computer that includes training a first model that outputs a vector representing a set of tags to be associated with the input image from the input image.
The calculator
Storage device and
Including a processor that operates according to a program stored in the storage device.
The first model includes an encoder that outputs a vector representing a tag set to be associated with the input image from the input image, and a decoder that inputs the output from the encoder.
In the method, the processor
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set previously associated with the first training image. Based on the constraints, the parameters of the first model are updated.
Get the first user tag set,
Images prepared in advance are sequentially input to the first model, a candidate output vector is acquired from the encoder, and the candidate output vector is acquired.
The degree of similarity between each of the candidate output vectors and the vector of the first user tag set is determined.
A method comprising determining an image associated with the first user tag set based on the similarity.

It ’s a computer system,
Storage device and
Including a processor that operates according to a program stored in the storage device.
The processor
A first model is trained that includes an encoder that outputs a vector representing a set of tags to be associated with the input image from the input image and a decoder that inputs the output from the encoder.
In the training of the first model,
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set previously associated with the first training image. Based on the constraints, the parameters of the first model are updated.
A computer system in which the constraint is to reduce the error between the first output vector from the encoder and the vector of the first tag set.

It ’s a computer system,
Storage device and
Including a processor that operates according to a program stored in the storage device.
The processor
A first model is trained that includes an encoder that outputs a vector representing a set of tags to be associated with the input image from the input image and a decoder that inputs the output from the encoder.
In the training of the first model,
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set preliminarily associated with the first training image. Based on the constraints, the parameters of the first model are updated .
The frequency distribution of tags in the database storing the images used for training the first model and the tag sets associated with each of the images was analyzed.
A computer system that adds new images and new tag sets associated with each other to the database so that the frequency distribution approaches a uniform state.

It ’s a computer system,
Storage device and
Including a processor that operates according to a program stored in the storage device.
The processor
A first model is trained that includes an encoder that outputs a vector representing a set of tags to be associated with the input image from the input image and a decoder that inputs the output from the encoder.
In the training of the first model,
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set previously associated with the first training image. Based on the constraints, the parameters of the first model are updated.
Input the first user image for the first model
The first user output vector for the first user image from the encoder is acquired and
The degree of similarity between the first user output vector and each vector of the tag set prepared in advance is determined.
A computer system that determines a set of tags associated with the first user image based on the similarity.

It ’s a computer system,
Storage device and
Including a processor that operates according to a program stored in the storage device.
The processor
A first model is trained that includes an encoder that outputs a vector representing a set of tags to be associated with the input image from the input image and a decoder that inputs the output from the encoder.
In the training of the first model,
The first training image is input to the encoder,
The first output image from the decoder is acquired and
Obtain the first output vector from the encoder
With respect to the first output vector from the encoder preset based on the error between the first training image and the first output image and the first tag set previously associated with the first training image. Based on the constraints, the parameters of the first model are updated.
Get the first user tag set,
Images prepared in advance are sequentially input to the first model, a candidate output vector is acquired from the encoder, and the candidate output vector is acquired.
The degree of similarity between each of the candidate output vectors and the vector of the first user tag set is determined.
A computer system that determines an image associated with the first user tag set based on the similarity.