JP2021157570A

JP2021157570A - Image similarity estimation system, learning device, estimation device, and program

Info

Publication number: JP2021157570A
Application number: JP2020057919A
Authority: JP
Inventors: 大地小池; Daichi Koike; 高志末永; Takashi Suenaga
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2021-10-07
Anticipated expiration: 2040-03-27
Also published as: JP7394680B2

Abstract

To provide an image similarity estimation system, a learning device, an estimation device, and a program, which can determine the similarity of images taking not only appearance but also concept into account.SOLUTION: An image similarity estimation system 1 comprises: an appearance information acquisition unit 10 for acquiring appearance information that indicates the appearance of an image; an appearance feature extraction unit 11 for extracting an appearance feature quantity that indicates the appearance feature of an image using the appearance information on the image and an appearance feature extraction model; a classification information acquisition unit 12 for acquiring classification information that indicates the classification of the image; a classification text feature extraction unit 13 for extracting a classification text feature quantity that indicates the feature of words that indicate image classification using the classification information on the image and a classification text feature extraction model; and an overall feature extraction unit 14 for extracting an overall feature quantity which is the feature of the whole of the image using the appearance feature quantity of the image, the classification text feature quantity and a multimodal model.SELECTED DRAWING: Figure 1

Description

本発明は、画像類似度推定システム、学習装置、推定装置、及びプログラムに関する。 The present invention relates to an image similarity estimation system, a learning device, an estimation device, and a program.

特許庁における商標出願の審査においては、出願に係る商標と、既に出願済みの商標とが類似するか否かが判断される。文字や図形等などの画像が類似するか否かを判定する技術として、画像の特徴に基づく深層学習を行い、類似する画像を抽出するものがある。例えば、特許文献１には、画像の複数個所を特定し、特定したそれぞれの箇所の特徴量を算出し、算出したそれぞれの特徴量に基づき、類似する画像を抽出する技術が開示されている。 In the examination of a trademark application at the JPO, it is determined whether or not the trademark pertaining to the application is similar to the trademark already applied for. As a technique for determining whether or not images such as characters and figures are similar, there is a technique for extracting similar images by performing deep learning based on the characteristics of the images. For example, Patent Document 1 discloses a technique of specifying a plurality of locations of an image, calculating feature amounts of the specified locations, and extracting similar images based on the calculated feature amounts.

特許第５９３４６５３号公報Japanese Patent No. 5934653

しかしながら、商標の類否は、出願商標及び引用商標がその外観、称呼又は観念等によって需要者に与える印象、記憶、連想等を総合して全体的に観察し、出願商標を指定商品又は指定役務に使用した場合に引用商標と出所混同のおそれがあるか否かにより判断する（商標法第４条第１項第１１号の審査基準）。つまり、外観のみならず、称呼及び観念のそれぞれの観点から、総合的に商標の類否が判断される。このため、特許文献１の技術を用いて画像の類似性、つまり外観の類似性のみを判定するだけでは、商標の類比を判断するうえで不十分となる問題があった。 However, regarding the similarity of trademarks, the impression, memory, association, etc. that the applied trademark and the cited trademark give to the consumer by their appearance, name, or idea are comprehensively observed, and the applied trademark is designated as a designated product or designated service. Judgment is made based on whether or not there is a risk of confusion with the cited trademark when used in (Trademark Law, Article 4, Paragraph 1, Item 11). That is, the similarity of a trademark is comprehensively judged not only from the viewpoint of appearance but also from the viewpoint of each name and idea. Therefore, there is a problem that it is insufficient to judge the analogy of trademarks only by judging the similarity of images, that is, the similarity of appearances by using the technique of Patent Document 1.

本発明は、上記問題を解決すべくなされたもので、その目的は、外観のみならず観念を考慮して画像の類否を判定することができる画像類似度推定システム、学習装置、推定装置、及びプログラムを提供することにある。 The present invention has been made to solve the above problems, and an object of the present invention is to obtain an image similarity estimation system, a learning device, an estimation device, which can determine the similarity of images in consideration of not only appearance but also ideas. And to provide the program.

上記問題を解決するために、本発明の一態様は、画像の外観を示す外観情報を取得する外観情報取得部と、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出部と、画像の分類を示す分類情報を取得する分類情報取得部と、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出部と、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、前記外観特徴抽出モデル、及び前記マルチモーダルモデルを生成するモデル生成部と、対象画像における前記全体特徴量、及び比較画像における前記全体特徴量に基づいて、前記対象画像と前記比較画像の類似度合いを推定する画像類似度推定部と、を備え、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、前記モデル生成部は、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより、前記外観特徴抽出モデルを生成し、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記モデル生成部は、前記外観特徴抽出部によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出部によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルを生成する、ことを特徴とする画像類似度推定システムである。 In order to solve the above problem, one aspect of the present invention uses an appearance information acquisition unit that acquires appearance information indicating the appearance of an image, the appearance information in the image, and an appearance feature extraction model to obtain the appearance of the image. Using the appearance feature extraction unit that extracts the appearance feature amount indicating the characteristics of the image, the classification information acquisition unit that acquires the classification information indicating the classification of the image, the classification information in the image, and the classification text feature extraction model, the image is concerned. Using the classification text feature extraction unit for extracting the classification text feature amount indicating the characteristics of the wording indicating the classification of, the appearance feature amount in the image, the classification text feature amount, and the multimodal model, the entire image in the image is used. The overall feature extraction unit that extracts the overall feature amount that is a feature, the appearance feature extraction model, the model generation unit that generates the multimodal model, the overall feature amount in the target image, and the overall feature amount in the comparative image. Based on the above, the appearance feature extraction model includes an image similarity estimation unit that estimates the degree of similarity between the target image and the comparison image, and the appearance feature extraction model outputs the appearance feature amount in the image from the appearance information in the image. The model generation unit generates the appearance feature extraction model by causing the learning model to learn the correspondence between the appearance information and the classification information in the training image, and the classification text feature extraction model is a model. , The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image, and is a model for extracting the feature amount of the wording indicating the classification. The model generation unit includes the appearance feature amount in the learning image extracted by the appearance feature extraction unit, the classification text feature amount in the learning image extracted by the classification text feature extraction unit, and the learning. This is an image similarity estimation system characterized in that a multimodal model is generated by having a learning model learn a correspondence relationship with the classification information in an image.

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記外観特徴抽出モデルは、深層学習の学習モデルの内部状態に重み付けした値を出力するアテンション機構を含み、前記モデル生成部は、前記アテンション機構に、前記学習用画像における前記外観情報と前記分類情報との対応関係に応じた重みを学習させるようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the appearance feature extraction model includes an attention mechanism that outputs a weighted value to the internal state of the learning model for deep learning, and the model generation. The unit may make the attention mechanism learn the weight according to the correspondence between the appearance information and the classification information in the learning image.

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記分類テキスト特徴抽出モデルは、文言に含まれる単語の特徴量を示す単語特徴量を、前記単語のｉｄｆ値で重みづけした値に基づいて当該文言の特徴を抽出するモデルであり、前記ｉｄｆ値は、分類済みの画像の集合である画像群に統計処理を行うことにより算出される値であるようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the classification text feature extraction model weights the word feature amount indicating the feature amount of the word included in the wording by the idf value of the word. It is a model for extracting the feature of the wording based on the assigned value, and the idf value may be a value calculated by performing statistical processing on an image group which is a set of classified images. ..

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記ｉｄｆ値は、分類済みの画像の集合である画像群の数における、前記分類テキスト特徴量を含む画像の数に対する割合を用いて算出される値であるようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the idf value is based on the number of images including the classified text feature amount in the number of image groups which are a set of classified images. It may be a value calculated using a ratio.

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記モデル生成部は、前記学習用画像における前記外観特徴量及び前記分類テキスト特徴量が、同一の範囲内に含まれるデータとなるように正規化する前処理を行い、前記前処理を行った前記学習用画像における、前記外観特徴量及び前記分類テキスト特徴量と前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルを生成するようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the model generation unit includes the appearance feature amount and the classification text feature amount in the learning image within the same range. Preprocessing that normalizes the data so that it becomes data is performed, and the learning model is trained to learn the correspondence between the appearance feature amount and the classification text feature amount and the classification information in the preprocessed training image. May generate the multimodal model.

また、本発明の一態様は、画像の外観を示す外観情報を取得する外観情報取得部と、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出部と、画像の分類を示す分類情報を取得する分類情報取得部と、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出部と、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、前記外観特徴抽出モデル、及び前記マルチモーダルモデルを生成するモデル生成部と、を備え、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、前記モデル生成部は、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより、前記外観特徴抽出モデルを生成し、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記モデル生成部は、前記外観特徴抽出部によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出部によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルを生成する学習装置である。 Further, one aspect of the present invention is an appearance feature that shows the appearance feature of the image by using the appearance information acquisition unit that acquires the appearance information indicating the appearance of the image, the appearance information in the image, and the appearance feature extraction model. Using the appearance feature extraction unit that extracts the amount, the classification information acquisition unit that acquires the classification information indicating the classification of the image, the classification information in the image, and the classification text feature extraction model, the wording indicating the classification of the image. Using the classification text feature extraction unit that extracts the classification text feature amount indicating the feature, the appearance feature amount in the image, the classification text feature amount, and the multimodal model, the overall feature amount that is the feature of the entire image in the image. The appearance feature extraction model includes an overall feature extraction unit for extracting the image, an appearance feature extraction model, and a model generation unit for generating the multimodal model. It is a model that outputs a quantity, and the model generation unit generates the appearance feature extraction model by training the learning model on the correspondence between the appearance information and the classification information in the training image, and generates the classification text. The feature extraction model is a model that extracts the feature amount of the wording indicating the classification, and the multimodal model is a model that outputs the total feature amount in the image from the appearance feature amount and the classification text feature amount in the image. The model generation unit includes the appearance feature amount in the learning image extracted by the appearance feature extraction unit and the classification text feature amount in the learning image extracted by the classification text feature extraction unit. , A learning device that generates the multimodal model by having the learning model learn the correspondence with the classification information in the learning image.

また、本発明の一態様は、画像の外観を示す外観情報を取得する外観情報取得部と、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出部と、画像の分類を示す分類情報を取得する分類情報取得部と、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出部と、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、対象画像における前記全体特徴量、及び比較画像における前記全体特徴量に基づいて、前記対象画像と前記比較画像の類似度合いを推定する画像類似度推定部と、を備え、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルであり、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記外観特徴抽出部によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出部によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルである推定装置である。 Further, one aspect of the present invention is an appearance feature that shows the appearance feature of the image by using the appearance information acquisition unit that acquires the appearance information indicating the appearance of the image, the appearance information in the image, and the appearance feature extraction model. Using the appearance feature extraction unit that extracts the amount, the classification information acquisition unit that acquires the classification information indicating the classification of the image, the classification information in the image, and the classification text feature extraction model, the wording indicating the classification of the image. Using the classification text feature extraction unit that extracts the classification text feature amount indicating the feature, the appearance feature amount in the image, the classification text feature amount, and the multimodal model, the overall feature amount that is the feature of the entire image in the image. An overall feature extraction unit that extracts The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image, and causes the learning model to learn the correspondence between the appearance information and the classification information in the training image. The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification, and the multimodal model is a model obtained from the appearance feature amount and the classification text feature amount in the image. , The model that outputs the overall feature amount in the image, the appearance feature amount in the learning image extracted by the appearance feature extraction unit, and the learning image extracted by the classification text feature extraction unit. This is an estimation device that is a model generated by having a learning model learn the correspondence between the classification text feature amount and the classification information in the training image.

また、本発明の一態様は、コンピュータを、画像の外観を示す外観情報を取得する外観情報取得手段、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出手段、画像の分類を示す分類情報を取得する分類情報取得手段、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出手段、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出手段、前記外観特徴抽出モデル、及び前記マルチモーダルモデルを生成するモデル生成手段、として機能させるためのプログラムであって、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、前記モデル生成手段において、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより、前記外観特徴抽出モデルが生成され、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記モデル生成手段において、前記外観特徴抽出手段によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出手段によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルが生成される、プログラムである。 Further, in one aspect of the present invention, a computer is used to show the appearance features of the image by using the appearance information acquisition means for acquiring the appearance information indicating the appearance of the image, the appearance information in the image, and the appearance feature extraction model. Using the appearance feature extraction means for extracting the appearance feature amount, the classification information acquisition means for acquiring the classification information indicating the classification of the image, the classification information in the image, and the classification text feature extraction model, the wording indicating the classification of the image. Using the classification text feature extraction means for extracting the classification text feature amount indicating the feature, the appearance feature amount in the image, the classification text feature amount, and the multimodal model, the overall feature amount which is the feature of the entire image in the image can be obtained. It is a program for functioning as an overall feature extraction means to be extracted, the appearance feature extraction model, and a model generation means for generating the multimodal model, and the appearance feature extraction model is the image from the appearance information in the image. This is a model that outputs the appearance feature amount in the above, and the appearance feature extraction model is generated by having the learning model learn the correspondence between the appearance information and the classification information in the training image in the model generation means. The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification, and the multimodal model is the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image. In the model generation means, the appearance feature amount in the learning image extracted by the appearance feature extraction means and the classification in the learning image extracted by the classification text feature extraction means. This is a program in which the multimodal model is generated by having the learning model learn the correspondence between the text feature amount and the classification information in the learning image.

また、本発明の一態様は、コンピュータを、画像の外観を示す外観情報を取得する外観情報取得手段、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出手段、画像の分類を示す分類情報を取得する分類情報取得手段、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出手段、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、対象画像における前記全体特徴量、及び比較画像における前記全体特徴量に基づいて、前記対象画像と前記比較画像の類似度合いを推定する画像類似度推定手段、として機能させるためのプログラムであって、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルであり、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記外観特徴抽出手段によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出手段によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルである、プログラムである。 Further, in one aspect of the present invention, a computer is used to show the appearance features of the image by using the appearance information acquisition means for acquiring the appearance information indicating the appearance of the image, the appearance information in the image, and the appearance feature extraction model. Using the appearance feature extraction means for extracting the appearance feature amount, the classification information acquisition means for acquiring the classification information indicating the classification of the image, the classification information in the image, and the classification text feature extraction model, the wording indicating the classification of the image. Using the classification text feature extraction means for extracting the classification text feature amount indicating the feature, the appearance feature amount in the image, the classification text feature amount, and the multimodal model, the overall feature amount which is the feature of the entire image in the image can be obtained. It functions as an overall feature extraction unit to be extracted, an image similarity estimation means for estimating the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image. The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image, and obtains a correspondence relationship between the appearance information in the learning image and the classification information. It is a model generated by training a training model, the classification text feature extraction model is a model for extracting feature amounts of words indicating classification, and the multimodal model is the appearance feature amount and the appearance feature amount in an image. It is a model that outputs the total feature amount in the image from the classification text feature amount, and is extracted by the appearance feature amount in the learning image extracted by the appearance feature extraction means and the classification text feature extraction means. This is a program generated by having a learning model learn the correspondence between the classification text feature amount in the learning image and the classification information in the learning image.

この発明によれば、外観のみならず観念を考慮して画像の類否を判定することができる。 According to the present invention, it is possible to determine the similarity of an image in consideration of not only the appearance but also the idea.

実施形態の画像類似度推定システム１の構成例を示すブロック図である。It is a block diagram which shows the structural example of the image similarity estimation system 1 of embodiment. 実施形態の画像Ｇの例を示す図である。It is a figure which shows the example of the image G of an embodiment. 実施形態の図形分類Ｚの例を示す図である。It is a figure which shows the example of the graphic classification Z of an embodiment. 実施形態の外観情報１７０の構成例を示す図である。It is a figure which shows the structural example of the appearance information 170 of an embodiment. 実施形態の分類情報１７１の構成例を示す図である。It is a figure which shows the structural example of the classification information 171 of an embodiment. 実施形態の画像類似度推定システム１が行う処理を説明する図である。It is a figure explaining the process performed by the image similarity estimation system 1 of an embodiment. 実施形態の外観特徴抽出モデル１７２を説明する図である。It is a figure explaining the appearance feature extraction model 172 of embodiment. 実施形態の分類テキスト特徴抽出モデル１７３を説明する図である。It is a figure explaining the classification text feature extraction model 173 of embodiment. 実施形態のマルチモーダルモデル１７４を説明する図である。It is a figure explaining the multimodal model 174 of embodiment. 実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the image similarity estimation system 1 of an embodiment. 実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the image similarity estimation system 1 of an embodiment. 実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the image similarity estimation system 1 of an embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

実施形態の画像類似度推定システム１は、画像同士が類似する度合いを推定するシステムである。画像類似度推定システム１は、例えば、特許庁における商標出願の審査における、出願に係る商標の類似の判定に適用される。 The image similarity estimation system 1 of the embodiment is a system that estimates the degree to which images are similar to each other. The image similarity estimation system 1 is applied to, for example, determining the similarity of a trademark according to an application in the examination of a trademark application at the Japan Patent Office.

商標の審査においては、外観の類似のみならず、称呼や概念的な類似を考慮した類似が判断される。例えば、商標の審査では、商標に付与される図形分類を用いて検索の論理式が作成される。そして、作成された論理式を用いた検索が実行されることにより、既に出願済みの商標の中から、出願に係る商標に類似する可能性がある商標の絞り込みが行われる。絞り込まれた商標の中から、外観、称呼、又は概念が類似するものが抽出される。 In trademark examination, not only similarities in appearance but also similarities in consideration of names and conceptual similarities are judged. For example, in trademark examination, a search logical formula is created using the graphic classification given to the trademark. Then, by executing the search using the created logical expression, the trademarks that may be similar to the trademark according to the application are narrowed down from the trademarks that have already been applied for. From the narrowed down trademarks, those with similar appearances, names, or concepts are extracted.

一般的に、深層学習のモデルを用いた画像処理では、画像における外観の特徴が多次元で抽出される。そして、外観の特徴を多次元空間で表現したベクトル同士の距離の近さに応じて類似度合いが推定される。すなわち、画像における外観の特徴から、類似度合いが推定される。このため、外観の特徴が全く異なる画像を類似すると推定することはほとんどあり得ない。例えば、同じ物体（例えば、たて琴など）を表現した画像であって、一方が写真など写実的な自然画像であり、他方がデザインされたイラスト画像である場合を考える。この場合、両画像における外観の特徴が大きく異なっている場合には、両者が類似すると推定されることは困難である。すなわち、たて琴の写真を示す画像と、たて琴をデザインしたイラスト画像とが類似すると推定されることは困難である。しかしながら、「たて琴」という概念が同一であることから、商標の類否判定においては、しばしば、両者が概念的に類似すると判断される場合がある。一般的な深層学習のモデルを用いた画像処理では、このような商標における概念が類似する画像を精度よく推定することが困難であった。 Generally, in image processing using a deep learning model, appearance features in an image are extracted in multiple dimensions. Then, the degree of similarity is estimated according to the closeness of the distance between the vectors expressing the features of the appearance in the multidimensional space. That is, the degree of similarity is estimated from the appearance characteristics in the image. For this reason, it is almost impossible to presume that images with completely different appearance characteristics are similar. For example, consider a case in which an image representing the same object (for example, a vertical koto), one is a realistic natural image such as a photograph, and the other is a designed illustration image. In this case, if the appearance features of the two images are significantly different, it is difficult to presume that they are similar. That is, it is difficult to presume that the image showing the photograph of the vertical koto and the illustration image showing the vertical koto are similar. However, since the concept of "tatekoto" is the same, it is often judged that the two are conceptually similar in the similarity determination of a trademark. In image processing using a general deep learning model, it has been difficult to accurately estimate images having similar concepts in such trademarks.

この対策として、本実施形態の画像類似度推定システム１では、分類テキスト特徴抽出モデル１７３を用いた推定を行う。分類テキスト特徴抽出モデル１７３は、画像における概念の特徴を学習させたモデルである。すなわち、本実施形態の画像類似度推定システム１では、画像における外観の特徴のみならず、画像における概念の特徴を抽出することができる。これにより、画像から抽出した概念の特徴を示すベクトル同士の距離の近さに応じて、概念の観点から類似度合いを推定することが可能となる。したがって、概念が類似する画像を抽出することができる。 As a countermeasure, the image similarity estimation system 1 of the present embodiment estimates using the classification text feature extraction model 173. The classification text feature extraction model 173 is a model in which the feature of the concept in the image is learned. That is, in the image similarity estimation system 1 of the present embodiment, not only the appearance feature in the image but also the conceptual feature in the image can be extracted. This makes it possible to estimate the degree of similarity from the viewpoint of the concept according to the closeness of the distance between the vectors showing the characteristics of the concept extracted from the image. Therefore, images with similar concepts can be extracted.

なお、ここでの画像における概念とは、画像の分類を示す文言であり、例えば、商標に付与された図形分類に相当する文言である。本実施形態における概念の特徴とは、文言に含まれる単語の特徴であり、例えば、単語を分散表現した単語ベクトルである。以下の説明では、画像における概念の特徴を、分類テキスト特徴と称する場合がある。 The concept in the image here is a wording indicating the classification of the image, and is, for example, a wording corresponding to the graphic classification given to the trademark. The feature of the concept in this embodiment is a feature of a word included in a wording, for example, a word vector in which words are expressed in a distributed manner. In the following description, conceptual features in images may be referred to as classification text features.

また、本実施形態の画像類似度推定システム１では、深層学習のモデルを用いて外観特徴抽出モデル１７２と分類テキスト特徴抽出モデル１７３を生成する。外観特徴抽出モデル１７２は、画像における外観の特徴を学習させたモデルである。分類テキスト特徴抽出モデル１７３は、外観と概念のそれぞれの特徴量に基づいて画像全体の特徴（以下、全体特徴ともいう）を抽出するモデルである。すなわち、本実施形態の画像類似度推定システム１では、画像における外観と概念のそれぞれの特徴量を統合させた特徴（全体特徴）を抽出することができる。これにより、画像から抽出した外観と概念の特徴を統合的に示すベクトル同士の距離の近さに応じて、外観と概念の両方を統合させた観点から類似度合いを推定することが可能となる。したがって、外観と概念とを統合的にみて類似する画像を抽出することができる。 Further, in the image similarity estimation system 1 of the present embodiment, the appearance feature extraction model 172 and the classification text feature extraction model 173 are generated by using the deep learning model. The appearance feature extraction model 172 is a model in which the appearance features in the image are learned. The classification text feature extraction model 173 is a model that extracts features of the entire image (hereinafter, also referred to as overall features) based on the respective feature amounts of the appearance and the concept. That is, in the image similarity estimation system 1 of the present embodiment, it is possible to extract features (overall features) in which the features of the appearance and the concept in the image are integrated. This makes it possible to estimate the degree of similarity from the viewpoint of integrating both the appearance and the concept according to the closeness of the distance between the vectors that integrally indicate the appearance and the feature of the concept extracted from the image. Therefore, it is possible to extract similar images by looking at the appearance and the concept in an integrated manner.

図１は、実施形態の画像類似度推定システム１の構成例を示すブロック図である。画像類似度推定システム１は、例えば、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、モデル生成部１５と、画像類似度推定部１６と、記憶部１７と、推定結果出力部１８とを備える。 FIG. 1 is a block diagram showing a configuration example of the image similarity estimation system 1 of the embodiment. The image similarity estimation system 1 includes, for example, an appearance information acquisition unit 10, an appearance feature extraction unit 11, a classification information acquisition unit 12, a classification text feature extraction unit 13, an overall feature extraction unit 14, and a model generation unit 15. , An image similarity estimation unit 16, a storage unit 17, and an estimation result output unit 18.

外観情報取得部１０は、画像における外観を示す情報を取得する。画像における外観を示す情報は、画像の見た目を示す情報であって、例えば、画素ごとの座標にＲＧＢ値が対応づけられた情報である。外観情報取得部１０は、取得した情報を、記憶部１７の外観情報１７０として記憶させる。 The appearance information acquisition unit 10 acquires information indicating the appearance in the image. The information indicating the appearance in the image is the information indicating the appearance of the image, for example, the information in which the RGB values are associated with the coordinates of each pixel. The appearance information acquisition unit 10 stores the acquired information as the appearance information 170 of the storage unit 17.

外観特徴抽出部１１は、画像における外観情報１７０、及び外観特徴抽出モデル１７２を用いて、当該画像における外観の特徴量（外観特徴量）を抽出する。外観特徴抽出モデル１７２は、画像における外観情報から当該画像における外観特徴量を出力するモデルである。外観特徴抽出モデル１７２は、モデル生成部１５によって生成される。外観特徴抽出モデル１７２の詳細については後で詳しく説明する。 The appearance feature extraction unit 11 extracts the appearance feature amount (appearance feature amount) in the image by using the appearance information 170 in the image and the appearance feature extraction model 172. The appearance feature extraction model 172 is a model that outputs the appearance feature amount in the image from the appearance information in the image. The appearance feature extraction model 172 is generated by the model generation unit 15. The details of the appearance feature extraction model 172 will be described in detail later.

分類情報取得部１２は、画像における分類を示す情報を取得する。画像における分類を示す情報は、画像に示された内容を分類する情報であって、例えば、商標における図形分類を示す情報である。分類情報取得部１２は、取得した情報を、記憶部１７の分類情報１７１として記憶させる。 The classification information acquisition unit 12 acquires information indicating classification in the image. The information indicating the classification in the image is the information for classifying the contents shown in the image, for example, the information indicating the graphic classification in the trademark. The classification information acquisition unit 12 stores the acquired information as the classification information 171 of the storage unit 17.

分類テキスト特徴抽出部１３は、画像における分類情報１７１、及び分類テキスト特徴抽出モデル１７３を用いて、当該画像における分類を示す文言の特徴量（分類テキスト特徴量）を抽出する。分類テキスト特徴抽出モデル１７３は、画像における分類情報から当該画像における分類テキスト特徴量を出力するモデルである。分類テキスト特徴抽出モデル１７３は、モデル生成部１５によって生成される。分類テキスト特徴抽出モデル１７３の詳細については後で詳しく説明する。 The classification text feature extraction unit 13 uses the classification information 171 in the image and the classification text feature extraction model 173 to extract the feature amount (classification text feature amount) of the wording indicating the classification in the image. The classification text feature extraction model 173 is a model that outputs the classification text feature amount in the image from the classification information in the image. The classification text feature extraction model 173 is generated by the model generation unit 15. The details of the classification text feature extraction model 173 will be described in detail later.

全体特徴抽出部１４は、画像における外観特徴量、分類テキスト特徴量、及びマルチモーダルモデル１７４を用いて、当該画像における画像全体の特徴量（全体特徴量）を抽出する。全体特徴抽出部１４は、画像における外観特徴量を外観特徴抽出部１１から取得する。全体特徴抽出部１４は、画像における分類テキスト特徴量を分類テキスト特徴抽出部１３から取得する。マルチモーダルモデル１７４は、画像における外観特徴量及び分類テキスト特徴量から、当該画像における全体特徴量を出力するモデルである。マルチモーダルモデル１７４の詳細については後で詳しく説明する。 The overall feature extraction unit 14 extracts the feature amount (overall feature amount) of the entire image in the image by using the appearance feature amount, the classification text feature amount, and the multimodal model 174 in the image. The overall feature extraction unit 14 acquires the appearance feature amount in the image from the appearance feature extraction unit 11. The overall feature extraction unit 14 acquires the classification text feature amount in the image from the classification text feature extraction unit 13. The multimodal model 174 is a model that outputs the total feature amount in the image from the appearance feature amount and the classification text feature amount in the image. The details of the multimodal model 174 will be described in detail later.

モデル生成部１５は、外観特徴抽出モデル１７２を生成する。この際、モデル生成部１５は、学習用画像における外観情報と分類情報との対応関係を深層学習のモデルに学習させる。これにより、モデル生成部１５は、入力された画像の外観情報から、当該画像における分類情報を出力するモデルを生成し、生成したモデルを示す情報を記憶部１７の外観特徴抽出モデル１７２として記憶させる。モデルを示す情報は、例えば、深層学習のモデルがＣＮＮ（Convolutional Neural Network）の学習モデルであれば、ＣＮＮの入力層、中間層、出力層の各層のユニット数、隠れ層の層数、活性化関数などを示す情報や、各階層のノードを結合する結合係数や重みを示す情報である。 The model generation unit 15 generates an appearance feature extraction model 172. At this time, the model generation unit 15 causes the deep learning model to learn the correspondence between the appearance information and the classification information in the learning image. As a result, the model generation unit 15 generates a model that outputs the classification information in the image from the input appearance information of the image, and stores the information indicating the generated model as the appearance feature extraction model 172 of the storage unit 17. .. The information indicating the model is, for example, if the deep learning model is a CNN (Convolutional Neural Network) learning model, the number of units in each of the input layer, the intermediate layer, and the output layer of the CNN, the number of hidden layers, and the activation. Information that indicates a function, etc., and information that indicates a connection coefficient or weight that connects nodes in each layer.

また、モデル生成部１５は、マルチモーダルモデル１７４を生成する。この際、モデル生成部１５は、学習用画像における外観特徴量及び分類テキスト特徴量と、分類情報との対応関係を深層学習のモデルに学習させる。モデル生成部１５は、外観特徴抽出部１１によって抽出された学習用画像における外観特徴量を取得する。モデル生成部１５は、分類テキスト特徴抽出部１３によって抽出された学習用画像における分類テキスト特徴量を取得する。これにより、モデル生成部１５は、入力された画像の外観特徴量及び分類テキスト特徴量から、当該画像における分類情報を出力するモデルを生成する。 In addition, the model generation unit 15 generates a multimodal model 174. At this time, the model generation unit 15 causes the deep learning model to learn the correspondence between the appearance feature amount and the classification text feature amount in the learning image and the classification information. The model generation unit 15 acquires the appearance feature amount in the learning image extracted by the appearance feature extraction unit 11. The model generation unit 15 acquires the classification text feature amount in the learning image extracted by the classification text feature extraction unit 13. As a result, the model generation unit 15 generates a model that outputs the classification information of the image from the input appearance feature amount and the classification text feature amount of the image.

ここで、画像の外観特徴量及び分類テキスト特徴量から抽出された分類情報は、画像の外観特徴量及び分類テキスト特徴量の双方に基づく特徴であり、全体特徴ということができる。すなわち、モデル生成部１５は、学習用画像における外観特徴量及び分類テキスト特徴量と、分類情報との対応関係を深層学習のモデルに学習させることにより、当該画像における全体特徴を出力するモデルを生成する。モデル生成部１５は、作成したモデルを示す情報を記憶部１７のマルチモーダルモデル１７４として記憶させる。 Here, the classification information extracted from the appearance feature amount and the classification text feature amount of the image is a feature based on both the appearance feature amount and the classification text feature amount of the image, and can be said to be an overall feature. That is, the model generation unit 15 generates a model that outputs the overall features in the image by letting the deep learning model learn the correspondence between the appearance features and the classification text features in the learning image and the classification information. do. The model generation unit 15 stores information indicating the created model as a multimodal model 174 of the storage unit 17.

画像類似度推定部１６は、画像の類似度合い（画像類似度）を推定する。画像類似度推定部１６は、複数の画像のそれぞれについて全体特徴量を取得する。画像類似度推定部１６は、全体特徴抽出部１４によって抽出された画像の全体特徴量を取得する。画像類似度推定部１６は、それぞれの画像から抽出された全体特徴における互いのベクトル空間上の距離（例えば、コサイン類似度）を算出する。例えば、画像類似度推定部１６は、算出した距離の順番を、類似する可能性が高い順序として推定する。或いは、画像類似度推定部１６は、算出した距離が所定の閾値未満であった場合、両画像が類似すると推定するようにしてもよい。 The image similarity estimation unit 16 estimates the degree of similarity of images (image similarity). The image similarity estimation unit 16 acquires the total feature amount for each of the plurality of images. The image similarity estimation unit 16 acquires the total feature amount of the image extracted by the total feature extraction unit 14. The image similarity estimation unit 16 calculates the distance (for example, cosine similarity) in the vector space of each other in the overall features extracted from each image. For example, the image similarity estimation unit 16 estimates the calculated order of distances as the order in which there is a high possibility of similarity. Alternatively, the image similarity estimation unit 16 may estimate that both images are similar when the calculated distance is less than a predetermined threshold value.

推定結果出力部１８は、画像類似度推定部１６によって推定された推定結果を出力する。推定結果出力部１８は、例えば、推定結果を図示しないディスプレイに出力することにより、推定結果を表示させる。或いは、推定結果出力部１８は、推定結果を図示しないプリンタに出力することにより、推定結果を印刷するようにしてもよい。 The estimation result output unit 18 outputs the estimation result estimated by the image similarity estimation unit 16. The estimation result output unit 18 displays the estimation result by outputting the estimation result to a display (not shown), for example. Alternatively, the estimation result output unit 18 may print the estimation result by outputting the estimation result to a printer (not shown).

上述した画像類似度推定システム１の機能部（外観情報取得部１０、外観特徴抽出部１１、分類情報取得部１２、分類テキスト特徴抽出部１３、全体特徴抽出部１４、モデル生成部１５、画像類似度推定部１６、及び推定結果出力部１８）は、例えば、ＣＰＵ（Central Processing Unit）などのハードウェアプロセッサがプログラム（ソフトウェア）を実行することにより実現される。これらの構成要素のうち一部または全部は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）、ＧＰＵ（Graphics Processing Unit）などのハードウェア（回路部；circuitryを含む）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。プログラムは、予めＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの記憶装置（非一過性の記憶媒体を備える記憶装置）に格納されていてもよいし、ＤＶＤやＣＤ−ＲＯＭなどの着脱可能な記憶媒体（非一過性の記憶媒体）に格納されており、記憶媒体がドライブ装置に装着されることでインストールされてもよい。 Functional units of the image similarity estimation system 1 described above (appearance information acquisition unit 10, appearance feature extraction unit 11, classification information acquisition unit 12, classification text feature extraction unit 13, overall feature extraction unit 14, model generation unit 15, image similarity. The degree estimation unit 16 and the estimation result output unit 18) are realized by, for example, a hardware processor such as a CPU (Central Processing Unit) executing a program (software). Some or all of these components are hardware (circuit part; circuitry) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array), GPU (Graphics Processing Unit). It may be realized by (including), or it may be realized by the cooperation of software and hardware. The program may be stored in advance in a storage device (a storage device including a non-transient storage medium) such as an HDD (Hard Disk Drive) or a flash memory, or a removable storage device such as a DVD or a CD-ROM. It is stored in a medium (non-transient storage medium) and may be installed by mounting the storage medium in a drive device.

記憶部１７は、少なくとも１つの記憶媒体を任意に組み合わせることによって構成される。記憶媒体は、例えば、ＨＤＤ（Hard Disk Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＡＭ（Random Access read/write Memory）、ＲＯＭ（Read Only Memory）である。記憶部１７は、画像類似度推定システム１の各種処理を実行するためのプログラム、及び各種処理を行う際に利用される一時的なデータを記憶する。 The storage unit 17 is configured by arbitrarily combining at least one storage medium. The storage medium is, for example, an HDD (Hard Disk Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a RAM (Random Access read / write Memory), and a ROM (Read Only Memory). The storage unit 17 stores a program for executing various processes of the image similarity estimation system 1 and temporary data used when performing various processes.

記憶部１７は、例えば、外観情報１７０と、分類情報１７１と、外観特徴抽出モデル１７２と、分類テキスト特徴抽出モデル１７３と、マルチモーダルモデル１７４とを記憶する。 The storage unit 17 stores, for example, the appearance information 170, the classification information 171 and the appearance feature extraction model 172, the classification text feature extraction model 173, and the multimodal model 174.

ここで、外観情報１７０と、分類情報１７１について、図２から図５を用いて説明する。
図２は、実施形態の画像Ｇの例を示すブロック図である。図３は、実施形態の図形分類Ｚの例を示す図である。図４は、実施形態の外観情報１７０の構成例を示す図である。図５は、実施形態の分類情報１７１の構成例を示す図である。 Here, the appearance information 170 and the classification information 171 will be described with reference to FIGS. 2 to 5.
FIG. 2 is a block diagram showing an example of the image G of the embodiment. FIG. 3 is a diagram showing an example of the graphic classification Z of the embodiment. FIG. 4 is a diagram showing a configuration example of the appearance information 170 of the embodiment. FIG. 5 is a diagram showing a configuration example of the classification information 171 of the embodiment.

図２に示すように、画像Ｇは、例えば、円の中に描かれた看護師のイラストを示す画像である。図２の例に示す画像Ｇにおける外観の特徴として、例えば、図３に示すような図形分類Ｚが付与される。この例では、図形分類Ｚは、「２．３．１頭部、上半身」及び「２．３．３尼僧、看護婦」などである。 As shown in FIG. 2, the image G is, for example, an image showing an illustration of a nurse drawn in a circle. As a feature of the appearance in the image G shown in the example of FIG. 2, for example, the graphic classification Z as shown in FIG. 3 is given. In this example, the graphic classification Z is "2.3.1 head, upper body", "2.3.3 nun, nurse" and the like.

図４に示すように外観情報１７０は、例えば、画像ＩＤと外観情報とを備える。画像ＩＤは画像を一意に識別する識別情報である。外観情報は、画像における外観を示す情報である。この例では、外観情報として、画素ごとの座標とＲＧＢ値とを示す情報が示されている。 As shown in FIG. 4, the appearance information 170 includes, for example, an image ID and appearance information. The image ID is identification information that uniquely identifies the image. Appearance information is information indicating the appearance in an image. In this example, as appearance information, information indicating coordinates for each pixel and RGB values is shown.

図５に示すように分類情報１７１は、例えば、画像ＩＤと分類情報とを備える。画像ＩＤは画像を一意に識別する識別情報である。分類情報は、画像における分類を示す情報である。この例では、分類情報として、商標における図形分類の番号体系とその番号体系に対応する分類の文言とが対応づけられた情報が示されている。 As shown in FIG. 5, the classification information 171 includes, for example, an image ID and classification information. The image ID is identification information that uniquely identifies the image. The classification information is information indicating the classification in the image. In this example, as the classification information, the information in which the number system of the graphic classification in the trademark and the wording of the classification corresponding to the number system are associated with each other is shown.

ここで、画像類似度推定システム１が画像の全体特徴を抽出する処理の流れを説明する。図６は、実施形態の画像類似度推定システム１が行う処理を説明する図である。 Here, the flow of the process in which the image similarity estimation system 1 extracts the overall features of the image will be described. FIG. 6 is a diagram illustrating a process performed by the image similarity estimation system 1 of the embodiment.

図６に示すように、画像類似度推定システム１は、画像Ｇにおける外観情報を外観特徴抽出モデル１７２に入力させることにより、外観特徴抽出モデル１７２から画像Ｇの外観特徴量を出力させる。また、画像類似度推定システム１は、画像Ｇにおける分類情報を分類テキスト特徴抽出モデル１７３に入力させることにより、分類テキスト特徴抽出モデル１７３から画像Ｇの分類テキスト特徴量を出力させる。そして、画像類似度推定システム１は、マルチモーダルモデル１７４に、画像Ｇにおける外観特徴量及び分類テキスト特徴量を入力させることにより、マルチモーダルモデル１７４から、画像Ｇにおける全体特徴量を出力させる。このように、画像類似度推定システム１では、外観特徴抽出モデル１７２、分類テキスト特徴抽出モデル１７３、及びマルチモーダルモデル１７４を用いて、画像Ｇにおける外観情報及び分類情報から、画像Ｇの全体特徴量を抽出する。 As shown in FIG. 6, the image similarity estimation system 1 outputs the appearance feature amount of the image G from the appearance feature extraction model 172 by inputting the appearance information in the image G into the appearance feature extraction model 172. Further, the image similarity estimation system 1 outputs the classification text feature amount of the image G from the classification text feature extraction model 173 by inputting the classification information in the image G into the classification text feature extraction model 173. Then, the image similarity estimation system 1 causes the multimodal model 174 to input the appearance feature amount and the classification text feature amount in the image G, so that the multimodal model 174 outputs the overall feature amount in the image G. As described above, in the image similarity estimation system 1, the appearance feature extraction model 172, the classification text feature extraction model 173, and the multimodal model 174 are used, and the overall feature amount of the image G is obtained from the appearance information and the classification information in the image G. Is extracted.

ここで、外観特徴抽出モデル１７２について、図７を用いて詳しく説明する。図７は、実施形態の外観特徴抽出モデル１７２を説明する図である。図７に示すように、外観特徴抽出モデル１７２は、例えば、ＣＮＮ部１７２Ａと、アテンション機構１７２Ｂと、乗算部１７２Ｃと、外観特徴出力部１７２Ｄとを備える。 Here, the appearance feature extraction model 172 will be described in detail with reference to FIG. 7. FIG. 7 is a diagram illustrating an appearance feature extraction model 172 of the embodiment. As shown in FIG. 7, the appearance feature extraction model 172 includes, for example, a CNN unit 172A, an attention mechanism 172B, a multiplication unit 172C, and an appearance feature output unit 172D.

ＣＮＮ部１７２Ａは、ＣＮＮによる深層学習のモデルである。アテンション機構１７２Ｂは、ＣＮＮ部１７２Ａから出力される内部状態に重みを付けて出力する機構である。例えば、アテンション機構１７２Ｂは、推定に重要でない部分（例えば、画像における背景の領域など）に、重要な部分と比較して小さな重みづけを行う。これにより、推定に有効な特徴に焦点をあて、推定結果により大きな影響を与えることが可能となる。加算部１７２Ｃは、ＣＮＮ部１７２Ａからの出力と、アテンション機構１７２Ｂからの出力とのそれぞれに重みを乗算して出力する。乗算部１７２Ｃは、例えば、ＣＮＮ部１７２Ａからの出力、又はアテンション機構１７２Ｂからの出力のいずれか一方を出力するスイッチとして機能する。これにより、アテンション機構１７２Ｂの有無を制御し、アテンション機構１７２Ｂの有無が推定の精度に与える影響を検証することが可能となる。外観特徴出力部１７２Ｄは、外観特徴抽出モデル１７２からの出力、つまり画像Ｇにおける外観特徴量が格納される出力層である。 CNN part 172A is a model of deep learning by CNN. The attention mechanism 172B is a mechanism that weights and outputs the internal state output from the CNN unit 172A. For example, the attention mechanism 172B weights parts that are not important for estimation (for example, a background area in an image) less than important parts. This makes it possible to focus on the features that are useful for the estimation and have a greater impact on the estimation results. The addition unit 172C multiplies each of the output from the CNN unit 172A and the output from the attention mechanism 172B by a weight and outputs the output. The multiplication unit 172C functions as a switch that outputs either the output from the CNN unit 172A or the output from the attention mechanism 172B, for example. This makes it possible to control the presence / absence of the attention mechanism 172B and verify the influence of the presence / absence of the attention mechanism 172B on the estimation accuracy. The appearance feature output unit 172D is an output layer in which the output from the appearance feature extraction model 172, that is, the appearance feature amount in the image G is stored.

例えば、まず、モデル生成部１５は、ＣＮＮ部１７２Ａのファインチューニングを行う。具体的に、モデル生成部１５は、ＣＮＮ部１７２Ａに、学習用画像における外観情報と分類情報との対応関係を、所定の終了条件を満たすまで繰り返し学習させる。学習用画像は、モデルの学習に用いられる画像であって、画像に対して、既にその分類情報が対応づけられている画像である。学習用画像は、例えば、出願済みの商標であって、商標における図形分類が付与されているものが用いられる。所定の終了条件は、任意に定められた条件であってよいが、例えば、学習段階における推定の精度の変化が収束することである。或いは所定の終了条件は、学習の回数が所定の上限に到達する、或いは推定の精度が所定の閾値以上になる、などの条件であってもよい。 For example, first, the model generation unit 15 performs fine tuning of the CNN unit 172A. Specifically, the model generation unit 15 causes the CNN unit 172A to repeatedly learn the correspondence between the appearance information and the classification information in the learning image until a predetermined end condition is satisfied. The learning image is an image used for learning a model, and is an image in which the classification information is already associated with the image. As the learning image, for example, an applied trademark to which the graphic classification in the trademark is given is used. The predetermined end condition may be an arbitrarily determined condition, but for example, the change in the accuracy of estimation in the learning stage converges. Alternatively, the predetermined end condition may be a condition that the number of learnings reaches a predetermined upper limit or the accuracy of estimation becomes equal to or higher than a predetermined threshold value.

次に、モデル生成部１５は、ファインチューニングをしたＣＮＮ部１７２Ａを用いて、アテンション機構１７２Ｂを学習させる。モデル生成部１５は、学習用画像における外観情報を入力することにより、ＣＮＮ部１７２Ａを介してアテンション機構１７２Ｂから出力される特徴量に基づき付与される確率が高い分類情報が、学習用画像における分類情報に近づくように、アテンション機構１７２Ｂにおけるパラメータを調整することにより、アテンション機構１７２Ｂを学習させる。 Next, the model generation unit 15 trains the attention mechanism 172B using the fine-tuned CNN unit 172A. By inputting the appearance information in the learning image, the model generation unit 15 classifies the classification information in the learning image with a high probability of being given based on the feature amount output from the attention mechanism 172B via the CNN unit 172A. The attention mechanism 172B is trained by adjusting the parameters in the attention mechanism 172B so as to approach the information.

このように、モデル生成部１５は、ＣＮＮ部１７２Ａのファインチューニング、及びアテンション機構１７２Ｂの学習の二つの手順を行うことにより、外観特徴抽出モデル１７２を生成する。 In this way, the model generation unit 15 generates the appearance feature extraction model 172 by performing the two steps of fine tuning the CNN unit 172A and learning the attention mechanism 172B.

ここで、分類テキスト特徴抽出モデル１７３について、図８を用いて詳しく説明する。図８は、分類テキスト特徴抽出モデル１７３を説明する図である。図８に示すように、分類テキスト特徴抽出モデル１７３は、例えば、抽出単語入力層１７３Ａと、単語特徴埋込部１７３Ｂと、加重平均部１７３Ｃと、分類テキスト特徴出力部１７３Ｄとを備える。 Here, the classification text feature extraction model 173 will be described in detail with reference to FIG. FIG. 8 is a diagram illustrating a classification text feature extraction model 173. As shown in FIG. 8, the classification text feature extraction model 173 includes, for example, an extraction word input layer 173A, a word feature embedding section 173B, a weighted average section 173C, and a classification text feature output section 173D.

抽出単語入力層１７３Ａは、画像Ｇの分類を示す文言から抽出された単語が入力される入力層である。抽出単語入力層１７３Ａには、例えば、画像Ｇの分類を示す文言において分かち書きされた単語のそれぞれが入力される。例えば、分類を示す文言が「頭部、上半身」である場合、抽出単語入力層１７３Ａには、「頭部」と「上半身」がそれぞれ入力される。図８の例では、例えば、抽出単語入力層１７３Ａにおける、ｗ１に「頭部」が入力され、ｗ２に「上半身」が入力される。この例のように、抽出単語入力層１７３Ａには、単語の数に応じた数のノードが設定されてよい。また、分類を示す文言が分かち書きされていない場合に、分類を示す文言を形態素解析することにより、品詞ごとに分離して、分類を示す文言から、分類を示す単語（例えば、名詞など）を抽出するようにしてもよい。 The extracted word input layer 173A is an input layer into which words extracted from the wording indicating the classification of the image G are input. In the extracted word input layer 173A, for example, each of the words divided in the wording indicating the classification of the image G is input. For example, when the wording indicating the classification is "head, upper body", "head" and "upper body" are input to the extracted word input layer 173A, respectively. In the example of FIG. 8, for example, in the extracted word input layer 173A, the “head” is input to w1 and the “upper body” is input to w2. As in this example, the extracted word input layer 173A may be set with a number of nodes according to the number of words. In addition, when the word indicating the classification is not divided, the word indicating the classification is separated by part of speech by morphological analysis, and the word indicating the classification (for example, noun) is extracted from the word indicating the classification. You may try to do it.

単語特徴埋込部１７３Ｂには、抽出単語入力層１７３Ａのそれぞれのノードに入力された単語の特徴が出力される。単語の特徴は、いわゆる単語の分散表現であり、例えば、コーパスを用いて学習したＷｏｒｄ２Ｖｅｃ（以下、Ｗ２Ｖ）などの自然言語処理モデルに単語を入力させることにより得られる、単語の特徴を示す情報である。 The word feature embedding unit 173B outputs the feature of the word input to each node of the extracted word input layer 173A. The characteristic of a word is a so-called distributed expression of a word, and is information indicating the characteristic of a word obtained by inputting a word into a natural language processing model such as Word2Vec (hereinafter, W2V) learned using a corpus. be.

ここで、図形の分類情報、特に商標における図形分類には、類似する商標を漏れなく抽出する必要があることから、比較的広い概念で図形分類が付与されているものがある。ここでの広い概念とは、例えば、「２６．１．１円」などの分類である。円が用いられている画像は数多く存在しており、この様な比較的広い概念での分類を示す文言の特徴を用いると、多数の画像が類似することになり、実質的な絞り込みとならない可能性が高い。つまり、比較的広い概念での分類を示す文言の特徴を反映させると、推定の精度を劣化させてしまう可能性がある。 Here, in the graphic classification information, particularly in the graphic classification in the trademark, since it is necessary to extract similar trademarks without omission, there are some that are given the graphic classification with a relatively broad concept. The broad concept here is, for example, a classification such as "26.1.1 yen". There are many images in which circles are used, and if the characteristics of the wording indicating classification in such a relatively broad concept are used, many images will be similar, and it may not be possible to narrow down substantially. Highly sexual. In other words, reflecting the characteristics of words that indicate classification in a relatively broad concept may deteriorate the accuracy of estimation.

この対策として、本実施形態では、絞り込みの効果が期待できない単語の影響が小さくなるように重みづけを行う。具体的に、加重平均部１７３Ｃは、単語から抽出された単語ベクトル（単語の特徴量）に、その単語のｉｄｆ値で重みづけし、単語ベクトルごとに加重平均した値を出力する。ｉｄｆ値は以下の（１）式で示される値である。 As a countermeasure for this, in the present embodiment, weighting is performed so that the influence of words for which the effect of narrowing down cannot be expected is reduced. Specifically, the weighted average unit 173C weights the word vector (word feature amount) extracted from the word with the idf value of the word, and outputs the weighted average value for each word vector. The idf value is a value represented by the following equation (1).

ｉｄｆ（Ｘ）＝ｌｏｇ（Ｎ＿ｔｏｔａｌ／Ｎ＿Ｘ） …（１） idf (X) = log (N_total / N_X) ... (1)

（１）式において、ｉｄｆ（Ｘ）は単語（Ｘ）におけるｉｄｆ値である。Ｎ＿ｔｏｔａｌは、図形分類が付与された画像の総数である。Ｎ＿Ｘは、単語（Ｘ）を含む図形分類が付与された画像の数である。（１）式に示す通り、画像の総数に対して多くの画像に付与されている分類に含まれる単語におけるｉｄｆ値は小さな値となり、画像の総数に対して少ない画像に付与されている分類に含まれる単語におけるｉｄｆ値は大きな値となる。このようなｉｄｆ値で重みづけがなされることにより、絞り込みに有効な単語の特徴を、分類テキスト特徴量により大きく影響させることができる。その一方で、絞り込みに効果が期待できない単語の特徴が分類テキスト特徴量に与える影響を抑制させることができる。 In the equation (1), the idf (X) is the idf value in the word (X). N_total is the total number of images to which the graphic classification is given. N_X is the number of images to which the graphic classification including the word (X) is given. As shown in equation (1), the idf value in the word included in the classification given to many images with respect to the total number of images is a small value, and the classification given to a small number of images with respect to the total number of images The idf value in the included word is a large value. By weighting with such an idf value, the features of words effective for narrowing down can be greatly influenced by the amount of classified text features. On the other hand, it is possible to suppress the influence of word features, which cannot be expected to be effective in narrowing down, on the classified text features.

分類テキスト特徴出力部１７３Ｄは、分類テキスト特徴抽出モデル１７３からの出力、つまり画像Ｇにおける分類テキスト特徴量が格納される出力層である。 The classification text feature output unit 173D is an output layer in which the output from the classification text feature extraction model 173, that is, the classification text feature amount in the image G is stored.

ここで、マルチモーダルモデル１７４について、図９を用いて詳しく説明する。図９は、マルチモーダルモデル１７４を説明する図である。図９に示すように、マルチモーダルモデル１７４は、例えば、特徴結合入力層１７４Ａと、全結合層１７４Ｂと、全体特徴出力部１７４Ｃとを備える。 Here, the multimodal model 174 will be described in detail with reference to FIG. FIG. 9 is a diagram illustrating a multimodal model 174. As shown in FIG. 9, the multimodal model 174 includes, for example, a feature coupling input layer 174A, a fully coupled layer 174B, and an overall feature output unit 174C.

特徴結合入力層１７４Ａは、画像Ｇにおける外観特徴量及び分類テキスト特徴量が入力される、マルチモーダルモデル１７４の入力層である。全体特徴出力部１７４Ｃは、マルチモーダルモデル１７４からの出力、つまり画像Ｇにおける全体特徴量が格納される出力層である。全結合層１７４Ｂは、特徴結合入力層１７４Ａと全体特徴出力部１７４Ｃとの間を全結合するＦＣ（Full Connection）層である。 The feature combination input layer 174A is an input layer of the multimodal model 174 in which the appearance feature amount and the classification text feature amount in the image G are input. The overall feature output unit 174C is an output layer in which the output from the multimodal model 174, that is, the overall feature amount in the image G is stored. The fully connected layer 174B is an FC (Full Connection) layer that fully connects the feature-coupled input layer 174A and the overall feature output unit 174C.

ここで、画像Ｇにおける外観特徴量は、外観特徴抽出モデル１７２から出力される。また、画像Ｇにおける分類テキスト特徴量は、分類テキスト特徴抽出モデル１７３から出力される。それぞれの特徴量が、互いに異なるモデルから出力されることから、それぞれの特徴量が取り得る範囲が、同じような範囲とならない可能性がある。このような取り得る範囲が異なる特徴量を単純にそのまま統合させて入力させてしまうと、モデルが一方の特徴量と出力との対応関係のみを学習してしまい、他方の特徴量が反映されていない偏った推定がなされる可能性が高くなる。 Here, the appearance feature amount in the image G is output from the appearance feature extraction model 172. Further, the classification text feature amount in the image G is output from the classification text feature extraction model 173. Since each feature is output from different models, the range that each feature can take may not be the same range. If such features with different possible ranges are simply integrated and input as they are, the model learns only the correspondence between one feature and the output, and the other feature is reflected. It is more likely that no biased estimates will be made.

このための対策として、本実施形態では、マルチモーダルモデル１７４に入力させる二つの特徴量を正規化する前処理を行う。具体的に、モデル生成部１５は、画像Ｇにおける外観特徴量と、画像Ｇにおける分類テキスト特徴量とが同程度の範囲（例えば、０から１）となるように、一方の特徴量に所定の一律の値を乗算する。モデル生成部１５は、必要に応じて他方の特徴量に、一方の特徴量に乗算した値とは異なる別の一律の値を乗算する。これにより、モデル生成部１５は、マルチモーダルモデル１７４を、二つの特徴量の両方を考慮して全体特徴量を出力するように学習させることができる。 As a countermeasure for this, in the present embodiment, preprocessing for normalizing the two features input to the multimodal model 174 is performed. Specifically, the model generation unit 15 determines one feature amount so that the appearance feature amount in the image G and the classification text feature amount in the image G are in the same range (for example, 0 to 1). Multiply a uniform value. The model generation unit 15 multiplies the other feature amount by another uniform value different from the value obtained by multiplying the one feature amount, if necessary. As a result, the model generation unit 15 can train the multimodal model 174 so as to output the total feature amount in consideration of both of the two feature amounts.

ここで、画像類似度推定システム１が行う処理の流れについて、図１０から図１２を用いて説明する。図１０から図１２は、実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。 Here, the flow of processing performed by the image similarity estimation system 1 will be described with reference to FIGS. 10 to 12. 10 to 12 are flow charts showing a flow of processing performed by the image similarity estimation system 1 of the embodiment.

図１０には、画像類似度推定システム１が分類テキスト特徴抽出モデル１７３を用いて画像から分類テキスト特徴量を抽出する処理の流れが示されている。画像類似度推定システム１は、画像Ｇの分類情報を取得する（ステップＳ１０）。画像類似度推定システム１は、分類情報を用いて、画像Ｇの分類を示す文言を単語ごとに分離（分かち書き）する（ステップＳ１１）。画像類似度推定システム１は、単語それぞれの単語ベクトルを抽出する（ステップＳ１２）。 FIG. 10 shows a flow of processing in which the image similarity estimation system 1 extracts the classification text feature amount from the image using the classification text feature extraction model 173. The image similarity estimation system 1 acquires the classification information of the image G (step S10). The image similarity estimation system 1 uses the classification information to separate (separate) the words indicating the classification of the image G for each word (step S11). The image similarity estimation system 1 extracts a word vector for each word (step S12).

一方、画像類似度推定システム１は、単語それぞれのｉｄｆ値を算出する（ステップＳ１３）。画像類似度推定システム１は、単語の単語ベクトルに、その単語のｉｄｆ値を重みづけ（乗算）する（ステップＳ１４）。画像類似度推定システム１は、重みづけしたそれぞれの単語における単語ベクトルを、単語ベクトルごとに加重平均した値を、画像Ｇにおける分類テキスト特徴量として出力する（ステップＳ１５）。 On the other hand, the image similarity estimation system 1 calculates the idf value of each word (step S13). The image similarity estimation system 1 weights (multiplies) the word vector of the word with the idf value of the word (step S14). The image similarity estimation system 1 outputs a weighted average value of the word vectors in each weighted word for each word vector as a classification text feature amount in the image G (step S15).

なお、図１０では、ステップＳ１２で単語ベクトルを抽出した後に、ステップＳ１３で単語のｉｄｆ値を算出する流れを例示して説明したが、少なくともステップＳ１４において単語ベクトルにｉｄｆ値が乗算できればよく、単語のｉｄｆ値を算出した後に、ステップＳ１０〜Ｓ１２に示す処理を行うことにより単語ベクトルを抽出してもよい。或いは、図１０における単語ベクトルを抽出する処理とは独立させた処理として、ｉｄｆ値を算出する処理を行ってもよい。 In FIG. 10, the flow of calculating the idf value of a word in step S13 after extracting the word vector in step S12 has been illustrated and described. However, it suffices if the word vector can be multiplied by the idf value at least in step S14. After calculating the idf value of, the word vector may be extracted by performing the processing shown in steps S10 to S12. Alternatively, a process of calculating the idf value may be performed as a process independent of the process of extracting the word vector in FIG.

図１１には、画像類似度推定システム１が、外観特徴抽出モデル１７２、及びマルチモーダルモデル１７４を生成する処理の流れが示されている。画像類似度推定システム１は、学習用画像の外観情報を取得する（ステップＳ２０）。画像類似度推定システム１は、学習用画像の分類情報を取得する（ステップＳ２１）。画像類似度推定システム１は、学習用画像の外観情報と分類情報との対応関係をＣＮＮ部１７２Ａに学習させることにより、外観特徴抽出モデル１７２のＣＮＮ部１７２Ａを学習させる（ステップＳ２２）。画像類似度推定システム１は、学習用画像の外観情報を入力させることにより得られるＣＮＮ部１７２Ａの内部状態に基づき出力される可能性の高い分類情報が、学習用画像の分類情報に近づくように、外観特徴抽出モデル１７２のアテンション機構１７２Ｂを学習させる（ステップＳ２３）。これにより、画像類似度推定システム１は、外観特徴抽出モデル１７２を生成する。 FIG. 11 shows a flow of processing in which the image similarity estimation system 1 generates the appearance feature extraction model 172 and the multimodal model 174. The image similarity estimation system 1 acquires appearance information of the learning image (step S20). The image similarity estimation system 1 acquires the classification information of the learning image (step S21). The image similarity estimation system 1 trains the CNN unit 172A of the appearance feature extraction model 172 by having the CNN unit 172A learn the correspondence between the appearance information and the classification information of the learning image (step S22). The image similarity estimation system 1 makes the classification information that is likely to be output based on the internal state of the CNN unit 172A obtained by inputting the appearance information of the learning image closer to the classification information of the learning image. , The attention mechanism 172B of the appearance feature extraction model 172 is learned (step S23). As a result, the image similarity estimation system 1 generates the appearance feature extraction model 172.

画像類似度推定システム１は、外観特徴抽出モデル１７２に学習用画像の外観情報を入力させることにより、学習用画像の外観特徴量を抽出する（ステップＳ２４）。画像類似度推定システム１は、分類テキスト特徴抽出モデル１７３に学習用画像の分類情報を入力させることにより、学習用画像の分類テキスト特徴量を抽出する（ステップＳ２５）。画像類似度推定システム１は、学習用画像の外観特徴量と分類テキスト特徴量とを正規化する処理を行う（ステップＳ２６）。画像類似度推定システム１は、正規化する処理をした学習用画像の外観特徴量と分類テキスト特徴量に基づき出力される可能性が高い分類情報が、学習用画像の分類情報（ここでは全体特徴量に相当する）近づくように、全結合層１７４Ｂを学習させる（パラメータを調整する）ことにより、マルチモーダルモデル１７４を生成する（ステップＳ２７）。画像類似度推定システム１は、生成した外観特徴抽出モデル１７２、マルチモーダルモデル１７４を記憶させる（ステップＳ２８）。 The image similarity estimation system 1 extracts the appearance feature amount of the learning image by inputting the appearance information of the learning image into the appearance feature extraction model 172 (step S24). The image similarity estimation system 1 extracts the classification text feature amount of the training image by causing the classification text feature extraction model 173 to input the classification information of the training image (step S25). The image similarity estimation system 1 performs a process of normalizing the appearance feature amount of the learning image and the classification text feature amount (step S26). In the image similarity estimation system 1, the classification information that is likely to be output based on the appearance feature amount and the classification text feature amount of the learning image that has been normalized is the classification information of the learning image (here, the overall feature). A multimodal model 174 is generated by training (adjusting the parameters) the fully connected layer 174B so as to approach (corresponding to the quantity) (step S27). The image similarity estimation system 1 stores the generated appearance feature extraction model 172 and the multimodal model 174 (step S28).

なお、図１１では、ステップＳ２４で外観特徴量を抽出した後に、ステップＳ２５で分類テキスト特徴量を抽出する流れを例示して説明したが、少なくともステップＳ２６において二つの特徴量（外観特徴量と分類テキスト特徴量）が正規化できればよく、分類テキスト特徴量を抽出した後に外観特徴量を抽出してもよい。 In FIG. 11, the flow of extracting the classification text feature amount in step S25 after extracting the appearance feature amount in step S24 has been illustrated and described, but at least two feature amounts (appearance feature amount and classification) in step S26 have been described. As long as the text feature amount) can be normalized, the appearance feature amount may be extracted after the classification text feature amount is extracted.

図１２には、画像類似度推定システム１が、二つの画像（ここでは対象画像と比較画像）の類似度を推定する処理の流れが示されている。画像類似度推定システム１は、対象画像における外観情報を取得し（ステップＳ３０）、取得した情報と外観特徴抽出モデル１７２とを用いて、対象画像における外観特徴量を抽出する（ステップＳ３１）。また、画像類似度推定システム１は、対象画像における分類情報を取得し（ステップＳ３２）、取得した情報と分類テキスト特徴抽出モデル１７３とを用いて、対象画像における分類テキスト特徴量を抽出する（ステップＳ３３）。そして、画像類似度推定システム１は、対象画像における外観特徴量と、分類テキスト特徴量と、マルチモーダルモデル１７４を用いて、対象画像における全体特徴量を抽出する（ステップＳ３４）。 FIG. 12 shows a flow of processing in which the image similarity estimation system 1 estimates the similarity between two images (here, a target image and a comparison image). The image similarity estimation system 1 acquires appearance information in the target image (step S30), and uses the acquired information and the appearance feature extraction model 172 to extract the appearance feature amount in the target image (step S31). Further, the image similarity estimation system 1 acquires the classification information in the target image (step S32), and extracts the classification text feature amount in the target image by using the acquired information and the classification text feature extraction model 173 (step). S33). Then, the image similarity estimation system 1 extracts the overall feature amount in the target image by using the appearance feature amount in the target image, the classification text feature amount, and the multimodal model 174 (step S34).

一方、画像類似度推定システム１は、比較画像における全体特徴量を抽出する（ステップＳ３５）。画像類似度推定システム１が比較画像における全体特徴量を抽出する処理の流れは、対象画像における全体特徴量を抽出する処理の流れと同様である。 On the other hand, the image similarity estimation system 1 extracts the total feature amount in the comparative image (step S35). The flow of the process in which the image similarity estimation system 1 extracts the total feature amount in the comparative image is the same as the flow of the process of extracting the total feature amount in the target image.

画像類似度推定システム１は、対象画像との類似度を推定したい全ての比較画像についてその全体特徴量を算出したか否かを判定する（ステップＳ３６）。画像類似度推定システム１は、対象画像と比較画像のそれぞれの全体特徴におけるベクトル空間上の距離をコサイン類似度として算出する（ステップＳ３７）。 The image similarity estimation system 1 determines whether or not the overall feature amount of all the comparative images for which the similarity with the target image is to be estimated has been calculated (step S36). The image similarity estimation system 1 calculates the distance in the vector space of the overall features of the target image and the comparison image as the cosine similarity (step S37).

なお、図１２では、ステップＳ３６で対象画像との類似度を推定したい全ての比較画像についてその全体特徴量を算出した後に、ステップＳ３６でそれぞれのコサイン類似度を算出する流れを例示して説明したが、少なくとも対象画像と比較画像の類似度を算出できればよく、比較画像における全体特徴量を抽出する都度、コサイン類似度を算出するようにしてもよい。 In FIG. 12, a flow of calculating the overall feature amount of all the comparative images for which the similarity with the target image is to be estimated in step S36 and then calculating each cosine similarity in step S36 has been illustrated and described. However, it suffices if at least the similarity between the target image and the comparison image can be calculated, and the cosine similarity may be calculated each time the overall feature amount in the comparison image is extracted.

以上説明したように、実施形態の画像類似度推定システム１は、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、モデル生成部１５と、画像類似度推定部１６とを備える。外観情報取得部１０は画像Ｇの外観を示す外観情報１７０を取得する。外観特徴抽出部１１は画像Ｇにおける外観情報１７０、及び外観特徴抽出モデル１７２を用いて、画像Ｇの外観の特徴を示す外観特徴量を抽出する。分類情報取得部１２は画像Ｇの分類を示す分類情報１７１を取得する。分類テキスト特徴抽出部１３は、画像Ｇにおける分類情報１７１、及び分類テキスト特徴抽出モデル１７３を用いて、画像Ｇの分類を示す文言の特徴を示す分類テキスト特徴量を抽出する。全体特徴抽出部１４は、画像Ｇにおける外観特徴量、分類テキスト特徴量、及びマルチモーダルモデル１７４を用いて、画像Ｇにおける画像全体の特徴である全体特徴量を抽出する。モデル生成部１５は、外観特徴抽出モデル１７２と、マルチモーダルモデル１７４を生成する。画像類似度推定部１６は、対象画像における全体特徴量、及び比較画像における全体特徴量に基づいて、対象画像と比較画像の類似度合いを推定する。これにより、実施形態の画像類似度推定システム１は、画像Ｇにおける外観と概念との両方を考慮した特徴を抽出することができ、外観のみならず観念を考慮して画像の類否を判定することが可能である。 As described above, the image similarity estimation system 1 of the embodiment includes the appearance information acquisition unit 10, the appearance feature extraction unit 11, the classification information acquisition unit 12, the classification text feature extraction unit 13, and the overall feature extraction unit. It includes 14, a model generation unit 15, and an image similarity estimation unit 16. The appearance information acquisition unit 10 acquires appearance information 170 indicating the appearance of the image G. The appearance feature extraction unit 11 extracts the appearance feature amount indicating the appearance feature of the image G by using the appearance information 170 in the image G and the appearance feature extraction model 172. The classification information acquisition unit 12 acquires classification information 171 indicating the classification of the image G. The classification text feature extraction unit 13 uses the classification information 171 in the image G and the classification text feature extraction model 173 to extract the classification text feature amount indicating the feature of the wording indicating the classification of the image G. The overall feature extraction unit 14 extracts the overall feature amount that is a feature of the entire image in the image G by using the appearance feature amount in the image G, the classification text feature amount, and the multimodal model 174. The model generation unit 15 generates an appearance feature extraction model 172 and a multimodal model 174. The image similarity estimation unit 16 estimates the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image. As a result, the image similarity estimation system 1 of the embodiment can extract features in consideration of both the appearance and the concept in the image G, and determines the similarity of the images in consideration of not only the appearance but also the idea. It is possible.

また、実施形態の画像類似度推定システム１では、外観特徴抽出モデル１７２は、深層学習の学習モデルの内部状態に重み付けした値を出力するアテンション機構１７２Ｂを含む。モデル生成部１５は、アテンション機構１７２Ｂに、学習用画像における外観情報と前記分類情報との対応関係に応じた重みを学習させる。これにより、実施形態の画像類似度推定システム１では、外観特徴抽出モデル１７２の内部状態において外観の特徴を抽出するのに有効なものに焦点をあてることができ、より精度よく外観特徴量を抽出することが可能となる。 Further, in the image similarity estimation system 1 of the embodiment, the appearance feature extraction model 172 includes an attention mechanism 172B that outputs a weighted value to the internal state of the learning model of deep learning. The model generation unit 15 causes the attention mechanism 172B to learn the weights according to the correspondence between the appearance information in the learning image and the classification information. As a result, in the image similarity estimation system 1 of the embodiment, it is possible to focus on what is effective for extracting the appearance feature in the internal state of the appearance feature extraction model 172, and the appearance feature amount is extracted more accurately. It becomes possible to do.

また、実施形態の画像類似度推定システム１では、分類テキスト特徴抽出モデル１７３は、文言に含まれる単語の特徴量を示す単語特徴量を、単語のｉｄｆ値で重みづけした値に基づいて当該文言の特徴を抽出するモデルである。ｉｄｆ値は、分類済みの画像の集合である画像群に統計処理を行うことにより算出される値である。これにより、実施形態の画像類似度推定システム１では、絞り込みに効果が期待できない図形分類の影響を弱め、絞り込みに有効な図形分類の影響を強めることができる。したがって、絞り込みにより効果的な分類テキスト特徴量を抽出することが可能となる。 Further, in the image similarity estimation system 1 of the embodiment, the classification text feature extraction model 173 uses the word feature amount indicating the feature amount of the word included in the wording based on the value weighted by the idf value of the word. It is a model that extracts the features of. The idf value is a value calculated by performing statistical processing on an image group which is a set of classified images. As a result, in the image similarity estimation system 1 of the embodiment, the influence of the graphic classification that cannot be expected to be effective for narrowing down can be weakened, and the influence of the graphic classification effective for narrowing down can be strengthened. Therefore, it is possible to extract effective classification text features by narrowing down.

また、実施形態の画像類似度推定システム１では、ｉｄｆ値は、分類済みの画像の集合である画像群の数に対する、前記分類テキスト特徴量を抽出する画像と同じ分類とされた画像の数の割合を用いて算出される値である。これにより、実施形態の画像類似度推定システム１では、上述した効果と同様の効果を奏する。 Further, in the image similarity estimation system 1 of the embodiment, the idf value is the number of images having the same classification as the image from which the classification text feature amount is extracted with respect to the number of image groups which are a set of classified images. It is a value calculated using a ratio. As a result, the image similarity estimation system 1 of the embodiment has the same effect as the above-mentioned effect.

また、実施形態の画像類似度推定システム１では、モデル生成部１５は、学習用画像における外観特徴量及び分類テキスト特徴量が、同一の範囲内に含まれるデータとなるように正規化する処理を行う。モデル生成部１５は、正規化する処理を行った前記学習用画像における、外観特徴量及び分類テキスト特徴量と前記分類情報との対応関係を学習モデルに学習させることにより、マルチモーダルモデル１７４を生成する。これにより、実施形態の画像類似度推定システム１では、二つの特徴量の一方に偏ることなく、両方の特徴量が共に反映された全体特徴量を抽出することができる。したがって、外観と概念の双方を鑑みて類似する画像を推定することができる。 Further, in the image similarity estimation system 1 of the embodiment, the model generation unit 15 performs a process of normalizing the appearance feature amount and the classification text feature amount in the learning image so that the data are included in the same range. conduct. The model generation unit 15 generates a multimodal model 174 by causing the learning model to learn the correspondence between the appearance feature amount and the classification text feature amount and the classification information in the learning image that has been normalized. do. As a result, in the image similarity estimation system 1 of the embodiment, it is possible to extract the total feature amount in which both feature amounts are reflected without being biased to one of the two feature amounts. Therefore, similar images can be estimated in view of both appearance and concept.

また、実施形態の画像類似度推定システム１は、学習装置として適用されてもよい。この場合、学習装置は、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、モデル生成部１５とを備える。これにより、学習装置は、画像Ｇにおける外観と概念とを考慮した全体特徴を抽出することができるモデルを生成することができる。 Further, the image similarity estimation system 1 of the embodiment may be applied as a learning device. In this case, the learning device includes an appearance information acquisition unit 10, an appearance feature extraction unit 11, a classification information acquisition unit 12, a classification text feature extraction unit 13, an overall feature extraction unit 14, and a model generation unit 15. .. As a result, the learning device can generate a model capable of extracting the overall features in consideration of the appearance and the concept in the image G.

また、実施形態の画像類似度推定システム１は、推定装置として適用されてもよい。この場合、推定装置は、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、画像類似度推定部１６とを備える。これにより、推定装置は、画像Ｇにおける外観と概念とを考慮した全体特徴を抽出することができる。したがって、画像Ｇにおける外観と概念とを考慮して、類似する画像を推定することが可能である。 Further, the image similarity estimation system 1 of the embodiment may be applied as an estimation device. In this case, the estimation device includes an appearance information acquisition unit 10, an appearance feature extraction unit 11, a classification information acquisition unit 12, a classification text feature extraction unit 13, an overall feature extraction unit 14, and an image similarity estimation unit 16. To be equipped. As a result, the estimation device can extract the overall features in consideration of the appearance and the concept in the image G. Therefore, it is possible to estimate a similar image in consideration of the appearance and the concept in the image G.

上述した実施形態における画像類似度推定システム１の全部又は一部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 All or part of the image similarity estimation system 1 in the above-described embodiment may be realized by a computer. In that case, the program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by the computer system and executed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" is a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short period of time. It may also include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or a client in that case. Further, the above program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system. It may be realized by using a programmable logic device such as FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.

１画像類似度推定システム
１０外観情報取得部
１１外観特徴抽出部
１２分類情報取得部
１３分類テキスト特徴抽出部
１４全体特徴抽出部
１５モデル生成部
１６画像類似度推定部
１７記憶部
１８推定結果出力部
１７０外観情報
１７１分類情報
１７２外観特徴抽出モデル
１７３分類テキスト特徴抽出モデル
１７４マルチモーダルモデル 1 Image similarity estimation system 10 Appearance information acquisition unit 11 Appearance feature extraction unit 12 Classification information acquisition unit 13 Classification text feature extraction unit 14 Overall feature extraction unit 15 Model generation unit 16 Image similarity estimation unit 17 Storage unit 18 Estimation result output unit 170 Appearance information 171 Classification information 172 Appearance feature extraction model 173 Classification text Feature extraction model 174 Multimodal model

Claims

An appearance information acquisition unit that acquires appearance information indicating the appearance of an image,
Using the appearance information in the image and the appearance feature extraction model, an appearance feature extraction unit that extracts an appearance feature amount indicating the appearance feature of the image, and an appearance feature extraction unit.
A classification information acquisition unit that acquires classification information indicating the classification of images,
Using the classification information in the image and the classification text feature extraction model, the classification text feature extraction unit for extracting the classification text feature amount indicating the feature of the wording indicating the classification of the image, and the classification text feature extraction unit.
Using the appearance feature amount, the classification text feature amount, and the multimodal model in the image, an overall feature extraction unit that extracts the overall feature amount that is a feature of the entire image in the image, and an overall feature extraction unit.
The appearance feature extraction model, the model generation unit that generates the multimodal model, and
An image similarity estimation unit that estimates the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image.
With
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image.
The model generation unit generates the appearance feature extraction model by having the learning model learn the correspondence between the appearance information and the classification information in the learning image.
The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification.
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image.
The model generation unit includes the appearance feature amount in the learning image extracted by the appearance feature extraction unit, the classification text feature amount in the learning image extracted by the classification text feature extraction unit, and the learning. The multimodal model is generated by letting the learning model learn the correspondence with the classification information in the image.
Image similarity estimation system.

The appearance feature extraction model includes an attention mechanism that outputs a weighted value to the internal state of the deep learning learning model.
The model generation unit causes the attention mechanism to learn weights according to the correspondence between the appearance information and the classification information in the learning image.
The image similarity estimation system according to claim 1.

The classification text feature extraction model is a model that extracts the features of the wording based on the value obtained by weighting the word feature amount indicating the feature amount of the word included in the wording with the idf value of the word.
The idf value is a value calculated by performing statistical processing on an image group which is a set of classified images.
The image similarity estimation system according to claim 1 or 2.

The idf value is a value calculated by using the ratio of the number of image groups, which is a set of classified images, to the number of images including the classified text feature amount.
The image similarity estimation system according to claim 3.

The model generation unit performs a process of normalizing the appearance feature amount and the classification text feature amount of the training image so that the data is included in the same range, and performs the normalization process. The multimodal model is generated by letting the learning model learn the correspondence between the appearance feature amount and the classification text feature amount and the classification information in the training image.
The image similarity estimation system according to any one of claims 1 to 4.

An appearance information acquisition unit that acquires appearance information indicating the appearance of an image,
Using the appearance information in the image and the appearance feature extraction model, an appearance feature extraction unit that extracts an appearance feature amount indicating the appearance feature of the image, and an appearance feature extraction unit.
A classification information acquisition unit that acquires classification information indicating the classification of images,
Using the classification information in the image and the classification text feature extraction model, the classification text feature extraction unit for extracting the classification text feature amount indicating the feature of the wording indicating the classification of the image, and the classification text feature extraction unit.
Using the appearance feature amount, the classification text feature amount, and the multimodal model in the image, an overall feature extraction unit that extracts the overall feature amount that is a feature of the entire image in the image, and an overall feature extraction unit.
The appearance feature extraction model, the model generation unit that generates the multimodal model, and
With
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image.
The model generation unit generates the appearance feature extraction model by having the learning model learn the correspondence between the appearance information and the classification information in the learning image.
The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification.
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image.
The model generation unit includes the appearance feature amount in the learning image extracted by the appearance feature extraction unit, the classification text feature amount in the learning image extracted by the classification text feature extraction unit, and the learning. The multimodal model is generated by letting the learning model learn the correspondence with the classification information in the image.
Learning device.

An appearance information acquisition unit that acquires appearance information indicating the appearance of an image,
Using the appearance information in the image and the appearance feature extraction model, an appearance feature extraction unit that extracts an appearance feature amount indicating the appearance feature of the image, and an appearance feature extraction unit.
A classification information acquisition unit that acquires classification information indicating the classification of images,
Using the classification information in the image and the classification text feature extraction model, the classification text feature extraction unit for extracting the classification text feature amount indicating the feature of the wording indicating the classification of the image, and the classification text feature extraction unit.
Using the appearance feature amount, the classification text feature amount, and the multimodal model in the image, an overall feature extraction unit that extracts the overall feature amount that is a feature of the entire image in the image, and an overall feature extraction unit.
An image similarity estimation unit that estimates the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image.
With
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image, and causes the learning model to learn the correspondence between the appearance information and the classification information in the learning image. The generated model,
The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification.
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image, and the appearance in the learning image extracted by the appearance feature extraction unit. A model generated by having a learning model learn the correspondence between the feature amount and the classification text feature amount in the training image extracted by the classification text feature extraction unit and the classification information in the learning image. Is,
Estimator.

Computer,
Appearance information acquisition means for acquiring appearance information indicating the appearance of an image,
An appearance feature extraction means for extracting an appearance feature amount indicating the appearance feature of the image by using the appearance information in the image and the appearance feature extraction model.
Classification information acquisition means for acquiring classification information indicating the classification of images,
A classification text feature extraction means for extracting a classification text feature amount indicating a feature of a wording indicating the classification of the image by using the classification information in the image and a classification text feature extraction model.
An overall feature extraction means for extracting an overall feature amount that is a feature of an entire image in the image using the appearance feature amount, the classified text feature amount, and a multimodal model in the image.
The appearance feature extraction model and the model generation means for generating the multimodal model,
It is a program to function as
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image.
In the model generation means, the appearance feature extraction model is generated by letting the learning model learn the correspondence between the appearance information and the classification information in the learning image.
The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification.
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image.
In the model generation means, the appearance feature amount in the learning image extracted by the appearance feature extraction means, the classification text feature amount in the learning image extracted by the classification text feature extraction means, and the learning. The multimodal model is generated by letting the learning model learn the correspondence with the classification information in the image.
program.

Computer,
Appearance information acquisition means for acquiring appearance information indicating the appearance of an image,
An appearance feature extraction means for extracting an appearance feature amount indicating the appearance feature of the image by using the appearance information in the image and the appearance feature extraction model.
Classification information acquisition means for acquiring classification information indicating the classification of images,
A classification text feature extraction means for extracting a classification text feature amount indicating a feature of a wording indicating the classification of the image by using the classification information in the image and a classification text feature extraction model.
Using the appearance feature amount, the classification text feature amount, and the multimodal model in the image, an overall feature extraction unit that extracts the overall feature amount that is a feature of the entire image in the image, and an overall feature extraction unit.
An image similarity estimation means that estimates the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image.
It is a program to function as
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image, and causes the learning model to learn the correspondence between the appearance information and the classification information in the learning image. The generated model,
The classification text feature extraction model is a model for extracting the feature amount of the wording indicating the classification.
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image, and the appearance in the learning image extracted by the appearance feature extraction means. A model generated by having a learning model learn the correspondence between the feature amount and the classification text feature amount in the training image extracted by the classification text feature extraction means and the classification information in the learning image. Is,
program.