JP7394680B2

JP7394680B2 - Image similarity estimation system, learning device, estimation device, and program

Info

Publication number: JP7394680B2
Application number: JP2020057919A
Authority: JP
Inventors: 大地小池; 高志末永
Original assignee: 株式会社Ｎｔｔデータグループ
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2023-12-08
Anticipated expiration: 2040-03-27
Also published as: JP2021157570A

Description

特許法第３０条第２項適用（１）発行者名一般社団法人電子情報通信学会刊行物名信学技報，ｖｏｌ．１１９，ｎｏ．３８６，ＭＶＥ２０１９－３０，ｐｐ．３１－３２，２０２０年１月発行年月日令和２年１月１６日（２）公開日令和２年１月２３日集会名メディアエクスペリエンス・バーチャル環境基礎研究会（ＭＶＥ）開催場所奈良先端科学技術大学院大学情報科学棟Ａ会場（奈良県生駒市高山町８９１６番地の５（けいはんな学研都市））Application of Article 30, Paragraph 2 of the Patent Act (1) Publisher name: Institute of Electronics, Information and Communication Engineers Publication name: IEICE Technical Report, vol. 119, no. 386, MVE2019-30, pp. 31-32, January 2020 Publication date January 16, 2020 (2) Publication date January 23, 2020 Meeting name Media Experience Virtual Environment Basic Study Group (MVE) Venue Nara Advanced Research Center Graduate University of Science and Technology Information Science Building Venue A (8916-5 Takayama-cho, Ikoma City, Nara Prefecture (Keihanna Science City))

本発明は、画像類似度推定システム、学習装置、推定装置、及びプログラムに関する。 The present invention relates to an image similarity estimation system, a learning device, an estimation device, and a program.

特許庁における商標出願の審査においては、出願に係る商標と、既に出願済みの商標とが類似するか否かが判断される。文字や図形等などの画像が類似するか否かを判定する技術として、画像の特徴に基づく深層学習を行い、類似する画像を抽出するものがある。例えば、特許文献１には、画像の複数個所を特定し、特定したそれぞれの箇所の特徴量を算出し、算出したそれぞれの特徴量に基づき、類似する画像を抽出する技術が開示されている。 When examining a trademark application at the Japan Patent Office, it is determined whether the trademark in the application is similar to a trademark that has already been applied for. As a technique for determining whether images of characters, figures, etc. are similar, there is a technique that performs deep learning based on image characteristics to extract similar images. For example, Patent Document 1 discloses a technique for identifying multiple locations in an image, calculating feature amounts for each of the identified locations, and extracting similar images based on the calculated feature amounts.

特許第５９３４６５３号公報Patent No. 5934653

しかしながら、商標の類否は、出願商標及び引用商標がその外観、称呼又は観念等によって需要者に与える印象、記憶、連想等を総合して全体的に観察し、出願商標を指定商品又は指定役務に使用した場合に引用商標と出所混同のおそれがあるか否かにより判断する（商標法第４条第１項第１１号の審査基準）。つまり、外観のみならず、称呼及び観念のそれぞれの観点から、総合的に商標の類否が判断される。このため、特許文献１の技術を用いて画像の類似性、つまり外観の類似性のみを判定するだけでは、商標の類比を判断するうえで不十分となる問題があった。 However, the similarity of trademarks is determined by comprehensively observing the impression, memory, association, etc. that the applied trademark and cited trademark give to consumers through their appearance, pronunciation, concept, etc., and Judgment will be made based on whether there is a risk of confusion with the source of the cited trademark when used in a trademark (examination criteria under Article 4, Paragraph 1, Item 11 of the Trademark Law). In other words, the similarity of trademarks is judged comprehensively from the viewpoints of not only appearance but also pronunciation and concept. For this reason, there is a problem in that simply determining the similarity of images, that is, the similarity of appearance only using the technique of Patent Document 1, is insufficient for determining the similarity of trademarks.

本発明は、上記問題を解決すべくなされたもので、その目的は、外観のみならず観念を考慮して画像の類否を判定することができる画像類似度推定システム、学習装置、推定装置、及びプログラムを提供することにある。 The present invention has been made to solve the above-mentioned problems, and its purpose is to provide an image similarity estimation system, a learning device, an estimation device, and a learning device capable of determining the similarity of images by considering not only the appearance but also the concept. and programs.

上記問題を解決するために、本発明の一態様は、画像の外観を示す外観情報を取得する外観情報取得部と、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出部と、画像の分類を示す分類情報を取得する分類情報取得部と、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出部と、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、前記外観特徴抽出モデル、及び前記マルチモーダルモデルを生成するモデル生成部と、対象画像における前記全体特徴量、及び比較画像における前記全体特徴量に基づいて、前記対象画像と前記比較画像の類似度合いを推定する画像類似度推定部と、を備え、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、前記モデル生成部は、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより、前記外観特徴抽出モデルを生成し、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記モデル生成部は、前記外観特徴抽出部によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出部によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルを生成する、ことを特徴とする画像類似度推定システムである。 In order to solve the above problem, one aspect of the present invention uses an appearance information acquisition unit that acquires appearance information indicating the appearance of an image, the appearance information in the image, and an appearance feature extraction model to determine the appearance of the image. an appearance feature extraction unit that extracts appearance feature quantities indicating the characteristics of the image; a classification information acquisition unit that acquires classification information indicating the classification of the image; A classified text feature extraction unit that extracts a classified text feature indicating the feature of the wording that indicates the classification of the image, the appearance feature in the image, the classified text feature, and a multimodal model to extract the entire image in the image. an overall feature extraction unit that extracts an overall feature amount that is a feature; a model generation unit that generates the appearance feature extraction model and the multimodal model; the overall feature amount in the target image and the overall feature amount in the comparison image. an image similarity estimating unit that estimates a degree of similarity between the target image and the comparison image based on the above, and the appearance feature extraction model outputs the appearance feature amount in the image from the appearance information in the image. The model generation unit generates the appearance feature extraction model by making the learning model learn the correspondence between the appearance information and the classification information in the learning image, and the classified text feature extraction model , the multimodal model is a model that outputs the overall feature amount of the image from the appearance feature amount and the classified text feature amount of the image; The model generation unit includes the appearance feature amount in the training image extracted by the appearance feature extraction unit, the classified text feature amount in the training image extracted by the classified text feature extraction unit, and the training image. The image similarity estimation system is characterized in that the multimodal model is generated by causing a learning model to learn the correspondence relationship between the images and the classification information.

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記外観特徴抽出モデルは、深層学習の学習モデルの内部状態に重み付けした値を出力するアテンション機構を含み、前記モデル生成部は、前記アテンション機構に、前記学習用画像における前記外観情報と前記分類情報との対応関係に応じた重みを学習させるようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the appearance feature extraction model includes an attention mechanism that outputs a value weighted to an internal state of a deep learning learning model, and the model generation The unit may cause the attention mechanism to learn a weight according to a correspondence relationship between the appearance information and the classification information in the learning image.

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記分類テキスト特徴抽出モデルは、文言に含まれる単語の特徴量を示す単語特徴量を、前記単語のｉｄｆ値で重みづけした値に基づいて当該文言の特徴を抽出するモデルであり、前記ｉｄｆ値は、分類済みの画像の集合である画像群に統計処理を行うことにより算出される値であるようにしてもよい。 In addition, one aspect of the present invention is that in the image similarity estimation system described above, the classified text feature extraction model weights word features indicating features of words included in a sentence using an idf value of the word. The idf value may be a value calculated by performing statistical processing on an image group that is a set of classified images. .

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記ｉｄｆ値は、分類済みの画像の集合である画像群の数における、前記分類テキスト特徴量を含む画像の数に対する割合を用いて算出される値であるようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the idf value is determined based on the number of images including the classified text feature amount in the number of image groups that are a set of classified images. The value may be calculated using a ratio.

また、本発明の一態様は、上記に記載の画像類似度推定システムにおいて、前記モデル生成部は、前記学習用画像における前記外観特徴量及び前記分類テキスト特徴量が、同一の範囲内に含まれるデータとなるように正規化する前処理を行い、前記前処理を行った前記学習用画像における、前記外観特徴量及び前記分類テキスト特徴量と前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルを生成するようにしてもよい。 Further, in one aspect of the present invention, in the image similarity estimation system described above, the model generation unit is configured such that the appearance feature amount and the classified text feature amount in the learning image are included within the same range. performing preprocessing to normalize the data, and causing a learning model to learn the correspondence between the appearance feature amount, the classified text feature amount, and the classification information in the learning image subjected to the preprocessing. The multimodal model may be generated by:

また、本発明の一態様は、画像の外観を示す外観情報を取得する外観情報取得部と、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出部と、画像の分類を示す分類情報を取得する分類情報取得部と、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出部と、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、前記外観特徴抽出モデル、及び前記マルチモーダルモデルを生成するモデル生成部と、を備え、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、前記モデル生成部は、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより、前記外観特徴抽出モデルを生成し、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記モデル生成部は、前記外観特徴抽出部によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出部によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルを生成する学習装置である。 Further, one aspect of the present invention provides an appearance information acquisition unit that acquires appearance information indicating the appearance of an image, and an appearance feature indicating the appearance feature of the image using the appearance information in the image and an appearance feature extraction model. an appearance feature extraction unit that extracts the quantity; a classification information acquisition unit that acquires classification information indicating the classification of the image; A classified text feature extraction unit that extracts a classified text feature indicating a feature, and a global feature that is a feature of the entire image in the image using the appearance feature, the classified text feature, and a multimodal model in the image. and a model generation unit that generates the appearance feature extraction model and the multimodal model, and the appearance feature extraction model extracts the appearance features in the image from the appearance information in the image. The model generation unit generates the appearance feature extraction model by making the learning model learn the correspondence between the appearance information and the classification information in the learning image, and generates the appearance feature extraction model and generates the classification text. The feature extraction model is a model that extracts the feature amount of words indicating classification, and the multimodal model is a model that outputs the overall feature amount of the image from the appearance feature amount and the classification text feature amount of the image. The model generation unit is configured to include the appearance feature amount in the training image extracted by the appearance feature extraction unit, and the classified text feature amount in the training image extracted by the classification text feature extraction unit. , a learning device that generates the multimodal model by causing a learning model to learn a correspondence relationship with the classification information in the learning image.

また、本発明の一態様は、画像の外観を示す外観情報を取得する外観情報取得部と、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出部と、画像の分類を示す分類情報を取得する分類情報取得部と、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出部と、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、対象画像における前記全体特徴量、及び比較画像における前記全体特徴量に基づいて、前記対象画像と前記比較画像の類似度合いを推定する画像類似度推定部と、を備え、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルであり、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記外観特徴抽出部によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出部によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルである推定装置である。 Further, one aspect of the present invention provides an appearance information acquisition unit that acquires appearance information indicating the appearance of an image, and an appearance feature indicating the appearance feature of the image using the appearance information in the image and an appearance feature extraction model. an appearance feature extraction unit that extracts the quantity; a classification information acquisition unit that acquires classification information indicating the classification of the image; A classified text feature extraction unit that extracts a classified text feature indicating a feature, and a global feature that is a feature of the entire image in the image using the appearance feature, the classified text feature, and a multimodal model in the image. an image similarity estimating unit that estimates a degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image. The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image, and causes the learning model to learn the correspondence between the appearance information and the classification information in the learning image. The classified text feature extraction model is a model that extracts the feature amount of the wording indicating the classification, and the multimodal model is a model that extracts the feature amount of the wording that indicates the classification. , is a model that outputs the overall feature amount in the image, the appearance feature amount in the training image extracted by the appearance feature extraction unit, and the training image extracted by the classified text feature extraction unit. The estimation device is a model generated by causing a learning model to learn the correspondence between the classified text feature amount and the classification information in the learning image.

また、本発明の一態様は、コンピュータを、画像の外観を示す外観情報を取得する外観情報取得手段、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出手段、画像の分類を示す分類情報を取得する分類情報取得手段、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出手段、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出手段、前記外観特徴抽出モデル、及び前記マルチモーダルモデルを生成するモデル生成手段、として機能させるためのプログラムであって、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、前記モデル生成手段において、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより、前記外観特徴抽出モデルが生成され、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記モデル生成手段において、前記外観特徴抽出手段によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出手段によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより、前記マルチモーダルモデルが生成される、プログラムである。 Further, one aspect of the present invention provides a computer that uses an appearance information acquisition unit that acquires appearance information indicating the appearance of an image, the appearance information in the image, and an appearance feature extraction model to indicate the appearance characteristics of the image. Appearance feature extraction means for extracting appearance feature amounts, classification information acquisition means for acquiring classification information indicating the classification of the image, the classification information in the image, and a classification text feature extraction model to extract the text indicating the classification of the image. A classified text feature extraction means for extracting a classified text feature indicating a feature, the appearance feature in the image, the classified text feature, and a multimodal model to extract the overall feature that is the feature of the entire image in the image. A program for functioning as overall feature extraction means for extracting, the appearance feature extraction model, and model generation means for generating the multimodal model, wherein the appearance feature extraction model extracts the appearance information from the image from the appearance information in the image. The model outputs the appearance feature amount in the model generation means, and the appearance feature extraction model is generated by causing the learning model to learn the correspondence between the appearance information and the classification information in the learning image. , the classification text feature extraction model is a model for extracting the feature amount of a wording indicating classification, and the multimodal model extracts the overall feature amount in the image from the appearance feature amount and the classification text feature amount in the image. In the model generating means, the appearance feature quantity in the learning image extracted by the appearance feature extraction means and the classification in the learning image extracted by the classified text feature extraction means are used. This program generates the multimodal model by causing a learning model to learn the correspondence between text features and the classification information in the learning image.

また、本発明の一態様は、コンピュータを、画像の外観を示す外観情報を取得する外観情報取得手段、画像における前記外観情報、及び外観特徴抽出モデルを用いて、当該画像の外観の特徴を示す外観特徴量を抽出する外観特徴抽出手段、画像の分類を示す分類情報を取得する分類情報取得手段、画像における前記分類情報、及び分類テキスト特徴抽出モデルを用いて、当該画像の分類を示す文言の特徴を示す分類テキスト特徴量を抽出する分類テキスト特徴抽出手段、画像における前記外観特徴量、前記分類テキスト特徴量、及びマルチモーダルモデルを用いて、当該画像における画像全体の特徴である全体特徴量を抽出する全体特徴抽出部と、対象画像における前記全体特徴量、及び比較画像における前記全体特徴量に基づいて、前記対象画像と前記比較画像の類似度合いを推定する画像類似度推定手段、として機能させるためのプログラムであって、前記外観特徴抽出モデルは、画像における前記外観情報から当該画像における前記外観特徴量を出力するモデルであり、学習用画像における前記外観情報と前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルであり、前記分類テキスト特徴抽出モデルは、分類を示す文言の特徴量を抽出するモデルであり、前記マルチモーダルモデルは、画像における前記外観特徴量及び前記分類テキスト特徴量から、当該画像における前記全体特徴量を出力するモデルであり、前記外観特徴抽出手段によって抽出された前記学習用画像における前記外観特徴量、及び前記分類テキスト特徴抽出手段によって抽出された前記学習用画像における前記分類テキスト特徴量と、前記学習用画像における前記分類情報との対応関係を学習モデルに学習させることにより生成されたモデルである、プログラムである。 Further, one aspect of the present invention provides a computer that uses an appearance information acquisition unit that acquires appearance information indicating the appearance of an image, the appearance information in the image, and an appearance feature extraction model to indicate the appearance characteristics of the image. Appearance feature extraction means for extracting appearance feature amounts, classification information acquisition means for acquiring classification information indicating the classification of the image, the classification information in the image, and a classification text feature extraction model to extract the text indicating the classification of the image. A classified text feature extraction means for extracting a classified text feature indicating a feature, the appearance feature in the image, the classified text feature, and a multimodal model to extract the overall feature that is the feature of the entire image in the image. It functions as an overall feature extraction unit that extracts, and an image similarity estimation unit that estimates a degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image. The appearance feature extraction model is a model that outputs the appearance feature amount of the image from the appearance information of the image, and calculates the correspondence between the appearance information and the classification information in the learning image. The model is a model generated by training a learning model, the classification text feature extraction model is a model that extracts the feature amount of a wording indicating classification, and the multimodal model is a model that extracts the feature amount of the wording indicating the classification, and the multimodal model is a model that extracts the feature amount of the wording indicating the classification. It is a model that outputs the overall feature amount of the image from the classified text feature amount, and the appearance feature amount in the training image extracted by the appearance feature extraction means and the appearance feature amount extracted by the classified text feature extraction means. The program is a model that is generated by causing a learning model to learn the correspondence between the classified text feature amount in the learning image and the classification information in the learning image.

この発明によれば、外観のみならず観念を考慮して画像の類否を判定することができる。 According to this invention, it is possible to determine the similarity of images by considering not only the appearance but also the concept.

実施形態の画像類似度推定システム１の構成例を示すブロック図である。1 is a block diagram showing a configuration example of an image similarity estimation system 1 according to an embodiment. FIG. 実施形態の画像Ｇの例を示す図である。It is a figure which shows the example of the image G of embodiment. 実施形態の図形分類Ｚの例を示す図である。It is a figure which shows the example of figure classification Z of embodiment. 実施形態の外観情報１７０の構成例を示す図である。It is a figure showing an example of composition of appearance information 170 of an embodiment. 実施形態の分類情報１７１の構成例を示す図である。It is a figure showing an example of composition of classification information 171 of an embodiment. 実施形態の画像類似度推定システム１が行う処理を説明する図である。FIG. 2 is a diagram illustrating processing performed by the image similarity estimation system 1 of the embodiment. 実施形態の外観特徴抽出モデル１７２を説明する図である。FIG. 3 is a diagram illustrating an appearance feature extraction model 172 according to the embodiment. 実施形態の分類テキスト特徴抽出モデル１７３を説明する図である。It is a diagram explaining a classified text feature extraction model 173 of the embodiment. 実施形態のマルチモーダルモデル１７４を説明する図である。It is a figure explaining the multimodal model 174 of an embodiment. 実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。It is a flow diagram showing the flow of processing performed by the image similarity estimation system 1 of the embodiment. 実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。It is a flow diagram showing the flow of processing performed by the image similarity estimation system 1 of the embodiment. 実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。It is a flow diagram showing the flow of processing performed by the image similarity estimation system 1 of the embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

実施形態の画像類似度推定システム１は、画像同士が類似する度合いを推定するシステムである。画像類似度推定システム１は、例えば、特許庁における商標出願の審査における、出願に係る商標の類似の判定に適用される。 The image similarity estimation system 1 of the embodiment is a system that estimates the degree of similarity between images. The image similarity estimation system 1 is applied, for example, to determining the similarity of trademarks related to applications in the examination of trademark applications at the Japan Patent Office.

商標の審査においては、外観の類似のみならず、称呼や概念的な類似を考慮した類似が判断される。例えば、商標の審査では、商標に付与される図形分類を用いて検索の論理式が作成される。そして、作成された論理式を用いた検索が実行されることにより、既に出願済みの商標の中から、出願に係る商標に類似する可能性がある商標の絞り込みが行われる。絞り込まれた商標の中から、外観、称呼、又は概念が類似するものが抽出される。 When examining trademarks, similarity is determined not only by appearance, but also by appellation and conceptual similarity. For example, in trademark examination, a logical search formula is created using the graphic classification assigned to the trademark. Then, by executing a search using the created logical formula, trademarks that are likely to be similar to the trademark in the application are narrowed down from among the trademarks that have already been applied for. From the narrowed down trademarks, those with similar appearance, name, or concept are extracted.

一般的に、深層学習のモデルを用いた画像処理では、画像における外観の特徴が多次元で抽出される。そして、外観の特徴を多次元空間で表現したベクトル同士の距離の近さに応じて類似度合いが推定される。すなわち、画像における外観の特徴から、類似度合いが推定される。このため、外観の特徴が全く異なる画像を類似すると推定することはほとんどあり得ない。例えば、同じ物体（例えば、たて琴など）を表現した画像であって、一方が写真など写実的な自然画像であり、他方がデザインされたイラスト画像である場合を考える。この場合、両画像における外観の特徴が大きく異なっている場合には、両者が類似すると推定されることは困難である。すなわち、たて琴の写真を示す画像と、たて琴をデザインしたイラスト画像とが類似すると推定されることは困難である。しかしながら、「たて琴」という概念が同一であることから、商標の類否判定においては、しばしば、両者が概念的に類似すると判断される場合がある。一般的な深層学習のモデルを用いた画像処理では、このような商標における概念が類似する画像を精度よく推定することが困難であった。 Generally, in image processing using a deep learning model, external features in an image are extracted in multiple dimensions. Then, the degree of similarity is estimated according to the closeness of the vectors expressing the appearance features in a multidimensional space. That is, the degree of similarity is estimated from the appearance characteristics of the images. For this reason, it is almost impossible to estimate that images with completely different external features are similar. For example, consider a case where two images depict the same object (such as a harp), one of which is a realistic natural image such as a photograph, and the other a designed illustration image. In this case, if the external features of both images are significantly different, it is difficult to estimate that the two images are similar. That is, it is difficult to estimate that an image showing a photograph of a harp and an illustration image in which a harp is designed are similar. However, since the concept of "tatekoto" is the same, when determining similarity of trademarks, it is often determined that the two are conceptually similar. In image processing using general deep learning models, it is difficult to accurately estimate images with similar concepts in trademarks.

この対策として、本実施形態の画像類似度推定システム１では、分類テキスト特徴抽出モデル１７３を用いた推定を行う。分類テキスト特徴抽出モデル１７３は、画像における概念の特徴を学習させたモデルである。すなわち、本実施形態の画像類似度推定システム１では、画像における外観の特徴のみならず、画像における概念の特徴を抽出することができる。これにより、画像から抽出した概念の特徴を示すベクトル同士の距離の近さに応じて、概念の観点から類似度合いを推定することが可能となる。したがって、概念が類似する画像を抽出することができる。 As a countermeasure against this, the image similarity estimation system 1 of this embodiment performs estimation using the classified text feature extraction model 173. The classified text feature extraction model 173 is a model that has learned the features of concepts in images. That is, in the image similarity estimation system 1 of this embodiment, it is possible to extract not only external features in images but also conceptual features in images. This makes it possible to estimate the degree of similarity from the concept point of view, depending on the closeness of the vectors representing the characteristics of the concept extracted from the image. Therefore, images with similar concepts can be extracted.

なお、ここでの画像における概念とは、画像の分類を示す文言であり、例えば、商標に付与された図形分類に相当する文言である。本実施形態における概念の特徴とは、文言に含まれる単語の特徴であり、例えば、単語を分散表現した単語ベクトルである。以下の説明では、画像における概念の特徴を、分類テキスト特徴と称する場合がある。 Note that the concept in the image here is a wording that indicates the classification of the image, and is, for example, a wording that corresponds to a graphical classification given to a trademark. The feature of a concept in this embodiment is a feature of a word included in a sentence, and is, for example, a word vector that is a distributed representation of a word. In the following description, conceptual features in images may be referred to as classified text features.

また、本実施形態の画像類似度推定システム１では、深層学習のモデルを用いて外観特徴抽出モデル１７２と分類テキスト特徴抽出モデル１７３を生成する。外観特徴抽出モデル１７２は、画像における外観の特徴を学習させたモデルである。分類テキスト特徴抽出モデル１７３は、外観と概念のそれぞれの特徴量に基づいて画像全体の特徴（以下、全体特徴ともいう）を抽出するモデルである。すなわち、本実施形態の画像類似度推定システム１では、画像における外観と概念のそれぞれの特徴量を統合させた特徴（全体特徴）を抽出することができる。これにより、画像から抽出した外観と概念の特徴を統合的に示すベクトル同士の距離の近さに応じて、外観と概念の両方を統合させた観点から類似度合いを推定することが可能となる。したがって、外観と概念とを統合的にみて類似する画像を抽出することができる。 Furthermore, the image similarity estimation system 1 of this embodiment generates an appearance feature extraction model 172 and a classified text feature extraction model 173 using a deep learning model. The appearance feature extraction model 172 is a model that has learned appearance features in an image. The classified text feature extraction model 173 is a model that extracts features of the entire image (hereinafter also referred to as overall features) based on the respective feature amounts of appearance and concept. That is, in the image similarity estimation system 1 of this embodiment, it is possible to extract a feature (overall feature) that integrates the respective feature amounts of appearance and concept in an image. This makes it possible to estimate the degree of similarity from the perspective of integrating both appearance and concept, depending on the closeness of the vectors that integrally represent the features of appearance and concept extracted from the image. Therefore, it is possible to extract similar images based on an integrated view of appearance and concept.

図１は、実施形態の画像類似度推定システム１の構成例を示すブロック図である。画像類似度推定システム１は、例えば、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、モデル生成部１５と、画像類似度推定部１６と、記憶部１７と、推定結果出力部１８とを備える。 FIG. 1 is a block diagram showing a configuration example of an image similarity estimation system 1 according to an embodiment. The image similarity estimation system 1 includes, for example, an appearance information acquisition section 10, an appearance feature extraction section 11, a classification information acquisition section 12, a classified text feature extraction section 13, an overall feature extraction section 14, and a model generation section 15. , an image similarity estimation section 16 , a storage section 17 , and an estimation result output section 18 .

外観情報取得部１０は、画像における外観を示す情報を取得する。画像における外観を示す情報は、画像の見た目を示す情報であって、例えば、画素ごとの座標にＲＧＢ値が対応づけられた情報である。外観情報取得部１０は、取得した情報を、記憶部１７の外観情報１７０として記憶させる。 The appearance information acquisition unit 10 acquires information indicating the appearance in an image. The information indicating the appearance of the image is information indicating the appearance of the image, and is, for example, information in which RGB values are associated with the coordinates of each pixel. The appearance information acquisition unit 10 stores the acquired information as appearance information 170 in the storage unit 17.

外観特徴抽出部１１は、画像における外観情報１７０、及び外観特徴抽出モデル１７２を用いて、当該画像における外観の特徴量（外観特徴量）を抽出する。外観特徴抽出モデル１７２は、画像における外観情報から当該画像における外観特徴量を出力するモデルである。外観特徴抽出モデル１７２は、モデル生成部１５によって生成される。外観特徴抽出モデル１７２の詳細については後で詳しく説明する。 The appearance feature extraction unit 11 uses the appearance information 170 in the image and the appearance feature extraction model 172 to extract the appearance feature amount (appearance feature amount) in the image. The appearance feature extraction model 172 is a model that outputs appearance feature amounts in an image from appearance information in the image. The appearance feature extraction model 172 is generated by the model generation unit 15. Details of the appearance feature extraction model 172 will be explained in detail later.

分類情報取得部１２は、画像における分類を示す情報を取得する。画像における分類を示す情報は、画像に示された内容を分類する情報であって、例えば、商標における図形分類を示す情報である。分類情報取得部１２は、取得した情報を、記憶部１７の分類情報１７１として記憶させる。 The classification information acquisition unit 12 acquires information indicating the classification of the image. The information indicating the classification in the image is information for classifying the content shown in the image, and is, for example, information indicating the graphic classification in a trademark. The classification information acquisition unit 12 stores the acquired information as classification information 171 in the storage unit 17.

分類テキスト特徴抽出部１３は、画像における分類情報１７１、及び分類テキスト特徴抽出モデル１７３を用いて、当該画像における分類を示す文言の特徴量（分類テキスト特徴量）を抽出する。分類テキスト特徴抽出モデル１７３は、画像における分類情報から当該画像における分類テキスト特徴量を出力するモデルである。分類テキスト特徴抽出モデル１７３は、モデル生成部１５によって生成される。分類テキスト特徴抽出モデル１７３の詳細については後で詳しく説明する。 The classified text feature extraction unit 13 uses the classification information 171 in the image and the classified text feature extraction model 173 to extract the feature amount of the wording indicating the classification in the image (classified text feature amount). The classified text feature extraction model 173 is a model that outputs classified text feature amounts in an image from classification information in the image. The classified text feature extraction model 173 is generated by the model generation unit 15. The details of the classified text feature extraction model 173 will be explained in detail later.

全体特徴抽出部１４は、画像における外観特徴量、分類テキスト特徴量、及びマルチモーダルモデル１７４を用いて、当該画像における画像全体の特徴量（全体特徴量）を抽出する。全体特徴抽出部１４は、画像における外観特徴量を外観特徴抽出部１１から取得する。全体特徴抽出部１４は、画像における分類テキスト特徴量を分類テキスト特徴抽出部１３から取得する。マルチモーダルモデル１７４は、画像における外観特徴量及び分類テキスト特徴量から、当該画像における全体特徴量を出力するモデルである。マルチモーダルモデル１７４の詳細については後で詳しく説明する。 The overall feature extraction unit 14 extracts the overall feature amount (overall feature amount) of the image using the appearance feature amount, the classified text feature amount, and the multimodal model 174 in the image. The overall feature extractor 14 acquires the appearance feature amount in the image from the appearance feature extractor 11. The overall feature extraction unit 14 acquires the classified text feature amount in the image from the classified text feature extraction unit 13 . The multimodal model 174 is a model that outputs the overall feature amount of an image from the appearance feature amount and classified text feature amount of the image. Details of the multimodal model 174 will be explained in detail later.

モデル生成部１５は、外観特徴抽出モデル１７２を生成する。この際、モデル生成部１５は、学習用画像における外観情報と分類情報との対応関係を深層学習のモデルに学習させる。これにより、モデル生成部１５は、入力された画像の外観情報から、当該画像における分類情報を出力するモデルを生成し、生成したモデルを示す情報を記憶部１７の外観特徴抽出モデル１７２として記憶させる。モデルを示す情報は、例えば、深層学習のモデルがＣＮＮ（Convolutional Neural Network）の学習モデルであれば、ＣＮＮの入力層、中間層、出力層の各層のユニット数、隠れ層の層数、活性化関数などを示す情報や、各階層のノードを結合する結合係数や重みを示す情報である。 The model generation unit 15 generates an appearance feature extraction model 172. At this time, the model generation unit 15 causes the deep learning model to learn the correspondence between appearance information and classification information in the learning image. Thereby, the model generation unit 15 generates a model that outputs classification information for the image from the appearance information of the input image, and stores information indicating the generated model as the appearance feature extraction model 172 in the storage unit 17. . For example, if the deep learning model is a CNN (Convolutional Neural Network) learning model, the information indicating the model includes the number of units in each layer of the CNN input layer, intermediate layer, and output layer, the number of hidden layers, activation This is information indicating functions, etc., and information indicating coupling coefficients and weights for coupling nodes in each layer.

また、モデル生成部１５は、マルチモーダルモデル１７４を生成する。この際、モデル生成部１５は、学習用画像における外観特徴量及び分類テキスト特徴量と、分類情報との対応関係を深層学習のモデルに学習させる。モデル生成部１５は、外観特徴抽出部１１によって抽出された学習用画像における外観特徴量を取得する。モデル生成部１５は、分類テキスト特徴抽出部１３によって抽出された学習用画像における分類テキスト特徴量を取得する。これにより、モデル生成部１５は、入力された画像の外観特徴量及び分類テキスト特徴量から、当該画像における分類情報を出力するモデルを生成する。 Furthermore, the model generation unit 15 generates a multimodal model 174. At this time, the model generation unit 15 causes the deep learning model to learn the correspondence between the appearance feature amount and the classified text feature amount in the learning image and the classification information. The model generation unit 15 acquires the appearance feature amount in the learning image extracted by the appearance feature extraction unit 11. The model generation unit 15 acquires the classified text feature amount in the learning image extracted by the classified text feature extraction unit 13. Thereby, the model generation unit 15 generates a model that outputs classification information for the image from the appearance feature amount and the classification text feature amount of the input image.

ここで、画像の外観特徴量及び分類テキスト特徴量から抽出された分類情報は、画像の外観特徴量及び分類テキスト特徴量の双方に基づく特徴であり、全体特徴ということができる。すなわち、モデル生成部１５は、学習用画像における外観特徴量及び分類テキスト特徴量と、分類情報との対応関係を深層学習のモデルに学習させることにより、当該画像における全体特徴を出力するモデルを生成する。モデル生成部１５は、作成したモデルを示す情報を記憶部１７のマルチモーダルモデル１７４として記憶させる。 Here, the classification information extracted from the appearance feature amount and the classified text feature amount of the image is a feature based on both the image appearance feature amount and the classified text feature amount, and can be called an overall feature. That is, the model generation unit 15 generates a model that outputs the overall feature of the image by having a deep learning model learn the correspondence between the appearance feature amount and the classified text feature amount in the training image and the classification information. do. The model generation unit 15 causes the storage unit 17 to store information indicating the created model as a multimodal model 174.

画像類似度推定部１６は、画像の類似度合い（画像類似度）を推定する。画像類似度推定部１６は、複数の画像のそれぞれについて全体特徴量を取得する。画像類似度推定部１６は、全体特徴抽出部１４によって抽出された画像の全体特徴量を取得する。画像類似度推定部１６は、それぞれの画像から抽出された全体特徴における互いのベクトル空間上の距離（例えば、コサイン類似度）を算出する。例えば、画像類似度推定部１６は、算出した距離の順番を、類似する可能性が高い順序として推定する。或いは、画像類似度推定部１６は、算出した距離が所定の閾値未満であった場合、両画像が類似すると推定するようにしてもよい。 The image similarity estimating unit 16 estimates the degree of image similarity (image similarity). The image similarity estimating unit 16 acquires the overall feature amount for each of the plurality of images. The image similarity estimation unit 16 obtains the overall feature amount of the image extracted by the overall feature extraction unit 14. The image similarity estimating unit 16 calculates the distance (for example, cosine similarity) between the overall features extracted from each image in the vector space. For example, the image similarity estimating unit 16 estimates the order of the calculated distances as the order in which there is a high possibility of similarity. Alternatively, the image similarity estimating unit 16 may estimate that both images are similar if the calculated distance is less than a predetermined threshold.

推定結果出力部１８は、画像類似度推定部１６によって推定された推定結果を出力する。推定結果出力部１８は、例えば、推定結果を図示しないディスプレイに出力することにより、推定結果を表示させる。或いは、推定結果出力部１８は、推定結果を図示しないプリンタに出力することにより、推定結果を印刷するようにしてもよい。 The estimation result output unit 18 outputs the estimation result estimated by the image similarity estimation unit 16. The estimation result output unit 18 displays the estimation result by outputting the estimation result to a display (not shown), for example. Alternatively, the estimation result output unit 18 may print the estimation result by outputting the estimation result to a printer (not shown).

上述した画像類似度推定システム１の機能部（外観情報取得部１０、外観特徴抽出部１１、分類情報取得部１２、分類テキスト特徴抽出部１３、全体特徴抽出部１４、モデル生成部１５、画像類似度推定部１６、及び推定結果出力部１８）は、例えば、ＣＰＵ（Central Processing Unit）などのハードウェアプロセッサがプログラム（ソフトウェア）を実行することにより実現される。これらの構成要素のうち一部または全部は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）、ＧＰＵ（Graphics Processing Unit）などのハードウェア（回路部；circuitryを含む）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。プログラムは、予めＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの記憶装置（非一過性の記憶媒体を備える記憶装置）に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体（非一過性の記憶媒体）に格納されており、記憶媒体がドライブ装置に装着されることでインストールされてもよい。 Functional units of the image similarity estimation system 1 described above (appearance information acquisition unit 10, appearance feature extraction unit 11, classification information acquisition unit 12, classified text feature extraction unit 13, overall feature extraction unit 14, model generation unit 15, image similarity The degree estimation unit 16 and the estimation result output unit 18) are realized by, for example, a hardware processor such as a CPU (Central Processing Unit) executing a program (software). Some or all of these components are hardware (circuit parts) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array), and GPU (Graphics Processing Unit). ), or may be realized by collaboration of software and hardware. The program may be stored in advance in a storage device (a storage device with a non-transitory storage medium) such as an HDD (Hard Disk Drive) or flash memory, or may be stored in a removable storage device such as a DVD or CD-ROM. It is stored in a medium (non-transitory storage medium), and may be installed by loading the storage medium into a drive device.

記憶部１７は、少なくとも１つの記憶媒体を任意に組み合わせることによって構成される。記憶媒体は、例えば、ＨＤＤ（Hard Disk Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＡＭ（Random Access read/write Memory）、ＲＯＭ（Read Only Memory）である。記憶部１７は、画像類似度推定システム１の各種処理を実行するためのプログラム、及び各種処理を行う際に利用される一時的なデータを記憶する。 The storage unit 17 is configured by arbitrarily combining at least one storage medium. The storage medium is, for example, a HDD (Hard Disk Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a RAM (Random Access read/write Memory), or a ROM (Read Only Memory). The storage unit 17 stores programs for executing various processes of the image similarity estimation system 1 and temporary data used when performing the various processes.

記憶部１７は、例えば、外観情報１７０と、分類情報１７１と、外観特徴抽出モデル１７２と、分類テキスト特徴抽出モデル１７３と、マルチモーダルモデル１７４とを記憶する。 The storage unit 17 stores, for example, appearance information 170, classification information 171, appearance feature extraction model 172, classified text feature extraction model 173, and multimodal model 174.

ここで、外観情報１７０と、分類情報１７１について、図２から図５を用いて説明する。
図２は、実施形態の画像Ｇの例を示すブロック図である。図３は、実施形態の図形分類Ｚの例を示す図である。図４は、実施形態の外観情報１７０の構成例を示す図である。図５は、実施形態の分類情報１７１の構成例を示す図である。 Here, the appearance information 170 and the classification information 171 will be explained using FIGS. 2 to 5.
FIG. 2 is a block diagram showing an example of the image G of the embodiment. FIG. 3 is a diagram showing an example of figure classification Z according to the embodiment. FIG. 4 is a diagram illustrating a configuration example of appearance information 170 according to the embodiment. FIG. 5 is a diagram showing a configuration example of the classification information 171 according to the embodiment.

図２に示すように、画像Ｇは、例えば、円の中に描かれた看護師のイラストを示す画像である。図２の例に示す画像Ｇにおける外観の特徴として、例えば、図３に示すような図形分類Ｚが付与される。この例では、図形分類Ｚは、「２．３．１頭部、上半身」及び「２．３．３尼僧、看護婦」などである。 As shown in FIG. 2, the image G is, for example, an image showing an illustration of a nurse drawn in a circle. As a feature of the appearance of the image G shown in the example of FIG. 2, for example, a figure classification Z as shown in FIG. 3 is assigned. In this example, the figure classifications Z are "2.3.1 Head, upper body" and "2.3.3 Nun, nurse".

図４に示すように外観情報１７０は、例えば、画像ＩＤと外観情報とを備える。画像ＩＤは画像を一意に識別する識別情報である。外観情報は、画像における外観を示す情報である。この例では、外観情報として、画素ごとの座標とＲＧＢ値とを示す情報が示されている。 As shown in FIG. 4, the appearance information 170 includes, for example, an image ID and appearance information. The image ID is identification information that uniquely identifies an image. Appearance information is information indicating the appearance in an image. In this example, information indicating the coordinates and RGB values of each pixel is shown as appearance information.

図５に示すように分類情報１７１は、例えば、画像ＩＤと分類情報とを備える。画像ＩＤは画像を一意に識別する識別情報である。分類情報は、画像における分類を示す情報である。この例では、分類情報として、商標における図形分類の番号体系とその番号体系に対応する分類の文言とが対応づけられた情報が示されている。 As shown in FIG. 5, the classification information 171 includes, for example, an image ID and classification information. The image ID is identification information that uniquely identifies an image. Classification information is information indicating classification in an image. In this example, as classification information, information is shown in which a numbering system for a figure classification in a trademark is associated with a wording of a classification corresponding to the numbering system.

ここで、画像類似度推定システム１が画像の全体特徴を抽出する処理の流れを説明する。図６は、実施形態の画像類似度推定システム１が行う処理を説明する図である。 Here, a process flow in which the image similarity estimation system 1 extracts the overall features of an image will be explained. FIG. 6 is a diagram illustrating processing performed by the image similarity estimation system 1 of the embodiment.

図６に示すように、画像類似度推定システム１は、画像Ｇにおける外観情報を外観特徴抽出モデル１７２に入力させることにより、外観特徴抽出モデル１７２から画像Ｇの外観特徴量を出力させる。また、画像類似度推定システム１は、画像Ｇにおける分類情報を分類テキスト特徴抽出モデル１７３に入力させることにより、分類テキスト特徴抽出モデル１７３から画像Ｇの分類テキスト特徴量を出力させる。そして、画像類似度推定システム１は、マルチモーダルモデル１７４に、画像Ｇにおける外観特徴量及び分類テキスト特徴量を入力させることにより、マルチモーダルモデル１７４から、画像Ｇにおける全体特徴量を出力させる。このように、画像類似度推定システム１では、外観特徴抽出モデル１７２、分類テキスト特徴抽出モデル１７３、及びマルチモーダルモデル１７４を用いて、画像Ｇにおける外観情報及び分類情報から、画像Ｇの全体特徴量を抽出する。 As shown in FIG. 6, the image similarity estimation system 1 causes the appearance feature extraction model 172 to output the appearance feature amount of the image G by inputting appearance information on the image G to the appearance feature extraction model 172. Furthermore, the image similarity estimation system 1 causes the classified text feature extraction model 173 to output the classified text feature amount of the image G by inputting the classification information on the image G to the classified text feature extraction model 173. Then, the image similarity estimation system 1 inputs the appearance feature amount and the classified text feature amount in the image G to the multimodal model 174, thereby causing the multimodal model 174 to output the overall feature amount in the image G. In this way, the image similarity estimation system 1 uses the appearance feature extraction model 172, the classified text feature extraction model 173, and the multimodal model 174 to calculate the overall feature amount of the image G from the appearance information and classification information in the image G. Extract.

ここで、外観特徴抽出モデル１７２について、図７を用いて詳しく説明する。図７は、実施形態の外観特徴抽出モデル１７２を説明する図である。図７に示すように、外観特徴抽出モデル１７２は、例えば、ＣＮＮ部１７２Ａと、アテンション機構１７２Ｂと、乗算部１７２Ｃと、外観特徴出力部１７２Ｄとを備える。 Here, the appearance feature extraction model 172 will be explained in detail using FIG. 7. FIG. 7 is a diagram illustrating the appearance feature extraction model 172 of the embodiment. As shown in FIG. 7, the appearance feature extraction model 172 includes, for example, a CNN section 172A, an attention mechanism 172B, a multiplication section 172C, and an appearance feature output section 172D.

ＣＮＮ部１７２Ａは、ＣＮＮによる深層学習のモデルである。アテンション機構１７２Ｂは、ＣＮＮ部１７２Ａから出力される内部状態に重みを付けて出力する機構である。例えば、アテンション機構１７２Ｂは、推定に重要でない部分（例えば、画像における背景の領域など）に、重要な部分と比較して小さな重みづけを行う。これにより、推定に有効な特徴に焦点をあて、推定結果により大きな影響を与えることが可能となる。加算部１７２Ｃは、ＣＮＮ部１７２Ａからの出力と、アテンション機構１７２Ｂからの出力とのそれぞれに重みを乗算して出力する。乗算部１７２Ｃは、例えば、ＣＮＮ部１７２Ａからの出力、又はアテンション機構１７２Ｂからの出力のいずれか一方を出力するスイッチとして機能する。これにより、アテンション機構１７２Ｂの有無を制御し、アテンション機構１７２Ｂの有無が推定の精度に与える影響を検証することが可能となる。外観特徴出力部１７２Ｄは、外観特徴抽出モデル１７２からの出力、つまり画像Ｇにおける外観特徴量が格納される出力層である。 The CNN unit 172A is a deep learning model using CNN. The attention mechanism 172B is a mechanism that weights and outputs the internal state output from the CNN unit 172A. For example, the attention mechanism 172B weights a portion that is not important for estimation (for example, a background region in an image) smaller than an important portion. This makes it possible to focus on features that are effective for estimation and have a greater influence on the estimation results. The adding unit 172C multiplies the output from the CNN unit 172A and the output from the attention mechanism 172B by a weight, and outputs the result. The multiplication unit 172C functions as a switch that outputs either the output from the CNN unit 172A or the output from the attention mechanism 172B, for example. This makes it possible to control the presence or absence of the attention mechanism 172B and to verify the influence of the presence or absence of the attention mechanism 172B on estimation accuracy. The appearance feature output unit 172D is an output layer in which the output from the appearance feature extraction model 172, that is, the appearance feature amount in the image G is stored.

例えば、まず、モデル生成部１５は、ＣＮＮ部１７２Ａのファインチューニングを行う。具体的に、モデル生成部１５は、ＣＮＮ部１７２Ａに、学習用画像における外観情報と分類情報との対応関係を、所定の終了条件を満たすまで繰り返し学習させる。学習用画像は、モデルの学習に用いられる画像であって、画像に対して、既にその分類情報が対応づけられている画像である。学習用画像は、例えば、出願済みの商標であって、商標における図形分類が付与されているものが用いられる。所定の終了条件は、任意に定められた条件であってよいが、例えば、学習段階における推定の精度の変化が収束することである。或いは所定の終了条件は、学習の回数が所定の上限に到達する、或いは推定の精度が所定の閾値以上になる、などの条件であってもよい。 For example, first, the model generation unit 15 performs fine tuning of the CNN unit 172A. Specifically, the model generation unit 15 causes the CNN unit 172A to repeatedly learn the correspondence between appearance information and classification information in the learning image until a predetermined termination condition is satisfied. The learning image is an image used for model learning, and is an image in which classification information has already been associated with the image. The learning image used is, for example, a trademark that has already been applied for and has been given a graphic classification in the trademark. The predetermined termination condition may be an arbitrarily determined condition, but for example, the predetermined termination condition is that the change in estimation accuracy in the learning stage converges. Alternatively, the predetermined termination condition may be that the number of times of learning reaches a predetermined upper limit, or that the accuracy of estimation exceeds a predetermined threshold.

次に、モデル生成部１５は、ファインチューニングをしたＣＮＮ部１７２Ａを用いて、アテンション機構１７２Ｂを学習させる。モデル生成部１５は、学習用画像における外観情報を入力することにより、ＣＮＮ部１７２Ａを介してアテンション機構１７２Ｂから出力される特徴量に基づき付与される確率が高い分類情報が、学習用画像における分類情報に近づくように、アテンション機構１７２Ｂにおけるパラメータを調整することにより、アテンション機構１７２Ｂを学習させる。 Next, the model generation unit 15 causes the attention mechanism 172B to learn using the fine-tuned CNN unit 172A. By inputting the appearance information in the learning image, the model generation unit 15 generates classification information that has a high probability of being assigned based on the feature amount output from the attention mechanism 172B via the CNN unit 172A. The attention mechanism 172B is made to learn by adjusting the parameters in the attention mechanism 172B so as to approach the information.

このように、モデル生成部１５は、ＣＮＮ部１７２Ａのファインチューニング、及びアテンション機構１７２Ｂの学習の二つの手順を行うことにより、外観特徴抽出モデル１７２を生成する。 In this way, the model generation unit 15 generates the appearance feature extraction model 172 by performing two procedures: fine tuning of the CNN unit 172A and learning of the attention mechanism 172B.

ここで、分類テキスト特徴抽出モデル１７３について、図８を用いて詳しく説明する。図８は、分類テキスト特徴抽出モデル１７３を説明する図である。図８に示すように、分類テキスト特徴抽出モデル１７３は、例えば、抽出単語入力層１７３Ａと、単語特徴埋込部１７３Ｂと、加重平均部１７３Ｃと、分類テキスト特徴出力部１７３Ｄとを備える。 Here, the classified text feature extraction model 173 will be explained in detail using FIG. 8. FIG. 8 is a diagram illustrating the classified text feature extraction model 173. As shown in FIG. 8, the classified text feature extraction model 173 includes, for example, an extracted word input layer 173A, a word feature embedding section 173B, a weighted average section 173C, and a classified text feature output section 173D.

抽出単語入力層１７３Ａは、画像Ｇの分類を示す文言から抽出された単語が入力される入力層である。抽出単語入力層１７３Ａには、例えば、画像Ｇの分類を示す文言において分かち書きされた単語のそれぞれが入力される。例えば、分類を示す文言が「頭部、上半身」である場合、抽出単語入力層１７３Ａには、「頭部」と「上半身」がそれぞれ入力される。図８の例では、例えば、抽出単語入力層１７３Ａにおける、ｗ１に「頭部」が入力され、ｗ２に「上半身」が入力される。この例のように、抽出単語入力層１７３Ａには、単語の数に応じた数のノードが設定されてよい。また、分類を示す文言が分かち書きされていない場合に、分類を示す文言を形態素解析することにより、品詞ごとに分離して、分類を示す文言から、分類を示す単語（例えば、名詞など）を抽出するようにしてもよい。 The extracted word input layer 173A is an input layer into which words extracted from the text indicating the classification of the image G are input. For example, each of the words separated in the text indicating the classification of the image G is input to the extracted word input layer 173A. For example, if the wording indicating the classification is "head, upper body", "head" and "upper body" are respectively input to the extracted word input layer 173A. In the example of FIG. 8, for example, in the extracted word input layer 173A, "head" is input to w1, and "upper body" is input to w2. As in this example, the number of nodes corresponding to the number of words may be set in the extracted word input layer 173A. In addition, when the text indicating the classification is not written separately, by morphologically analyzing the text indicating the classification, it is separated by part of speech and the word indicating the classification (for example, a noun) is extracted from the text indicating the classification. You may also do so.

単語特徴埋込部１７３Ｂには、抽出単語入力層１７３Ａのそれぞれのノードに入力された単語の特徴が出力される。単語の特徴は、いわゆる単語の分散表現であり、例えば、コーパスを用いて学習したＷｏｒｄ２Ｖｅｃ（以下、Ｗ２Ｖ）などの自然言語処理モデルに単語を入力させることにより得られる、単語の特徴を示す情報である。 The word feature embedding unit 173B outputs the features of the words input to each node of the extracted word input layer 173A. Word features are so-called distributed representations of words, and are information indicating the features of words that can be obtained by inputting words into a natural language processing model such as Word2Vec (hereinafter referred to as W2V) trained using a corpus. be.

ここで、図形の分類情報、特に商標における図形分類には、類似する商標を漏れなく抽出する必要があることから、比較的広い概念で図形分類が付与されているものがある。ここでの広い概念とは、例えば、「２６．１．１円」などの分類である。円が用いられている画像は数多く存在しており、この様な比較的広い概念での分類を示す文言の特徴を用いると、多数の画像が類似することになり、実質的な絞り込みとならない可能性が高い。つまり、比較的広い概念での分類を示す文言の特徴を反映させると、推定の精度を劣化させてしまう可能性がある。 Here, some graphic classification information, particularly graphic classification of trademarks, is given a relatively broad concept of graphic classification because it is necessary to extract all similar trademarks. The broad concept here is, for example, a classification such as "26.1.1 yen". There are many images that use circles, and if we use the characteristics of the wording that indicates classification based on a relatively broad concept, many images will be similar, and it may not be possible to narrow down the search effectively. Highly sexual. In other words, if the characteristics of the wording indicating a classification based on a relatively broad concept are reflected, the accuracy of estimation may deteriorate.

この対策として、本実施形態では、絞り込みの効果が期待できない単語の影響が小さくなるように重みづけを行う。具体的に、加重平均部１７３Ｃは、単語から抽出された単語ベクトル（単語の特徴量）に、その単語のｉｄｆ値で重みづけし、単語ベクトルごとに加重平均した値を出力する。ｉｄｆ値は以下の（１）式で示される値である。 As a countermeasure for this, in this embodiment, weighting is performed so that the influence of words for which no narrowing effect can be expected is reduced. Specifically, the weighted average unit 173C weights the word vector (feature amount of the word) extracted from the word with the idf value of the word, and outputs the weighted average value for each word vector. The idf value is a value expressed by the following equation (1).

ｉｄｆ（Ｘ）＝ｌｏｇ（Ｎ＿ｔｏｔａｌ／Ｎ＿Ｘ） …（１） idf(X)=log(N_total/N_X)...(1)

（１）式において、ｉｄｆ（Ｘ）は単語（Ｘ）におけるｉｄｆ値である。Ｎ＿ｔｏｔａｌは、図形分類が付与された画像の総数である。Ｎ＿Ｘは、単語（Ｘ）を含む図形分類が付与された画像の数である。（１）式に示す通り、画像の総数に対して多くの画像に付与されている分類に含まれる単語におけるｉｄｆ値は小さな値となり、画像の総数に対して少ない画像に付与されている分類に含まれる単語におけるｉｄｆ値は大きな値となる。このようなｉｄｆ値で重みづけがなされることにより、絞り込みに有効な単語の特徴を、分類テキスト特徴量により大きく影響させることができる。その一方で、絞り込みに効果が期待できない単語の特徴が分類テキスト特徴量に与える影響を抑制させることができる。 In equation (1), idf(X) is the idf value for word (X). N_total is the total number of images to which graphic classifications have been assigned. N_X is the number of images to which a graphic classification including the word (X) is assigned. As shown in equation (1), the idf value for words included in classifications that are assigned to many images relative to the total number of images is small, and The idf value of the included words becomes a large value. By weighting with such idf values, the word features that are effective for narrowing down can be made to have a greater influence on the classified text feature amounts. On the other hand, it is possible to suppress the influence of word features that are not expected to be effective in narrowing down the classification text features.

分類テキスト特徴出力部１７３Ｄは、分類テキスト特徴抽出モデル１７３からの出力、つまり画像Ｇにおける分類テキスト特徴量が格納される出力層である。 The classified text feature output unit 173D is an output layer in which the output from the classified text feature extraction model 173, that is, the classified text feature amount in the image G is stored.

ここで、マルチモーダルモデル１７４について、図９を用いて詳しく説明する。図９は、マルチモーダルモデル１７４を説明する図である。図９に示すように、マルチモーダルモデル１７４は、例えば、特徴結合入力層１７４Ａと、全結合層１７４Ｂと、全体特徴出力部１７４Ｃとを備える。 Here, the multimodal model 174 will be explained in detail using FIG. 9. FIG. 9 is a diagram illustrating the multimodal model 174. As shown in FIG. 9, the multimodal model 174 includes, for example, a feature combination input layer 174A, a fully connected layer 174B, and an overall feature output section 174C.

特徴結合入力層１７４Ａは、画像Ｇにおける外観特徴量及び分類テキスト特徴量が入力される、マルチモーダルモデル１７４の入力層である。全体特徴出力部１７４Ｃは、マルチモーダルモデル１７４からの出力、つまり画像Ｇにおける全体特徴量が格納される出力層である。全結合層１７４Ｂは、特徴結合入力層１７４Ａと全体特徴出力部１７４Ｃとの間を全結合するＦＣ（Full Connection）層である。 The feature combination input layer 174A is an input layer of the multimodal model 174 to which the appearance feature amount and the classified text feature amount in the image G are input. The overall feature output unit 174C is an output layer in which the output from the multimodal model 174, that is, the overall feature amount in the image G is stored. The fully connected layer 174B is an FC (Full Connection) layer that fully connects between the feature combination input layer 174A and the overall feature output section 174C.

ここで、画像Ｇにおける外観特徴量は、外観特徴抽出モデル１７２から出力される。また、画像Ｇにおける分類テキスト特徴量は、分類テキスト特徴抽出モデル１７３から出力される。それぞれの特徴量が、互いに異なるモデルから出力されることから、それぞれの特徴量が取り得る範囲が、同じような範囲とならない可能性がある。このような取り得る範囲が異なる特徴量を単純にそのまま統合させて入力させてしまうと、モデルが一方の特徴量と出力との対応関係のみを学習してしまい、他方の特徴量が反映されていない偏った推定がなされる可能性が高くなる。 Here, the appearance feature amount in the image G is output from the appearance feature extraction model 172. Further, the classified text feature amount in the image G is output from the classified text feature extraction model 173. Since the respective feature quantities are output from different models, the possible ranges of the respective feature quantities may not be the same. If such features with different possible ranges are simply integrated and input as is, the model will only learn the correspondence between one feature and the output, and the other feature will not be reflected. This increases the possibility that biased estimates will be made.

このための対策として、本実施形態では、マルチモーダルモデル１７４に入力させる二つの特徴量を正規化する前処理を行う。具体的に、モデル生成部１５は、画像Ｇにおける外観特徴量と、画像Ｇにおける分類テキスト特徴量とが同程度の範囲（例えば、０から１）となるように、一方の特徴量に所定の一律の値を乗算する。モデル生成部１５は、必要に応じて他方の特徴量に、一方の特徴量に乗算した値とは異なる別の一律の値を乗算する。これにより、モデル生成部１５は、マルチモーダルモデル１７４を、二つの特徴量の両方を考慮して全体特徴量を出力するように学習させることができる。 As a countermeasure for this, in the present embodiment, preprocessing is performed to normalize the two feature quantities input to the multimodal model 174. Specifically, the model generation unit 15 sets a predetermined value to one of the feature amounts so that the appearance feature amount in the image G and the classified text feature amount in the image G are in the same range (for example, from 0 to 1). Multiply by a uniform value. The model generation unit 15 multiplies the other feature amount by a uniform value different from the value multiplied by one feature amount, as necessary. Thereby, the model generation unit 15 can train the multimodal model 174 to take both of the two feature quantities into consideration and output the entire feature quantity.

ここで、画像類似度推定システム１が行う処理の流れについて、図１０から図１２を用いて説明する。図１０から図１２は、実施形態の画像類似度推定システム１が行う処理の流れを示すフロー図である。 Here, the flow of processing performed by the image similarity estimation system 1 will be explained using FIGS. 10 to 12. 10 to 12 are flowcharts showing the flow of processing performed by the image similarity estimation system 1 of the embodiment.

図１０には、画像類似度推定システム１が分類テキスト特徴抽出モデル１７３を用いて画像から分類テキスト特徴量を抽出する処理の流れが示されている。画像類似度推定システム１は、画像Ｇの分類情報を取得する（ステップＳ１０）。画像類似度推定システム１は、分類情報を用いて、画像Ｇの分類を示す文言を単語ごとに分離（分かち書き）する（ステップＳ１１）。画像類似度推定システム１は、単語それぞれの単語ベクトルを抽出する（ステップＳ１２）。 FIG. 10 shows a flow of processing in which the image similarity estimation system 1 extracts classified text features from an image using the classified text feature extraction model 173. The image similarity estimation system 1 acquires classification information of the image G (step S10). The image similarity estimation system 1 uses the classification information to separate (separate) the wording indicating the classification of the image G into words (step S11). The image similarity estimation system 1 extracts word vectors for each word (step S12).

一方、画像類似度推定システム１は、単語それぞれのｉｄｆ値を算出する（ステップＳ１３）。画像類似度推定システム１は、単語の単語ベクトルに、その単語のｉｄｆ値を重みづけ（乗算）する（ステップＳ１４）。画像類似度推定システム１は、重みづけしたそれぞれの単語における単語ベクトルを、単語ベクトルごとに加重平均した値を、画像Ｇにおける分類テキスト特徴量として出力する（ステップＳ１５）。 On the other hand, the image similarity estimation system 1 calculates an idf value for each word (step S13). The image similarity estimation system 1 weights (multiplies) the word vector of a word by the idf value of the word (step S14). The image similarity estimation system 1 outputs the weighted average of the weighted word vectors for each word vector as the classified text feature amount in the image G (step S15).

なお、図１０では、ステップＳ１２で単語ベクトルを抽出した後に、ステップＳ１３で単語のｉｄｆ値を算出する流れを例示して説明したが、少なくともステップＳ１４において単語ベクトルにｉｄｆ値が乗算できればよく、単語のｉｄｆ値を算出した後に、ステップＳ１０～Ｓ１２に示す処理を行うことにより単語ベクトルを抽出してもよい。或いは、図１０における単語ベクトルを抽出する処理とは独立させた処理として、ｉｄｆ値を算出する処理を行ってもよい。 In addition, in FIG. 10, the flow of calculating the idf value of the word in step S13 after extracting the word vector in step S12 was explained as an example, but it is only necessary that the word vector can be multiplied by the idf value in step S14, After calculating the idf value of , word vectors may be extracted by performing the processing shown in steps S10 to S12. Alternatively, the process of calculating the idf value may be performed as a process independent of the process of extracting word vectors in FIG. 10.

図１１には、画像類似度推定システム１が、外観特徴抽出モデル１７２、及びマルチモーダルモデル１７４を生成する処理の流れが示されている。画像類似度推定システム１は、学習用画像の外観情報を取得する（ステップＳ２０）。画像類似度推定システム１は、学習用画像の分類情報を取得する（ステップＳ２１）。画像類似度推定システム１は、学習用画像の外観情報と分類情報との対応関係をＣＮＮ部１７２Ａに学習させることにより、外観特徴抽出モデル１７２のＣＮＮ部１７２Ａを学習させる（ステップＳ２２）。画像類似度推定システム１は、学習用画像の外観情報を入力させることにより得られるＣＮＮ部１７２Ａの内部状態に基づき出力される可能性の高い分類情報が、学習用画像の分類情報に近づくように、外観特徴抽出モデル１７２のアテンション機構１７２Ｂを学習させる（ステップＳ２３）。これにより、画像類似度推定システム１は、外観特徴抽出モデル１７２を生成する。 FIG. 11 shows a process flow in which the image similarity estimation system 1 generates the appearance feature extraction model 172 and the multimodal model 174. The image similarity estimation system 1 acquires appearance information of the learning image (step S20). The image similarity estimation system 1 acquires classification information of the learning image (step S21). The image similarity estimation system 1 causes the CNN unit 172A of the appearance feature extraction model 172 to learn the correspondence between the appearance information of the learning image and the classification information (step S22). The image similarity estimation system 1 is configured so that the classification information that is likely to be output based on the internal state of the CNN unit 172A obtained by inputting the appearance information of the training image approaches the classification information of the training image. , the attention mechanism 172B of the appearance feature extraction model 172 is trained (step S23). Thereby, the image similarity estimation system 1 generates the appearance feature extraction model 172.

画像類似度推定システム１は、外観特徴抽出モデル１７２に学習用画像の外観情報を入力させることにより、学習用画像の外観特徴量を抽出する（ステップＳ２４）。画像類似度推定システム１は、分類テキスト特徴抽出モデル１７３に学習用画像の分類情報を入力させることにより、学習用画像の分類テキスト特徴量を抽出する（ステップＳ２５）。画像類似度推定システム１は、学習用画像の外観特徴量と分類テキスト特徴量とを正規化する処理を行う（ステップＳ２６）。画像類似度推定システム１は、正規化する処理をした学習用画像の外観特徴量と分類テキスト特徴量に基づき出力される可能性が高い分類情報が、学習用画像の分類情報（ここでは全体特徴量に相当する）近づくように、全結合層１７４Ｂを学習させる（パラメータを調整する）ことにより、マルチモーダルモデル１７４を生成する（ステップＳ２７）。画像類似度推定システム１は、生成した外観特徴抽出モデル１７２、マルチモーダルモデル１７４を記憶させる（ステップＳ２８）。 The image similarity estimation system 1 extracts the appearance feature amount of the learning image by inputting the appearance information of the learning image to the appearance feature extraction model 172 (step S24). The image similarity estimation system 1 extracts the classified text feature amount of the training image by inputting the classification information of the training image to the classified text feature extraction model 173 (step S25). The image similarity estimation system 1 performs a process of normalizing the appearance feature amount and the classified text feature amount of the learning image (step S26). In the image similarity estimation system 1, classification information that is likely to be output based on the appearance features and classified text features of the training images that have undergone normalization processing is the classification information of the training images (here, the overall The multimodal model 174 is generated by training the fully connected layer 174B (adjusting the parameters) so that it approaches the amount (equivalent to the amount) (step S27). The image similarity estimation system 1 stores the generated appearance feature extraction model 172 and multimodal model 174 (step S28).

なお、図１１では、ステップＳ２４で外観特徴量を抽出した後に、ステップＳ２５で分類テキスト特徴量を抽出する流れを例示して説明したが、少なくともステップＳ２６において二つの特徴量（外観特徴量と分類テキスト特徴量）が正規化できればよく、分類テキスト特徴量を抽出した後に外観特徴量を抽出してもよい。 In FIG. 11, the flow of extracting the appearance feature amount in step S24 and then extracting the classified text feature amount in step S25 was explained as an example, but at least two feature amounts (appearance feature amount and classification text feature amount) are extracted in step S26. It is only necessary that the text feature amount) can be normalized, and the appearance feature amount may be extracted after the classified text feature amount is extracted.

図１２には、画像類似度推定システム１が、二つの画像（ここでは対象画像と比較画像）の類似度を推定する処理の流れが示されている。画像類似度推定システム１は、対象画像における外観情報を取得し（ステップＳ３０）、取得した情報と外観特徴抽出モデル１７２とを用いて、対象画像における外観特徴量を抽出する（ステップＳ３１）。また、画像類似度推定システム１は、対象画像における分類情報を取得し（ステップＳ３２）、取得した情報と分類テキスト特徴抽出モデル１７３とを用いて、対象画像における分類テキスト特徴量を抽出する（ステップＳ３３）。そして、画像類似度推定システム１は、対象画像における外観特徴量と、分類テキスト特徴量と、マルチモーダルモデル１７４を用いて、対象画像における全体特徴量を抽出する（ステップＳ３４）。 FIG. 12 shows a process flow in which the image similarity estimation system 1 estimates the similarity between two images (here, a target image and a comparison image). The image similarity estimation system 1 acquires appearance information in the target image (step S30), and uses the acquired information and the appearance feature extraction model 172 to extract appearance features in the target image (step S31). The image similarity estimation system 1 also acquires classification information in the target image (step S32), and uses the acquired information and the classified text feature extraction model 173 to extract classified text features in the target image (step S32). S33). Then, the image similarity estimation system 1 extracts the overall feature amount in the target image using the appearance feature amount, the classified text feature amount, and the multimodal model 174 in the target image (step S34).

一方、画像類似度推定システム１は、比較画像における全体特徴量を抽出する（ステップＳ３５）。画像類似度推定システム１が比較画像における全体特徴量を抽出する処理の流れは、対象画像における全体特徴量を抽出する処理の流れと同様である。 On the other hand, the image similarity estimation system 1 extracts the overall feature amount in the comparison image (step S35). The process flow in which the image similarity estimation system 1 extracts the overall feature amount in the comparison image is similar to the process flow in which the overall feature amount is extracted in the target image.

画像類似度推定システム１は、対象画像との類似度を推定したい全ての比較画像についてその全体特徴量を算出したか否かを判定する（ステップＳ３６）。画像類似度推定システム１は、対象画像と比較画像のそれぞれの全体特徴におけるベクトル空間上の距離をコサイン類似度として算出する（ステップＳ３７）。 The image similarity estimation system 1 determines whether the overall feature amount has been calculated for all comparison images whose similarity with the target image is to be estimated (step S36). The image similarity estimation system 1 calculates the distance in the vector space in the overall features of each of the target image and the comparison image as a cosine similarity (step S37).

なお、図１２では、ステップＳ３６で対象画像との類似度を推定したい全ての比較画像についてその全体特徴量を算出した後に、ステップＳ３６でそれぞれのコサイン類似度を算出する流れを例示して説明したが、少なくとも対象画像と比較画像の類似度を算出できればよく、比較画像における全体特徴量を抽出する都度、コサイン類似度を算出するようにしてもよい。 In addition, FIG. 12 illustrates and explains the flow of calculating the cosine similarity of each image in step S36 after calculating the overall feature amount of all comparison images whose similarity with the target image is to be estimated in step S36. However, it is only necessary to be able to calculate at least the similarity between the target image and the comparison image, and the cosine similarity may be calculated each time the overall feature amount in the comparison image is extracted.

以上説明したように、実施形態の画像類似度推定システム１は、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、モデル生成部１５と、画像類似度推定部１６とを備える。外観情報取得部１０は画像Ｇの外観を示す外観情報１７０を取得する。外観特徴抽出部１１は画像Ｇにおける外観情報１７０、及び外観特徴抽出モデル１７２を用いて、画像Ｇの外観の特徴を示す外観特徴量を抽出する。分類情報取得部１２は画像Ｇの分類を示す分類情報１７１を取得する。分類テキスト特徴抽出部１３は、画像Ｇにおける分類情報１７１、及び分類テキスト特徴抽出モデル１７３を用いて、画像Ｇの分類を示す文言の特徴を示す分類テキスト特徴量を抽出する。全体特徴抽出部１４は、画像Ｇにおける外観特徴量、分類テキスト特徴量、及びマルチモーダルモデル１７４を用いて、画像Ｇにおける画像全体の特徴である全体特徴量を抽出する。モデル生成部１５は、外観特徴抽出モデル１７２と、マルチモーダルモデル１７４を生成する。画像類似度推定部１６は、対象画像における全体特徴量、及び比較画像における全体特徴量に基づいて、対象画像と比較画像の類似度合いを推定する。これにより、実施形態の画像類似度推定システム１は、画像Ｇにおける外観と概念との両方を考慮した特徴を抽出することができ、外観のみならず観念を考慮して画像の類否を判定することが可能である。 As described above, the image similarity estimation system 1 of the embodiment includes the appearance information acquisition section 10, the appearance feature extraction section 11, the classification information acquisition section 12, the classified text feature extraction section 13, and the overall feature extraction section. 14, a model generation section 15, and an image similarity estimation section 16. The appearance information acquisition unit 10 acquires appearance information 170 indicating the appearance of the image G. The appearance feature extraction unit 11 uses the appearance information 170 in the image G and the appearance feature extraction model 172 to extract appearance feature amounts indicating the appearance characteristics of the image G. The classification information acquisition unit 12 acquires classification information 171 indicating the classification of the image G. The classified text feature extraction unit 13 uses the classification information 171 in the image G and the classified text feature extraction model 173 to extract a classified text feature quantity that indicates the feature of the wording that indicates the classification of the image G. The overall feature extraction unit 14 extracts the overall feature amount, which is the feature of the entire image G, using the appearance feature amount, the classified text feature amount, and the multimodal model 174 in the image G. The model generation unit 15 generates an appearance feature extraction model 172 and a multimodal model 174. The image similarity estimating unit 16 estimates the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image. As a result, the image similarity estimation system 1 of the embodiment can extract features that take into account both the appearance and the concept in the image G, and determine the similarity of images by considering not only the appearance but also the concept. Is possible.

また、実施形態の画像類似度推定システム１では、外観特徴抽出モデル１７２は、深層学習の学習モデルの内部状態に重み付けした値を出力するアテンション機構１７２Ｂを含む。モデル生成部１５は、アテンション機構１７２Ｂに、学習用画像における外観情報と前記分類情報との対応関係に応じた重みを学習させる。これにより、実施形態の画像類似度推定システム１では、外観特徴抽出モデル１７２の内部状態において外観の特徴を抽出するのに有効なものに焦点をあてることができ、より精度よく外観特徴量を抽出することが可能となる。 Further, in the image similarity estimation system 1 of the embodiment, the appearance feature extraction model 172 includes an attention mechanism 172B that outputs a value weighted to the internal state of the deep learning learning model. The model generation unit 15 causes the attention mechanism 172B to learn weights according to the correspondence between the appearance information in the learning image and the classification information. As a result, the image similarity estimation system 1 of the embodiment can focus on the internal state of the appearance feature extraction model 172 that is effective for extracting appearance features, and extract appearance feature amounts with higher accuracy. It becomes possible to do so.

また、実施形態の画像類似度推定システム１では、分類テキスト特徴抽出モデル１７３は、文言に含まれる単語の特徴量を示す単語特徴量を、単語のｉｄｆ値で重みづけした値に基づいて当該文言の特徴を抽出するモデルである。ｉｄｆ値は、分類済みの画像の集合である画像群に統計処理を行うことにより算出される値である。これにより、実施形態の画像類似度推定システム１では、絞り込みに効果が期待できない図形分類の影響を弱め、絞り込みに有効な図形分類の影響を強めることができる。したがって、絞り込みにより効果的な分類テキスト特徴量を抽出することが可能となる。 In addition, in the image similarity estimation system 1 of the embodiment, the classified text feature extraction model 173 extracts the word feature amount representing the feature amount of the word included in the sentence based on the value weighted by the idf value of the word. This is a model that extracts the features of The idf value is a value calculated by performing statistical processing on an image group that is a set of classified images. Thereby, in the image similarity estimation system 1 of the embodiment, it is possible to weaken the influence of figure classification, which is not expected to be effective in narrowing down, and strengthen the influence of figure classification, which is effective in narrowing down. Therefore, it becomes possible to extract effective classified text features by narrowing down.

また、実施形態の画像類似度推定システム１では、ｉｄｆ値は、分類済みの画像の集合である画像群の数に対する、前記分類テキスト特徴量を抽出する画像と同じ分類とされた画像の数の割合を用いて算出される値である。これにより、実施形態の画像類似度推定システム１では、上述した効果と同様の効果を奏する。 Furthermore, in the image similarity estimation system 1 of the embodiment, the idf value is the number of images that are classified in the same manner as the image from which the classified text feature is extracted, with respect to the number of image groups that are a set of classified images. This is a value calculated using a ratio. Thereby, the image similarity estimation system 1 of the embodiment achieves the same effects as those described above.

また、実施形態の画像類似度推定システム１では、モデル生成部１５は、学習用画像における外観特徴量及び分類テキスト特徴量が、同一の範囲内に含まれるデータとなるように正規化する処理を行う。モデル生成部１５は、正規化する処理を行った前記学習用画像における、外観特徴量及び分類テキスト特徴量と前記分類情報との対応関係を学習モデルに学習させることにより、マルチモーダルモデル１７４を生成する。これにより、実施形態の画像類似度推定システム１では、二つの特徴量の一方に偏ることなく、両方の特徴量が共に反映された全体特徴量を抽出することができる。したがって、外観と概念の双方を鑑みて類似する画像を推定することができる。 Further, in the image similarity estimation system 1 of the embodiment, the model generation unit 15 performs a process of normalizing the appearance feature amount and the classification text feature amount in the learning image so that they are data included in the same range. conduct. The model generation unit 15 generates a multimodal model 174 by having a learning model learn the correspondence between the appearance feature amount, the classified text feature amount, and the classification information in the learning image that has been subjected to the normalization process. do. Thereby, in the image similarity estimation system 1 of the embodiment, it is possible to extract the entire feature amount in which both feature amounts are reflected, without being biased towards one of the two feature amounts. Therefore, similar images can be estimated based on both appearance and concept.

また、実施形態の画像類似度推定システム１は、学習装置として適用されてもよい。この場合、学習装置は、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、モデル生成部１５とを備える。これにより、学習装置は、画像Ｇにおける外観と概念とを考慮した全体特徴を抽出することができるモデルを生成することができる。 Moreover, the image similarity estimation system 1 of the embodiment may be applied as a learning device. In this case, the learning device includes an appearance information acquisition section 10, an appearance feature extraction section 11, a classification information acquisition section 12, a classified text feature extraction section 13, an overall feature extraction section 14, and a model generation section 15. . Thereby, the learning device can generate a model that can extract the overall features in consideration of the appearance and concept of the image G.

また、実施形態の画像類似度推定システム１は、推定装置として適用されてもよい。この場合、推定装置は、外観情報取得部１０と、外観特徴抽出部１１と、分類情報取得部１２と、分類テキスト特徴抽出部１３と、全体特徴抽出部１４と、画像類似度推定部１６とを備える。これにより、推定装置は、画像Ｇにおける外観と概念とを考慮した全体特徴を抽出することができる。したがって、画像Ｇにおける外観と概念とを考慮して、類似する画像を推定することが可能である。 Moreover, the image similarity estimation system 1 of the embodiment may be applied as an estimation device. In this case, the estimation device includes an appearance information acquisition section 10, an appearance feature extraction section 11, a classification information acquisition section 12, a classified text feature extraction section 13, an overall feature extraction section 14, and an image similarity estimation section 16. Equipped with. Thereby, the estimation device can extract the overall feature in consideration of the appearance and concept of the image G. Therefore, it is possible to estimate similar images by considering the appearance and concept of image G.

上述した実施形態における画像類似度推定システム１の全部又は一部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 All or part of the image similarity estimation system 1 in the embodiment described above may be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, a "computer-readable recording medium" refers to a storage medium that dynamically stores a program for a short period of time, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include a device that retains a program for a certain period of time, such as a volatile memory inside a computer system that is a server or client in that case. Further, the above-mentioned program may be one for realizing a part of the above-mentioned functions, or may be one that can realize the above-mentioned functions in combination with a program already recorded in the computer system. It may be realized using a programmable logic device such as an FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

１画像類似度推定システム
１０外観情報取得部
１１外観特徴抽出部
１２分類情報取得部
１３分類テキスト特徴抽出部
１４全体特徴抽出部
１５モデル生成部
１６画像類似度推定部
１７記憶部
１８推定結果出力部
１７０外観情報
１７１分類情報
１７２外観特徴抽出モデル
１７３分類テキスト特徴抽出モデル
１７４マルチモーダルモデル 1 Image similarity estimation system 10 Appearance information acquisition unit 11 Appearance feature extraction unit 12 Classification information acquisition unit 13 Classified text feature extraction unit 14 Overall feature extraction unit 15 Model generation unit 16 Image similarity estimation unit 17 Storage unit 18 Estimation result output unit 170 Appearance information 171 Classification information 172 Appearance feature extraction model 173 Classified text feature extraction model 174 Multimodal model

Claims

an appearance information acquisition unit that acquires appearance information indicating the appearance of the image;
an appearance feature extraction unit that uses the appearance information in the image and the appearance feature extraction model to extract appearance feature amounts indicating the appearance characteristics of the image;
a classification information acquisition unit that acquires classification information indicating the classification of the image;
a classified text feature extraction unit that uses the classification information in the image and the classified text feature extraction model to extract a classified text feature quantity that represents a feature of a wording that indicates the classification of the image;
an overall feature extraction unit that extracts an overall feature that is a feature of the entire image in the image using the appearance feature, the classified text feature, and the multimodal model in the image;
a model generation unit that generates the appearance feature extraction model and the multimodal model;
an image similarity estimation unit that estimates a degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image;
Equipped with
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image,
The model generation unit generates the appearance feature extraction model by causing a learning model to learn a correspondence relationship between the appearance information and the classification information in the learning image,
The classified text feature extraction model is a model that extracts feature amounts of text indicating classification,
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classified text feature amount in the image,
The model generation unit includes the appearance feature amount in the training image extracted by the appearance feature extraction unit, the classified text feature amount in the training image extracted by the classified text feature extraction unit, and the training image. generating the multimodal model by causing a learning model to learn the correspondence relationship with the classification information in the image for use;
Image similarity estimation system.

The appearance feature extraction model includes an attention mechanism that outputs a weighted value for the internal state of the deep learning learning model,
The model generation unit causes the attention mechanism to learn weights according to the correspondence between the appearance information and the classification information in the learning image.
The image similarity estimation system according to claim 1.

The classified text feature extraction model is a model that extracts features of a word based on a value obtained by weighting a word feature representing a feature of a word included in the word with an idf value of the word,
The idf value is a value calculated by performing statistical processing on an image group that is a set of classified images.
The image similarity estimation system according to claim 1 or claim 2.

The idf value is a value calculated using the ratio of the number of image groups that are a set of classified images to the number of images that include the classified text feature amount,
The image similarity estimation system according to claim 3.

The model generation unit performs normalization processing such that the appearance feature amount and the classified text feature amount in the training image are data included in the same range, and performs the normalization processing. generating the multimodal model by causing a learning model to learn the correspondence between the appearance feature amount, the classified text feature amount, and the classification information in the learning image;
The image similarity estimation system according to any one of claims 1 to 4.

an appearance information acquisition unit that acquires appearance information indicating the appearance of the image;
an appearance feature extraction unit that uses the appearance information in the image and the appearance feature extraction model to extract appearance feature amounts indicating the appearance characteristics of the image;
a classification information acquisition unit that acquires classification information indicating the classification of the image;
a classified text feature extraction unit that uses the classification information in the image and the classified text feature extraction model to extract a classified text feature quantity that represents a feature of a wording that indicates the classification of the image;
an overall feature extraction unit that extracts an overall feature that is a feature of the entire image in the image using the appearance feature, the classified text feature, and the multimodal model in the image;
a model generation unit that generates the appearance feature extraction model and the multimodal model;
Equipped with
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image,
The model generation unit generates the appearance feature extraction model by causing a learning model to learn a correspondence relationship between the appearance information and the classification information in the learning image,
The classified text feature extraction model is a model that extracts feature amounts of text indicating classification,
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classified text feature amount in the image,
The model generation unit includes the appearance feature amount in the training image extracted by the appearance feature extraction unit, the classified text feature amount in the training image extracted by the classified text feature extraction unit, and the training image. generating the multimodal model by causing a learning model to learn the correspondence relationship with the classification information in the image for use;
learning device.

an appearance information acquisition unit that acquires appearance information indicating the appearance of the image;
an appearance feature extraction unit that uses the appearance information in the image and the appearance feature extraction model to extract appearance feature amounts indicating the appearance characteristics of the image;
a classification information acquisition unit that acquires classification information indicating the classification of the image;
a classified text feature extraction unit that uses the classification information in the image and the classified text feature extraction model to extract a classified text feature quantity that represents a feature of a wording that indicates the classification of the image;
an overall feature extraction unit that extracts an overall feature that is a feature of the entire image in the image using the appearance feature, the classified text feature, and the multimodal model in the image;
an image similarity estimation unit that estimates a degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image;
Equipped with
The appearance feature extraction model is a model that outputs the appearance feature amount of the image from the appearance information in the image, and by making the learning model learn the correspondence between the appearance information and the classification information in the learning image. The generated model is
The classified text feature extraction model is a model that extracts feature amounts of text indicating classification,
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classified text feature amount in the image, and the multimodal model outputs the overall feature amount in the image, and the appearance feature amount in the learning image extracted by the appearance feature extraction unit. A model generated by causing a learning model to learn the correspondence between the feature amount and the classified text feature amount in the training image extracted by the classified text feature extraction unit and the classification information in the training image. is,
Estimation device.

computer,
Appearance information acquisition means for acquiring appearance information indicating the appearance of the image;
Appearance feature extraction means for extracting appearance feature amounts indicating the appearance characteristics of the image using the appearance information in the image and the appearance feature extraction model;
classification information acquisition means for acquiring classification information indicating the classification of the image;
Classified text feature extraction means that uses the classification information in the image and a classified text feature extraction model to extract a classified text feature quantity that represents a feature of a wording that indicates the classification of the image;
overall feature extraction means for extracting an overall feature amount that is a feature of the entire image in the image using the appearance feature amount, the classified text feature amount, and the multimodal model in the image;
model generation means for generating the appearance feature extraction model and the multimodal model;
It is a program to function as
The appearance feature extraction model is a model that outputs the appearance feature amount in the image from the appearance information in the image,
In the model generation means, the appearance feature extraction model is generated by causing a learning model to learn the correspondence between the appearance information and the classification information in the learning image,
The classified text feature extraction model is a model that extracts feature amounts of text indicating classification,
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classified text feature amount in the image,
In the model generation means, the appearance feature amount in the learning image extracted by the appearance feature extraction means, the classified text feature amount in the learning image extracted by the classified text feature extraction means, and the learning The multimodal model is generated by causing a learning model to learn the correspondence relationship with the classification information in the image for use.
program.

computer,
Appearance information acquisition means for acquiring appearance information indicating the appearance of the image;
Appearance feature extraction means for extracting appearance feature amounts indicating the appearance characteristics of the image using the appearance information in the image and the appearance feature extraction model;
classification information acquisition means for acquiring classification information indicating the classification of the image;
Classified text feature extraction means that uses the classification information in the image and a classified text feature extraction model to extract a classified text feature quantity that represents a feature of a wording that indicates the classification of the image;
an overall feature extraction unit that extracts an overall feature that is a feature of the entire image in the image using the appearance feature, the classified text feature, and the multimodal model in the image;
image similarity estimating means for estimating the degree of similarity between the target image and the comparison image based on the overall feature amount in the target image and the overall feature amount in the comparison image;
It is a program to function as
The appearance feature extraction model is a model that outputs the appearance feature amount of the image from the appearance information in the image, and by making the learning model learn the correspondence between the appearance information and the classification information in the learning image. The generated model is
The classified text feature extraction model is a model that extracts feature amounts of text indicating classification,
The multimodal model is a model that outputs the overall feature amount in the image from the appearance feature amount and the classified text feature amount in the image, and the multimodal model outputs the overall feature amount in the image, and the appearance feature amount in the learning image extracted by the appearance feature extraction means. A model generated by having a learning model learn the correspondence between the feature amount and the classified text feature amount in the training image extracted by the classified text feature extraction means and the classification information in the training image. is,
program.