JP2017091192A

JP2017091192A - Method and device for learning between documents in different languages using images, and method and device for searching cross-lingual document

Info

Publication number: JP2017091192A
Application number: JP2015220107A
Authority: JP
Inventors: 類佳舟木; Ruika Funaki; 中山　英樹; Hideki Nakayama; 英樹中山
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2015-11-10
Filing date: 2015-11-10
Publication date: 2017-05-25
Anticipated expiration: 2035-11-10
Also published as: JP6712796B2

Abstract

PROBLEM TO BE SOLVED: To provide a method for learning between documents in different languages that indirectly uses images.SOLUTION: A first training data set including a pair group of a first language document and an image is prepared, and a second training data set including a pair group of a second language document and an image is prepared. In the first training data set, a first feature vector and a second feature vector are extracted from the first language document and from the image, respectively. In the second training data set, a third feature vector and the second feature vector are extracted from the second language document and from the image, respectively. The first, second, and third feature vectors are then used for conducting a generalization canonical correlation analysis so that the first and third feature vectors are mapped using the second feature vector.SELECTED DRAWING: Figure 3

Description

本発明は、言語横断文書検索に関するものである。 The present invention relates to cross-language document retrieval.

言語横断文書検索ないし異言語間文書検索は、例えば日本語の文書を入力し、関連・類似する英語文書を検索する技術である。従来技術では、システムを学習させるために多大な量の対訳コーパス（例えば、日本語・英語両言語で書かれた文書セット）が必要であり、一般的にはこのようなデータの入手自体が困難であるため実現性に乏しかった。 Cross-language document search or cross-language document search is a technique for inputting a Japanese document and searching for related / similar English documents. In the prior art, a large amount of bilingual corpora (for example, a document set written in both Japanese and English) is required to learn the system, and it is generally difficult to obtain such data itself. Therefore, feasibility was poor.

具体的には、大量の文書を手作業で翻訳することは多大な労力を要する。また、Webからバイリンガルドキュメントをクローリングして、学習データとして用いることも考えられるが、ウェブ上の多くの文書は１つの言語に閉じられている。したがって、十分な量の多言語文書を収集することは簡単ではなく、特にマイナーな言語であればなおさらである。 Specifically, manual translation of a large amount of documents requires a great deal of labor. It is also possible to crawling a bilingual document from the Web and using it as learning data, but many documents on the Web are closed to one language. Therefore, collecting a sufficient amount of multilingual documents is not easy, especially for minor languages.

そこで、近年Web上に豊富に存在するマルチメディア情報、特に、文書と画像のペア、に着目した。異なる言語で記載された２つの文書が共に画像を含み、かつ、画像特徴が類似する場合、その文書に含まれるテキストも類似するであろうことが想定できる。また、画像は、母国語にかかわらず画像に含まれる意味内容を理解し得ることに加えて、どの国の文書にも画像が含まれ得ることからユニバーサルな表現であるという利点を有している。 Therefore, we focused on multimedia information that exists abundantly on the Web in recent years, especially documents and image pairs. If two documents written in different languages both contain images and the image features are similar, it can be assumed that the text contained in the documents will also be similar. Moreover, in addition to being able to understand the semantic content contained in the image regardless of the mother tongue, the image has the advantage of being a universal expression since the document can be included in any national document. .

伝統的な画像認識における機械学習は自然言語処理の分野における機械学習に比べて非常に貧弱であった。しかしながら、近年はDeep Learning（深層学習）のブレークスルーによって画像認識の精度が人間のレベルに急速に近づいている（非特許文献１２）。
Douglas J Carroll. 1968. Generalization of canonical correlation analysis to three or more sets of variables. In Proceedings of the 76th Annual Convention of the American Psychological Association, volume 3, pages 227-228. Jon Robers Kettenring. 1971. Canonical Analysis of Several Sets of Variables. Biometrika, 58(3):433-451. Michel Velden and Yoshio Takane. 2012. Generalized Canonical Correlation Analysis with Missing Values. Computational Statistics, 27(3):551-571. Jan Rupnik, Andrej Muhic, and Primo Skraba. 2012. Cross-Lingual Document Retrieval through Hub Languages. In Neural Information Processing Systems Workshop. Harold Hotelling. 1936. Relations between Two Sets of Variants. Biometrika, 28:321-377. David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical Correlation Analysis: an Overview with Application to Learning Methods. Neural Computation, 16(12):2639-2664. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R.G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A New Approach to Cross-modal Multimedia Retrieval. Proceedings of the International Conference on Multimedia, pages 251-260. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multiview Embedding Space for Modeling Internet Images, Tags, and their Semantics. International Journal of Computer Vision, 106(2):210-233. Alexei Vinokourov, John Shawe-Taylor, and Nello Cristianini. 2002. Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis. Advances in Neural Information Processing Systems, pages 1473-1480. Yaoyong Li and John Shawe-Taylor. 2004. Using KCCA for Japanese-English Cross-language Information Retrieval and Classification. In Learning Methods for Text Understanding Raghavendra Udupa and Mitesh M Khapra. 2010. Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 492-500. Hao Fang, Saurabh Gupta, Forrest Iandola, K. Rupesh Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From Captions to Visual Concepts and Back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the Devil in the Details: Delving Deep into Convolutional Nets. In Proceedings of the British Machine Vision Conference, pages 1-11. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proceedings of the International Conference on Machine Learning, pages 647-655. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe : Convolutional Architecture for Fast Feature Embedding. In Proceedings of the ACM International Conferenceon Multimedia, pages 675-678. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. Christian Szegedy, Scott Reed, Pierre Sermanet, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. CoRRabs/1409.4842. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pages 1097-1105. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying Conditional Random Fields to Japanese Morphological Analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230-237. Cyrus Rashtchian, Peter Young, Micah Hodosh, Julia Hockenmaier, and North Goodwin Ave. 2010. Collecting Image Annotations Using Amazon’s Mechanical Turk. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139-147. Florent Perronnin, Jorge S´anchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision, pages 143-156. David G Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1150-1157. Machine learning in traditional image recognition is very poor compared to machine learning in the field of natural language processing. However, in recent years, the accuracy of image recognition is rapidly approaching the human level due to the breakthrough of deep learning (non-patent document 12).
Douglas J Carroll. 1968. Generalization of canonical correlation analysis to three or more sets of variables.In Proceedings of the 76th Annual Convention of the American Psychological Association, volume 3, pages 227-228. Jon Robers Kettenring. 1971. Canonical Analysis of Several Sets of Variables. Biometrika, 58 (3): 433-451. Michel Velden and Yoshio Takane. 2012.Generalized Canonical Correlation Analysis with Missing Values.Computational Statistics, 27 (3): 551-571. Jan Rupnik, Andrej Muhic, and Primo Skraba. 2012. Cross-Lingual Document Retrieval through Hub Languages. In Neural Information Processing Systems Workshop. Harold Hotelling. 1936. Relations between Two Sets of Variants. Biometrika, 28: 321-377. David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical Correlation Analysis: an Overview with Application to Learning Methods. Neural Computation, 16 (12): 2639-2664. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A New Approach to Cross-modal Multimedia Retrieval.Proceedings of the International Conference on Multimedia, pages 251-260. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014.A Multiview Embedding Space for Modeling Internet Images, Tags, and their Semantics.International Journal of Computer Vision, 106 (2): 210-233. Alexei Vinokourov, John Shawe-Taylor, and Nello Cristianini. 2002. Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis. Advances in Neural Information Processing Systems, pages 1473-1480. Yaoyong Li and John Shawe-Taylor. 2004. Using KCCA for Japanese-English Cross-language Information Retrieval and Classification.In Learning Methods for Text Understanding Raghavendra Udupa and Mitesh M Khapra. 2010. Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search.Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 492-500. Hao Fang, Saurabh Gupta, Forrest Iandola, K. Rupesh Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From Captions to Visual Concepts and Back.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the Devil in the Details: Delving Deep into Convolutional Nets. In Proceedings of the British Machine Vision Conference, pages 1-11. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013.DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition.In Proceedings of the International Conference on Machine Learning, pages 647-655 . Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014.Caffe: Convolutional Architecture for Fast Feature Embedding. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. Christian Szegedy, Scott Reed, Pierre Sermanet, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. CoRRabs / 1409.4842. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pages 1097-1105. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying Conditional Random Fields to Japanese Morphological Analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230-237. Cyrus Rashtchian, Peter Young, Micah Hodosh, Julia Hockenmaier, and North Goodwin Ave. 2010.Collecting Image Annotations Using Amazon's Mechanical Turk.Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 139-147. Florent Perronnin, Jorge S´anchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification.In Proceedings of the European Conference on Computer Vision, pages 143-156. David G Lowe. 1999. Object recognition from local scale-invariant features.In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1150-1157.

本発明は、画像を間接的に利用した異言語文書間の学習方法及び装置を提案し、また、その学習モデルを用いて異言語間文書検索を行うことを目的とするものである。 An object of the present invention is to propose a learning method and apparatus between different language documents using an image indirectly, and to perform a document search between different languages using the learning model.

本発明に係る画像を媒介した異言語文書間の学習法は、
第１言語文書と画像のペア群からなる第１訓練データセットを用意し、
第２言語文書と画像のペア群からなる第２訓練データセットを用意し、
第１訓練データセットにおいて、第１言語文書から第１特徴ベクトルを抽出し、画像から第２特徴ベクトルを抽出し、
第２訓練データセットにおいて、第２言語文書から第３特徴ベクトルを抽出し、画像から第２特徴ベクトルを抽出し、
第１特徴ベクトル、第２特徴ベクトル、第３特徴ベクトルを用いて一般化正準相関分析を行うことで、第２特徴ベクトルを媒介として第１特徴ベクトルと第３特徴ベクトルのマッピングを行う、
ものである。 The learning method between different language documents mediated by images according to the present invention is as follows:
Prepare a first training data set consisting of pairs of first language documents and images,
Prepare a second training data set consisting of pairs of second language documents and images,
In the first training data set, a first feature vector is extracted from the first language document, a second feature vector is extracted from the image,
In the second training data set, a third feature vector is extracted from the second language document, a second feature vector is extracted from the image,
Performing generalized canonical correlation analysis using the first feature vector, the second feature vector, and the third feature vector, thereby mapping the first feature vector and the third feature vector through the second feature vector;
Is.

１つの態様では、第１特徴ベクトル及び第３特徴ベクトルは、Bag of wordsを用いて抽出される。
Bag of wordsは文書の特徴量を抽出するための一つの代表的な例示であって、これに限定されるものではない。 In one aspect, the first feature vector and the third feature vector are extracted using Bag of words.
Bag of words is one representative example for extracting a feature amount of a document, and is not limited to this.

１つの態様では、第２特徴ベクトルは、畳込みニューラルネットワークを用いて抽出される。
ＣＮＮとしては、AlexNetやCaffeNet、GoogLeNet、VGG netを例示することができる。 In one aspect, the second feature vector is extracted using a convolutional neural network.
Examples of CNN include AlexNet, CaffeNet, GoogLeNet, and VGG net.

１つの態様では、第１特徴ベクトル、第２特徴ベクトル、第３特徴ベクトルは、次元縮約手段によって次元縮約されている。
次元縮約手段としては、典型的には主成分分析（PCA）が例示される。 In one aspect, the first feature vector, the second feature vector, and the third feature vector are dimensionally reduced by the dimension reduction means.
A typical example of the dimension reduction means is principal component analysis (PCA).

１つの態様では、前記第１訓練データセットは、第１言語のＷｅｂからのクローリングによって取得したマルチメディアデータを含み、
前記第２訓練データセットは、第２言語のＷｅｂからのクローリングによって取得したマルチメディアデータを含む。 In one aspect, the first training data set includes multimedia data obtained by crawling from a web in a first language;
The second training data set includes multimedia data acquired by crawling from the Web in the second language.

１つの態様では、さらに、第１言語文書と第２言語文書のペア群からなる第３訓練データセットを用意し、
第３訓練データセットにおいて、第１言語のテキストから第１特徴ベクトルを抽出し、第２言語のテキストから第３特徴ベクトルを抽出し、
前記一般化正準相関分析において、さらに、第３訓練データセットから抽出された第１特徴ベクトル及び第３特徴ベクトルを用いる。
この態様は、後述するFew-Shot学習に対応するものである。 In one aspect, a third training data set comprising a pair group of a first language document and a second language document is further prepared,
Extracting a first feature vector from text in a first language, extracting a third feature vector from text in a second language in a third training data set;
In the generalized canonical correlation analysis, the first feature vector and the third feature vector extracted from the third training data set are further used.
This mode corresponds to Few-Shot learning described later.

本発明に係る画像を媒介した異言語文書間の学習装置は、
第１言語文書と画像のペア群からなる第１訓練データセットと、
第２言語文書と画像のペア群からなる第２訓練データセットと、
第１言語文書から第１特徴ベクトルを抽出する第１特徴ベクトル抽出手段と、
画像から第２特徴ベクトルを抽出する第２特徴ベクトル抽出手段と、
第２言語文書から第３特徴ベクトルを抽出する第３特徴ベクトル抽出手段と、
一般化正準相関分析手段と、
を備え、
前記一般化正準相関分析手段が、第１特徴ベクトル、第２特徴ベクトル、第３特徴ベクトルを用いて一般化正準相関分析を行うことで、第２特徴ベクトルを媒介として第１特徴ベクトルと第３特徴ベクトルのマッピングを行う、
ものである。 A learning device between different language documents mediated by an image according to the present invention,
A first training data set consisting of pairs of first language documents and images;
A second training data set consisting of pairs of second language documents and images;
First feature vector extraction means for extracting a first feature vector from a first language document;
Second feature vector extraction means for extracting a second feature vector from the image;
Third feature vector extraction means for extracting a third feature vector from the second language document;
Generalized canonical correlation analysis means;
With
The generalized canonical correlation analysis means performs the generalized canonical correlation analysis using the first feature vector, the second feature vector, and the third feature vector, and thereby the first feature vector Mapping the third feature vector;
Is.

本発明に係る言語横断文書検索法は、上記異言語文書間の学習法によって得られた学習モデルを用いるものであり、
前記学習モデルにおいて、第１言語空間（「第１特徴ベクトル」の空間）から正準空間への第１射影係数、第２言語空間（「第３特徴ベクトル」の空間）から正準空間への第２射影係数が規定されており、
第１言語クエリ文書から第１特徴ベクトルを抽出し、
抽出された第１特徴ベクトルを、第１射影係数を用いて正準空間に射影して第１射影特徴ベクトルを取得し、
第２言語ターゲット文書候補から第３特徴ベクトルを抽出し、
抽出された第３特徴ベクトルを、第２射影係数を用いて正準空間に射影して第３射影特徴ベクトルを取得し、
第１射影特徴ベクトルと第３射影特徴ベクトル間の類似度を用いてターゲット文書を決定する、ものである。 The cross-language document search method according to the present invention uses a learning model obtained by the learning method between the different language documents,
In the learning model, the first projection coefficient from the first language space (the “first feature vector” space) to the canonical space, and the second language space (the “third feature vector” space) to the canonical space. A second projection coefficient is defined,
Extracting a first feature vector from a first language query document;
Projecting the extracted first feature vector onto a canonical space using the first projection coefficient to obtain a first projected feature vector;
Extracting a third feature vector from the second language target document candidate;
Projecting the extracted third feature vector onto the canonical space using the second projection coefficient to obtain a third projected feature vector;
The target document is determined using the similarity between the first projection feature vector and the third projection feature vector.

本発明に係る言語横断文書検索装置は、上記異言語文書間の学習法によって得られた学習モデルを用いるものであり、
前記学習モデルにおいて、第１言語空間から正準空間への第１射影係数、第２言語空間から正準空間への第２射影係数が規定されており、
第１言語クエリ文書から第１特徴ベクトルを抽出する手段と、
抽出された第１特徴ベクトルを、第１射影係数を用いて正準空間に射影して第１射影特徴ベクトルを取得する手段と、
第２言語ターゲット文書候補から第３特徴ベクトルを抽出する手段と、
抽出された第３特徴ベクトルを、第２射影係数を用いて正準空間に射影して第３射影特徴ベクトルを取得する手段と、
第１射影特徴ベクトルと第３射影特徴ベクトル間の類似度を用いてターゲット文書を決定する手段と、
を備えている。 The cross-language document search device according to the present invention uses a learning model obtained by the learning method between the different language documents,
In the learning model, a first projection coefficient from the first language space to the canonical space and a second projection coefficient from the second language space to the canonical space are defined,
Means for extracting a first feature vector from a first language query document;
Means for projecting the extracted first feature vector into a canonical space using a first projection coefficient to obtain a first projected feature vector;
Means for extracting a third feature vector from the second language target document candidate;
Means for projecting the extracted third feature vector onto a canonical space using a second projection coefficient to obtain a third projected feature vector;
Means for determining a target document using the similarity between the first projected feature vector and the third projected feature vector;
It has.

本発明では、異言語文書間の学習において、画像を媒介させることで対訳コーパスが存在しない場合にも学習が可能となった（Zero-shot学習）。
本発明では、異言語文書間の学習において、画像を媒介させることで対訳コーパスが少ない場合にも学習が可能となった（Few-shot学習）。 In the present invention, learning between different language documents can be performed even when there is no bilingual corpus by mediating images (Zero-shot learning).
In the present invention, learning between different language documents can be performed even when there are few parallel corpora by mediating images (Few-shot learning).

異言語文書間の学習において、英語などの文書をハブとして用いることで学習を可能とする研究（非特許文献４）においても、対訳コーパスが必要となり、
２つの言語にまたがった情報が必要である。
これに対して、本発明では、マルチメディア情報が付随している（図１参照）ことを前提として、一つの言語に閉じた情報のみで学習ができる点に特徴がある。
画像はユニバーサルな表現であるがゆえに、各言語ごとの文書（一つの言語に閉じた文書）に含まれおり、学習の際に橋（媒介）となる役割を果たす。 In a study between different language documents, a bilingual corpus is also required in research (Non-Patent Document 4) that enables learning by using a document such as English as a hub.
Information across two languages is needed.
On the other hand, the present invention is characterized in that learning can be performed only with information closed in one language on the premise that multimedia information is attached (see FIG. 1).
Since the image is a universal expression, it is included in a document for each language (a document closed in one language) and plays a role as a bridge (mediation) in learning.

本発明は、近年web上に豊富に存在するマルチメディア情報に着目し、画像を間接的に利用した学習法を用いることで、近年の画像認識技術のブレークスルーと相俟って、一切の対訳コーパスなしに（あるいは、少しの対訳コーパスを用いて）異言語間文書検索を実現する。 The present invention focuses on multimedia information that has been abundant on the web in recent years, and by using a learning method that indirectly uses images, in combination with recent image recognition technology breakthroughs, Cross-language document retrieval is implemented without a corpus (or with a small bilingual corpus).

画像媒介学習の概念図である。It is a conceptual diagram of image-mediated learning. 画像データを含むウェブ文書を示す。A web document containing image data is shown. 本発明に係る画像を媒介した異言語文書間の学習システムの概要図である。1 is a schematic diagram of a learning system between different language documents mediated by images according to the present invention. 本発明に係る言語横断文書検索システムの概要図である。1 is a schematic diagram of a cross-language document search system according to the present invention. 本発明の一実施形態に係る異言語文書間の学習システムの概要図である。It is a schematic diagram of the learning system between different language documents concerning one embodiment of the present invention. 一般化正準相関分析を用いた、日本語文書の特徴ベクトル、英語文書の特徴ベクトル、画像の特徴ベクトルの正準空間への投影を示す概念図である。It is a conceptual diagram which shows the projection to the canonical space of the feature vector of a Japanese document, the feature vector of an English document, and the feature vector of an image using generalized canonical correlation analysis. 射影された空間（正準空間）における最近傍探索の概念図である。It is a conceptual diagram of the nearest neighbor search in the projected space (canonical space). 実験に用いたデータセットを例示する図である。１枚の画像に対して、５つの英文、各英文に対応する５つの日本文が用意されている。５つの日本文から抽出された特徴ベクトルと１枚の画像から抽出された特徴ベクトルがペアを形成し、５つの英文から抽出された特徴ベクトルと１枚の画像から抽出された特徴ベクトルがペアを形成している。It is a figure which illustrates the data set used for experiment. Five English sentences and five Japanese sentences corresponding to each English sentence are prepared for one image. A feature vector extracted from five Japanese sentences and a feature vector extracted from one image form a pair, and a feature vector extracted from five English sentences and a feature vector extracted from one image form a pair. Forming. 検索精度実験１の結果を示す。検索精度実験１は、[train-E/I]、[train-I/J]のサンプル数を変化させて実験を行うと共に、さらに、[train-E/J]のサンプル数を変化させて実験を行った。The result of search accuracy experiment 1 is shown. Search accuracy experiment 1 was conducted by changing the number of samples in [train-E / I] and [train-I / J], and also changing the number of samples in [train-E / J]. Went. 検索精度実験２の結果を示す。検索精度実験２は、複数種類の画像特徴を用いた比較実験である。The result of the search accuracy experiment 2 is shown. Search accuracy experiment 2 is a comparative experiment using a plurality of types of image features.

[A]画像媒介学習システムの概要
図３を参照しつつ、本発明に係る画像媒介学習システムの概要を説明する。
[A-1]学習に用いるデータ
まず、第１訓練データセットと第２訓練データセットを用意する。第１訓練データセットは、第１言語文書と画像のペア群からなる。第２訓練データセットは、第２言語文書と画像のペア群からなる。後述する実験例では、第１言語、第２言語として、日本語、英語を使用したが、第１言語、第２言語は、任意の異言語から選択され得る。１つの態様では、第１訓練データセットに含まれる画像と第２訓練データセットに含まれる画像には、全く違う画像が用いられ、一切オーバーラップが生じないようになっている。なお、同じ画像が含まれていることを排除するものではない。 [A] Outline of Image-Mediated Learning System An outline of the image-mediated learning system according to the present invention will be described with reference to FIG.
[A-1] Data used for learning First, a first training data set and a second training data set are prepared. The first training data set includes a pair group of a first language document and an image. The second training data set is composed of a pair group of a second language document and an image. In the experimental example to be described later, Japanese and English are used as the first language and the second language, but the first language and the second language can be selected from any different languages. In one aspect, completely different images are used for the images included in the first training data set and the images included in the second training data set so that no overlap occurs. It is not excluded that the same image is included.

第１訓練データセットにおいて、第１言語文書から第１特徴ベクトルを抽出し、画像から第２特徴ベクトルを抽出する。第２訓練データセットにおいて、第２言語文書から第３特徴ベクトルを抽出し、画像から第２特徴ベクトルを抽出する。これらの特徴ベクトルを用いて学習が行われる。なお、画像から抽出した特徴ベクトルを総称して第２特徴ベクトルとしているが、第１訓練データセットから得られる第２特徴ベクトルと、第２訓練データセットから得られる第２特徴ベクトルと、は異なる。 In the first training data set, a first feature vector is extracted from the first language document, and a second feature vector is extracted from the image. In the second training data set, a third feature vector is extracted from the second language document, and a second feature vector is extracted from the image. Learning is performed using these feature vectors. The feature vectors extracted from the images are collectively referred to as the second feature vector. However, the second feature vector obtained from the first training data set is different from the second feature vector obtained from the second training data set. .

後述する実施例では、UIUC Pascal Sentence Dataset (非特許文献１９)を用いたが、訓練データセットのソースとしては様々なデータが考えられる。１つの態様では、前記第１訓練データセットは、第１言語のWebからのクローリングによって取得したマルチメディアデータを含み、前記第２訓練データセットは、第２言語のWebからのクローリングによって取得したマルチメディアデータを含む。Twitter等のSNS、ニュース記事、ブログ記事などはマルチメディアデータが付与されているデータが豊富にあり、クローリングによって取得できる。また、訓練データセットはWebから取得したデータに限定されない。例えば、写真等の画像と文書を含む紙媒体（書籍、広告チラシ）をスキャンして、画像データと文書データのペアを取得してもよい。あるいは、テレビ番組の映像と字幕からテキストとマルチメディアのペアを取得してもよい。 In the examples described later, UIUC Pascal Sentence Dataset (Non-Patent Document 19) is used, but various data can be considered as the source of the training data set. In one aspect, the first training data set includes multimedia data obtained by crawling from a first language web, and the second training data set is obtained by crawling from a second language web. Contains media data. SNS such as Twitter, news articles, blog articles, etc. have abundant data with multimedia data and can be obtained by crawling. The training data set is not limited to data acquired from the Web. For example, a pair of image data and document data may be acquired by scanning a paper medium (book, advertisement flyer) including an image such as a photograph and a document. Alternatively, a text / multimedia pair may be acquired from video and subtitles of a television program.

[A-2]テキスト特徴の抽出
１つの態様では、第１特徴ベクトル及び第３特徴ベクトルは、Bag of words(BoW)を用いて抽出される。BoWは文書の特徴量を抽出するための一つの代表的な例示であって、BoW以外の非限定的な手法として、以下の手法が例示される。
TF-IDF特徴；
Word2vecによって得られた単語ベクトルを用いて文書における単語分すべて足し合わせて平均を取るなどの方法；
Paragraph2vecから得られた文書ベクトル；
N-gram言語モデルによる特徴；
BoWをLSI（Latent Semantic Indexing, LSA: Latent Semantic Analysisともいわれる）を使って次元削減したもの；
BoWをLDA（Latent Dirichlet Allocation）によって次元削減したもの。 [A-2] Text Feature Extraction In one aspect, the first feature vector and the third feature vector are extracted using Bag of words (BoW). BoW is one typical example for extracting the feature amount of a document, and the following methods are exemplified as non-limiting methods other than BoW.
TF-IDF features;
Using word vectors obtained by Word2vec, adding all the words in the document and averaging them;
Document vector obtained from Paragraph2vec;
Features of N-gram language model;
BoW reduced in dimension using LSI (Latent Semantic Indexing, LSA: also called Latent Semantic Analysis);
BoW reduced in dimension by LDA (Latent Dirichlet Allocation).

[A-3]画像特徴の抽出
１つの態様では、第２特徴ベクトルは、畳込みニューラルネットワーク(CNN)を用いて抽出される。畳込みニューラルネットワーク(CNN)は、画像認識として最も成功している深層学習である（非特許文献１２〜１７）。CNNとしては、AlexNetやCaffeNet、GoogLeNet、VGG netを例示することができる。CNNは設計者によって様々な構造がある。CNNは学習をするときに正しく分類が行われるようにそれぞれの層を結合する重みを更新していく。つまり、学習済みニューラルネットワークは重みが定まっており、このニューラルネットワークに画像を入力するとそれぞれの重みを元に活性化関数の値が計算される。その結果を特徴量として使う。複数の層のどの層の値を使うかについては限定されない。ネットワークの最終層に近いあたりの層が一般的に良い特徴が得られると考えられるが、実験等において最良の層を選択し得ることが当業者に理解される。なお、本発明に用いられ得る第２特徴ベクトルとして、CNN以外の特徴量を排除するものではなく、例えば、フィッシャーベクトル、Bag of Features(Bag of Visual Words)等も含む。 [A-3] Image Feature Extraction In one aspect, the second feature vector is extracted using a convolutional neural network (CNN). The convolutional neural network (CNN) is deep learning that is most successful as image recognition (Non-Patent Documents 12 to 17). Examples of CNN include AlexNet, CaffeNet, GoogLeNet, and VGG net. CNN has various structures depending on the designer. CNN updates the weights that combine the layers so that classification is done correctly when learning. That is, weights are determined in the learned neural network, and when an image is input to the neural network, the value of the activation function is calculated based on the respective weights. The result is used as a feature value. There is no limitation on which layer value of the plurality of layers is used. It is understood by those skilled in the art that the best layer can be selected in experiments and the like, although the layer near the final layer of the network is generally considered to give good characteristics. The second feature vector that can be used in the present invention does not exclude feature quantities other than CNN, and includes, for example, a Fisher vector, Bag of Features (Bag of Visual Words), and the like.

[A-4]特徴ベクトルの次元縮約
１つの態様では、第１特徴ベクトル、第２特徴ベクトル、第３特徴ベクトルを、次元縮約手段によって次元縮約してもよい。各特徴ベクトルの次元縮約は任意であるが、計算時間を考慮すると、データの規模が大きい場合には、次元縮約を行うことが望ましい。次元縮約手段としては、典型的には主成分分析（PCA）が例示されるが、独立成分分析(ICA)、LDAやLSIを用いてもよい。 [A-4] Dimensional reduction of feature vectors In one aspect, the first feature vector, the second feature vector, and the third feature vector may be dimensionally reduced by a dimensional reduction means. The dimension reduction of each feature vector is arbitrary, but considering the calculation time, it is desirable to perform dimension reduction when the data size is large. Typically, principal component analysis (PCA) is exemplified as the dimension reduction means, but independent component analysis (ICA), LDA, or LSI may be used.

[A-5]一般化正準相関分析(GCCA)
一般化正準相関分析は、２つの変数群を扱う正準相関分析を、３つ以上の変数群を扱うように一般化したものであり、複数のモダリティ間の相関の和を最大にするようにデータをマッピングすることができる。第１特徴ベクトル、第２特徴ベクトル、第３特徴ベクトルを用いて一般化正準相関分析を行うことで、第２特徴ベクトルを媒介として第１特徴ベクトルと第３特徴ベクトルのマッピングを行う。 [A-5] Generalized canonical correlation analysis (GCCA)
Generalized canonical correlation analysis is a canonical correlation analysis that handles two variable groups, generalized to handle more than two variable groups, and maximizes the sum of correlations between multiple modalities. Data can be mapped to By performing the generalized canonical correlation analysis using the first feature vector, the second feature vector, and the third feature vector, the first feature vector and the third feature vector are mapped using the second feature vector as a medium.

２つの変数群を扱う正準相関分析（CCA）を用いた学習やCLDRへの応用は知られている（非特許文献５〜１１）。GCCAは、ｍ個（本実施形態では、ｍ＝３）のモダリティ用の一般化CCAである。 Learning using canonical correlation analysis (CCA) that handles two variable groups and its application to CLDR are known (Non-Patent Documents 5 to 11). GCCA is a generalized CCA for m modalities (m = 3 in this embodiment).

GCCAの代表的な例には、非特許文献１（Carroll）、非特許文献２（Kettenring）、非特許文献３（Velden et al）、非特許文献４（Rupnik et al）に記載されたものが知られている。非特許文献４では、非特許文献２の手法（部分的に非特許文献１にも言及）が採用されている。後述する実験では、非特許文献２のGCCAを採用したが、本発明を実現するために他のGCCAを採用し得ることが当業者に理解される。 Typical examples of GCCA include those described in Non-Patent Document 1 (Carroll), Non-Patent Document 2 (Kettenring), Non-Patent Document 3 (Velden et al), and Non-Patent Document 4 (Rupnik et al). Are known. In Non-Patent Document 4, the technique of Non-Patent Document 2 (partly referred to in Non-Patent Document 1) is adopted. In the experiment described later, the GCCA of Non-Patent Document 2 was adopted, but it will be understood by those skilled in the art that other GCCA can be adopted to realize the present invention.

[A-6] 画像媒介型異言語間学習モデル
GCCAによって第１言語空間から正準空間への第１射影係数、第２言語空間から正準空間への第２射影係数が決定される。すなわち、異言語文書間の学習法によって得られた学習モデルにおいて、第１言語空間から正準空間への第１射影係数、第２言語空間から正準空間への第２射影係数が規定されている。第１言語文書から抽出された第１特徴ベクトルは、第１射影係数によって、第１射影特徴ベクトルに変換される。第２言語文書から抽出された第３特徴ベクトルは、第２射影係数によって、第３射影特徴ベクトルに変換される。第１射影特徴ベクトルと第３射影特徴ベクトルは、いわば共通空間ないしジョイント空間である正準空間（広義には、「潜在空間」）において対比させることができる。 [A-6] Image-mediated cross-language learning model
GCCA determines a first projection coefficient from the first language space to the canonical space and a second projection coefficient from the second language space to the canonical space. That is, in the learning model obtained by the learning method between different language documents, the first projection coefficient from the first language space to the canonical space and the second projection coefficient from the second language space to the canonical space are defined. Yes. The first feature vector extracted from the first language document is converted into the first projected feature vector by the first projection coefficient. The third feature vector extracted from the second language document is converted into a third projected feature vector by the second projection coefficient. The first projected feature vector and the third projected feature vector can be compared in a canonical space (in a broad sense, “latent space”) that is a common space or joint space.

[A-7]Few-Shot学習
本発明の一つの理想的な形は、一切の対訳コーパスなしに異言語間文書検索を実現するためのZero-Shot学習であるが、少量の対訳コーパスを用いたFew-Shot学習を実行してもよい。この場合、第１言語文書と第２言語文書のペア群からなる第３訓練データセットを用意し、第３訓練データセットにおいて、第１言語のテキストから第１特徴ベクトルを抽出し、第２言語のテキストから第３特徴ベクトルを抽出し、前記一般化正準相関分析において、さらに、第３訓練データセットから抽出された第１特徴ベクトル及び第３特徴ベクトルが用いられる。 [A-7] Few-Shot Learning One ideal form of the present invention is Zero-Shot learning for realizing cross-language document search without any bilingual corpus, but using a small amount of bilingual corpus. Few-Shot learning may be performed. In this case, a third training data set including a pair group of the first language document and the second language document is prepared, and the first feature vector is extracted from the text of the first language in the third training data set, and the second language In the generalized canonical correlation analysis, the first feature vector and the third feature vector extracted from the third training data set are further used.

[A-8]ハードウェア構成
本発明に係る学習システムは、一つあるいは複数のコンピュータから構成されており、当該コンピュータは、ハードウェアとしての処理手段（ＣＰＵ等）、記憶手段（ハードディスク、ＲＡＭ、ＲＯＭ等）、入力手段、出力手段ないし表示手段、ソフトウエアとしてのコンピュータを動作させる制御プログラム等を備えている。第１言語文書と画像のペア群からなる第１訓練データセット、第２言語文書と画像のペア群からなる第２訓練データセットは、記憶手段に格納されている。テキストデータから特徴を抽出する手段、画像データから特徴を抽出する手段は、処理手段から構成される。第１言語文書から抽出された第１特徴ベクトル、画像から抽出された第２特徴ベクトル、第２言語文書から抽出された第３特徴ベクトルは、記憶手段に格納される。一般化正準相関分析は、第１特徴ベクトル、第２特徴ベクトル、第３特徴ベクトルを用いて、処理手段により実行され、一般化正準相関分析を行うことで、第２特徴ベクトルを媒介として第１特徴ベクトルと第３特徴ベクトルのマッピングが行われる。具体的には、第１言語空間から正準空間への第１射影係数、第２言語空間から正準空間への第２射影係数が算出され、この射影係数は記憶手段に記憶される。 [A-8] Hardware Configuration The learning system according to the present invention includes one or a plurality of computers, and the computer includes processing means (CPU, etc.) as hardware, storage means (hard disk, RAM, ROM, etc.), input means, output means or display means, and a control program for operating the computer as software. A first training data set consisting of a pair group of a first language document and an image and a second training data set consisting of a pair group of a second language document and an image are stored in the storage means. The means for extracting features from the text data and the means for extracting features from the image data comprise processing means. The first feature vector extracted from the first language document, the second feature vector extracted from the image, and the third feature vector extracted from the second language document are stored in the storage means. The generalized canonical correlation analysis is executed by the processing means using the first feature vector, the second feature vector, and the third feature vector. By performing the generalized canonical correlation analysis, the second feature vector is used as a medium. Mapping of the first feature vector and the third feature vector is performed. Specifically, a first projection coefficient from the first language space to the canonical space and a second projection coefficient from the second language space to the canonical space are calculated, and the projection coefficients are stored in the storage means.

[B]言語横断文書検索システムの概要
図４を参照しつつ、本発明に係る言語横断文書検索システムの概要を説明する。
[B-1]画像媒介型異言語間学習モデル
異言語文書間の学習法によって得られた学習モデルにおいて、第１言語空間から正準空間への第１射影係数、第２言語空間から正準空間への第２射影係数が規定されている。第１言語文書から抽出された第１特徴ベクトルは、第１射影係数によって、第１射影特徴ベクトルに変換される。第２言語文書から抽出された第３特徴ベクトルは、第２射影係数によって、第３射影特徴ベクトルに変換される。第１射影特徴ベクトルと第３射影特徴ベクトルの類似度から、第１言語文書と第２言語文書の類似度を推定することができる。 [B] Overview of Cross-Language Document Search System An overview of a cross-language document search system according to the present invention will be described with reference to FIG.
[B-1] Image-mediated interlingual learning model In the learning model obtained by the learning method between different language documents, the first projection coefficient from the first language space to the canonical space, the canonical from the second language space A second projection coefficient to space is defined. The first feature vector extracted from the first language document is converted into the first projected feature vector by the first projection coefficient. The third feature vector extracted from the second language document is converted into a third projected feature vector by the second projection coefficient. The similarity between the first language document and the second language document can be estimated from the similarity between the first projection feature vector and the third projection feature vector.

[B-2]検索
第１言語クエリ文書が検索システムに入力されると、入力されたテキストデータから第１特徴ベクトルが抽出される。抽出された第１特徴ベクトルを、第１射影係数を用いて正準空間に射影して第１射影特徴ベクトルを取得する。 [B-2] When a search first language query document is input to the search system, a first feature vector is extracted from the input text data. The extracted first feature vector is projected onto the canonical space using the first projection coefficient to obtain the first projected feature vector.

一方、第２言語ターゲット文書候補から第３特徴ベクトルを抽出し、抽出された第３特徴ベクトルを、第２射影係数を用いて正準空間に射影して第３射影特徴ベクトルを取得する。なお、各第２言語文書に対応した第３射影特徴ベクトルが予め抽出されて記憶部に格納されており、予め記憶されている第３射影特徴ベクトルを用いて、次に述べる類似度を計算してもよい。 On the other hand, a third feature vector is extracted from the second language target document candidate, and the extracted third feature vector is projected onto the canonical space using the second projection coefficient to obtain a third projected feature vector. A third projection feature vector corresponding to each second language document is extracted in advance and stored in the storage unit, and the similarity described below is calculated using the third projection feature vector stored in advance. May be.

第１射影特徴ベクトルと第２射影特徴ベクトル間の類似度を用いてターゲット文書を決定する。類似度は、典型的にはベクトル間の距離によって表され、ユーグリッド距離、マハラノビス距離、マンハッタン距離が例示される。また、類似度として、コサイン類似度を用いてもよい。典型的には、最も類似度が大きい候補を第２言語ターゲット文書として出力する。あるいは、類似度が大きい複数の候補を第２言語ターゲット文書として出力してもよく、類似度に応じてランク付けして表示してもよい。 The target document is determined using the similarity between the first projected feature vector and the second projected feature vector. The similarity is typically represented by a distance between vectors, and examples include a Eugrid distance, a Mahalanobis distance, and a Manhattan distance. Further, the cosine similarity may be used as the similarity. Typically, the candidate having the highest similarity is output as the second language target document. Alternatively, a plurality of candidates having a high degree of similarity may be output as the second language target document, and may be ranked and displayed according to the degree of similarity.

第２言語ターゲット文書候補をどのように設定するかについては、特に限定されない。１つの態様では、第１言語クエリ文書が入力された時点で入手可能な全ての第２言語文書が候補となる。例えば、第２言語のWebからのクローリングによって取得した全てのデータを対象としてもよい。 How to set the second language target document candidate is not particularly limited. In one aspect, all second language documents available at the time the first language query document is input are candidates. For example, all data acquired by crawling from the second language Web may be targeted.

[B-3]ハードウェア構成
本発明に係る検索システムは、一つあるいは複数のコンピュータから構成されており、当該コンピュータは、ハードウェアとしての処理手段（ＣＰＵ等）、記憶手段（ハードディスク、ＲＡＭ、ＲＯＭ等）、入力手段、出力手段ないし表示手段、ソフトウエアとしてのコンピュータを動作させる制御プログラム等を備えている。ユーザ端末も、一つあるいは複数のコンピュータから構成されており、当該コンピュータは、処理手段、記憶手段、入力手段、出力手段ないし表示手段、コンピュータを動作させる制御プログラム等を備えている。 [B-3] Hardware Configuration The search system according to the present invention includes one or a plurality of computers. The computer includes processing means (CPU, etc.) as hardware, storage means (hard disk, RAM, ROM, etc.), input means, output means or display means, and a control program for operating the computer as software. The user terminal is also composed of one or a plurality of computers, and the computer includes processing means, storage means, input means, output means or display means, a control program for operating the computer, and the like.

検索システムとユーザ端末は、インターネットに代表されるコンピュータネットワークを介して相互に情報のやり取りを可能とする送受信手段を備えており、インターネットに代表されるコンピュータネットワークを介して互いに通信可能に接続されている。検索システムは、インターネットに代表されるコンピュータネットワークを介して既存の検索エンジンに接続されている。ユーザ端末の画面には、例えば、クエリ画面が表示され、ユーザ端末の入力手段から第１言語のテキストデータを入力し、検索クエリとして検索システムへ送信する。なお、テキストデータ入力に代えてドキュメントのアップロードでもよく、また、検索システム側で自動的に似た文章を抽出して推薦したりする推薦システム等の場合にはユーザ側のインタラクションはなく、検索結果がユーザ端末に表示される。１つの態様では、複数の第２言語が選択可能となっており、１つあるいは複数の第２言語を指定する。検索システムでは、検索クエリに基づいて第２言語ターゲット文書候補との類似度を計算し、検索結果をユーザ端末から閲覧可能とする。 The search system and the user terminal are provided with transmission / reception means that can exchange information with each other via a computer network represented by the Internet, and are connected to each other via a computer network represented by the Internet. Yes. The search system is connected to an existing search engine via a computer network represented by the Internet. For example, a query screen is displayed on the screen of the user terminal, text data in the first language is input from the input means of the user terminal, and is transmitted to the search system as a search query. In addition, instead of text data input, document upload may be used, and in the case of a recommendation system that automatically extracts and recommends similar sentences on the search system side, there is no user side interaction and the search result Is displayed on the user terminal. In one aspect, a plurality of second languages can be selected, and one or a plurality of second languages are designated. In the search system, the similarity with the second language target document candidate is calculated based on the search query, and the search result can be viewed from the user terminal.

[C]実施例
[C-1]使用するデータ
本実施例において、表１に示す異データディビジョンを用いる。
[train-E/I]: 英語テキストと画像のペアからなる学習ドキュメント
[train-I/J]: 日本語テキストと画像のペアからなる学習ドキュメント
[train-E/J]: 英語テキストと日本語テキストのペアからなる学習ドキュメント
[test-E/J]: 英語テキストと日本語テキストのペアからなるテストドキュメント
各データディビジョンは重複していない。例えば、[train-E/I]における画像データと、[train-I/J]における画像データは異なる。 [C] Examples
[C-1] Data to be used In this embodiment, the different data division shown in Table 1 is used.
[train-E / I]: Learning document consisting of pairs of English text and images
[train-I / J]: Learning document consisting of pairs of Japanese text and images
[train-E / J]: Learning document consisting of pairs of English text and Japanese text
[test-E / J]: Test document consisting of pairs of English text and Japanese text
Each data division does not overlap. For example, the image data in [train-E / I] is different from the image data in [train-I / J].

表１に示すように、各ディビジョンの各モダリティにはＩＤが定義されている。例えば、E1は、[train-E/I]ディビジョンにおける英文の特徴を表す。パラレルコーポラに基づく典型的なCLDRは、学習データとして[train-E/J]のみを用い、[test-E/J]を用いて評価を行う。本実験に係る[train-E/J]データを用いないZero-Shot学習シナリオでは、[train-E/I]と[train-I/J] のみを学習データとして用いる。Few-Shot学習シナリオでは、少しの[train-E/J]サンプルを用いる。本明細書において、これらの学習を合わせて画像媒介学習と呼ぶ。 As shown in Table 1, an ID is defined for each modality of each division. For example, E1 represents an English feature in the [train-E / I] division. A typical CLDR based on a parallel corpora uses only [train-E / J] as learning data and evaluates it using [test-E / J]. In the Zero-Shot learning scenario that does not use [train-E / J] data according to this experiment, only [train-E / I] and [train-I / J] are used as learning data. In the Few-Shot learning scenario, a few [train-E / J] samples are used. In this specification, these learnings are collectively referred to as image-mediated learning.

[C-2]システムの概要図
実施例に係るシステムの概要図を図５に示す。英語、画像、日本語の３つの特徴量を用いて学習を行う。学習においては、英語テキストから第１特徴ベクトルを抽出し、画像から第２特徴ベクトルを抽出し、日本語から第３特徴ベクトルを抽出し、得られた特徴を主成分分析(PCA)によって次元縮約し、縮約された特徴をGCCAによって学習させる。 [C-2] System Overview Diagram FIG. 5 shows a system overview according to the embodiment. Learning is performed using three feature quantities: English, images, and Japanese. In learning, the first feature vector is extracted from English text, the second feature vector is extracted from the image, the third feature vector is extracted from Japanese, and the obtained features are dimensionally reduced by principal component analysis (PCA). Learn about reduced and reduced features by GCCA.

テストにおいては、クエリ日本語テキストから得られた特徴を主成分分析(PCA)によって次元縮約する。なお、図５の矢印が示しているとおり、PCA projectionで低次元に射影される係数は学習フェーズにおけるPCAで学習される。縮約された特徴をGCCAによって得られた第１射影係数を用いて射影し、一方、英語テキストから得られた特徴を主成分分析(PCA)によって次元縮約し、縮約された特徴をGCCAによって得られた第２射影係数を用いて射影し、ジョイント空間において、日本文から英文への最近傍探索を行う。 In the test, the features obtained from the query Japanese text are dimensionally reduced by principal component analysis (PCA). Note that, as indicated by the arrows in FIG. 5, the coefficients projected in a low dimension by the PCA projection are learned by the PCA in the learning phase. The reduced features are projected using the first projection coefficient obtained by GCCA, while the features obtained from the English text are dimensionally reduced by principal component analysis (PCA), and the reduced features are GCCA. Projection is performed using the second projection coefficient obtained by the above, and the nearest neighbor search from Japanese to English is performed in the joint space.

[C-3]画像特徴の抽出
画像の特徴は、畳込みニューラルネットワークを用いて抽出される。本実施例では、ILSVRC2012 dataset (非特許文献１５)を用いて事前学習されており、Caffe (非特許文献１６)に提供されるCNNモデルを適用する。実験では、GoogLeNet(非特許文献１７)のpool5/7x7 s1層の特徴量、VGG(非特許文献１３)のfc6層の特徴量、CaffeNet(非特許文献１６、非特許文献１８)のfc6層の特徴量を画像特徴ベクトルとして用いた。 [C-3] Image Feature Extraction Image features are extracted using a convolutional neural network. In this embodiment, learning is performed in advance using an ILSVRC2012 dataset (Non-patent Document 15), and a CNN model provided in Caffe (Non-Patent Document 16) is applied. In the experiment, the features of the pool5 / 7x7 s1 layer of GoogLeNet (Non-Patent Document 17), the features of the fc6 layer of VGG (Non-Patent Document 13), and the fc6 layer of CaffeNet (Non-Patent Document 16, Non-Patent Document 18) The feature quantity was used as an image feature vector.

[C-4]テキストの特徴
英語及び日本語のテキスト特徴としては、bag of words (BoW)及びTF-IDF（term frequency-inverse document frequency）による重み付けを用いた。形態素解析による日本文の単語への分割において、MeCab libraryを用いる（非特許文献１８）。実験では、stop wordの削除やstemmingのような前処理は行わなかったが、これらを行うのは任意である。 [C-4] Text Features Weighting by bag of words (BoW) and TF-IDF (term frequency-inverse document frequency) was used as English and Japanese text features. A MeCab library is used in dividing Japanese sentences into words by morphological analysis (Non-patent Document 18). In the experiment, no pre-processing such as stop word deletion or stemming was performed, but these are optional.

[C-5]一般化正準相関分析(GCCA)
GCCAを用いることで、複数のモダリティ間の相関の和を最大にするようにデータをマッピングすることができる（図６参照）。本実施例では、非特許文献２のGCCAを採用した。GCCAの計算自体は公知であり、また、以下に述べるGCCAは一例であって、本発明に用いられるGCCAを限定するものではない。
E、I、Jをそれぞれ、英文、画像、日本文とすると、特徴ベクトル
において、
は正準変数を表す。
ここで、X_kバーは、特徴ベクトルの平均である。
h_kは、射影係数である。
k∈{E, I, J}であり、各特徴ベクトルと、対応する射影係数と、から、正準変数（射影特徴ベクトル）を計算することができる。 [C-5] Generalized canonical correlation analysis (GCCA)
By using GCCA, data can be mapped so as to maximize the sum of correlations between a plurality of modalities (see FIG. 6). In this example, GCCA of Non-Patent Document 2 was adopted. The calculation of GCCA itself is known, and the GCCA described below is an example, and does not limit the GCCA used in the present invention.
If E, I, and J are English, images, and Japanese, respectively, the feature vector
In
Represents a canonical variable.
Here, X _k bar is an average of feature vectors.
h _k is a projection coefficient.
k∈ {E, I, J}, and canonical variables (projection feature vectors) can be calculated from each feature vector and the corresponding projection coefficient.

GCCAは、最大化問題
を、拘束条件
のもとに解くことで式(1)が導出され、
射影係数h_kは、以下の一般化固有値問題を解くことによって、得られるモダリティの各ペアの相関の合計を最大化するようにして計算される。
ここで、
であり、
Σ_ijは、モダリティi,jの共分散マトリックスであって、i,j∈｛E,I,J｝である。Σ_EJはE₃とJ₃、Σ_IJはI₂とJ₂によって計算し、Σ_IIはI₁とI₂によって計算し、Σ_JJはJ₂とJ₃によって計算する。Σ_EJは、Σ_EIとΣ_IJと異なり、特別にZero-Shot学習の場合は訓練サンプル数が0になる場合があり、その場合は上記最大化問題に寄与しないため、結果的にΣ_EJは0 で埋まる。 GCCA is a maximization problem
The constraint
Equation (1) is derived by solving under
The projection coefficient h _k is calculated to maximize the total correlation of each pair of modalities obtained by solving the following generalized eigenvalue problem.
here,
And
Σ _ij is a covariance matrix of modalities i and j, and i, jε {E, I, J}. Σ _EJ is calculated by E ₃ and J ₃ , Σ _IJ is calculated by I ₂ and J ₂ , Σ _II is calculated by I ₁ and I ₂ , and Σ _JJ is calculated by J ₂ and J ₃ . Σ _EJ differs from Σ _EI and Σ _IJ , especially in the case of Zero-Shot learning, the number of training samples may be 0, in which case it does not contribute to the above maximization problem, so Σ _EJ Filled with zeros.

正準軸は、以下のように標準化される。
また、過学習を防止するため、正則化項を加える。すなわち、
であり、αは正則化のパラメータである。 The canonical axis is standardized as follows.
A regularization term is added to prevent overlearning. That is,
And α is a regularization parameter.

[C-5]ジョイント空間における最近傍探索
第１言語のクエリ文書が与えられた時に、他言語である第２言語の関連文書を探索するためには、ジョイント空間において、クエリ文書と候補文書の距離を計算すればよい。ジョイント空間における特徴ベクトル（射影特徴ベクトル）は、
を用いて算出することができ、h_kはGCCAによって取得される。 [C-5] Nearest neighbor search in joint space When a query document in the first language is given, in order to search related documents in the second language, which is another language, the query document and the candidate document are searched in the joint space. What is necessary is just to calculate the distance. The feature vector in the joint space (projective feature vector) is
And h _k is obtained by GCCA.

例えば、クエリ文書を日本語、ターゲット文書を英語とすると、ジョイント空間における最近傍は、以下の式で計算できる。
ここで、zⁱ _E,z^j _Jは、それぞれ、ターゲット文書、クエリ文書の射影特徴ベクトルであり、d(・)は距離関数である。本実施例では、距離関数はユーグリッド距離である。 For example, if the query document is Japanese and the target document is English, the nearest neighbor in the joint space can be calculated by the following equation.
Here, z ⁱ _E and z ^j _J are projected feature vectors of the target document and the query document, respectively, and d (•) is a distance function. In this embodiment, the distance function is the Eugrid distance.

[D]実験
[D-1]実験で用いたデータセット
UIUC Pascal Sentence Dataset (非特許文献１９)は、それぞれ内容を記述する５つの英文の注釈を備えた1000個の画像を有している。このデータセットは、画像からの文書の生成の研究のために作られたものであるが、本実施形態に係る画像媒介型CLDRに用いるため、各英文に対応する日本文の翻訳を用意した（図８参照）。本実験において、各画像に対応する５つの文章はまとめて一つのテキストデータとして取り扱う。よって、本セットアップでは、１０００個の文書からなるデータセットの各文書は、１つの画像、対応する英文テキスト、日本文テキストからなる。図６の概念図における各シンボルは、「５つの日本文から抽出された特徴ベクトル」、「１枚の画像から抽出された特徴ベクトル」、「５つの英文から抽出された特徴ベクトル」を表している。 [D] Experiment
[D-1] Data set used in the experiment
UIUC Pascal Sentence Dataset (Non-Patent Document 19) has 1000 images with 5 English annotations each describing the content. This data set was created for the study of document generation from images, but in order to use it for the image-mediated CLDR according to the present embodiment, a Japanese translation corresponding to each English sentence was prepared ( (See FIG. 8). In this experiment, five sentences corresponding to each image are treated as one text data. Therefore, in this setup, each document of the data set consisting of 1000 documents consists of one image, corresponding English text, and Japanese text. Each symbol in the conceptual diagram of FIG. 6 represents “a feature vector extracted from five Japanese sentences”, “a feature vector extracted from one image”, and “a feature vector extracted from five English sentences”. Yes.

[D-2]評価
表１における各データディビジョンから重複しないようランダムにデータを抽出した。 [train-E/I]、[train-I/J]のサンプルサイズを変化させて実験を行った。具体的には、サンプル数を、100、200、300、400とした。さらに、[train-E/J]のサンプル数を段階的に0から100まで増やして、Few-Shot学習シナリオを創出した。試験データ[test-E/J]のサイズは100に設定した。
このセットアップにしたがって、GCCAに基づく画像媒介型CLDRを実行し、[train-E/J]データのみを用いたCCAに基づく従来のCLDRの結果と比較した。 [D-2] Data was randomly extracted from each data division in Evaluation Table 1 so as not to overlap. The experiment was performed by changing the sample size of [train-E / I] and [train-I / J]. Specifically, the number of samples was 100, 200, 300, and 400. Furthermore, the number of [train-E / J] samples was gradually increased from 0 to 100 to create a Few-Shot learning scenario. The size of the test data [test-E / J] was set to 100.
According to this setup, GCCA-based image-mediated CLDR was performed and compared with the results of conventional CLDR based on CCA using only [train-E / J] data.

テストデータにおける第１番目の日本文→英文検索精度(the top-1Japanese to English retrieval accuracy)について性能を評価した。１００のテストサンプルが与えられた場合に、chance rateは１％である。各試行において、ランダムにデータを替えながら５０回のトライアルを行い、平均スコアを用いた。全ての特徴は、ＰＣＡによって１００次元に縮約され、また、αを0.01に設定した。 The performance was evaluated for the first Japanese to English retrieval accuracy in the test data. When 100 test samples are given, the chance rate is 1%. In each trial, 50 trials were performed while changing data at random, and an average score was used. All features were reduced to 100 dimensions by PCA and α was set to 0.01.

図９に示すように、実験結果から、Zero-Shot学習シナリオ、Few-Shot学習シナリオのいずれにおいても、テキスト−画像データの量が増えるにしたがって正確性が向上することがわかる。テキスト−画像データ量を増やすことでさらなる正確性の向上が期待できると考えられる。 As shown in FIG. 9, the experimental results show that the accuracy improves as the amount of text-image data increases in both the Zero-Shot learning scenario and the Few-Shot learning scenario. It is considered that further improvement in accuracy can be expected by increasing the amount of text-image data.

Zero-Shot学習シナリオの結果（Zero-Shot学習の精度）を表３にまとめる。画像特徴は、GoogLeNetによって抽出され、テキスト特徴は、bag-of-words (BoW)及びTF-IDFを用いた。
図９に示すように、GCCAとCCAの両方において、[train-E/J]のサンプルサイズが増えることで性能が向上するが、予想され得ることであるが、サンプルサイズが英文テキストと日本文テキスト間で直接学習できる程度に大きくなると、CCAの性能がGCCAの性能を上回る。しかしながら、[train-E/J]のデータ量が少ない場合には、GCCAのCCAのベースラインを上回り、したがって、Zero-Shot学習シナリオにおいても、画像媒介型学習は有用である。 The results of Zero-Shot learning scenario (Zero-Shot learning accuracy) are summarized in Table 3. Image features were extracted by GoogLeNet, and text features used bag-of-words (BoW) and TF-IDF.
As shown in Fig. 9, in both GCCA and CCA, the performance increases as the sample size of [train-E / J] increases, but it can be expected that the sample size is in English text and Japanese text. When it becomes large enough to learn directly between texts, the performance of CCA exceeds that of GCCA. However, when the amount of [train-E / J] data is small, it exceeds the GCCA CCA baseline, and therefore image-mediated learning is useful even in the Zero-Shot learning scenario.

[D-3]画像特徴の効果
本実施形態における画像特徴の性能の効果について検証した（図１０、表４）。画像特徴として、以下のＣＮＮを用いて抽出された３つの異なる特徴を用いた。
GoogLeNet(非特許文献１７)のpool5/7x7 s1層の特徴量、
VGG(非特許文献１３)のfc6層の特徴量、
CaffeNet(非特許文献１６、非特許文献１８)のfc6層の特徴量 [D-3] Effect of Image Feature The effect of the image feature performance in this embodiment was verified (FIG. 10, Table 4). Three different features extracted using the following CNN were used as image features.
Features of the pool5 / 7x7 s1 layer of GoogLeNet (Non-Patent Document 17)
VGG (Non-Patent Document 13) fc6 layer features,
Features of fc6 layer of CaffeNet (Non-patent document 16, Non-patent document 18)

さらに、深層学習を用いた画像特徴の抽出の前に広く用いられていたフィッシャーベクトル(非特許文献２０)についてもテストを行った。フィッシャーベクトルについては、SIFT記述子(非特許文献２１)を主成分分析によって６４次元に縮約し、６４要素を用いた混合ガウス分布を用いた。最終の特徴抽出に４つの空間グリッドを用いた。 Furthermore, the Fisher vector (Non-patent Document 20), which was widely used before image feature extraction using deep learning, was also tested. For the Fisher vector, the SIFT descriptor (Non-Patent Document 21) was reduced to 64 dimensions by principal component analysis, and a mixed Gaussian distribution using 64 elements was used. Four spatial grids were used for final feature extraction.

表４に、複数の画像特徴を用いたZero-Shot学習の精度を示す。[train-E/I]、 [train-I/J]のサンプルサイズは、４００である。テキスト特徴は、bag-of-words (BoW)及びTF-IDFを用いた。
各画像特徴を用いた場合の正確性の順序は、用いられた画像特徴について既知の性能の順序、具体的には、GoogLeNet→VGG net→CaffeNet→FisherVector
の順番で認識精度が高いこと（非特許文献１２）と一致した。画像媒介型CLDRにおいても同様の順位となったということは、より良い特徴量を使えば、画像媒介型CLDRにおいて高い検索精度が得られることを意味する。 Table 4 shows the accuracy of Zero-Shot learning using a plurality of image features. The sample size of [train-E / I] and [train-I / J] is 400. Text features were bag-of-words (BoW) and TF-IDF.
The order of accuracy when using each image feature is the order of known performance for the image features used, specifically GoogLeNet → VGG net → CaffeNet → FisherVector
This is consistent with the fact that the recognition accuracy is high in this order (Non-patent Document 12). The same ranking in the image-mediated CLDR means that a higher search accuracy can be obtained in the image-mediated CLDR if a better feature amount is used.

Claims

Prepare a first training data set consisting of pairs of first language documents and images,
Prepare a second training data set consisting of pairs of second language documents and images,
In the first training data set, a first feature vector is extracted from the first language document, a second feature vector is extracted from the image,
In the second training data set, a third feature vector is extracted from the second language document, a second feature vector is extracted from the image,
Performing generalized canonical correlation analysis using the first feature vector, the second feature vector, and the third feature vector, thereby mapping the first feature vector and the third feature vector through the second feature vector;
A learning method between different language documents via images.

The learning method according to claim 1, wherein the first feature vector and the third feature vector are extracted using Bag of words.

The learning method according to claim 1, wherein the second feature vector is extracted using a convolutional neural network.

The learning method according to claim 1, wherein the first feature vector, the second feature vector, and the third feature vector are dimensionally reduced.

The first training data set includes multimedia data obtained by crawling from the web in a first language;
The second training data set includes multimedia data acquired by crawling from the Web in a second language.
The learning method according to claim 1.

Furthermore, a third training data set consisting of a pair group of the first language document and the second language document is prepared,
Extracting a first feature vector from text in a first language, extracting a third feature vector from text in a second language in a third training data set;
In the generalized canonical correlation analysis, a first feature vector and a third feature vector extracted from a third training data set are further used.
The learning method according to claim 1.

A first training data set consisting of pairs of first language documents and images;
A second training data set consisting of pairs of second language documents and images;
First feature vector extraction means for extracting a first feature vector from a first language document;
Second feature vector extraction means for extracting a second feature vector from the image;
Third feature vector extraction means for extracting a third feature vector from the second language document;
Generalized canonical correlation analysis means;
With
The generalized canonical correlation analysis means performs the generalized canonical correlation analysis using the first feature vector, the second feature vector, and the third feature vector, and thereby the first feature vector Mapping the third feature vector;
A learning device between different language documents via images.

The learning apparatus according to claim 7, wherein the first feature vector extraction unit and the third feature vector extraction unit acquire Bag of words.

The learning apparatus according to claim 7, wherein the second feature vector extraction unit is a convolutional neural network.

The learning device includes principal component analysis means,
The learning device according to claim 7, wherein the first feature vector, the second feature vector, and the third feature vector are dimensionally reduced by a dimension reduction means.

The first training data set includes multimedia data obtained by crawling from the web in a first language;
The second training data set includes multimedia data acquired by crawling from the Web in a second language.
The learning device according to claim 7.

And a third training data set comprising a pair group of the first language document and the second language document,
In the generalized canonical correlation analysis, a first feature vector and a third feature vector extracted from a third training data set are further used.
The learning device according to claim 7.

A cross-language document search method using a learning model obtained by a learning method between different language documents according to any one of claims 1 to 6,
In the learning model, a first projection coefficient from the first language space to the canonical space and a second projection coefficient from the second language space to the canonical space are defined,
Extracting a first feature vector from a first language query document;
Projecting the extracted first feature vector onto a canonical space using the first projection coefficient to obtain a first projected feature vector;
Extracting a third feature vector from the second language target document candidate;
Projecting the extracted third feature vector onto the canonical space using the second projection coefficient to obtain a third projected feature vector;
Determining a target document using the similarity between the first and third projected feature vectors;
Cross-language document retrieval method.

A cross-language document search apparatus using a learning model obtained by a learning method between different language documents according to any one of claims 1 to 6,
In the learning model, a first projection coefficient from the first language space to the canonical space and a second projection coefficient from the second language space to the canonical space are defined,
Means for extracting a first feature vector from a first language query document;
Means for projecting the extracted first feature vector into a canonical space using a first projection coefficient to obtain a first projected feature vector;
Means for extracting a third feature vector from the second language target document candidate;
Means for projecting the extracted third feature vector onto a canonical space using a second projection coefficient to obtain a third projected feature vector;
Means for determining a target document using the similarity between the first projected feature vector and the third projected feature vector;
A cross-language document search device.