JP2010009237A

JP2010009237A - Multi-language similar document retrieval device, method and program, and computer-readable recording medium

Info

Publication number: JP2010009237A
Application number: JP2008166339A
Authority: JP
Inventors: Masahiro Oku; 雅博奥; Naoto Abe; 直人阿部; Katsuto Bessho; 克人別所; Toshiro Uchiyama; 俊郎内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-06-25
Filing date: 2008-06-25
Publication date: 2010-01-14

Abstract

<P>PROBLEM TO BE SOLVED: To perform similarity retrieval even to a retrieval target document written in a natural language different from an inputted similarity-retrieving keyword or similarity-retrieving document. <P>SOLUTION: This multi-language similar document retrieval device previously calculates a concept vector of a word to n (n≥2) pieces of natural languages in each natural language, and stores it into a word concept base. The multi-language similar document retrieval device previously calculates a concept vector of a retrieval target document to a retrieval target document group by use of the word concept base to the natural language written with the retrieval target document, and stores it into a retrieval target document concept base. When an input key is inputted, the multi-language similar document retrieval device translates it into another natural language, creates a translation key, calculates an input document translation concept vector to the created translation key by use of the word concept base to each natural language, compares the input translation concept vector and the concept vector of the retrieval target document of the retrieval target document concept base to each natural language, and calculates similarity. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、入力されたキーワードや文書と類似した検索対象文書を検索することを可能とする類似文書検索技術に係り、特に、入力された類似検索用キーワードや類似検索用文書とは異なる自然言語で書かれた検索対象文書に対しても類似検索を可能とする多言語間類似文書検索装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体に関する。 The present invention relates to a similar document search technique that makes it possible to search a search target document similar to an input keyword or document, and in particular, a natural language different from the input similar search keyword or similar search document. The present invention relates to a multilingual similar document search apparatus and method, a program, and a computer-readable recording medium that can perform a similar search even on a search target document written in the above.

言語にまたがる類似検索に関する従来技術には以下のようなものがある。 Conventional techniques related to similarity search across languages include the following.

（１）ある自然言語で記述されたキーワードを類義語展開し、更にそれらの翻訳結果をクエリとして、各々の検索対象文書を検索する。 (1) Synonyms are developed for keywords described in a certain natural language, and each search target document is searched using the translation result as a query.

（２）ある自然言語のテキストコーパスに対して、別の自然言語への対訳によりなる言語対訳コーパスを利用し、両言語の全単語間の共起関係の傾向に基づいて多次元空間中に配置する。この多次元空間中の類似度（距離）によって言語の異なる単語の類似度を計算する（例えば、非特許文献１参照）。この単語間の類似度をもとに多言語間の類似検索を実現する。
Masuichi, H., Flournoy, R., Kaufmann, S., and Peters, S., "A Bootstrapping method for Extracting Bilingual Text Pairs", Coling 2000, pp.1066-1070 （2000）. (2) A text corpus of a natural language is placed in a multi-dimensional space based on the tendency of co-occurrence relationships between all words in both languages using a language parallel corpus consisting of a translation into another natural language. To do. The similarity of words in different languages is calculated based on the similarity (distance) in the multidimensional space (see, for example, Non-Patent Document 1). Based on the similarity between words, a similarity search between multiple languages is realized.
Masuichi, H., Flournoy, R., Kaufmann, S., and Peters, S., "A Bootstrapping method for Extracting Bilingual Text Pairs", Coling 2000, pp.1066-1070 (2000).

しかしながら、上記従来の技術には、以下のような問題がある。 However, the above conventional techniques have the following problems.

Ａ：上記の（１）の方法では、言語毎の類似文書は得られるが、それらの間の類似度を計算することができない。 A: In the above method (1), similar documents for each language can be obtained, but the similarity between them cannot be calculated.

Ｂ：上記の（２）の方法では、言語にまたがる類似度は計算できるが、複数の言語で記述された対訳コーパスを用意しなければならず、コーパスの準備に大きな工数を必要とする。 B: In the method (2), the similarity across languages can be calculated, but a parallel corpus written in a plurality of languages must be prepared, and a large number of man-hours are required for preparing the corpus.

本発明は、上記の点に鑑みなされたもので、入力された類似検索用キーワードや類似検索用文書とは異なる自然言語で書かれた検索対象文書に対しても類似検索を可能とする多言語間類似文書検索装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and is capable of performing a similar search even for a search target document written in a natural language different from the input similar search keyword or similar search document. It is an object of the present invention to provide an inter-similar document retrieval apparatus and method, a program, and a computer-readable recording medium.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、入力された検索キーと検索対象文書とで言語が異なる場合における類似度を求める多言語間類似文書検索装置であって、
ｎ個（ｎ≧２）の自然言語に対して、
自然言語毎に単語の概念ベクトルを計算し、単語概念ベース記憶手段８に格納する単語概念ベース格納処理手段１と、
検索対象文書群に対して、該検索対象文書が書かれている自然言語に対する単語概念ベース記憶手段８を用いて、該検索対象文書の概念ベクトルを計算し、検索対象文書概念ベース記憶手段１０に格納する検索対象文書概念ベース格納処理手段２と、
ある自然言語で書かれた検索に用いる単語群あるいは文書（以下、「入力キー」と記す）１１を別の自然言語に翻訳し、該別の自然言語に翻訳された単語群あるいは文書（以下、「翻訳キー」と記す）を作成する機械翻訳手段３と、
入力キー１１と機械翻訳手段３で作成された翻訳キーに対して、各々の自然言語に対する単語概念ベース記憶手段８を用いて、入力文書翻訳概念ベクトルを計算する入力文書翻訳概念ベクトル計算手段４と、
入力文書翻訳概念ベクトル計算手段４で得られた入力翻訳概念ベクトルと、各々の自然言語に対する検索対象文書概念ベース記憶手段１０の検索対象文書の概念ベクトルとを比較して、類似度を計算する類似度計算手段５と、を有する。 The present invention (Claim 1) is a multilingual similar document search apparatus for obtaining a similarity in a case where languages are different between an input search key and a search target document,
For n (n ≧ 2) natural languages,
A word concept base storage processing means 1 for calculating a concept vector of a word for each natural language and storing it in the word concept base storage means 8;
Using the word concept base storage unit 8 for the natural language in which the search target document is written, the concept vector of the search target document is calculated for the search target document group and stored in the search target document concept base storage unit 10. Search target document concept base storage processing means 2 for storing;
A word group or document (hereinafter referred to as “input key”) 11 used for a search written in a certain natural language is translated into another natural language, and the word group or document translated into the other natural language (hereinafter, referred to as “input key”). Machine translation means 3 for creating “translation key”),
An input document translation concept vector calculation means 4 for calculating an input document translation concept vector using the word concept base storage means 8 for each natural language for the translation key created by the input key 11 and the machine translation means 3 ,
Similarity for calculating similarity by comparing the input translation concept vector obtained by the input document translation concept vector calculation means 4 with the concept vector of the search target document in the search target document concept base storage means 10 for each natural language Degree calculation means 5.

また、本発明（請求項２）は、類似度計算手段５で計算された類似度の高い順に検索対象文書を表示する類似文書表示手段を更に有する。 The present invention (Claim 2) further includes similar document display means for displaying search target documents in descending order of similarity calculated by the similarity calculation means 5.

また、本発明（請求項３）は、類似文書表示手段５において、入力と異なる自然言語による結果を入力と同じ自然言語に翻訳する翻訳手段を含む。 Further, the present invention (Claim 3) includes translation means for translating the result in a natural language different from the input into the same natural language in the similar document display means 5.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項４）は、入力された検索キーと検索対象文書とで言語が異なる場合における類似度を求める多言語間類似文書検索方法であって、
ｎ個（ｎ≧２）の自然言語に対して、
単語概念ベース格納処理手段が、予め自然言語毎に単語の概念ベクトルを計算し、単語概念ベース記憶手段に格納する単語概念ベース格納処理ステップ（ステップ１）と、
検索対象文書概念ベース格納処理手段が、検索対象文書群に対して、該検索対象文書が書かれている自然言語に対する単語概念ベース記憶手段を用いて、予め該検索対象文書の概念ベクトルを計算し、検索対象文書概念ベース記憶手段に格納する（ステップ２）検索対象文書概念ベース格納処理ステップの２ステップにより、単語概念ベースと検索対象文書概念ベースを予め構築しておき、
ある自然言語で書かれた検索に用いる単語群あるいは文書（以下、「入力キー」と記す）が入力された際に機械翻訳手段が、該入力キーを別の自然言語に翻訳し、該別の自然言語に翻訳された単語群あるいは文書（以下、「翻訳キー」と記す）を作成する機械翻訳ステップ（ステップ３）と、
入力文書翻訳概念ベクトル計算手段が、入力キーと機械翻訳ステップ（ステップ３）で作成された翻訳キーに対して、各々の自然言語に対する単語概念ベース記憶手段を用いて、入力文書翻訳概念ベクトルを計算する入力文書翻訳概念ベクトル計算ステップ（ステップ４）と、
類似度計算手段が、入力文書翻訳概念ベクトル計算ステップ（ステップ４）で得られた入力文書翻訳概念ベクトルと、各々の自然言語に対する検索対象文書概念ベース記憶手段の検索対象文書の概念ベクトルとを比較して、類似度を計算する類似度計算ステップ（ステップ５）と、を行う。 The present invention (Claim 4) is a multilingual similar document search method for obtaining a similarity when the input search key and the search target document have different languages,
For n (n ≧ 2) natural languages,
A word concept base storage processing unit that calculates a word concept vector in advance for each natural language and stores it in the word concept base storage unit (step 1);
The retrieval target document concept base storage processing unit calculates a concept vector of the retrieval target document in advance using the word concept base storage unit for the natural language in which the retrieval target document is written for the retrieval target document group. Then, the word concept base and the search target document concept base are constructed in advance by two steps of the search target document concept base storage step (step 2).
When a word group or a document (hereinafter referred to as “input key”) used for a search written in a natural language is input, the machine translation means translates the input key into another natural language, A machine translation step (step 3) for creating a word group or a document (hereinafter referred to as a “translation key”) translated into a natural language;
The input document translation concept vector calculation means calculates the input document translation concept vector using the word concept base storage means for each natural language for the input key and the translation key created in the machine translation step (step 3). Input document translation concept vector calculation step (step 4),
The similarity calculation means compares the input document translation concept vector obtained in the input document translation concept vector calculation step (step 4) with the concept vector of the search target document in the search target document concept base storage means for each natural language. Then, the similarity calculation step (step 5) for calculating the similarity is performed.

また、本発明（請求項５）は、類似度計算ステップ（ステップ５）で計算された類似度の高い順に検索対象文書を表示する類似文書表示ステップを更に行う。 The present invention (Claim 5) further performs a similar document display step of displaying search target documents in descending order of similarity calculated in the similarity calculation step (Step 5).

また、本発明（請求項６）は、類似文書表示ステップにおいて、入力と異なる自然言語による結果を入力と同じ自然言語に翻訳する翻訳ステップを更に有する。 The present invention (Claim 6) further includes a translation step of translating the result in a natural language different from the input into the same natural language in the similar document display step.

本発明（請求項７）は、請求項１乃至３のいずれか１項に記載の多言語間類似文書検索装置を構成する各手段としてコンピュータを機能させるための多言語間類似文書検索プログラムである。 The present invention (Claim 7) is a multilingual similar document search program for causing a computer to function as each means constituting the multilingual similar document search apparatus according to any one of claims 1 to 3. .

本発明（請求項８）は、請求項７記載の多言語間類似文書検索プログラムを格納したコンピュータ読取可能な記録媒体である。 The present invention (Claim 8) is a computer-readable recording medium storing the multilingual similar document search program according to Claim 7.

上記のように、本発明によれば、以下のような効果が得られる。 As described above, according to the present invention, the following effects can be obtained.

Ａ：従来の技術の（１）の方法とは異なり、概念ベクトルに基づく類似度を用いることにより、言語間の類似度を同じ尺度で計算することができる。 A: Unlike the conventional method (1), the similarity between languages can be calculated on the same scale by using the similarity based on the concept vector.

Ｂ：従来の技術の（２）のように対訳コーパスではなく、言語毎にコーパスを用意すればよく、コーパスの準備に大きな工数を必要としない。 B: Instead of a bilingual corpus as in the conventional technique (2), it is sufficient to prepare a corpus for each language, and a large man-hour is not required for preparing the corpus.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における多言語間類似文書検索装置の構成を示す。 FIG. 3 shows a configuration of a multilingual similar document search apparatus according to an embodiment of the present invention.

同図に示す多言語間類似文書検索装置は、単語概念ベース格納処理部１、検索対象文書概念ベース格納処理部２、機械翻訳部３、入力文書翻訳概念ベクトル計算部４、類似度計算部５、類似文書表示部６、複数の単語概念ベース８、複数の検索対象文書概念ベース１０から構成され、ハードウェア的には、ＣＰＵやメモリから構成される。 The multilingual similar document search apparatus shown in FIG. 1 includes a word concept base storage processing unit 1, a search target document concept base storage processing unit 2, a machine translation unit 3, an input document translation concept vector calculation unit 4, and a similarity calculation unit 5. , A similar document display unit 6, a plurality of word concept bases 8, and a plurality of search target document concept bases 10, and in terms of hardware, a CPU and a memory.

単語概念ベース格納処理部１は、ｎ個の自然言語のいずれかで書かれたコーパス７（同図中の"−ｉ"は記述されている自然言語を表す。例えば、"−１"は日本語、"−２"は英語など）をもとに、それぞれの自然言語の単語概念ベクトルを計算して単語概念ベース８に格納する。なお、単語概念ベース格納処理部１に入力されるコーパス７は、データベース形式で格納されているものを、単語概念ベース格納処理部１において読み出すものとする。 The word concept base storage processing unit 1 has a corpus 7 written in one of n natural languages ("-i" in the figure represents a natural language being described. For example, "-1" represents Japan The word concept vector of each natural language is calculated and stored in the word concept base 8. It is assumed that the corpus 7 input to the word concept base storage processing unit 1 is read in the word concept base storage processing unit 1 in a database format.

検索対象文書概念ベース格納処理部２は、ｎ個の自然言語のいずれかで書かれた検索対象文書９をもとに、それぞれの自然言語の単語概念ベース８を参照して検索対象文書概念ベクトルを計算して検索対象文書概念ベース１０に格納する。なお、入力される検索対象文書９は、データベース形式で格納されているものを、当該検索対象文書概念ベース格納処理部２において読み出すものとする。 The search target document concept base storage processing unit 2 refers to the search target document 9 written in any one of the n natural languages and refers to the word concept base 8 of each natural language to search target document concept vectors. Is stored in the search target document concept base 10. The input search target document 9 is stored in the database format and is read by the search target document concept base storage processing unit 2.

機械翻訳部３は、利用者から入力された検索に用いる単語群あるいは文書である入力キー１１を別の自然言語に翻訳し、翻訳キーを作成する。例えば、入力が日本語であった場合、別の自然言語である英語、フランス語などに翻訳し、得られた結果を翻訳キーとするものである。 The machine translation unit 3 translates the input key 11 which is a word group or a document used for the search input by the user into another natural language, and creates a translation key. For example, if the input is in Japanese, it is translated into another natural language such as English or French, and the obtained result is used as a translation key.

入力文書翻訳概念ベクトル計算部４は、入力キー１１と機械翻訳部３で得られた入力キー１１に対する翻訳キーに対して、各々の自然言語に対する単語概念ベース８を用いて、入力文書翻訳概念ベクトルを計算する。 The input document translation concept vector calculation unit 4 uses the word concept base 8 for each natural language with respect to the translation key for the input key 11 obtained by the input key 11 and the machine translation unit 3, and uses the input document translation concept vector. Calculate

類似度計算部５は、入力文書翻訳概念ベクトル計算部４で得られた入力文書翻訳概念ベクトルと、各々の自然言語に対する検索対象文書概念ベース１０とを比較して、類似度を計算する。 The similarity calculation unit 5 compares the input document translation concept vector obtained by the input document translation concept vector calculation unit 4 with the search target document concept base 10 for each natural language, and calculates the similarity.

類似文書表示部６は、類似度計算部５で計算された類似度の高い順に検索対象文書９を表示する。 The similar document display unit 6 displays the search target document 9 in descending order of similarity calculated by the similarity calculation unit 5.

単語概念ベース８は、ｎ個の自然言語毎に単語の概念ベクトルを格納する。 The word concept base 8 stores a word concept vector for every n natural languages.

検索対象文書９は、ｎ個の自然言語毎の文書である。 The search target document 9 is a document for each of n natural languages.

検索対象文書概念ベース１０は、ｎ個の自然言語毎に検索対象文書ベクトルを格納する。 The search target document concept base 10 stores a search target document vector for every n natural languages.

上記の構成における動作を以下に説明する。 The operation in the above configuration will be described below.

図４は、本発明の一実施の形態における動作のフローチャートである。 FIG. 4 is a flowchart of the operation in one embodiment of the present invention.

なお、以下の説明では、ｎ個の自然言語に対する単語概念ベース８は事前に構築済みであり、これらを使ってｎ個の自然言語に対する検索対象文書概念ベース１０も構築済みであるとする。 In the following description, it is assumed that the word concept base 8 for n natural languages has been constructed in advance, and that the search target document concept base 10 for n natural languages has also been constructed using these.

単語概念ベース８、検索対象文書概念ベース１０の構築方法については、特に問わないが、例えば、文献１「Schutze, "Dimensions of Meaning", Proceedings of Supercomputing 92, pp.787-796」、または、文献２「笠原、松澤、石川、"国語辞書を利用した日常語の類似性判別、"情報処理学会論文誌、Vol. 38, No. 7, pp. 1272-1284 （1997）．」に記載の手法により構築することができる。 The construction method of the word concept base 8 and the search target document concept base 10 is not particularly limited. For example, Reference 1 “Schutze,“ Dimensions of Meaning ”, Proceedings of Supercomputing 92, pp. 787-796”, or Reference 2 “Kasahara, Matsuzawa, Ishikawa,“ Difference of everyday words using a Japanese language dictionary, ”Information Processing Society Journal, Vol. 38, No. 7, pp. 1272-1284 (1997). Can be constructed.

ステップ１０１）機械翻訳部３では、ある言語で記述された単語群あるいは文書である入力キー１１を（ｎ−１）個の他の自然言語に翻訳し、翻訳キーを作成する。さらに、入力キーと翻訳キーを入力文書翻訳概念ベクトル計算部４に送る。 Step 101) The machine translation unit 3 translates the input key 11 which is a word group or a document described in a certain language into (n-1) other natural languages, and creates a translation key. Further, the input key and the translation key are sent to the input document translation concept vector calculation unit 4.

ステップ１０２）入力文書翻訳概念ベクトル計算部４では、入力キー１１と同じ自然言語から構築された単語概念ベース８を用いて、入力キーの概念ベクトルを計算する。 Step 102) The input document translation concept vector calculation unit 4 calculates the concept vector of the input key using the word concept base 8 constructed from the same natural language as the input key 11.

ステップ１０３）さらに、入力文書概念ベクトル計算部４では、全ての翻訳キーに対して、各々の自然言語と同じ自然言語から構築された単語概念ベース８を用いて翻訳キーの概念ベクトルを計算する。 Step 103) Further, the input document concept vector calculation unit 4 calculates the concept vector of the translation key for all the translation keys using the word concept base 8 constructed from the same natural language as each natural language.

ステップ１０４）入力文書翻訳概念ベクトル計算部４では、入力キー及び翻訳キーが単語であるか否かで処理を分ける。 Step 104) The input document translation concept vector calculation unit 4 divides the processing depending on whether or not the input key and the translation key are words.

ステップ１０５）入力キー及び翻訳キーが単語である場合には、得られた概念ベクトルを各々の記述言語に対する入力文書翻訳概念ベクトルとする。 Step 105) When the input key and the translation key are words, the obtained concept vector is set as an input document translation concept vector for each description language.

ステップ１０６）入力キー及び翻訳キーが単語群あるいは文書である場合は、各々の記述言語毎に得られた概念ベクトル群の重心を計算し、各々の記述言語に対する入力文書翻訳概念ベクトルとする。 Step 106) When the input key and the translation key are a word group or a document, the centroid of the concept vector group obtained for each description language is calculated and set as an input document translation concept vector for each description language.

ステップ１０７）得られた各々の記述言語に対する入力文書翻訳概念ベクトルを類似度計算部５に送る。類似度計算部５では、ｎ個の自然言語に対する入力文書翻訳概念ベクトルを受け取ることになる。 Step 107) The obtained input document translation concept vector for each description language is sent to the similarity calculation unit 5. The similarity calculation unit 5 receives input document translation concept vectors for n natural languages.

ステップ１０８）類似度計算部５は、入力キー１１に対する入力文書翻訳概念ベクトルに対して、対応する自然言語の検索対象文書概念ベース１０中の全ての検索対象文書概念ベクトルとの類似度を計算する。 Step 108) The similarity calculation unit 5 calculates the similarity between the input document translation concept vector for the input key 11 and all search target document concept vectors in the search target document concept base 10 of the corresponding natural language. .

ステップ１０９）類似度計算部５は、翻訳キーに対する入力文書翻訳概念ベクトルに対して、各々の自然言語に対応する検索対象文書概念ベース１０中のすべての検索対象文書概念ベクトルとの類似度を計算する。 Step 109) The similarity calculation unit 5 calculates the similarity between the input document translation concept vector corresponding to the translation key and all search target document concept vectors in the search target document concept base 10 corresponding to each natural language. To do.

ステップ１１０）そして、類似度計算部５では、ステップ１０８とステップ１０９で得られた計算結果を類似文書表示部６に送る。すなわち、ｎ個の自然言語（入力キーの自然言語の翻訳キーの（ｎ−１）個の自然言語）について個々の類似度に応じた計算結果を類似文書表示部６に送る。 Step 110) Then, the similarity calculation unit 5 sends the calculation results obtained in steps 108 and 109 to the similar document display unit 6. That is, a calculation result corresponding to each similarity degree is sent to the similar document display unit 6 for n natural languages ((n-1) natural languages of the natural language translation key of the input key).

ステップ１１１）類似文書表示部６では、類似度計算部５から送られた計算結果をメモリ（図示せず）に一旦格納し、当該類似度の高い順にソートし、自然言語の種類に関係なく、類似度の高い順に検索対象文書概念ベース１０の概念ベクトルの並べ替えを行う。更に、対応する文書を対応する自然言語で記述された検索対象文書９から引き出し、表示する。 Step 111) In the similar document display unit 6, the calculation results sent from the similarity calculation unit 5 are temporarily stored in a memory (not shown) and sorted in descending order of the similarity, regardless of the type of natural language. The concept vectors of the search target document concept base 10 are rearranged in descending order of similarity. Further, the corresponding document is extracted from the search target document 9 described in the corresponding natural language and displayed.

このとき、図５に示すように、類似文書表示部６中に、類似文書機械翻訳部６２を設け、入力と異なる（ｎ−１）個の自然言語による結果に対して機械翻訳を行うことにより、入力と同じ自然言語に変換した後に表示してもよい。 At this time, as shown in FIG. 5, a similar document machine translation unit 62 is provided in the similar document display unit 6, and machine translation is performed on (n-1) natural language results different from the input. , It may be displayed after being converted to the same natural language as the input.

なお、上記の図３に示す多言語間文書検索装置１３の構成要素の動作をプログラムとして構築し、多言語間文書検索装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of the constituent elements of the multilingual document search apparatus 13 shown in FIG. 3 is constructed as a program and installed in a computer used as the multilingual document search apparatus, or is executed or via a network. It can be distributed.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、キーワードや文書と類似した文書を検索する類似検索技術、特に、多言語間における類似検索技術に適用可能である。 The present invention can be applied to a similar search technique for searching for a document similar to a keyword or a document, in particular, a similar search technique between multiple languages.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の一実施の形態における多言語間類似文書検索装置の構成図である。It is a block diagram of the multilingual similar document search apparatus in one embodiment of this invention. 本発明の一実施の形態における動作のフローチャートである。It is a flowchart of the operation | movement in one embodiment of this invention. 本発明の一実施の形態における類似文書表示部の構成例である。It is a structural example of the similar document display part in one embodiment of this invention.

Explanation of symbols

１単語概念ベース格納処理手段、単語概念ベース格納処理部
２検索対象文書概念ベース格納処理手段、検索対象文書概念ベース格納処理部
３機械翻訳手段、機械翻訳部
４入力文書翻訳概念ベクトル計算手段、入力文書翻訳概念ベクトル計算部
５類似度計算手段、類似度計算部
６類似文書表示部
７コーパス
８単語概念ベース記憶手段、単語概念ベース記憶部
９検索対象文書
１０検索対象文書概念ベース記憶手段、検索対象文書概念ベース
１１入力キー
１２異言語間類似度文書検索結果
１３多言語間類似文書検索装置
６１類似文書表示制御部
６２類似文書機械翻訳部 1 word concept base storage processing means, word concept base storage processing section 2 search target document concept base storage processing means, search target document concept base storage processing section 3 machine translation means, machine translation section 4 input document translation concept vector calculation means, input Document translation concept vector calculation unit 5 Similarity calculation unit, similarity calculation unit 6 Similar document display unit 7 Corpus 8 Word concept base storage unit, word concept base storage unit 9 Search target document 10 Search target document concept base storage unit, search target Document concept base 11 Input key 12 Cross-language similarity document search result 13 Multi-language similar document search device 61 Similar document display control unit 62 Similar document machine translation unit

Claims

A multi-language similar document search device for obtaining a similarity in a case where a language is different between an input search key and a search target document,
For n (n ≧ 2) natural languages,
A word concept base storage processing means for calculating a word concept vector for each natural language and storing it in a word concept base storage means;
For the search target document group, the concept vector of the search target document is calculated using the word concept base storage means for the natural language in which the search target document is written, and stored in the search target document concept base storage means. Search target document concept base storage processing means,
A word group or a document (hereinafter referred to as “input key”) used for a search written in a certain natural language is translated into another natural language, and the word group or document (hereinafter, “ A machine translation means for creating a translation key),
An input document translation concept vector calculation means for calculating an input document translation concept vector using a word concept base storage means for each natural language for the translation key created by the input key and the machine translation means,
Similarity calculation that calculates the similarity by comparing the input translation concept vector obtained by the input document translation concept vector calculation means with the concept vector of the search target document in the search target document concept base storage means for each natural language Means,
A multilingual similar document search apparatus characterized by comprising:

The multilingual similar document search device according to claim 1, further comprising similar document display means for displaying search target documents in descending order of similarity calculated by the similarity calculation means.

Similar document display means
The multilingual similar document search device according to claim 2, further comprising a translation unit that translates a result in a natural language different from the input into the same natural language as the input.

A multi-language similar document search method for obtaining a similarity in a case where a language is different between an input search key and a search target document,
For n (n ≧ 2) natural languages,
A word concept base storage processing unit that calculates a word concept vector in advance for each natural language and stores it in the word concept base storage unit;
The retrieval target document concept base storage processing unit calculates a concept vector of the retrieval target document in advance using the word concept base storage unit for the natural language in which the retrieval target document is written for the retrieval target document group. The word concept base and the search target concept base are constructed in advance by two steps of the search target document concept base storage processing step stored in the search target document concept base storage means,
When a word group or a document (hereinafter referred to as “input key”) used for a search written in a natural language is input, the machine translation means translates the input key into another natural language, A machine translation step for creating a word group or document (hereinafter referred to as a “translation key”) translated into a natural language of
The input document translation concept vector calculation means calculates the input document translation concept vector using the word concept base storage means for each natural language for the input key and the translation key created in the machine translation step. Concept vector calculation step and
The similarity calculation means compares the input translation concept vector obtained in the input document translation concept vector calculation step with the concept vector of the search target document in the search target document concept base storage means for each natural language. A similarity calculation step for calculating
A method for retrieving similar documents between multiple languages.

5. The multilingual similar document search method according to claim 4, further comprising a similar document display step of displaying search target documents in descending order of similarity calculated in the similarity calculation step.

In the similar document display step,
6. The multilingual similar document search method according to claim 5, further comprising a translation step of translating a result in a natural language different from the input into the same natural language as the input.

A multilingual similar document search program for causing a computer to function as each means constituting the multilingual similar document search device according to any one of claims 1 to 3.

A computer-readable recording medium storing the multilingual similar document search program according to claim 7.