JP2007025939A

JP2007025939A - Multilingual document retrieval device, multilingual document retrieval method and program for retrieving multilingual document

Info

Publication number: JP2007025939A
Application number: JP2005205370A
Authority: JP
Inventors: Hiraki Ishikawa; 開石川; Toru Akamine; 亨赤峯
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-07-14
Filing date: 2005-07-14
Publication date: 2007-02-01
Anticipated expiration: 2025-07-14
Also published as: JP4640593B2

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that a conventional method for retrieving multilingual documents is not suitable for a system which needs the new establishment of a category or the review of a classification axis in an as-needed base because the correspondence relation of the category between different languages must be 1:1. <P>SOLUTION: Translated documents are generated from original documents of each language by translation processing, and the translated documents are classified to a document category of the language. When a retrieval query is inputted, matched candidates matched to the retrieval query are retrieved from original documents of the same language as the query and translated documents thereof. For a document lingually paired with each matched candidate or a document having the relation of an original document-translated document to the matched candidate, the category thereof is determined. Further, among documents paired with other documents belonging to the determined category, documents of the same language as the retrieval query are extracted as relevant candidates, and the relevant candidates and the matched candidates previously retrieved are outputted as a retrieval result. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、日本語や英語等の様々な言語で記述された文書データ群からコンピュータにより所望の文書データを検索する技術に関し、特に、言語ごとに独自のカテゴリが設けられている文書データ群に対し検索を実行する技術に関する。 The present invention relates to a technique for retrieving desired document data from a document data group described in various languages such as Japanese and English, and particularly to a document data group in which a unique category is provided for each language. The present invention relates to a technique for executing a search.

従来、コンピュータにより多言語文書を検索する方法は、検索キーワードとして入力される検索クエリを検索対象と同じ言語に翻訳してから文書検索を行うという方法と、検索対象を検索クエリと同じ言語に翻訳してから文書検索を行う方法の二つに大別できる。前者の方法は、後者に比べて検索対象文書をあらかじめ翻訳しておく必要がないという利点がある一方で、検索の確度がクエリの翻訳精度に左右されるため、検索者の要求に見合った結果を得難いという難点がある。 Conventionally, there are two ways to search for multilingual documents using a computer: a search query that is entered as a search keyword is translated into the same language as the search target, and then a document search is performed. After that, it can be roughly divided into two methods of document retrieval. The former method has the advantage that the search target document does not need to be translated in advance compared to the latter method. On the other hand, the accuracy of the search depends on the translation accuracy of the query. It is difficult to obtain.

多言語文書の検索に関し従来提案されている手法として、例えば、後述の特許文献１乃至６に記載のものがある。特許文献１に記載の手法は、各言語の文書が予めカテゴリに分類され、言語間で内容の対応するカテゴリが１対１で関連付けられている場合に、原言語の検索結果から適合カテゴリを選択し、目的言語での検索を選択カテゴリに対応するカテゴリに絞って行うというものである。 Conventionally proposed techniques for searching for multilingual documents include, for example, those described in Patent Documents 1 to 6 described later. In the method described in Patent Document 1, when a document in each language is classified into categories in advance and the categories corresponding to the contents are correlated one-to-one, a matching category is selected from the search result of the source language. Then, the search in the target language is limited to the category corresponding to the selected category.

特許文献２に記載の手法は、クエリを翻訳する多言語検索において、検索クエリを翻訳する前後で検索語の類義語をシソーラスから求め、それをクエリに追加するというものである。特許文献３に記載の手法は、クエリを翻訳する多言語検索において、クエリのワード数が短く且つそれが翻訳辞書における新語の場合に適切な翻訳クエリが得られないという問題に対処するため、クエリから、原言語の検索結果に基づくTFIDFスコアを使って関連語を抽出し、この関連語の訳語を翻訳クエリに加えるというものである。 The technique described in Patent Document 2 is to obtain a synonym of a search term from a thesaurus before and after translating the search query in a multilingual search for translating the query, and add it to the query. The technique described in Patent Document 3 is a query that uses a query in which the number of words in a query is short and an appropriate translation query cannot be obtained when it is a new word in a translation dictionary. Then, the related words are extracted using the TFIDF score based on the search result of the source language, and the translated words of the related words are added to the translation query.

特許文献４に記載の手法は、クエリを翻訳する多言語検索において、初期検索要求による原言語の検索結果から検索要求に対する適合文書を検索者に選択させ、選択した適合文書中の、検索要求に関して特徴的な単語リストを、出現単語の統計情報に基づいて生成し、これを目的言語に翻訳して目的言語での検索を行うというものである。特許文献５に記載の手法は、クエリを翻訳する多言語検索において、検索結果の文書選択を効率化することを目的として、階層的ベイズクラスタリングを用いて検索結果の分類まとめ上げを行い、多言語の検索結果を統一的に提示するというものである。 In the method described in Patent Document 4, in a multilingual search for translating a query, a searcher selects a matching document corresponding to a search request from a search result of a source language by an initial search request, and the search request in the selected matching document is related to A characteristic word list is generated on the basis of statistical information of appearance words, and this is translated into a target language and searched in the target language. In the multilingual search for translating a query, the technique described in Patent Literature 5 classifies search results using hierarchical Bayesian clustering for the purpose of improving the efficiency of document selection of search results. The search results are presented in a unified manner.

特許文献６に記載の手法は、クエリを翻訳する多言語検索において、文書に発行日時、カテゴリ、分野等のメタデータが付加されており、検索要求に対する原言語の検索結果からメタデータに関するスクリーニング条件を検索者に選択させる。そして、この条件を他言語の絞込みに用いることにより、読むのが困難な外国語の文書を、原言語でのスクリーニングを利用して効率よく絞込みを行うというものである。
特開２００２−１８９７４５号公報特開平０８−３０５７２８号公報特開２００３−２０８４４１号公報特開２００１−０２２７８７号公報特開２００３−７６７１０号公報特開２００３−０５０８２１号公報 In the technique described in Patent Document 6, in multilingual search for translating a query, metadata such as an issue date, category, and field is added to a document. Let the searcher select By using this condition for narrowing down other languages, documents in foreign languages that are difficult to read are efficiently filtered using screening in the source language.
JP 2002-189745 A Japanese Patent Laid-Open No. 08-305728 JP 2003-208441 A JP 2001-022787 A JP 2003-76710 A Japanese Patent Laid-Open No. 2003-050821

しかしながら、カテゴリの対応関係を用いて検索結果を絞り込むという従来の方法において適切な検索結果を得るには、異言語間のカテゴリの対応関係が１対１であることを必要とされる。このような条件は、例えば、新規の原文書が次々と検索対象に追加され、それに伴ってカテゴリの新設や分類軸の見直しが随時必要となるシステムには不向きである。この種のシステムは、例えば、サービス事業者のコンタクトセンターで利用される知識ベース、すなわち顧客応対業務の担当者たちが応対案件の報告書を登録し共有するデータベースなどがその典型例である。 However, in order to obtain an appropriate search result in the conventional method of narrowing down the search result using the category correspondence, it is necessary that the category correspondence between different languages is one-to-one. Such a condition is not suitable for a system in which, for example, new original documents are added to the search target one after another, and a new category or a review of the classification axis is required accordingly. A typical example of this type of system is, for example, a knowledge base used in the contact center of a service provider, that is, a database in which customer service representatives register and share reports on response cases.

本発明の目的は、検索対象の文書が言語毎に管理されており、且つ、各文書が言語毎に独自のカテゴリによって分類されている場合であっても、より適切な文書を提示し得る手法を提供することにある。 It is an object of the present invention to provide a method capable of presenting a more appropriate document even when a document to be searched is managed for each language and each document is classified by a unique category for each language. Is to provide.

本発明に係る多言語文書検索装置は、プロセッサと、言語ごとに規定された文書カテゴリを割り当てられた複数の原文書を言語別に記憶する記憶媒体とを備え、前記プロセッサは、原文書の翻訳により翻訳文書を生成する手段と、翻訳文書を当該原文書に関連付けて言語別に前記記憶媒体へ格納する手段と、翻訳文書と同一言語の文書カテゴリから該翻訳文書の文書カテゴリを求める手段と、入力された検索クエリに適合する適合候補を該検索クエリと同一言語の原文書および翻訳文書から検索する手段と、前記適合候補に対する翻訳文書または原文書の文書カテゴリを認識し、該文書カテゴリに属する他の文書に対する翻訳文書または原文書のうち前記検索クエリと同一言語の文書を関連候補として抽出し、該関連候補および前記適合候補を検索結果として出力する手段とを有する。 The multilingual document search apparatus according to the present invention includes a processor and a storage medium that stores a plurality of original documents assigned with document categories defined for each language for each language. Means for generating a translation document, means for associating the translation document with the original document and storing it in the storage medium according to language, means for obtaining the document category of the translation document from the document category of the same language as the translation document, and Means for searching for a matching candidate that matches the search query from the original document and the translation document in the same language as the search query, recognizes the document category of the translation document or the original document for the matching candidate, and other documents belonging to the document category A document in the same language as the search query is extracted as a related candidate from a translated document or an original document for the document, and the related candidate and the matching candidate are extracted And means for outputting the retrieval result.

本発明に係る多言語文書検索方法は、言語ごとに規定された文書カテゴリを割り当てられた複数の原文書を言語別に記憶する記憶媒体と接続されたプロセッサが、原文書の翻訳により翻訳文書を生成するステップと、翻訳文書を当該原文書に関連付けて言語別に前記記憶媒体へ格納するステップと、翻訳文書と同一言語の文書カテゴリから該翻訳文書の文書カテゴリを求めるステップと、入力手段により入力された検索クエリに適合する適合候補を該検索クエリと同一言語の原文書および翻訳文書から検索するステップと、前記適合候補に対する翻訳文書または原文書の文書カテゴリを認識し、該文書カテゴリに属する他の文書に対する翻訳文書または原文書のうち前記検索クエリと同一言語の文書を関連候補として抽出し、該関連候補および前記適合候補を検索結果として出力手段から出力するステップとを備える。 A multilingual document search method according to the present invention includes: a processor connected to a storage medium that stores a plurality of original documents to which a document category defined for each language is assigned for each language, and generates a translated document by translating the original document A step of storing the translated document in association with the original document in the storage medium according to the language, a step of obtaining the document category of the translated document from the document category of the same language as the translated document, and an input means A step of searching for a matching candidate that matches the search query from an original document and a translation document in the same language as the search query, a translation document corresponding to the matching candidate or a document category of the original document, and other documents belonging to the document category Documents in the same language as the search query are extracted as related candidates from translated documents or original documents for And a step of outputting from the output means the fit candidate as a search result.

かかる本発明の基本構想を説明すると、文書検索の前処理として、各言語の原文書から翻訳処理により翻訳文書を生成し、生成した翻訳文書を当該言語の文書カテゴリに分類する。例えば、原文書が日本語である場合、翻訳した英語文書を英語の文書カテゴリに分類することとなる。 The basic concept of the present invention will be described. As a pre-process for document search, a translation document is generated from an original document of each language by translation processing, and the generated translation document is classified into a document category of the language. For example, when the original document is Japanese, the translated English document is classified into an English document category.

上記前処理の後、ある言語で検索クエリが入力されると、同言語の原文書および翻訳文書から検索クエリに適合する適合候補を検索する。また、検出された適合候補と言語上の対となる文書、すなわち適合候補に対し原文書及び翻訳文書の関係にある文書について、そのカテゴリを求める。 After the pre-processing, when a search query is input in a certain language, matching candidates that match the search query are searched from the original document and translation document in the same language. Further, a category is obtained for a document that is paired in language with the detected matching candidate, that is, a document that has a relationship between the original document and the translated document with respect to the matching candidate.

さらに、求めたカテゴリに属する他の文書と対になる文書のうち、検索クエリと同一言語の文書を関連候補として検出する。その結果、検索クエリと同一言語の適合候補および関連候補が検出され、これらの候補を検索結果として出力する。このような構成を採用し、多言語文書検索を行うことにより本発明の目的を達成することができる。 Furthermore, a document in the same language as the search query is detected as a related candidate among documents paired with other documents belonging to the obtained category. As a result, matching candidates and related candidates in the same language as the search query are detected, and these candidates are output as search results. By adopting such a configuration and performing multilingual document search, the object of the present invention can be achieved.

本発明によれば、多言語文書の検索処理において、検索クエリにマッチする適合候補に加え、この適合候補と対の文書が属するカテゴリを利用して関連候補を抽出することから、各言語のカテゴリの分類軸や粒度が一致せず、異言語間のカテゴリの対応関係が１対１とならない場合でも、適切な検索結果を提示することができる。これにより、新規の原文書が次々と検索対象に追加されるようなシステムにおいても検索精度の向上を図ることができる。 According to the present invention, in a multilingual document search process, in addition to matching candidates that match a search query, related candidates are extracted using a category to which a document paired with the matching candidate belongs. Even when the classification axes and the granularity of are not consistent and the correspondence relationship of categories between different languages is not 1: 1, an appropriate search result can be presented. Thereby, it is possible to improve the search accuracy even in a system in which new original documents are added to the search target one after another.

本発明を実施するための最良の形態について図面を参照して詳細に説明する。本実施形態では、検索対象の文書データが日本語及び英語の二言語であり、これら二言語のそれぞれに独自の文書カテゴリが設けられているとする。独自の文書カテゴリが設定された状態とは、すなわち、文学、政治、科学といった文書カテゴリの構成内容が各言語で共通ではないことを指す。また、以下の説明において日本語カテゴリ及び英語カテゴリとは、予め各言語の原文書に関連付けられている文書カテゴリを指す。 The best mode for carrying out the present invention will be described in detail with reference to the drawings. In this embodiment, it is assumed that document data to be searched is in two languages, Japanese and English, and each of these two languages has a unique document category. The state where the original document category is set indicates that the contents of the document category such as literature, politics, and science are not common in each language. In the following description, the Japanese category and the English category refer to document categories that are associated with the original documents in each language in advance.

図１を参照すると、本実施形態の多言語文書検索装置であるコンピュータは、キーボードのような入力手段１００と、プログラム制御により動作するプロセッサ２００と、ハードディスクのような記憶媒体３００と、ディスプレイおよびプリンタ等の出力手段４００とを備える。 Referring to FIG. 1, a computer that is a multilingual document search apparatus according to this embodiment includes an input unit 100 such as a keyboard, a processor 200 that operates under program control, a storage medium 300 such as a hard disk, a display, and a printer. Output means 400.

プロセッサ２００は、その機能構成として、多言語文書の検索に関する処理を行う文書検索統合手段２０１、日本語文書検索手段２０２及び英語文書検索手段２０３と、文書検索に先立つ前処理を行う英日翻訳処理手段２０４、日英翻訳処理手段２０５、日本語カテゴリ判別手段２０６及び英語カテゴリ判別手段２０７とを有する。これらの機能構成は、記憶媒体３００等に格納されているプログラム（図示略）をプロセッサ２００が実行することにより実現される機能に対応する。 The processor 200 includes, as its functional configuration, a document search integration unit 201, a Japanese document search unit 202, and an English document search unit 203 that perform processing related to multilingual document search, and an English-Japanese translation process that performs preprocessing prior to document search Means 204, Japanese-English translation processing means 205, Japanese category discrimination means 206, and English category discrimination means 207. These functional configurations correspond to functions realized by the processor 200 executing a program (not shown) stored in the storage medium 300 or the like.

記憶媒体３００は、日本語の原文書を記憶する日本語文書保持手段３０１と、英語の原文書を記憶する英語文書保持手段３０２と、言語ごとに独自に設定された文書カテゴリと原文書との関連を記憶する日本語カテゴリ保持手段３０３及び英語カテゴリ保持手段３０４と、英日翻訳処理手段２０４による翻訳文書を記憶する英日翻訳文書保持手段３０５と、日英翻訳処理手段２０５による翻訳文書を記憶する日英翻訳文書保持手段３０６と、日本語及び英語間の翻訳辞書を記憶する英日対訳辞書保持手段３０７及び日英対訳辞書保持手段３０８とを有する。 The storage medium 300 includes a Japanese document holding unit 301 that stores a Japanese original document, an English document holding unit 302 that stores an English original document, and a document category and an original document that are uniquely set for each language. Japanese category holding means 303 and English category holding means 304 for storing associations, English-Japanese translation document holding means 305 for storing translation documents by English-Japanese translation processing means 204, and translation documents by Japanese-English translation processing means 205 are stored. A Japanese-English translation document holding means 306, an English-Japanese bilingual dictionary holding means 307 for storing a Japanese-English translation dictionary, and a Japanese-English bilingual dictionary holding means 308.

図２に示すフローチャートを参照して本実施形態の全体の動作について詳細に説明する。まず、検索を実行する前段階として、文書検索に必要なデータが記憶媒体３００に登録されているか否かをチェックし（ステップＡ０）、否の場合、次に説明する前処理を実行する。 The overall operation of this embodiment will be described in detail with reference to the flowchart shown in FIG. First, as a pre-stage for executing the search, it is checked whether or not the data necessary for the document search is registered in the storage medium 300 (step A0). If not, the preprocessing described below is executed.

日英翻訳処理手段２０５は、日本語文書保持手段３０１に記憶されている日本語文書を入力して日英翻訳処理を行い、その翻訳結果を日英翻訳文書として、翻訳元の日本語文書との対応関係と共に日英翻訳文書保持手段３０６に記録する（ステップＡ１）。さらに、日英翻訳文書と翻訳元の日本語文書とを、翻訳処理の過程で得られるそれぞれの言語での単語境界情報と共に、英語カテゴリ判別手段２０７に出力する。 The Japanese-English translation processing unit 205 inputs a Japanese document stored in the Japanese document holding unit 301 and performs a Japanese-English translation process. Are recorded in the Japanese-English translation document holding means 306 (step A1). Further, the Japanese-English translation document and the translation-source Japanese document are output to the English category discrimination means 207 together with the word boundary information in each language obtained in the course of the translation process.

英語カテゴリ判別手段２０７は、入力された日英翻訳文書を英語カテゴリに分類する（ステップＡ２）。これは、現時点で未分類の日英翻訳文書を新たな英語の文書とみなし、それが英語カテゴリの何れに分類されるかを判断するという処理に相当する。具体的には、英語カテゴリ判別手段２０７が、対象の日英翻訳文書について、英語カテゴリ保持手段３０４に記憶されている英語カテゴリの何れに分類されるかを判断する。なお、一つの日英翻訳文書が３つ以上のカテゴリに分類されても、いずれのカテゴリに分類されなくてもよい。 The English category discriminating means 207 classifies the input Japanese-English translation document into the English category (step A2). This corresponds to a process in which a currently unclassified Japanese-English translation document is regarded as a new English document and it is determined in which of the English categories it is classified. Specifically, the English category discriminating unit 207 determines which of the English categories stored in the English category holding unit 304 the target Japanese-English translation document is classified into. Note that one Japanese-English translation document may be classified into three or more categories, or may not be classified into any category.

日英翻訳文書を英語カテゴリに分類する方法としては二通りある。一つは、日英翻訳文書をそのまま分類するという方法であり、もう一つは、翻訳元の日本語文書を用いて日英翻訳文書を英語カテゴリに分類するというものである。前者は、対象の日英翻訳文書を既存の英語文書と比較し、両者間の単語の一致度に基づき日英翻訳文書の英語カテゴリを決定するという方法である。 There are two ways to classify Japanese-English translation documents into English categories. One is a method of classifying Japanese-English translation documents as they are, and the other is a method of classifying Japanese-English translation documents into English categories using a Japanese document as a translation source. The former is a method in which a target Japanese-English translation document is compared with an existing English document, and an English category of the Japanese-English translation document is determined based on the degree of coincidence of words between the two.

以下、後者の日本語文書を用いて分類する方法について詳細に説明する。英語カテゴリ判別手段２０７は、英語カテゴリと各カテゴリに属する英語文書との対応関係を英語カテゴリ保持手段３０４から取得し、また、各カテゴリに属する英語文書を英語文書保持手段３０２から取得する。そして、翻訳元となる日本語文書と各英語文書との間の類似度を、両文書の単語が一致する割合などに基づいて計算する。 Hereinafter, the latter method of classification using Japanese documents will be described in detail. The English category discriminating unit 207 acquires the correspondence between the English category and the English document belonging to each category from the English category holding unit 304, and acquires the English document belonging to each category from the English document holding unit 302. Then, the degree of similarity between the Japanese document as the translation source and each English document is calculated based on the ratio of the words in both documents.

日本語文書と英語文書との類似度を計算するにあたっては、当分野にて従来知られた方法を用いる。例えば、文書中に出現する各単語を次元とし、その出現頻度、あるいは出現頻度に単語の重み付けを掛け合わせた値（tf*idf（tf：Term Frequency、idf：Inverse Document Frequency））を要素に持つような文書ベクトルを求め、この文書ベクトルの内積を二文書の類似度とするという方法を用いることができる。 In calculating the similarity between a Japanese document and an English document, a method conventionally known in this field is used. For example, each word appearing in a document is taken as a dimension, and the element has an appearance frequency or a value obtained by multiplying the appearance frequency by the word weight (tf * idf (tf: Term Frequency, idf: Inverse Document Frequency)). A method of obtaining such a document vector and using the inner product of the document vectors as the similarity of two documents can be used.

また、英語カテゴリ判別手段２０７は、日本語文書中の単語を英語文書の単語と照合するにあたり、単語境界情報を用いて、日本語文書及び英語文書を構成する単語を認識する。そして、日本語文書中の単語に対する訳語候補を日英対訳辞書保持手段３０８から取得し、これらの訳語候補と合致する単語の候補を英語文書から求める。なお、日本語文書中の単語の訳語候補が、日英翻訳文書中で対応する単語の訳語と異なる場合は、日英翻訳文書中の該当箇所に第２訳語候補として括弧書きなどで挿入すれば、訳語選択誤りによる検索精度の低下や、翻訳誤りによる読みにくさを改善する効果が期待できる。 In addition, the English category discriminating means 207 recognizes the words constituting the Japanese document and the English document using the word boundary information when collating the word in the Japanese document with the word of the English document. Then, translation candidates for words in the Japanese document are acquired from the Japanese-English bilingual dictionary holding means 308, and word candidates that match these translation word candidates are obtained from the English document. If the translation candidate of a word in a Japanese document is different from the translation of the corresponding word in a Japanese-English translation document, it can be inserted as a second translation candidate in parentheses or the like at the corresponding location in the Japanese-English translation document. It can be expected that the search accuracy is reduced due to a translation error and the difficulty in reading due to a translation error is improved.

英語カテゴリ判別手段２０７は、日本語文書及び英語文書間の類似度についての計算結果を基に、日本語文書に極めて類似する英語文書が存在する英語カテゴリ、あるいは、大半の英語文書と日本語文書とが一定以上の類似度を示す英語カテゴリを選定する。そして、選定した英語カテゴリを、未分類であった日英翻訳文書のカテゴリに決定し、この日英翻訳文書をその英語カテゴリに関連付けて日英対訳辞書保持手段３０８へ保存する。 The English category discriminating means 207 is based on the calculation result of the similarity between the Japanese document and the English document, the English category in which there is an English document very similar to the Japanese document, or most English documents and Japanese documents. Select an English category that shows a certain degree of similarity. Then, the selected English category is determined as the category of the Japanese-English translation document that has not been classified, and this Japanese-English translation document is stored in the Japanese-English bilingual dictionary holding means 308 in association with the English category.

日英翻訳文書のカテゴリ分けに翻訳元の日本語文書を用いる上記方法には、次のメリットがある。一般に、翻訳処理では原語に対し訳語が一意に決定されるため、たとえ翻訳が正しく行われても、日本語での意味が同一となる英単語同士が文字的に一致しないことにより、該当文書として検出されない可能性がある。これに対し、日本語文書の単語を対訳辞書にて翻訳しながら英語文書と照合するという上記方法では、複数の訳語候補のうち、照合相手の単語と合致するものがあればカウントされるため、異言語間の文書照合に柔軟性を与えることができる。 The above-described method using Japanese translation source documents for categorizing Japanese-English translation documents has the following advantages. In general, in the translation process, the translated word is uniquely determined with respect to the original word, so even if the translation is done correctly, the English words with the same meaning in Japanese do not match literally. It may not be detected. On the other hand, in the above method of collating the word of the Japanese document with the English document while translating it in the bilingual dictionary, if there is a match with the word of the collation partner among a plurality of translation word candidates, Flexibility can be given to document collation between different languages.

英日翻訳処理手段２０４は、英語カテゴリ判別手段２０７が行った処理手順と同様な手順により、英語文書保持手段３０２に記憶されている英語文書を日本語に翻訳する（ステップＡ３）。日本語カテゴリ判別手段２０６は、英語カテゴリ判別手段２０７が日英方向で行った処理と同様の処理を英日方向で行う（ステップＡ４）。なお、図２に示す手順では日英翻訳の後に英日翻訳が行われるが、この順序は限定されるものではなく、図示の逆の順序であってもよい。 The English-Japanese translation processing means 204 translates the English document stored in the English document holding means 302 into Japanese by the same procedure as the processing procedure performed by the English category discrimination means 207 (step A3). The Japanese category discriminating means 206 performs the same processing in the English-Japanese direction as the processing performed by the English category discriminating means 207 in the Japanese-English direction (step A4). In the procedure shown in FIG. 2, the English-Japanese translation is performed after the Japanese-English translation, but this order is not limited and may be the reverse order shown in the figure.

上記の前処理の完了後、ユーザが検索キーワードの入力操作を行うことにより、入力手段１００から検索クエリが入力されると（ステップＡ５）、このクエリに対する検索処理を以下の手順にて行う。 When the user performs a search keyword input operation after the above pre-processing is completed and a search query is input from the input means 100 (step A5), the search processing for this query is performed according to the following procedure.

まず、文書検索統合手段２０１は、入力された検索クエリの言語を判別し（ステップＡ５−１）、その結果、検索クエリの言語が日本語である場合は、日本語文書検索手段２０２との協働により、後述する日本語の適合候補および関連候補を検索する（ステップＡ６）。また、検索クエリの言語が英語である場合、文書検索統合手段２０１は、英語文書検索手段２０３との協働により英語の適合候補および関連候補を検索する（ステップＡ７）。 First, the document search integration unit 201 determines the language of the input search query (step A5-1). As a result, if the language of the search query is Japanese, the document search integration unit 201 cooperates with the Japanese document search unit 202. The search is made for Japanese matching candidates and related candidates to be described later (step A6). If the language of the search query is English, the document search integration unit 201 searches for English matching candidates and related candidates in cooperation with the English document search unit 203 (step A7).

ここで、図３に示すフローチャートを参照して、適合候補及び関連候補の検索手順について説明する。以下、一例として、検索クエリが日本語であるケースを説明する。 Here, with reference to the flowchart shown in FIG. 3, a procedure for searching for a match candidate and a related candidate will be described. Hereinafter, a case where the search query is Japanese will be described as an example.

文書検索統合手段２０１は、日本語の検索クエリを日本語文書検索手段２０２へ出力する。日本語文書検索手段２０２は、入力された検索クエリに対し、日本語文書保持手段３０１の全ての日本語文書と、英日翻訳文書保持手段３０５の全ての英日翻訳文書とを検索対象として、クエリの条件に適合する文書を検索し、これにより得られた文書集合を適合候補として文書検索統合手段２０１へ出力する（ステップＳ１）。適合候補としては、日本語文書および英日翻訳文書のいずれか一方に限らず、双方が混在してもよい。 The document search integration unit 201 outputs a Japanese search query to the Japanese document search unit 202. The Japanese document search unit 202 searches the input search query for all Japanese documents in the Japanese document holding unit 301 and all English-Japanese translation documents in the English-Japanese translation document holding unit 305. Documents that match the query conditions are searched, and the document set obtained as a result is output to the document search integration unit 201 as matching candidates (step S1). The matching candidates are not limited to either Japanese documents or English-Japanese translation documents, and both may be mixed.

文書検索統合手段２０１は、日本語文書検索手段２０２から適合候補を取得すると、適合候補となる日本語文書及び英日翻訳文書が属する日本語カテゴリを日本語カテゴリ保持手段３０３から取得する（ステップＳ２）。カテゴリの取得に成功した場合、すなわち適合候補が何れかの日本語カテゴリに属する場合（ステップＳ３：Yes）、そのカテゴリに含まれる他の文書を関連候補として抽出する（ステップＳ４）。 When the document search integration unit 201 acquires the match candidate from the Japanese document search unit 202, the document search integration unit 201 acquires the Japanese category to which the match candidate Japanese document and the English-Japanese translation document belong from the Japanese category holding unit 303 (step S2). ). If the category acquisition is successful, that is, if the matching candidate belongs to any Japanese category (step S3: Yes), other documents included in the category are extracted as related candidates (step S4).

このとき、１つの文書に対し複数のカテゴリを得た場合は、各カテゴリから関連候補を抽出する。なお、対象のカテゴリ内に他の文書が存在しない場合、あるいは適合候補が何れのカテゴリにも属さない場合（ステップＳ３：No）は、次のステップへ移行する。 At this time, when a plurality of categories are obtained for one document, related candidates are extracted from each category. If no other document exists in the target category, or if the matching candidate does not belong to any category (step S3: No), the process proceeds to the next step.

次に、文書検索統合手段２０１は、適合候補と対になる文書、すなわち日本語文書の翻訳結果にあたる日英翻訳文書、および、英日翻訳文書の翻訳元にあたる英語文書を検索し、それらのカテゴリを英語カテゴリ保持手段３０４から求める（ステップＳ５）。その結果、適合候補と対になる文書が何れかの英語カテゴリに属する場合（ステップＳ６：Yes）、そのカテゴリに属する他の文書に着目し、着目した各文書と対になる文書、すなわち日本語文書または英日翻訳文書を関連候補として抽出する（ステップＳ７）。 Next, the document search integration unit 201 searches for a document paired with the matching candidate, that is, a Japanese-English translation document corresponding to the translation result of the Japanese document, and an English document corresponding to the translation source of the English-Japanese translation document, and their categories. Is obtained from the English category holding means 304 (step S5). As a result, when the document paired with the matching candidate belongs to any English category (step S6: Yes), pay attention to the other document belonging to the category, and the document paired with each focused document, that is, Japanese A document or an English-Japanese translation document is extracted as a related candidate (step S7).

最後に、文書検索統合手段２０１は、上記手順により得た日本語の適合候補及び関連候補を併せて出力手段４００へ出力する（ステップＳ８）。 Finally, the document search integration unit 201 outputs the Japanese matching candidates and related candidates obtained by the above procedure together to the output unit 400 (step S8).

上記説明は検索クエリが日本語であるケースであったが、英語の場合は、文書検索統合手段２０１がそのクエリを英語文書検索手段２０３へ出力することにより、以降、図３に示す上記手順に沿って同様の検索処理を行う。これにより、英語の検索クエリに対し、英語の適合候補及び関連候補を得る。 The above description is a case where the search query is in Japanese. However, in the case of English, the document search integration unit 201 outputs the query to the English document search unit 203, so that the procedure shown in FIG. A similar search process is performed along. Thereby, an English matching candidate and a related candidate are obtained for an English search query.

以上の処理により、先に入力された検索クエリに対する検索結果として、適合候補および関連候補が出力手段４００によりユーザに提示される（図２：ステップＡ８）。 Through the above processing, the matching candidate and the related candidate are presented to the user by the output unit 400 as a search result for the previously input search query (FIG. 2: Step A8).

《具体例》
本実施形態の理解を深めるため、具体的な例を用いて詳細に説明する。ここでは、図４に示すように、日本語文書保持手段３０１にある日本語文書ｄ_Ｊ１、ｄ_Ｊ２、ｄ_Ｊ３がカテゴリＪａに分類され、また、英語文書保持手段３０２の英語文書ｄ_Ｅ１、ｄ_Ｅ２、ｄ_Ｅ３がカテゴリＥｂに分類されているとする。図５に、これら原文書の一例を示す。以降では、同図５に示す日本語の検索クエリｑ_Ｊにより多言語文書を検索する例について説明する。 "Concrete example"
In order to deepen the understanding of the present embodiment, a detailed example will be described. Here, as shown in FIG. 4, the Japanese documents d _J1 , d _J2 , d _J3 in the Japanese document holding unit 301 are classified into the category Ja, and the English documents d _E1 , d in the English document holding unit 302. _E2, _{d E3} is assumed to be classified in the category Eb. FIG. 5 shows an example of these original documents. Hereinafter, an example in which a multilingual document is searched using the Japanese search query q _J shown in FIG. 5 will be described.

日英翻訳処理手段２０５は、日本語文書ｄ_Ｊ１、ｄ_Ｊ２、ｄ_Ｊ３から日英翻訳文書ｄ_Ｊ１´、ｄ_Ｊ２´、ｄ_Ｊ３´を生成して日英翻訳文書保持手段３０６に記録する（図２：ステップＡ１）。英日翻訳処理手段２０４は、英語文書ｄ_Ｅ１、ｄ_Ｅ２、ｄ_Ｅ３から英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´、ｄ_Ｅ３´を生成して英日翻訳文書保持手段３０５に記録する（図２：ステップＡ３）。図６に、上記の翻訳処理により生成された翻訳文書と原文書との関連を模式的に示す。 The Japanese-English translation processing unit 205 generates Japanese-English translation documents d _J1 ′, d _J2 ′, d _J3 ′ from the Japanese documents d _J1 , d _J2 , d _J3 and records them in the Japanese-English translation document holding unit 306 ( FIG. 2: Step A1). The English-Japanese translation processing means 204 generates English-Japanese translation documents d _E1 ′, d _E2 ′, d _E3 ′ from the English documents d _E1 , d _E2 , d _E3 and records them in the English-Japanese translation document holding means 305 (FIG. 2: Step A3). FIG. 6 schematically shows the relationship between the translation document generated by the translation process and the original document.

英語カテゴリ判別手段２０７は、日英翻訳文書ｄ_Ｊ１´、ｄ_Ｊ２´、ｄ_Ｊ３´を英語カテゴリに分類する（図２：ステップＡ２）。ここでは、図７に示すように、日英翻訳文書ｄ_Ｊ１´は英語文書ｄ_Ｅ１、ｄ_Ｅ２、ｄ_Ｅ３と同じ英語カテゴリＥｂに分類され、残りの日英翻訳文書ｄ_Ｊ２´およびｄ_Ｊ３´はそれぞれ別の英語カテゴリＥｃおよびＥｄに分類されたとする。 The English category discriminating means 207 classifies the Japanese-English translation documents d _J1 ′, d _J2 ′, d _J3 ′ into English categories (FIG. 2: Step A2). Here, as shown in FIG. 7, the Japanese-English translation documents d _J1 ′ are classified into the same English category Eb as the English documents d _E1 , d _E2 , d _E3, and the remaining Japanese-English translation documents d _J2 ′ and d _J3 ′. Are classified into different English categories Ec and Ed.

また、日本語カテゴリ判別手段２０６が、英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´、ｄ_Ｅ３´に対し日本語カテゴリへの分類を行う（図２：ステップＡ４）。ここでは、図７に示すように、今回生成した全ての英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´、ｄ_Ｅ３´が何れの日本語カテゴリにも該当しないとする。 Further, the Japanese category discriminating means 206 classifies the English-Japanese translation documents d _E1 ′, d _E2 ′, and d _E3 ′ into Japanese categories (FIG. 2: step A4). Here, as shown in FIG. 7, it is assumed that all the English-Japanese translation documents d _E1 ′, d _E2 ′, and d _E3 ′ generated this time do not correspond to any Japanese category.

図７に示す状態にて前処理が完了した後、日本語検索クエリｑ_Ｊが入力されると（図２：ステップＡ５）、日本語文書検索手段２０２は、この日本語の検索クエリｑ_Ｊと同一言語である日本語文書ｄ_Ｊ１、ｄ_Ｊ２、ｄ_Ｊ３、及び、英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´、ｄ_Ｅ３´を対象として適合候補の検索を行う。その結果、図８の（１）に示すように、検索クエリｑ_Ｊに対し英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´が適合候補として検索されたとする。 After the preprocessing is completed in the state shown in FIG. 7, when a Japanese search query q _J is input (FIG. 2: step A5), the Japanese document search means 202 determines that the Japanese search query q _J Relevant candidates are searched for Japanese documents d _J1 , d _J2 , d _J3 and English-Japanese translation documents d _E1 ′, d _E2 ′, d _E3 ′, which are the same language. As a result, as shown in (1) of FIG. 8, it is assumed that English-Japanese translation documents d _E1 ′ and d _E2 ′ are searched as matching candidates for the search query q _J.

次に、文書検索統合手段２０１が、適合候補としての英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´のカテゴリ、及び、その翻訳元である英語文書ｄ_Ｅ１、ｄ_Ｅ２のカテゴリを求め、各カテゴリから関連候補を抽出する。ここで、英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´は、いずれの日本語カテゴリにも属さないので、現時点で関連候補は抽出されない。なお、仮に、今回の適合候補が日本語文書ｄ_Ｊ１であった場合は、そのカテゴリＪａに属する他の文書である日本語文書ｄ_Ｊ２、ｄ_Ｊ３が関連候補として抽出されることとなる。 Next, the document search integration unit 201 obtains the categories of the English-Japanese translation documents d _E1 ′ and d _E2 ′ as matching candidates and the categories of the English documents d _E1 and d _E2 that are the translation sources, and from each category. Extract related candidates. Here, since the English-Japanese translation documents d _E1 ′ and d _E2 ′ do not belong to any Japanese category, no related candidate is extracted at this time. If the current candidate for matching is the Japanese document d _J1 , the Japanese documents d _J2 and d _J3 , which are other documents belonging to the category Ja, are extracted as related candidates.

一方、英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´の翻訳元である英語文書ｄ_Ｅ１、ｄ_Ｅ２について検証すると、そのカテゴリは、図８の（２）に示すように英語カテゴリＥｂである。そこで、カテゴリＥｂにおいて、英語文書ｄ_Ｅ１、ｄ_Ｅ２以外の文書である英語文書ｄ_Ｅ３および日英翻訳文書ｄ_Ｊ１´に着目し、これら着目した各文書と言語上の対になる文書を関連候補として抽出する。すなわち、図８の（３）に示すように、着目した英語文書ｄ_Ｅ３の翻訳結果である英日翻訳文書ｄ_Ｅ３´と、同じく着目した日英翻訳文書ｄ_Ｊ１´の翻訳元である日本語文書ｄ_Ｊ１とが関連候補となる。 On the other hand, English-Japanese translation document _{_d} E1 _', _d _E2' and will be verified English document _{_d E1,} _d _E2 is a translation source of, that category is the English category Eb as shown in (2) of FIG. 8. Therefore, in the category Eb, focusing on the English document d _E1, d _E2 other than the document in which English document d _E3 and Japanese-to-English translation document d _{J1 ',} associated candidate a document to be paired on each document and the language you have these attention Extract as That is, as shown in FIG. 8 (3), the English-Japanese translation document d _E3 ′, which is the translation result of the focused English document d _E3 , and the Japanese translation source of the focused Japanese-English translation document d _J1 ′. The document d _J1 is a related candidate.

最後に、文書検索統合手段２０１が、図８の（４）に示すように、適合候補としての２つの英日翻訳文書ｄ_Ｅ１´、ｄ_Ｅ２´と、関連候補としての２つの英日翻訳文書ｄ_Ｅ３´及び日本語文書ｄ_Ｊ１とを検索クエリｑ_Ｊに対する検索結果として出力手段４００へ出力する。 Finally, as shown in (4) of FIG. 8, the document search integration unit 201 performs two English-Japanese translation documents d _E1 ′ and d _E2 ′ as matching candidates and two English-Japanese translation documents as related candidates. d _E3 ′ and the Japanese document d _J1 are output to the output unit 400 as a search result for the search query q _J.

以上説明した実施形態によれば、多言語文書の検索処理において、検索クエリにマッチする適合候補に加え、この適合候補と対の文書が属するカテゴリを利用して関連候補を抽出し、適合候補及び関連候補を併せて出力することから、各言語のカテゴリの分類軸や粒度が一致せず、異言語間のカテゴリの対応関係が１対１とならないシステムであっても、適切な検索結果を提示することができる。 According to the embodiment described above, in the multilingual document search process, in addition to the match candidate matching the search query, a related candidate is extracted using the category to which the document paired with the match candidate belongs, Since related candidates are output together, the classification axis and granularity of categories in each language do not match, and appropriate search results are presented even in a system where the correspondence between categories in different languages is not 1: 1. can do.

なお、上記実施形態では、検索対象の言語種が２種類であったが、本発明は言語種が３種類以上である多言語文書検索にも適用可能である。その際、関連候補としての文書は、検索クエリと同一言語の文書を抽出するよう制御する。一例として、言語種が日本語／英語／独語の３種類である多言語検索について以下に説明する。 In the above embodiment, there are two types of language to be searched. However, the present invention can also be applied to a multilingual document search having three or more language types. At this time, the document as the related candidate is controlled to extract a document in the same language as the search query. As an example, multilingual search in which the language type is Japanese / English / German is described below.

上記３言語による文書検索を実施するにあたっては、プロセッサ２００の構成として、図１に示すものに、独語文書検索手段、独語カテゴリ判別手段、独日翻訳処理手段、日独翻訳処理手段、独英翻訳処理手段および英独翻訳処理手段を加える。また、記憶媒体３００には、各言語に関し図１に示す構成と同様の、独語に関連する構成を加える。プロセッサ２００による前処理としては、独語の原文書に対する日本語／英語への翻訳処理、日本語／英語の各原文書に対する独語への翻訳処理、及び、それぞれで得た翻訳文書を当該言語のカテゴリに分類する処理を加える。 In performing the document search in the above three languages, the configuration of the processor 200 includes, as shown in FIG. 1, a German document search means, a German language category determination means, a German-Japanese translation processing means, a Japanese-German translation processing means, and a German-English translation. Add processing means and English-German translation processing means. In addition, the storage medium 300 has a configuration related to German, similar to the configuration shown in FIG. As pre-processing by the processor 200, the original document in German / Japanese is translated into Japanese / English, the original document in Japanese / English is translated into German, and the obtained translation document is classified into the category of the language. Add processing to classify.

次に、上記３言語による検索処理について具体例を用いて説明する。ここでは、図９に示すように、各言語の原文書である日本語文書ｄ_Ｊ１／ｄ_Ｊ２／ｄ_Ｊ３、英語文書ｄ_Ｅ１／ｄ_Ｅ２／ｄ_Ｅ３、独語文書ｄ_Ｇ１／ｄ_Ｇ２が、それぞれ対応する日本語カテゴリＪａ、英語カテゴリＥｂ、独語カテゴリＧｃに予め分類されているとする。 Next, the search processing in the three languages will be described using a specific example. Here, as shown in FIG. 9, a Japanese document d _J1 / d _J2 / d _J3 , an English document d _E1 / d _E2 / d _E3 , and a German document d _G1 / d _G2 , which are original documents in each language, respectively. Assume that the categories are previously classified into the corresponding Japanese category Ja, English category Eb, and German language category Gc.

また、各言語の原文書に対する翻訳文書のカテゴリが、図１０に示すように決定されたとする。図示の（Ｊ）なる符合は、日本語に翻訳された翻訳文書であることを示し、例えば、ｄ_Ｅ１ ^（Ｊ）は、英語文書ｄ_Ｅ１を和訳して得た英日翻訳文書を指す。また、同様に、（Ｅ）は英語の翻訳文書であることを示し、（Ｇ）は独語の翻訳文書であることを示す。 Further, it is assumed that the category of the translation document for the original document in each language is determined as shown in FIG. The sign (J) shown in the figure indicates that the document is translated into Japanese. For example, d _E1 ^(J) indicates an English-Japanese translation document obtained by translating the English document d _E1 into Japanese. Similarly, (E) indicates an English translation document, and (G) indicates a German translation document.

図１０に示す状態にて前処理が完了した後、日本語クエリｑ_Ｊが入力されると、日本語文書検索手段は、この日本語の検索クエリと同一言語である日本語文書ｄ_Ｊ１／ｄ_Ｊ２／ｄ_Ｊ３、英日翻訳文書ｄ_Ｅ１ ^（Ｊ）／ｄ_Ｅ２ ^（Ｊ）／ｄ_Ｅ３ ^（Ｊ）、及び、独日翻訳文書ｄ_Ｇ１ ^（Ｊ）／ｄ_Ｇ２ ^（Ｊ）を対象として適合候補の検索を行う。その結果、図１１の（１）に示すように、検索クエリｑ_Ｊに対し英日翻訳文書ｄ_Ｅ１ ^（Ｊ）／ｄ_Ｅ２ ^（Ｊ）が適合候補として検索されたとする。 After the preprocessing is completed in the state shown in FIG. 10, when a Japanese query q _J is input, the Japanese document search means causes the Japanese document d _J1 / d having the same language as the Japanese search query. _J2 / d _J3 , English-Japanese translation documents d _E1 ^(J) / d _E2 ^(J) / d _E3 ^(J) , and German-Japanese translation documents d _G1 ^(J) / d _G2 ^(J) Perform a search. As a result, as shown in (1) in FIG. 11, the search query _{q J} EJ to translate document _{^{_{^{d E1 (J) / d E2}}}} (J) is retrieved as adapted candidates.

次に、文書検索統合手段が、適合候補としての英日翻訳文書ｄ_Ｅ１ ^（Ｊ）／ｄ_Ｅ２ ^（Ｊ）のカテゴリ、及び、その翻訳元である英語文書ｄ_Ｅ１／ｄ_Ｅ２のカテゴリを求め、各カテゴリから関連候補を抽出する。ここで、英日翻訳文書ｄ_Ｅ１ ^（Ｊ）／ｄ_Ｅ２ ^（Ｊ）は、いずれの日本語カテゴリにも属さないので、現時点で関連候補は抽出されない。 Next, the document search integration unit obtains the category of the English-Japanese translation document d _E1 ^(J) / d _E2 ^(J) as the candidate for matching and the category of the English document d _E1 / d _E2 that is the translation source, Extract related candidates from each category. Here, the English-Japanese translation document d _E1 ^(J) / d _E2 ^(J) does not belong to any Japanese category, so no related candidate is extracted at this time.

一方、英日翻訳文書ｄ_Ｅ１ ^（Ｊ）／ｄ_Ｅ２ ^（Ｊ）の翻訳元である英語文書ｄ_Ｅ１／ｄ_Ｅ２について検証すると、そのカテゴリは、図１１の（２）に示すように英語カテゴリＥｂである。そこで、カテゴリＥｂにおいて、英語文書ｄ_Ｅ１／ｄ_Ｅ２以外の文書である英語文書ｄ_Ｅ３と、日英翻訳文書ｄ_Ｊ１ ^（Ｅ）と、独英翻訳文書ｄ_Ｇ１ ^（Ｅ）／ｄ_Ｇ２ ^（Ｅ）とに着目し、これら着目した各文書と言語上の対になる文書を関連候補として抽出する。 On the other hand, when the English document d _E1 / d _E2 which is the translation source of the English-Japanese translation document d _E1 ^(J) / d _E2 ^(J) is verified, the category is the English category Eb as shown in (2) of FIG. It is. Therefore, in the category Eb, the English document d _E3 which is a document other than the English document d _E1 / d _E2 , the Japanese-English translation document d _J1 ^(E), and the German-English translation document d _G1 ^(E) / d _G2 ^(E) Then, the document that is paired with each of the focused documents is extracted as a related candidate.

図１１の（３）に示すように、関連候補としては、まず、着目した英語文書ｄ_Ｅ３の翻訳結果である英日翻訳文書ｄ_Ｅ３ ^（Ｊ）と、同じく着目した日英翻訳文書ｄ_Ｊ１ ^（Ｅ）の翻訳元である日本語文書ｄ_J１とが抽出される。 As shown in (3) of FIG. 11, as related candidates, first, an English-Japanese translation document d _E3 ^(J) , which is a translation result of the focused English document d _E3 , and a focused Japanese-English translation document d _J1 ^{( The} Japanese document d _J1 which is the translation source of ^E) is extracted.

また、同じく着目した独英翻訳文書ｄ_Ｇ１ ^（Ｅ）／ｄ_Ｇ２ ^（Ｅ）の場合、その翻訳元は独語文書ｄ_Ｇ１／ｄ_Ｇ２であり、今回の検索クエリ（日本語）とは言語が一致しない。この場合、独英翻訳文書ｄ_Ｇ１ ^（Ｅ）／ｄ_Ｇ２ ^（Ｅ）を関連候補の抽出処理から除外してもよいが、図１１の（３）´に示すような手順を加えることにより、より多くの関連候補を得ることができる。すなわち、着目した独英翻訳文書ｄ_Ｇ１ ^（Ｅ）／ｄ_Ｇ２ ^（Ｅ）の翻訳元となる独語文書ｄ_Ｇ１／ｄ_Ｇ２に対する日本語による翻訳文書、すなわち独日翻訳文書ｄ_Ｇ１ ^（Ｊ）／ｄ_Ｇ２ ^（Ｊ）を関連候補に加える。 Also, in the case of the German-English translation document d _G1 ^(E) / d _G2 ^(E) , which is also focused on, the translation source is the German document d _G1 / d _G2 , and the language is the same as the current search query (Japanese) do not do. In this case, the German-English translation document d _G1 ^(E) / d _G2 ^(E) may be excluded from the related candidate extraction process, but by adding a procedure as shown in FIG. Many related candidates can be obtained. That is, the focused German-English translation document _{^{_{^{d G1 (E) / d G2}}}} (E) of the translation will originate German Japanese translation document for document _d G1 _{/ d G2,} i.e. DE-Japanese translation document _d ^{G1 (J)} / _d Add _G2 ^(J) to the related candidates.

最後に、文書検索統合手段が、図１１の（４）に示すように、適合候補としての２つの英日翻訳文書ｄ_Ｅ１ ^（Ｊ）／ｄ_Ｅ２ ^（Ｊ）と、関連候補としての日本語文書ｄ_J１、英日翻訳文書ｄ_Ｅ３ ^（Ｊ）および独日翻訳文書ｄ_Ｇ１ ^（Ｊ）／ｄ_Ｇ２ ^（Ｊ）を検索クエリｑ_Ｊに対する検索結果として出力手段へ出力する。 Finally, as shown in FIG. 11 (4), the document search integration means performs two English-Japanese translation documents d _E1 ^(J) / d _E2 ^(J) as matching candidates and a Japanese document as a related candidate. d _J1 , English-Japanese translation document d _E3 ^(J) and German-Japanese translation document d _G1 ^(J) / d _G2 ^(J) are output to the output means as search results for search query q _J.

３言語以上の文書検索では、適合候補の対となる文書のカテゴリ（Ｅｂ）に、検索クエリの言語と関連しない独英翻訳文書ｄ_Ｇ１ ^（Ｅ）／ｄ_Ｇ２ ^（Ｅ）のような文書が含まれる可能性があるが、これらの文書を媒介にした（３）´に示す検索を加えることにより、文書の検索範囲が拡大され、結果、より多くの検索結果をユーザに提示することができる。 In a document search of three or more languages, a document such as a German-English translation document d _G1 ^(E) / d _G2 ^(E) that is not related to the language of the search query is included in the category (Eb) of the pair of matching candidates. However, by adding the search shown in (3) ′ using these documents as a medium, the search range of the documents is expanded, and as a result, more search results can be presented to the user.

本発明は、特許、論文、製品やサービスFAQ、コンタクトセンターの応対記録、オフィス文書など、カテゴリ分類された多言語文書を検索する用途に好適である。また、このような多言語文書の分類及び検索を行う文書共有システムにも適用可能である。 The present invention is suitable for searching for multilingual documents classified into categories such as patents, papers, product and service FAQs, contact center response records, office documents, and the like. The present invention is also applicable to a document sharing system that performs classification and retrieval of such multilingual documents.

本発明の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of embodiment of this invention. 実施形態の動作手順を示すフローチャートである。It is a flowchart which shows the operation | movement procedure of embodiment. 実施形態における文書検索に関する手順を示すフローチャートである。It is a flowchart which shows the procedure regarding the document search in embodiment. 実施形態の具体例を説明するための説明図（その１）である。It is explanatory drawing (the 1) for demonstrating the specific example of embodiment. 実施形態の具体例を説明するための説明図（その２）である。It is explanatory drawing (the 2) for demonstrating the specific example of embodiment. 実施形態の具体例を説明するための説明図（その３）である。It is explanatory drawing (the 3) for demonstrating the specific example of embodiment. 実施形態の具体例を説明するための説明図（その４）である。It is explanatory drawing (the 4) for demonstrating the specific example of embodiment. 実施形態の具体例を説明するための説明図（その５）である。It is explanatory drawing (the 5) for demonstrating the specific example of embodiment. 他の実施形態の具体例を説明するための説明図（その１）である。It is explanatory drawing (the 1) for demonstrating the specific example of other embodiment. 他の実施形態の具体例を説明するための説明図（その２）である。It is explanatory drawing (the 2) for demonstrating the specific example of other embodiment. 他の実施形態の具体例を説明するための説明図（その３）である。It is explanatory drawing (the 3) for demonstrating the specific example of other embodiment.

Explanation of symbols

１００入力手段
２００プロセッサ
３００記憶媒体
４００出力手段
２０１：文書検索統合手段、２０２：日本語文書検索手段、２０３：英語文書検索手段、２０４：英日翻訳処理手段、２０５：日英翻訳処理手段、２０６：日本語カテゴリ判別手段、２０７：英語カテゴリ判別手段
３０１：日本語文書保持手段、３０２：英語文書保持手段、３０３：日本語カテゴリ保持手段、３０４：英語カテゴリ保持手段、３０５：英日翻訳文書保持手段、３０６：日英翻訳文書保持手段、３０７：英日対訳辞書保持手段、３０８：日英対訳辞書保持手段 100 input means 200 processor 300 storage medium 400 output means 201: document search integration means 202: Japanese document search means 203: English document search means 204: English-Japanese translation processing means 205: Japanese-English translation processing means 206 : Japanese category discriminating means, 207: English category discriminating means 301: Japanese document holding means, 302: English document holding means, 303: Japanese category holding means, 304: English category holding means, 305: English-Japanese translation document holding Means 306: Japanese-English translation document holding means 307: English-Japanese bilingual dictionary holding means 308: Japanese-English bilingual dictionary holding means

Claims

A processor and a storage medium for storing a plurality of original documents assigned with a document category defined for each language for each language;
The processor is
Means for generating a translation document by translating the original document;
Means for associating the translated document with the original document and storing it in the storage medium by language;
Means for obtaining a document category of the translated document from a document category of the same language as the translated document;
Means for searching for matching candidates that match the input search query from an original document and a translation document in the same language as the search query;
Recognizing a document category of a translation document or an original document for the matching candidate, extracting a document in the same language as the search query from a translation document or an original document for another document belonging to the document category as the related candidate And a multilingual document search apparatus characterized by comprising: means for outputting the matching candidates as search results.

When the processor extracts related candidates, if there is a translated document in which the language of the corresponding original document is different from the language of the search query in another document belonging to the recognized document category, the translated document for the original document The multilingual document search apparatus according to claim 1, wherein a translation document in the same language as the search query is added to the related candidates.

When obtaining the document category of the translation document, the processor collates the original document belonging to the document category of the language and the original document of the translation source of the translation document by word-by-word parallel translation, and the similarity between both documents by the collation 3. The multilingual document retrieval apparatus according to claim 1, wherein a document category of the translated document is determined based on the document.

When obtaining the document category of the translated document, the processor collates the original document belonging to the document category of the language and the translated document for each word, and based on the similarity between the two documents by the collation, the document category of the translated document The multilingual document search apparatus according to claim 1, wherein the multilingual document search apparatus determines the number of words.

A program for causing a computer to function as the multilingual document search apparatus according to any one of claims 1 to 4.

A processor connected to a storage medium for storing a plurality of original documents assigned with document categories defined for each language for each language,
Generating a translation document by translating the original document;
Storing a translated document in the storage medium in association with the original document in a language;
Obtaining a document category of the translated document from a document category in the same language as the translated document;
Searching for matching candidates that match the search query input by the input means from the original document and the translation document in the same language as the search query;
Recognizing a document category of a translation document or an original document for the matching candidate, extracting a document in the same language as the search query from a translation document or an original document for another document belonging to the document category as the related candidate And a step of outputting the matching candidate as a search result from an output means.

In the step of outputting the search result by the processor,
When there is a translated document in which the language of the corresponding original document is different from the language of the search query in another document belonging to the recognized document category, the translated document of the same language as the search query among the translated documents for the original document The multilingual document search method according to claim 6, wherein: is added to the related candidates.

In the step, wherein the processor determines a document category of the translation document,
Collating the original document belonging to the document category of the language and the original document of translation of the translated document by word-by-word parallel translation, and determining the document category of the translated document based on the similarity between the two documents by the collation The multilingual document search method according to claim 6 or 7, wherein the multilingual document search method is characterized.

In the step, wherein the processor determines a document category of the translation document,
8. The original document belonging to the document category of the language and the translated document are collated for each word, and the document category of the translated document is determined based on the similarity between the two documents by the collation. The described multilingual document search method.