JP2002259445A

JP2002259445A - Corresponding category retrieval system and method

Info

Publication number: JP2002259445A
Application number: JP2001058303A
Authority: JP
Inventors: Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2001-03-02
Filing date: 2001-03-02
Publication date: 2002-09-13
Anticipated expiration: 2021-03-02
Also published as: JP4013489B2

Abstract

PROBLEM TO BE SOLVED: To make it possible to easily determine the corresponding relation between categories. SOLUTION: Depending on learning data, a word vector set is produced with a word vector producing means 13. With a document vector producing means 14, the word vector set is referred to and document vector sets in both categories are produced. With a multilingual retrieval means 15, a document pair having close semantic contents is extracted on the basis of the text vector. With a retrieval result holding means 16, the obtained document pair is added to a learning data holding part 12 as a piece of the learning data. The above processing is repeatedly executed. A category corresponding relation determining means 17, for instance, if the ratio of the sum total of the document pairs stored in the retrieval result holding means 16 to the number of total documents included in both categories is greater than a predetermined threshold, determines that both categories have a similarity.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書集合が複数の
カテゴリに分類されているカテゴリ構造を対象とし、異
なる言語に対してそれぞれ構築された複数のカテゴリ構
造間のカテゴリの対応関係を決定する技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is directed to a category structure in which a document set is classified into a plurality of categories, and determines a category correspondence between the plurality of category structures constructed for different languages. About technology.

【０００２】[0002]

【従来の技術】大量の文書集合へのアクセスを容易にす
る方法の一つとして、文書集合を複数のカテゴリへと分
類する手法を挙げることができる。文書集合をカテゴリ
に分類した場合、ユーザが求める文書が属していると想
定されるカテゴリのみを検索対象として検索を行うこと
により、効率よく所望の文書を得ることが可能となる。
文書集合を人手によってカテゴリへと分類する場合もあ
れば、文献「情報検索論認知的アプローチへの展望，Ｄ
ａｖｉｄＥｌｌｉｓ著，丸善株式会社，（１９９
４）」に記述されているようなカテゴリ分類を自動化す
る手法もこれまで多く提案されてきた。2. Description of the Related Art As one of methods for facilitating access to a large number of document sets, there is a method of classifying the document sets into a plurality of categories. When the document set is classified into categories, it is possible to efficiently obtain a desired document by performing a search with only the category to which the document desired by the user belongs as a search target.
In some cases, a set of documents is manually categorized into categories.
avid Ellis, Maruzen Co., Ltd., (199
Many techniques for automating categorization as described in "4)" have been proposed.

【０００３】このようなカテゴリ化された文書集合（以
降、カテゴリ構造とも呼ぶ）が複数の言語に対して構築
されている場合、複数のカテゴリ構造間のカテゴリの対
応関係（類似する意味内容の文書集合を含むカテゴリの
対応関係）を決定することは、言語をまたがる文書検索
（多言語文書検索）を行う上で重要である。すなわち、
検索対象とするターゲット言語（二次的な検索に用いら
れる言語）の文書集合を、ソース言語（直接に検索に用
いられる言語）による検索要求に近いカテゴリに限定す
ることによって、検索の精度を向上させることが可能と
なる。When such a categorized document set (hereinafter also referred to as a category structure) is constructed for a plurality of languages, the correspondence of the categories between the plurality of category structures (a document having similar semantic contents) Determining the correspondence between categories including a set) is important in performing document search across languages (multilingual document search). That is,
Improve search accuracy by limiting the set of documents in the target language (language used for secondary search) to be searched to categories close to search requests in the source language (language used directly for search) It is possible to do.

【０００４】このようなカテゴリの対応関係を自動的に
決定するための方法としては、多言語文書検索の手法を
流用する方法が主である。例えば、文献「Ｈｉｒｏｓｈ
ｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏ
ｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａ
ｎｌｅｙＰｅｔｅｒｓ，”ＱｕｅｒｙＴｒａｎｓｌ
ａｔｉｏｎＭｅｔｈｏｄｆｏｒＣｒｏｓｓＬａ
ｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅ
ｖａｌ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＭａ
ｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎＳｕｍｍｉｔ
ＶＩＩ ’９９ＷｏｒｋｓｈｏｐｏｎＭａｃｈｉ
ｎｅＴｒａｎｓｌａｔｉｏｎｆｏｒＣｒｏｓｓ
ＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒ
ｉｅｖａｌ，（１９９９）」では、翻訳対の集合（パラ
レルコーパス）を学習データとして、異なる言語で書か
れた文書の各々を同じベクトル空間上の文書ベクトルと
して表現し、ベクトル間の余弦の値を文書間の類似度で
あるとして多言語文書検索を行う手法が提案されてい
る。この手法を用いれば、カテゴリに属する全ての文書
に対応する文書ベクトルの和をカテゴリベクトルと定義
し、カテゴリベクトル間の余弦を類似度と定義すること
によって、異なる言語を対象として構築された複数のカ
テゴリ構造間のカテゴリの対応関係を決定することが可
能となる。As a method for automatically determining the correspondence between such categories, a method of diverting a multilingual document search technique is mainly used. For example, the document "Hirosh
iMasuichi, Raymond Flourno
y, Stefan Kaufmann and Sta
nley Peters, "Query Transl
ation Method for Cross La
ngeage Information Retrie
val ", The Proceedingsof Ma
chine Translation Summit
VII '99 Works on Machi
ne Translation for Cross
Language Information Retr
ieval, (1999) ", a set of translation pairs (parallel corpus) is used as learning data, each document written in a different language is expressed as a document vector in the same vector space, and the cosine value between the vectors is expressed as a document. A method of performing a multilingual document search on the basis of the similarity between them has been proposed. With this method, the sum of the document vectors corresponding to all the documents belonging to the category is defined as a category vector, and the cosine between the category vectors is defined as a similarity. It is possible to determine the correspondence of categories between category structures.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら現状にお
いては、上記の手法によって対応するカテゴリを決定す
る上で実用上十分な精度が得られているとは言い難い。
一般に多言語情報検索の検索精度を低下させる最大の要
因は、単語あるいはフレーズの意味曖昧性の問題であ
る。第１の言語のある単語（フレーズ）を第２の言語の
単語（フレーズ）へと翻訳する際には、多くの翻訳候補
が存在する。例えば、英語の「ｂａｓｅ」という単語
は、軍事用語としては「基地」、野球用語としては
「塁」、政治用語としては「支持母体」、数学用語とし
ては「基数」、化学用語としては「塩基」、文法用語と
しては「期体」、建築用語としては「（塗料の）主成
分」等、分野に依存して様々な翻訳候補が存在する。こ
れらの翻訳候補は多くの場合分野依存であるため、多言
語情報検索では、検索対象を特定の分野の文書集合に限
れば高い精度が得られると言われている。すなわち、カ
テゴリ内にはそのカテゴリの分野に応じた訳語が存在
し、分野ごとに適切な訳語を用いて多言語文書検索を行
う必要がある。上記の文書ベクトルを用いた手法では、
学習データとしてある一つのパラレルコーパスを用いる
ため分野に応じた適切な文書ベクトルを生成することが
できず、したがって意味曖昧性の問題を解決することが
できない。カテゴリごとにパラレルコーパスを用意する
ことができれば分野に応じた適切な文書ベクトルを生成
することは可能であるが、一般にパラレルコーパスは入
手が困難であり、実際にはそのようなアプローチは不可
能である。パラレルコーパスを学習データとする多言語
文書検索手法以外にも、２ヶ国語辞書を用いる多言語文
書検索手法も数多く提案されているが、一般的な２ヶ国
語辞書を用いた場合は意味曖昧性の問題が解決できず、
分野（カテゴリ）ごとに２ヶ国語辞書を用意することが
実際上不可能である点は全く同様である。However, at present, it is difficult to say that practically sufficient accuracy has been obtained in determining the corresponding category by the above method.
In general, the biggest factor that reduces the search accuracy of multilingual information search is the problem of semantic ambiguity of words or phrases. When translating a word (phrase) in the first language into a word (phrase) in the second language, there are many translation candidates. For example, the word "base" in English is "base" as a military term, "base" as a baseball term, "support matrix" as a political term, "radix" as a mathematical term, and "base" as a chemical term. There are various translation candidates depending on the field, such as "", a grammatical term "period", and an architectural term "a main component of (paint)". Since these translation candidates are often field-dependent, it is said that in multilingual information search, high accuracy can be obtained if the search target is limited to a set of documents in a specific field. That is, there is a translation in the category according to the field of the category, and it is necessary to perform a multilingual document search using an appropriate translation for each field. In the method using the above document vector,
Since one parallel corpus is used as learning data, it is not possible to generate a document vector appropriate for the field, and thus it is not possible to solve the problem of semantic ambiguity. If a parallel corpus can be prepared for each category, it is possible to generate an appropriate document vector according to the field, but in general, it is difficult to obtain a parallel corpus, and in practice such an approach is not possible. is there. Many multilingual document search methods using bilingual dictionaries have been proposed in addition to multilingual document search methods using a parallel corpus as learning data. However, when a general bilingual dictionary is used, semantic ambiguity is obtained. Problem can not be solved,
It is exactly the same that it is practically impossible to prepare a bilingual dictionary for each field (category).

【０００６】本発明はこのような点に鑑みてなされたも
のであり、カテゴリごとにパラレルコーパスを用意する
ことなく、高い精度でカテゴリ間の対応関係を決定する
ことができるシステムを提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and has as its object to provide a system capable of determining the correspondence between categories with high accuracy without preparing a parallel corpus for each category. Aim.

【０００７】[0007]

【課題を解決するための手段】文献「Ｈｉｒｏｓｈｉ
Ｍａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏ
ｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａ
ｎｌｅｙＰｅｔｅｒｓ，”ＡＢｏｏｔｓｔｒａｐｐ
ｉｎｇｍｅｔｈｏｄｆｏｒＥｘｔｒａｃｔｉｎｇ
ＢｉｌｉｎｇｕａｌＴｅｘｔＰａｉｒｓ”，Ｔｈ
ｅＰｒｏｃｅｅｄｉｎｇｓｏｆＴｈｅ１８ｔｈ
ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅ
ｏｎＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓ
ｔｉｃｓ，ｐｐ．１０６６−１０７０（２０００）」で
は、以下のような２ヶ国語の類似文書ペア決定手法が提
案されている。Means for Solving the Problems Reference "Hiroshi"
Masuichi, Raymond Flourno
y, Stefan Kaufmann and Sta
nley Peters, "A Bootstrapp
ing method for Extracting
Bilingual Text Pairs ", Th
e Proceedings of The 18th
International Conference
on Computational Linguis
tics, pp. 1066-1070 (2000) ", the following bilingual similar document pair determination method is proposed.

【０００８】「あるパラレルコーパスを初期の学習デー
タとして多言語文書検索を行い、２ヶ国語の文書が混在
する文書集合中から類似する２ヶ国語文書ペアを決定
し、得られた文書ペアを初期の学習データに追加し、得
られた学習データに基づいて再度多言語文書検索を行
う。この多言語文書検索処理と、得られた文書ペアの学
習データへの追加処理を交互に繰り返すことによって、
学習ペアを成長させ、最終的に精度の高いパラレルコー
パス（２ヶ国語文書ペア）を得る。」[0008] A multilingual document search is performed using a certain parallel corpus as initial learning data, a similar bilingual document pair is determined from a document set in which bilingual documents are mixed, and the obtained document pair is initialized. The multilingual document search is performed again based on the obtained learning data, and the multilingual document search process and the process of adding the obtained document pair to the learning data are alternately repeated.
The learning pairs are grown, and finally a highly accurate parallel corpus (bilingual document pair) is obtained. "

【０００９】上記文献に記載されている通り、この手法
は多言語文書検索の対象である文書集合中の各文書の意
味内容が似通っている（同一の分野である）場合にしか
有効に働かない。本発明は、上記手法のこの性質を逆に
利用するものである。すなわち、第１の言語で書かれた
文書集合を含む第１のカテゴリと第２の言語で書かれた
文書集合を含む第２のカテゴリを合わせたものを、多言
語文書検索の対象として上記手法を適用し、学習ペアが
成長すれば第１のカテゴリと第２のカテゴリの分野が類
似のものであると判断する。As described in the above document, this method works effectively only when the meanings of the documents in the set of documents to be searched for multilingual documents are similar (in the same field). . The present invention exploits this property of the above approach in reverse. That is, a combination of a first category including a set of documents written in a first language and a second category including a set of documents written in a second language is used as a target of multilingual document search. Is applied, and if the learning pair grows, it is determined that the fields of the first category and the second category are similar.

【００１０】本発明の一構成は、図１に示されるよう
に、第１の言語を対象として生成された第１のカテゴリ
構造と第２の言語を対象として生成された第２のカテゴ
リ構造を保持するカテゴリ構造保持手段（１）と、多言
語文書検索を行う際の学習データを保持する学習データ
保持手段（２）と、学習データ保持手段に保持されてい
る学習データを用いて、カテゴリ構造保持手段に保持さ
れている第１のカテゴリ中のカテゴリと第２のカテゴリ
中のカテゴリを対象として多言語文書検索を行い、類似
する第１の言語と第２の言語の２ヶ国語文書ペアを決定
する多言語文書検索手段（３）と、多言語文書検索によ
って得られる文書ペアを保持すると共に、該文書ペアを
学習データ保持手段に追加する検索結果保持手段（４）
と、検索結果保持手段に保持されている文書ペアを参照
してカテゴリ間の対応関係を決定するカテゴリ対応関係
決定手段（５）とを備えることを特徴とし、この構成に
おいて、多言語文書検索手段による多言語検索処理と、
検索結果保持手段による文書ペアの学習データへの追加
処理とを交互に繰り返すものである。One configuration of the present invention is, as shown in FIG. 1, composed of a first category structure generated for a first language and a second category structure generated for a second language. A category structure holding unit (1) for holding, a learning data holding unit (2) for holding learning data for performing a multilingual document search, and a category structure using the learning data held in the learning data holding unit. A multilingual document search is performed for the category in the first category and the category in the second category held in the holding unit, and a similar bilingual document pair of the first language and the second language is searched. Multilingual document search means (3) to be determined, and search result holding means (4) for holding a document pair obtained by multilingual document search and adding the document pair to the learning data holding means
And a category correspondence determining means (5) for determining the correspondence between the categories with reference to the document pair held in the search result holding means. Multilingual search processing by
The process of adding the document pair to the learning data by the search result holding means is alternately repeated.

【００１１】なお、本発明は装置またはシステムとして
実現できるのみでなく、方法としても実現可能であり、
少なくともその一部をコンピュータプログラムとして構
成することもできることはもちろんである。The present invention can be realized not only as an apparatus or a system but also as a method.
Of course, at least a part thereof can be configured as a computer program.

【００１２】本発明の上述の一構成および本発明の他の
構成は特許請求の範囲に明瞭に記載され、また以下にお
いて実施例を用いて詳細に説明される。[0012] The above-mentioned configuration of the present invention and other configurations of the present invention are clearly described in the claims, and are described in detail below by using embodiments.

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施例について説
明する。Embodiments of the present invention will be described below.

【００１４】図２は、本発明の実施例の対応カテゴリ検
索システムの構成を示している。なお、この実施例にお
いては、日本語と英語を対象として説明を行うが、形態
素解析処理（文を単語へと分割する処理）が適用可能な
言語であればいかなる言語であっても同様の効果を得る
ことができる。FIG. 2 shows the configuration of the corresponding category search system according to the embodiment of the present invention. In this embodiment, the description is given for Japanese and English, but the same effect can be obtained in any language to which morphological analysis processing (processing of dividing a sentence into words) can be applied. Can be obtained.

【００１５】図２において、カテゴリ構造保持手段１１
は、複数の日本語文書および複数の英語文書をそれぞれ
カテゴリに分類して格納するカテゴリ構造（第１のカテ
ゴリ構造と第２のカテゴリ構造）を計算機内部に保持す
る手段である。In FIG. 2, the category structure holding means 11
Is a means for storing in a computer a category structure (a first category structure and a second category structure) in which a plurality of Japanese documents and a plurality of English documents are classified into categories and stored.

【００１６】学習データ保持手段１２は、日英の翻訳文
書対の集合（日英のパラレルコーパス）を初期学習デー
タとして保持する手段である。該パラレルコーパスは、
特に分野を限るものではなく、入手が容易な一般的内容
のパラレルコーパスである。また、検索結果保持手段１
６から日英文書ペアを受け取ると、初期学習データであ
るパラレルコーパスに追加して新たな日英のパラレルコ
ーパスとして保持する。The learning data holding means 12 is a means for holding a set of Japanese-English translation document pairs (Japanese-English parallel corpus) as initial learning data. The parallel corpus is
The field is not particularly limited, and is a general-purpose parallel corpus that is easily available. Search result holding means 1
When a Japanese-English document pair is received from No. 6, it is added to the parallel corpus, which is the initial learning data, and held as a new Japanese-English parallel corpus.

【００１７】単語ベクトル生成手段１３は、学習データ
保持手段１２に保持される日英のパラレルコーパスを学
習データとして、そこに含まれる全ての日本語単語およ
び英語単語に対して、対応する多次元ベクトル（単語ベ
クトル）を計算する手段である。以下、単語ベクトルを
計算するアルゴリズムを説明する。The word vector generating means 13 uses the Japanese-English parallel corpus held in the learning data holding means 12 as learning data, and for all Japanese words and English words contained therein, a corresponding multidimensional vector. (Word vector). Hereinafter, an algorithm for calculating a word vector will be described.

【００１８】［ステップ１］：学習データ中に含まれる
全ての日本語文書および英語文書に対して形態素解析処
理を施す。［ステップ２］：ステップ１で得られた全単語のうち、
学習データ中で出現頻度の多いものから順にｎ個の単語
を選択する。ここで得られたｎ個の単語のことを特徴表
現語と呼ぶことにする。ｎの値は数千のオーダーとす
る。［ステップ３］：行と列がそれぞれ、ステップ１で得ら
れた全ての日本語／英語単語、および特徴表現語に対応
する行列を作成する。ステップ１で得られた全ての日本
語／英語単語の総異なり語数が１０万であり、ｎの値を
３，０００とした場合、１０万行×３，０００列の行列
ができることになる。この行列の各要素には、その要素
の行に対応する単語と列に対応する特徴表現語が、学習
データ中に含まれる全ての日英文書翻訳対中で何度共起
しているか（同時に出現しているか）を記録する。すな
わち、日英の翻訳対を一つの文書であるとみなして、文
書内の共起回数をカウントする。こうして得られた行列
のことを共起行列と呼ぶことにする。このようにして、
全日本語単語と全英語単語をｎ次元のベクトルで表現す
る共起行列を作成することができる。このベクトルは、
各単語がどのようなコンテキストで出現しやすい傾向に
あるかを示すベクトルであるといえる。[Step 1]: All Japanese and English documents included in the learning data are subjected to morphological analysis. [Step 2]: Of all the words obtained in Step 1,
In the learning data, n words are selected in descending order of appearance frequency. The n words obtained here are called feature expression words. The value of n is on the order of thousands. [Step 3]: Create matrices whose rows and columns correspond to all the Japanese / English words and feature expression words obtained in step 1, respectively. If the total number of different words of all the Japanese / English words obtained in step 1 is 100,000 and the value of n is 3,000, a matrix of 100,000 rows × 3,000 columns can be formed. For each element of this matrix, how many times the word corresponding to the row of the element and the characteristic expression word corresponding to the column co-occur in all the Japanese-English document translation pairs included in the training data (at the same time) Record whether it appears). That is, the translation pair in Japanese and English is regarded as one document, and the number of co-occurrences in the document is counted. The matrix thus obtained is called a co-occurrence matrix. In this way,
A co-occurrence matrix that represents all Japanese words and all English words by n-dimensional vectors can be created. This vector is
This can be said to be a vector indicating in what context each word tends to appear.

【００１９】［ステップ４］：ステップ３で得られたｎ
次元のベクトルは次元数が大きいため、後に必要となる
処理で計算時間が膨大なものになってしまう。そこで、
計算処理を実時間の範囲に抑えるために、元のｎ次元の
ベクトルを行列の次元圧縮手法によって、ｎ’次元（数
百次元）のベクトルへと圧縮する。次元圧縮手法には様
々なものが存在するが、「Ｂｅｒｒｙ，Ｍ．，Ｄｏ，
Ｔ．，Ｏ’Ｂｒｉｅｎ，Ｇ．，Ｋｒｉｓｈｎａ，Ｖ．
ａｎｄＶａｒａｄｈａｎ，Ｓ．，”ＳＶＤＰＡＣＫＣ
ＵＳＥＲ’ＳＧＵＩＤＥ”．Ｔｅｃｈ．Ｒｅｐ．Ｃ
Ｓ−９３−１９４．ＵｎｉｖｅｒｓｉｔｙｏｆＴｅ
ｎｎｅｓｓｅｅ，Ｋｎｏｘｖｉｌｌｅ，ＴＮ（１９９
３）」で詳細な説明がなされているＳｉｎｇｕｌａｒ
ＶａｌｕｅＤｅｃｏｍｐｏｓｉｔｉｏｎがその代表例
である。このようにして全ての日本語単語および英語単
語に対して得られたｎ’次元のベクトルを単語ベクトル
と呼ぶことにする。[Step 4]: n obtained in step 3
Since a dimensional vector has a large number of dimensions, the processing required later requires an enormous amount of calculation time. Therefore,
In order to suppress the calculation process in the real-time range, the original n-dimensional vector is compressed into an n′-dimensional (hundreds of dimensions) vector by a matrix dimension compression method. There are various dimensional compression methods, but “Berry, M., Do,
T. O'Brien, G .; Krishna, V .;
and Varadhan, S.M. , "SVDPACKC
USER'S GUIDE ".Tech.Rep.C
S-93-194. University of Te
nnessee, Knoxville, TN (199
Singular, which is described in detail in 3)
Value Decomposition is a typical example. The n′-dimensional vector obtained for all Japanese words and English words in this manner is called a word vector.

【００２０】文書ベクトル生成手段１４は、単語ベクト
ル生成手段１３で得られる単語ベクトルを用いて、カテ
ゴリ構造保持手段１１中に保持されているカテゴリＡ中
の全日本語文書およびカテゴリＢ中の全英語文書に対応
する文書ベクトルを計算する手段である。まず、カテゴ
リ構造保持手段１１中に保持されているカテゴリＡ中の
全日本語文書およびカテゴリＢ中の全英語文書に形態素
解析処理を施し、単語へと分割する。次に、各文書中に
含まれる全単語に対応する単語ベクトルの総和を正規化
した（ベクトルの長さを１とした）ベクトルを計算し、
得られたベクトルを文書ベクトルとする。ただし、対応
する単語ベクトルが単語ベクトル生成手段１３によって
生成されていない単語は無視するものとする。The document vector generation means 14 uses the word vectors obtained by the word vector generation means 13 to convert all Japanese documents in category A and all English documents in category B held in the category structure holding means 11. Is a means for calculating a document vector corresponding to. First, all Japanese documents in category A and all English documents in category B held in the category structure holding means 11 are subjected to morphological analysis to divide them into words. Next, a vector (where the length of the vector is 1) obtained by normalizing the sum of the word vectors corresponding to all the words included in each document is calculated,
Let the obtained vector be a document vector. However, a word whose corresponding word vector has not been generated by the word vector generating means 13 is ignored.

【００２１】多言語検索手段１５は、カテゴリ構造保持
手段１１に保持されている第１のカテゴリ中の任意の日
本語カテゴリ（カテゴリＡ）と第２のカテゴリ中の任意
の英語カテゴリ（カテゴリＢ）のカテゴリペア中から類
似する日英の文書ペアを検索する手段である。したがっ
て、以下の処理を全ての日本語カテゴリと英語カテゴリ
のカテゴリペア（カテゴリＡとカテゴリＢの任意の組み
合わせ）に対してそれぞれ行うものとする。The multilingual search means 15 includes an arbitrary Japanese category (category A) in the first category and an arbitrary English category (category B) in the second category held in the category structure holding means 11. Is a means of searching for a similar Japanese-English document pair from the category pairs. Therefore, it is assumed that the following processing is performed for all the category pairs of the Japanese category and the English category (any combination of category A and category B).

【００２２】まず、文書ベクトル生成手段１４から得ら
れる文書ベクトルを参照することにより、以下の条件を
満たす日本語文書と英語文書のペアを、カテゴリＡおよ
びカテゴリＢに属する全ての文書集合から抽出する。First, a pair of a Japanese document and an English document satisfying the following conditions is extracted from all the document sets belonging to category A and category B by referring to the document vector obtained from the document vector generating means 14. .

【００２３】「文書ペア中の日本語文書に対応する文書
ベクトルと最も関連度の高い（内積の値が大きい）英語
文書ベクトルがペア中の英語文書ベクトルであり、逆に
ペア中の英語文書ベクトルと最も関連度の高い日本語文
書ベクトルがペア中の日本語文書ベクトルである。」"The English document vector having the highest degree of relevance (having a large inner product value) to the document vector corresponding to the Japanese document in the document pair is the English document vector in the pair, and conversely the English document vector in the pair. The most relevant Japanese document vector is the Japanese document vector in the pair. "

【００２４】次に、上記の条件を満たす日英文書ペアう
ち、ペア中の日英文書に対応する日英文書ベクトルの間
の内積の値が予め設定された閾値よりも大きいペアを抽
出する。このようにして得られた日英の文書ペアは、意
味内容が極めて近いものであり、学習データとして使用
することができるものとなる。Next, of the Japanese-English document pairs satisfying the above conditions, a pair in which the value of the inner product between the Japanese-English document vectors corresponding to the Japanese-English documents in the pair is larger than a preset threshold value is extracted. The Japanese-English document pair obtained in this way has very close semantic contents, and can be used as learning data.

【００２５】検索結果保持手段１６は、カテゴリＡとカ
テゴリＢを対象に多言語検索手段１５から得られた日英
文書ペア集合を計算機内部に保持する手段である。得ら
れた文書ペア集合は、新たな学習データの一部として学
習データ保持手段１２へ渡される。The search result holding means 16 is a means for holding a set of Japanese-English document pairs obtained from the multilingual search means 15 for the categories A and B in the computer. The obtained document pair set is passed to the learning data holding unit 12 as a part of new learning data.

【００２６】このようにして、（１）学習データ保持手
段１２に保持された学習データに基づき、単語ベクトル
生成手段１３によって単語ベクトル集合を生成し、
（２）文書ベクトル生成手段１４によって文書ベクトル
集合を生成し、（３）多言語検索手段１５によって意味
内容が近い日英の文書ペアを抽出し、（４）検索結果保
持手段１６によって、得られた文書ペアを学習データの
一部として学習データ保持手段１２に追加する（既に追
加されている場合は以前のものと置き換える）。という
処理を繰り返し行うことにより、カテゴリＡ中の日本語
文書集合とカテゴリＢ中の英語文書集合の意味内容が近
い（カテゴリＡとカテゴリＢが同分野に属する）場合に
限り、検索結果保持手段１６中に保持される文書ペアの
数が徐々に増加することになる。As described above, (1) a word vector set is generated by the word vector generating means 13 based on the learning data held in the learning data holding means 12,
(2) A document vector set is generated by the document vector generation means 14, (3) Japanese-English document pairs having similar meanings are extracted by the multilingual search means 15, and (4) Search result holding means 16 obtains the document pairs. The added document pair is added to the learning data holding unit 12 as a part of the learning data (if it has already been added, replace it with the previous one). By repeatedly performing the processing described above, only when the semantic contents of the Japanese document set in the category A and the English document set in the category B are close (the category A and the category B belong to the same field), the search result holding unit 16 The number of document pairs held therein will gradually increase.

【００２７】カテゴリ対応関係決定手段１７は、「検索
結果保持手段１６に保持されている文書ペアの総数」の
「カテゴリＡおよびカテゴリＢに含まれる総文書数」に
対する割合を参照し、該割合が予め定められた閾値Ｔよ
りも大きい場合、カテゴリＡとカテゴリＢが類似する
（同分野の）カテゴリペアであると決定する。また、カ
テゴリＡとカテゴリＢに対して上記の繰り返し処理が一
定回数以上行われたにもかかわらず、該割合が閾値Ｔを
超えない場合は、カテゴリＡとカテゴリＢが類似する
（同分野の）カテゴリペアではないと決定する。The category correspondence determination unit 17 refers to the ratio of the “total number of documents included in the category A and the category B” of “the total number of document pairs held in the search result holding unit 16”, and If it is larger than the predetermined threshold T, it is determined that the category A and the category B are similar (in the same field) category pairs. If the ratio does not exceed the threshold value T even though the above-described repetition processing has been performed on the category A and the category B for a certain number of times or more, the category A and the category B are similar (of the same field). Decide not a category pair.

【００２８】カテゴリの対応関係の決定は、１回の文書
検索だけで終了させても良い。また、閾値を多段に設定
しても良い。例えば所定回数目の文書検索で閾値ａ（ａ
＜ｂ）未満であれば、非対応と判別し、閾値ａ以上で閾
値ｂ未満であれば、再度文書検索を繰返し、同様な判別
を行い、閾値ｂ以上であれば、即座にカテゴリが対応す
ると判別するような構成を採用しても良い。要するに、
文書検索結果が、カテゴリの対応関係を肯定する兆候を
示すときに、カテゴリが対応すると判別すれば、どのよ
うな構成を採用しても良い。The determination of the category correspondence may be completed only by one document search. Further, the threshold value may be set in multiple stages. For example, a threshold a (a
If it is less than <b), it is determined to be non-corresponding. If it is equal to or more than the threshold a and less than the threshold b, the document search is repeated again, and the same determination is performed. A configuration for making a determination may be employed. in short,
When the document search result indicates that the correspondence between the categories is positive, any configuration may be adopted as long as it is determined that the categories correspond.

【００２９】このような構成をとり、カテゴリ対応関係
決定手段１７によって、全ての日本語カテゴリと英語カ
テゴリのカテゴリペア（カテゴリＡとカテゴリＢの任意
の組み合わせ）に対してそれぞれ対応関係の有無を決定
することにより、第１のカテゴリと第２のカテゴリのカ
テゴリの対応関係を網羅的に決定することが可能とな
る。With such a configuration, the category correspondence determining means 17 determines whether or not there is a corresponding relation for all of the category pairs of the Japanese category and the English category (arbitrary combination of category A and category B). By doing so, it is possible to comprehensively determine the correspondence between the categories of the first category and the second category.

【００３０】なお、本実施例では前述の文献「Ｈｉｒｏ
ｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏ
ｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎ
ｄＳｔａｎｌｅｙＰｅｔｅｒｓ，”ＱｕｅｒｙＴｒ
ａｎｓｌａｔｉｏｎＭｅｔｈｏｄｆｏｒＣｒｏｓ
ｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔ
ｒｉｅｖａｌ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏ
ｆＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎＳｕｍ
ｍｉｔＶＩＩ’９９ＷｏｒｋｓｈｏｐｏｎＭａｃ
ｈｉｎｅＴｒａｎｓｌａｔｉｏｎｆｏｒＣｒｏｓ
ｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅ
ｔｒｉｅｖａｌ，（１９９９）」に記載のパラレルコー
パスを学習データとするベクトル空間法に基づく多言語
文書検索手法を利用したが、学習データへの検索結果の
追加を繰り返し行うことが可能な多言語文書検索手法で
あれば、いかなる手法であっても同様の効果が得られる
（図１参照）。In this embodiment, the above-mentioned document "Hiro"
shi Masuichi, Raymond Flo
urnoy, Stefan Kaufmann an
dStanley Peters, “Query Tr
announcement Method for Cros
s Language InformationRet
rieval ", The Proceedings o
f Machine Translation Sum
mit VII'99 Workshopon Mac
hine Translation for Cros
s Language Information Re
trivalent, (1999), a multilingual document search method based on a vector space method using a parallel corpus as learning data is used, but a multilingual document search that can repeatedly add a search result to the learning data is used. A similar effect can be obtained by any method (see FIG. 1).

【００３１】例えば、カテゴリＡ中の第１の言語で書か
れた文書を機械翻訳システムによって第２の言語へと翻
訳し、一般的な単言語を対象とする文書検索手法を用い
て多言語文書検索を行う手法によっても同様の効果を得
ることができる。For example, a document written in a first language in category A is translated into a second language by a machine translation system, and a multilingual document is searched using a general monolingual document search technique. Similar effects can be obtained by a search method.

【００３２】パラレルコーパスを学習データとして機械
翻訳システムを実現する例として、文献「Ｐｅｔｅｒ
Ｆ．Ｂｒｏｗｎ，ＳｔｅｐｈｅｎＡ．ＤｅｌｌａＰ
ｉｅｔｒａ，ＶｉｎｃｅｎｔＪ．ＤｅｌｌａＰｉｅ
ｔｒａ，ａｎｄＲｏｂｅｒｔＬ．Ｍｅｒｃｅｒ，”
Ｔｈｅｍａｔｈｅｍａｔｉｃｓｏｆｓｔａｔｉｓ
ｔｉｃａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏ
ｎ：Ｐａｒａｍｅｔｅｒｅｓｔｉｍａｔｉｏｎ”，Ｃｏ
ｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，３
２：２６３−３１１，１９９３．」を挙げることができ
る。As an example of realizing a machine translation system using a parallel corpus as learning data, a document “Peter
F. Brown, Stephen A .; Della P
ietra, Vincent J .; Della Pie
tra, and Robert L. Mercer, "
The materials of statuses
physical Machine Transatio
n: Parameter estimation ", Co
Mputational Linguistics, 3
2: 263-311, 1993. ].

【００３３】[0033]

【発明の効果】以上のように本発明によれば、分野ごと
に学習データを用意することなしに単語曖昧性解消の問
題を回避し、異なる言語を対象として構築された複数の
カテゴリ構造間のカテゴリの対応関係を高い精度で決定
することが可能となる。As described above, according to the present invention, the problem of word ambiguity resolution can be avoided without preparing learning data for each field, and a plurality of category structures constructed for different languages can be used. The correspondence between categories can be determined with high accuracy.

[Brief description of the drawings]

【図１】本発明に係る典型的な対応カテゴリ検索シス
テムの構成を示す図である。FIG. 1 is a diagram showing a configuration of a typical corresponding category search system according to the present invention.

【図２】本発明の一実施例に係る対応カテゴリ検索シ
ステムの構成を示す図である。FIG. 2 is a diagram showing a configuration of a corresponding category search system according to one embodiment of the present invention.

[Explanation of symbols]

１１カテゴリ構造保持手段１２学習データ保持手段１３単語ベクトル生成手段１４文書ベクトル生成手段１５多言語検索手段１６検索結果保持手段１７カテゴリ対応関係決定手段 11 Category structure holding means 12 Learning data holding means 13 Word vector generating means 14 Document vector generating means 15 Multilingual search means 16 Search result holding means 17 Category correspondence determining means

Claims

[Claims]

1. A first language generated for a first language.
Category structure holding means for holding a category structure of the second language and a second category structure generated for the second language, learning data holding means for holding learning data when performing a multilingual document search, and learning data. Using the learning data held in the holding means, a multilingual document search is performed on the categories in the first category structure and the categories in the second category structure held in the category structure holding means, Similar first
Multilingual document search means for determining a bilingual document pair of the second language and the second language, and a search result for holding the document pair obtained by the multilingual document search and adding the document pair to the learning data holding means A correspondence category search system comprising: a holding unit; and a category correspondence determining unit that determines a correspondence between categories by referring to a document pair held in the search result holding unit.

2. A first language generated for a first language.
Category structure holding means for holding the category structure of the first language and the second category structure generated for the second language, and learning data holding means for holding a translation document pair as learning data when performing a multilingual document search And using the learning data held in the learning data holding means to target the categories in the first category structure and the categories in the second category structure held in the category structure holding means by the vector space method. A multilingual document search means for performing a multilingual document search based on the first language and a similar bilingual document pair of a second language and a document pair obtained by the multilingual document search; A search result holding unit that adds a document pair to the learning data holding unit, and a correspondence between categories by referring to the document pair held in the search result holding unit. And a category correspondence determining means for determining.

3. A first language generated for a first language.
Category structure holding means for holding the category structure of the second language and a second category structure generated for the second language, and learning data holding means for holding a translation document pair as learning data when performing a multilingual document search. Using the learning data held in the learning data holding means to convert a document written in a first language belonging to a category in the first category structure held in the category structure holding means into a second language Translated and obtained translated document set and second
A multilingual document search means for performing a document search for a set of documents belonging to a category in the category structure of the first language and determining a similar bilingual document pair of a first language and a second language; A search result holding unit that holds the obtained document pair and adds the document pair to the learning data holding unit; and a category that determines the correspondence between the categories by referring to the document pair held in the search result holding unit. And a correspondence relation determining means.

4. A first language generated for a first language.
A category structure holding step of storing a category structure of the second language and a second category structure generated for the second language; a learning data holding step of storing learning data when performing a multilingual document search; Using the learning data stored in the holding step, a multilingual document search is performed on the categories in the first category structure and the categories in the second category structure stored in the category structure step to find similarities. A multilingual document search step for determining a bilingual document pair of the first language and the second language; holding the document pair obtained in the multilingual document search step; and adding the document pair as learning data Search result holding step and the document pair stored in the search result holding step A corresponding category determining step of determining a corresponding relationship.

5. A first language generated for a first language.
A category structure holding step of storing a category structure of the second language and a second category structure generated for the second language; a learning data holding step of storing learning data when performing a multilingual document search; Using the learning data stored in the holding step, a multilingual document search is performed on the categories in the first category structure and the categories in the second category structure stored in the category structure step to find similarities. A multilingual document search step for determining a bilingual document pair of the first language and the second language; holding the document pair obtained in the multilingual document search step; and adding the document pair as learning data Search result holding step and the document pair stored in the search result holding step And a category correspondence determining step of determining a correspondence.

6. A first language generated for a first language.
Category structure holding means for holding a category structure of the second language and a second category structure generated for the second language, learning data holding means for holding learning data when performing a multilingual document search, and learning data. Using the learning data held in the holding means, a multilingual document search is performed on the categories in the first category structure and the categories in the second category structure held in the category structure holding means, Similar first
Multilingual document search means for determining a bilingual document pair of the second language and the second language; category correspondence determining means for determining correspondence between categories based on the document pair obtained by the multilingual document search; A corresponding category search system comprising: