JP2001357065A

JP2001357065A - Method and device for retrieving similar sentence and recording medium having similar sentence retrieval program recorded thereon

Info

Publication number: JP2001357065A
Application number: JP2000178367A
Authority: JP
Inventors: Takayuki Adachi; 貴行足立; Kura Furuse; 蔵古瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-06-14
Filing date: 2000-06-14
Publication date: 2001-12-26

Abstract

PROBLEM TO BE SOLVED: To provide a method and a device for retrieving similar sentences, with which a grammatically or semantically similar sentence can be retrieved, and to provide a recording medium having a similar sentence retrieval program recorded thereon. SOLUTION: First, the information on a part and kind for which grammatically or semantically replacement, deletion and addition can be performed is imparted concerning the similar candidate sentences of an example sentence collection and information on a part and kind which can be replaced or which can be matched with the added part of a similar candidate sentence is imparted similarly to an input sentence as well. Then, in the case of calculating similarity between the input sentence and each of similar candidate sentences, a processing is performed while considering the coincidence of the replaced part of the same kind, the deletion of an unwanted part or addition of a lacking to the different part in each of sentence, and the similar candidate sentence having highest similarity is extracted as a similar sentence together with similarity.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、自然言語の入力文
に対し類似文を検索する類似文検索方法及び装置並びに
類似文検索プログラムを記録した記録媒体に関する。な
お、検索した類似文に対応する訳文が存在する場合、そ
の訳文を抽出する。また、その類似文と訳文を利用して
入力文の訳文を生成する実例型翻訳方法およびその装置
の一部に適用できる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similar sentence retrieval method and apparatus for retrieving a similar sentence from an input sentence of a natural language, and a recording medium storing a similar sentence retrieval program. If there is a translated sentence corresponding to the searched similar sentence, the translated sentence is extracted. In addition, the present invention can be applied to a part of an example type translation method and an apparatus for generating a translation of an input sentence using the similar sentence and the translation.

【０００２】[0002]

【従来の技術】従来の類似文検索方法として、従来方法
１「ＥｍｍａｎｕｅｌＰｌａｎａｓ，ｅｔａｌ．，
“ＦｏｒｍａｌｉｚｉｎｇＴｒａｎｓｌａｔｉｏｎ
Ｍｅｍｏｒｉｅｓ”，ＭＴＳｕｍｍｉｔＶＩＩ，Ｓｅ
ｐｔｅｍｂｅｒ，１９９９」内に記載されている方法が
ある。この方法では形態素解析で区切った単位を利用
し、語の表記だけでなく標準形さらに品詞まで一致対象
を拡張して処理を行うもので、入力文中の表記で一致し
た語の割合、入力文中の標準形で一致した語の割合、入
力文中の品詞で一致した語の割合、候補文中で入力文中
の語と共通な語の割合、入力文中で候補文中の語と共通
な語の割合について、上記列挙した順に２文間の類似度
を比較して、類似文検索を行っている。2. Description of the Related Art As a conventional similar sentence retrieval method, a conventional method 1 "Emmanuel Planas, et al.,
“Formalizing Translation
Memories ", MTSummit VII, Se
ptember, 1999 ". In this method, units that are separated by morphological analysis are used, and processing is performed by expanding the matching target not only in word notation but also in standard form and part of speech, and the percentage of words matched by the notation in the input sentence, The percentage of words that match in standard form, the percentage of words that match in the part of speech in the input sentence, the percentage of words in the candidate sentence that are common to the words in the input sentence, and the percentage of words that are common to the words in the candidate sentence in the input sentence A similar sentence search is performed by comparing the similarity between two sentences in the listed order.

【０００３】また、別の類似文検索方法として、従来方
法２「特開平６−２９０２１０号の自然言語の翻訳装
置」内に記載されている方法がある。この方法では入力
文および検索対象の文から構文的な表層パターンを生成
してそれらを比較し、パターンの類似度によって類似文
検索を行っている。As another similar sentence retrieval method, there is a method described in a conventional method 2 "Translation device for natural language disclosed in Japanese Patent Laid-Open No. 6-290210". In this method, a syntactic surface pattern is generated from an input sentence and a sentence to be searched, and these patterns are compared, and a similar sentence search is performed based on the similarity of the pattern.

【０００４】また、別の類似文検索方法として、従来方
法３「隅田英一郎，堤豊，“翻訳支援のための類似用例
の実用的検索法”，電子情報通信学会論文誌Ｄ−ＩＩ，
Ｖｏｌ．Ｊ７４−Ｄ−ＩＩ，Ｎｏ．１０，１９９１」に
記載されている方法がある。この方法では形態素解析し
た後、入力文に完全一致する文を検索する。一致しない
場合には入力文の品詞を一般化して、入力文と完全一致
する文を検索する。As another similar sentence retrieval method, a conventional method 3 "Eiichiro Sumida, Yutaka Tsutsumi," A practical retrieval method of similar examples for translation support ", IEICE Transactions D-II,
Vol. J74-D-II, No. 10, 1991 ". In this method, after performing morphological analysis, a sentence that completely matches the input sentence is searched. If they do not match, the part of speech of the input sentence is generalized to search for a sentence that exactly matches the input sentence.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、従来方
法１は、一致対象が形態素解析で区切った形態素単位で
あるため、ある文の複数の形態素からなる表現と別の文
の表現を一致させて類似度計算を行うことができない。However, in the conventional method 1, since the matching target is a morpheme unit divided by morphological analysis, the expression composed of a plurality of morphemes of a certain sentence is matched with the expression of another sentence to obtain a similarity. The degree calculation cannot be performed.

【０００６】従来方法２では、パターンに必ず動詞が現
れている必要があり、動詞が省略された文は扱えない。In the conventional method 2, a verb must always appear in the pattern, and a sentence in which the verb is omitted cannot be handled.

【０００７】従来方法３では、一般化するのが入力文の
みで、検索対象の文に関しては一般化が行われないの
で、検索の適用範囲は狭い。In the conventional method 3, since only the input sentence is generalized and the sentence to be searched is not generalized, the applicable range of the search is narrow.

【０００８】従来方法１、３では、入力文と対訳用例の
同じ自然言語の文で類似文検索を行い、類似文の訳文を
編集して翻訳を行う実例型翻訳の一部として利用する場
合に、入力文と比べて不足している語句を類似文に追加
し、その訳文にも対応する語句を追加して適切な訳文を
生成する場合を考慮した類似文検索を行っていない。In the conventional methods 1 and 3, a similar sentence search is performed using a sentence in the same natural language as an input sentence and a bilingual example, and the translated sentence of the similar sentence is edited and used as a part of an example-type translation. In addition, a similar sentence search is not performed in consideration of a case where a missing phrase is added to the similar sentence compared to the input sentence and a corresponding phrase is added to the translated sentence to generate an appropriate translated sentence.

【０００９】また、従来方法１〜３では、文法的に類似
した文の検索であり、意味的な類似文の検索は考慮され
ていない。Further, in the conventional methods 1 to 3, the search for grammatically similar sentences is performed, and the search for semantically similar sentences is not considered.

【００１０】本発明は上記の事情に鑑みてなされたもの
で、表記だけでは類似度が高くない場合でも、文法的ま
たは意味的に類似した文が検索でき、また、入力文に似
ていない文をあらかじめ削除することで、類似度計算の
時間を短縮できる類似文検索方法及び装置並びに類似文
検索プログラムを記録した記録媒体を提供することを目
的とする。The present invention has been made in view of the above circumstances. Even when notation alone is not high in similarity, a sentence that is grammatically or semantically similar can be searched, and a sentence that does not resemble an input sentence can be searched. The object of the present invention is to provide a similar sentence search method and apparatus capable of shortening the time of similarity calculation by deleting in advance, and a recording medium recording a similar sentence search program.

【００１１】[0011]

【課題を解決するための手段】上記目的を達成するため
に本発明は、例文集から入力文の類似文を検索する類似
文検索方法において、例文集の類似候補文について事前
に文法的もしくは意味的に置換、削除、追加が可能な箇
所に各情報を付与し、入力文にも同様に置換が可能な箇
所や類似候補文の追加可能箇所との一致が可能な箇所に
各情報を付与した上で、入力文と類似候補文との類似度
計算の際に、各文の差分箇所に対しての同種の置換箇所
の一致、不要箇所の削除や不足箇所の追加を考慮した処
理を行い、最も類似度の高い類似候補文を類似文として
類似度とともに抽出することを特徴とする。In order to achieve the above object, the present invention provides a similar sentence retrieval method for retrieving a similar sentence of an input sentence from an example sentence collection. Each piece of information is added to places where replacement, deletion, and addition are possible, and each piece of information is added to input sentences in places where replacement is possible and where similar candidate sentences can be added. Above, when calculating the similarity between the input sentence and the similar candidate sentence, perform processing in consideration of the matching of the same type of replacement part with the difference part of each sentence, deletion of unnecessary parts and addition of missing parts, A similar candidate sentence having the highest similarity is extracted as a similar sentence together with the similarity.

【００１２】また本発明は、前記類似文検索方法におい
て、類似度の最も高い類似候補文に加え、類似度が高い
方から所定の数の類似候補文を類似文として出力するこ
とを特徴とする。Further, the present invention is characterized in that, in the similar sentence search method, a predetermined number of similar candidate sentences having higher similarities are output as similar sentences in addition to the similar candidate sentences having the highest similarity. .

【００１３】また本発明は、前記類似文検索方法におい
て、置換、削除、追加の情報を付与するための基となる
データとして、汎用的に利用できるものと、文書の分野
に依存するものに分けて各データを作成し、文書分野に
依存するデータの自動作成において、既存の汎用的もし
くは分野依存のデータを用いて情報を例文に付与し、置
換可能かつ削除可能な箇所を削った例文集から類似して
いる文を集め、文中の置換情報が付与されていない箇所
で、その前後の箇所の表記や置換の種類が一致してお
り、該当箇所の情報が同じで表記の異なるものの集合を
新たな置換対象のデータとして作成し、同時に、新たな
置換対象のデータと前後の表記などを考慮して、新たな
削除対象のデータとして作成することを特徴とする。Further, according to the present invention, in the similar sentence search method, data that can be used for general purposes and data that depends on the field of documents are classified as data serving as bases for adding replacement, deletion, and additional information. In the automatic creation of data that depends on the document field, information is added to the example sentences using existing general-purpose or field-dependent data, and from the example sentence collection where parts that can be replaced and deleted are deleted. A similar sentence is collected, and the notation and replacement type of the preceding and succeeding parts in the part where the replacement information is not added in the sentence match, and a new set of the same information and the different notation is added. It is characterized in that it is created as new data to be replaced, and at the same time it is created as new data to be deleted in consideration of the new data to be replaced and the notation before and after.

【００１４】また本発明は、前記類似文検索方法におい
て、類似候補文について、例文集の文が大量にある場合
に、入力文の語句と同じ語句の数が所定の閾値以上であ
る類似候補文を新たな類似候補文とすることを特徴とす
る。Further, according to the present invention, in the similar sentence search method, when there are a large number of sentences in the example sentence collection of similar candidate sentences, the number of similar phrases in the input sentence is equal to or greater than a predetermined threshold. As a new similar candidate sentence.

【００１５】また本発明は、前記類似文検索方法におい
て、例文集の各文と訳文の組である対訳用例を用いて、
入力文の類似文とその対訳を抽出することを特徴とす
る。Further, the present invention provides the similar sentence search method, wherein a bilingual translation example, which is a set of each sentence of the example sentence collection and a translated sentence, is used.
It is characterized by extracting a similar sentence of the input sentence and its translation.

【００１６】また本発明の類似文検索装置は、用例文を
複数保存した用例部と、入力文を読み込む入力手段と、
前記用例部の用例文から得られる類似候補文を語句単位
に解析し、文法的もしくは意味的に置換、削除、追加が
可能な箇所に各情報を付与する用例解析・情報付与手段
と、前記入力手段によって読み込まれた入力文を語句単
位に解析し、文法的もしくは意味的に置換が可能な箇所
や類似候補文の追加可能箇所との一致が可能な箇所に各
情報を付与する解析・情報付与手段と、解析された類似
候補文について、入力文と類似候補文との類似度計算の
際に、各文の差分箇所に対して同種の置換箇所の一致、
不要箇所の削除や不足箇所の追加を考慮した上で類似度
を計算し、最も類似度が高い類似候補文を類似文として
抽出する検索手段、前記検索手段により抽出された類似
文を類似度とともに出力する出力手段とを有することを
特徴とするものである。Further, the similar sentence retrieval apparatus of the present invention comprises an example section storing a plurality of example sentences, an input means for reading an input sentence,
An example analyzing / information adding means for analyzing a similar candidate sentence obtained from the example sentence of the example unit on a phrase basis, and adding each information to a place where grammatical or semantic replacement, deletion, and addition are possible; Analyze and input information by analyzing the input sentence read by the means on a word-by-phrase basis and assigning each information to a place where grammatical or semantic substitution is possible or a place where a similar candidate sentence can be added. Means for calculating the similarity between the input sentence and the similar candidate sentence with respect to the analyzed similar candidate sentence;
Search means for calculating similarity in consideration of deletion of unnecessary parts and addition of missing parts, and extracting the similar candidate sentence having the highest similarity as a similar sentence, the similar sentence extracted by the search means together with the similarity Output means for outputting.

【００１７】また本発明は、前記類似文検索装置におい
て、検索手段が、類似度の最も高い類似候補文に加え、
類似度が高い方から所定の数の類似候補文を類似文とし
て抽出することを特徴とするものである。Further, according to the present invention, in the similar sentence search apparatus, the search means includes a similar candidate sentence having the highest similarity,
It is characterized in that a predetermined number of similar candidate sentences from the one with the highest similarity are extracted as similar sentences.

【００１８】また本発明は、前記類似文検索装置におい
て、置換、削除、追加の情報の付与において、基となる
データとして、汎用的に利用できるものと、文書の分野
に依存するものに分けて各データを記述しておき、文書
の分野に依存するデータの自動作成において、既存の汎
用的もしくは分野依存のデータを用いて置換可能かつ削
除可能な箇所を削った例文集の文から類似している文を
集め、文中の置換情報が付与されていない箇所で、その
前後の箇所の表記や置換の種類が一致しており、該当箇
所の情報が同じで表記の異なるものの集合を新たな置換
対象のデータとして作成し、同時に、新たな置換対象の
データと前後の表記などを考慮して、新たな削除対象の
データとして作成するデータ作成手段を有することを特
徴とするものである。Further, according to the present invention, in the similar sentence retrieval apparatus, when replacing, deleting, or adding additional information, data that can be used for general purposes and data that depends on the field of documents are divided into basic data. Each data is described, and in the automatic creation of data depending on the field of the document, similar to the sentence of the example sentence collection where parts that can be replaced and deleted are deleted using existing general-purpose or field-dependent data Of the sentence where the replacement information has not been added, the notation and the type of replacement before and after it are the same, and a set of items with the same information but different notations is added as a new replacement target. Data creation means for creating new data to be deleted, and at the same time creating new data to be deleted in consideration of the new data to be replaced and the notation before and after. .

【００１９】また本発明は、前記類似文検索装置におい
て、検索手段において、事前に入力文の語句と同じ語句
の数が所定の閾値以上の文を類似候補文として検索対象
とすることを特徴とするものである。Further, the present invention is characterized in that, in the similar sentence search device, the search means sets in advance a sentence in which the number of phrases equal to the input sentence is equal to or greater than a predetermined threshold as a similar candidate sentence. Is what you do.

【００２０】また本発明は、前記類似文検索装置におい
て、用例文に対して訳文が対応づけられた対訳用例を用
いた場合に、前記検索手段により抽出された類似文とそ
の訳文を出力する出力手段とを有することを特徴とする
ものである。Further, according to the present invention, in the similar sentence retrieval apparatus, when a bilingual translation example in which a translated sentence is associated with an example sentence is used, a similar sentence extracted by the retrieval means and an output for outputting the translated sentence are output. Means.

【００２１】また本発明は、例文集から入力文の類似文
を検索する類似文検索プログラムを記録した記録媒体に
おいて、例文集の類似候補文について事前に文法的もし
くは意味的に置換、削除、追加が可能な箇所に各情報を
付与し、入力文にも同様に置換が可能な箇所や類似候補
文の追加可能箇所との一致が可能な箇所に各情報を付与
した上で、入力文と類似候補文との類似度計算の際に、
各文の差分箇所に対しての同種の置換箇所の一致、不要
箇所の削除や不足箇所の追加を考慮した処理を行い、最
も類似度の高い類似候補文を類似文として類似度ととも
に抽出する処理をコンピュータに実行させるためのもの
である。According to the present invention, a similar sentence search program for retrieving a similar sentence of an input sentence from an example sentence collection is recorded on a recording medium. Each information is added to a place where the input sentence is possible, and each information is added to a place where the input sentence can be replaced and a place where a similar candidate sentence can be added. When calculating the similarity with the candidate sentence,
A process that considers matching of the same type of replacement part to the difference part of each sentence, deleting unnecessary parts and adding missing parts, and extracting the similar candidate sentence with the highest similarity as similar sentence together with similarity Is executed by a computer.

【００２２】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、類似度の最も高い類似候
補文に加え、類似度が高い方から所定の数の類似候補文
を類似文として出力する処理をコンピュータに実行させ
るためのものである。Further, according to the present invention, in a recording medium on which the similar sentence search program is recorded, a predetermined number of similar candidate sentences having the highest similarity are output as similar sentences in addition to the similar candidate sentences having the highest similarity. This is for causing a computer to execute processing.

【００２３】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、置換、削除、追加の情報
を付与するための基となるデータとして、汎用的に利用
できるものと、文書の分野に依存するものに分けて各デ
ータを作成し、文書分野に依存するデータの自動作成に
おいて、既存の汎用的もしくは分野依存のデータを用い
て情報を例文に付与し、置換可能かつ削除可能な箇所を
削った例文集から類似している文を集め、文中の置換情
報が付与されていない箇所で、その前後の箇所の表記や
置換の種類が一致しており、該当箇所の情報が同じで表
記の異なるものの集合を新たな置換対象のデータとして
作成し、同時に、新たな置換対象のデータと前後の表記
などを考慮して、新たな削除対象のデータとして作成す
る処理をコンピュータに実行させるためのものである。[0023] The present invention also relates to a recording medium on which the similar sentence search program is recorded, which can be generally used as base data for adding replacement, deletion, and additional information, and in the field of documents. Create each data separately for dependent items, and in the automatic creation of data that depends on the document field, add information to example sentences using existing general-purpose or field-dependent data, and specify places that can be replaced and deleted Collect similar sentences from the cut example sentence collection, and in places where replacement information is not added in the sentence, the notation and replacement type of the preceding and following parts match, and the information of the corresponding part is the same and the notation A process of creating a set of different items as new data to be replaced and, at the same time, creating new data to be deleted in consideration of the new data to be replaced and the notation before and after, etc. It is intended to be executed by the data.

【００２４】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、類似候補文について、例
文集の文が大量にある場合に、入力文の語句と同じ語句
の数が所定の閾値以上である類似候補文を新たな類似候
補文とする処理をコンピュータに実行させるためのもの
である。Further, according to the present invention, in the recording medium storing the similar sentence search program, when there are a large number of sentences in the example sentence collection of similar candidate sentences, the number of the same phrases as the words of the input sentence is not less than a predetermined threshold value This is for causing the computer to execute a process of setting the similar candidate sentence as a new similar candidate sentence.

【００２５】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、例文集の各文と訳文の組
である対訳用例を用いて、入力文の類似文とその対訳を
抽出する処理をコンピュータに実行させるためのもので
ある。Further, according to the present invention, a process for extracting a similar sentence of an input sentence and a bilingual translation thereof on a recording medium on which the similar sentence search program is recorded, using a bilingual example which is a set of each sentence of the example sentence and a translated sentence. It is intended to be executed by a computer.

【００２６】本発明は対訳例文集にある類似候補文から
入力文の類似文を検索する方法において、文法的もしく
は意味的に置換可能箇所、類似候補文の追加可能箇所と
の一致が可能な箇所の情報を付与された入力文と事前に
文法的もしくは意味的に置換、削除、追加可能箇所の情
報を付与された類似候補文を用いて、入力文と類似候補
文で表現が異なる箇所について、置換（入力文と類似候
補文）および削除、追加（類似候補文）を行って類似度
を計算し、最も類似度が高い類似候補文を類似文として
類似度とともに抽出すると同時に類似文の訳文を抽出す
るようにした。According to the present invention, in a method of searching for a similar sentence of an input sentence from similar candidate sentences in a bilingual example sentence collection, a grammatically or semantically replaceable portion and a portion where a similar candidate sentence can be added to a possible addition portion Using a similar candidate sentence to which the information of the input sentence to which information is added and a grammatically or semantically replaced, deleted, and addable portion in advance is used, for a portion where the expression differs between the input sentence and the similar candidate sentence, The similarity is calculated by performing replacement (input sentence and similar candidate sentence), and deleting and adding (similar candidate sentence), extracting the similar candidate sentence having the highest similarity as a similar sentence together with the similarity, and simultaneously translating the similar sentence. I tried to extract.

【００２７】また、装置構成として対訳用例に関するデ
ータを保存した用例部と、入力文を読み込む入力手段
と、前記用例部による類似候補文を語句単位に解析し、
文法的もしくは意味的に置換、削除、追加が可能な箇所
と種類の情報を付与する用例解析・情報付与手段と、前
記入力手段による入力文を語句単位に解析し、文法的も
しくは意味的に置換が可能な箇所や類似候補文の追加可
能箇所との一致が可能な箇所に各情報を付与する解析・
情報付与手段と、解析された類似候補文と解析された入
力文とで語句が異なる箇所について、入力文と類似候補
文の語句を置換したり、類似候補文の語句を削除した
り、類似候補文に語句を追加して、入力文との類似度を
計算し、最も類似度が高い類似候補文を類似文として抽
出すると同時に対訳も抽出する検索手段、検索結果を出
力する出力手段とを有するようにした。[0027] Further, an example section storing data relating to a translation example as an apparatus configuration, input means for reading an input sentence, and analyzing similar candidate sentences by the example section in terms of words and phrases,
Grammatically or semantically replace, delete, add example and assignable information that can be added and added, and an input analysis unit that analyzes the input sentence by the input unit, and replaces it grammatically or semantically. Analysis that assigns each information to a place where a match is possible and a place where a similar candidate sentence can be added
The information adding means replaces the input sentence and the phrase of the similar candidate sentence, deletes the phrase of the similar candidate sentence, deletes the phrase of the similar candidate sentence, and the like in a portion where the phrase differs between the analyzed similar candidate sentence and the analyzed input sentence. A search unit for calculating a similarity with the input sentence by adding a word to the sentence, extracting a similar candidate sentence having the highest similarity as a similar sentence, and simultaneously extracting a translation, and an output unit for outputting a search result I did it.

【００２８】また、用例部において、用例文をあらかじ
め解析し、かつ解析された語句に対して、置換、削除、
追加が可能な箇所に各情報を自動または手動により設定
するデータ作成手段を有するようにした。In the example section, the example sentence is analyzed in advance, and the analyzed words are replaced, deleted,
A data creation means for automatically or manually setting each information at a place where addition is possible is provided.

【００２９】[0029]

【発明の実施の形態】以下図面を参照して本発明の実施
形態例を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００３０】図１は、本発明の一実施形態例に係る類似
文検索装置の処理手順ならびに装置構成を示したもの
で、１は第１自然言語文を入力する入力部、２は解析・
情報付与部で、入力部１で読み込まれた文を図２（ｂ）
に示すように形態素解析等によって、文を語句に分解し
後述する置換可能箇所や類似候補文の追加可能箇所との
一致可能箇所を付与する。３は、解析された入力文と後
述する解析された用例文とを比較して類似文を検索する
検索部、４は、検索部３で抽出された類似文、類似度、
類似文の訳文を出力する出力部である。５は、後述する
対訳用例集などを含む用例部である。FIG. 1 shows a processing procedure and an apparatus configuration of a similar sentence retrieval apparatus according to an embodiment of the present invention, wherein 1 is an input section for inputting a first natural language sentence, and 2 is an analysis section.
The information reading unit reads the sentence read by the input unit 1 as shown in FIG.
As shown in (1), the sentence is decomposed into words and phrases by morphological analysis and the like, and a replaceable portion described later and a matchable portion with the addable portion of similar candidate sentences are added. Reference numeral 3 denotes a search unit for comparing the analyzed input sentence with an analyzed example sentence to be described later to search for a similar sentence, and 4 denotes a similar sentence extracted by the search unit 3,
An output unit that outputs a translation of a similar sentence. Reference numeral 5 denotes an example section including a bilingual example collection to be described later.

【００３１】検索部３は、図１３に示すように解析・情
報付与部２で解析・情報付与し、入力文の類似文を解析
済み対訳用例集６０から検索する。まず、類似候補文抽
出部３０１では入力文に含まれる語句が閾値以上の数だ
け含まれている用例文を類似候補文として絞り込む。次
に類似候補文・入力文加工部３０２では、解析して情報
が付与された入力文と類似候補文を用いて、入力文と類
似候補文との差分箇所を同じ種類の記号に置換したり、
類似候補文だけにある不要箇所を削除したり、類似候補
文に不足している箇所の語句を追加して、お互いの文が
文法的もしくは意味的に類似するように加工を施す。次
に類似度計算部３０３で加工された入力文と類似候補文
を類似度計算し、最も類似度が高い文を類似度とともに
抽出する。The search unit 3 analyzes and assigns information by the analysis and information assigning unit 2 as shown in FIG. 13, and searches the analyzed sentence collection 60 for a similar sentence of the input sentence. First, the similar candidate sentence extraction unit 301 narrows down, as similar candidate sentences, example sentences in which the number of words included in the input sentence is equal to or greater than the threshold value. Next, the similar candidate sentence / input sentence processing unit 302 replaces the difference between the input sentence and the similar candidate sentence with a symbol of the same type using the input sentence and the similar candidate sentence to which the information is added by analysis. ,
Unnecessary portions only in the similar candidate sentences are deleted, or words of the missing portions are added to the similar candidate sentences, and processing is performed so that the sentences are grammatically or semantically similar to each other. Next, the input sentence and the similar candidate sentence processed by the similarity calculation unit 303 are calculated for similarity, and the sentence having the highest similarity is extracted together with the similarity.

【００３２】１０〜６０は、用例部５に含まれ類似文検
索のためにあらかじめデータベースとして準備されてい
る対訳用例集およびその用例解析・情報付与部等で、４
０は図３に例示するように文番号の日本語文（第１自然
言語）とそれに対応する英語文（第２自然言語）とが対
になった対訳用例が複数データとして記憶されている対
訳用例集で、図３では、サッカーについての日本語と英
語の対訳の場合について示したもので、このような対訳
用例が必要に応じて分野毎、翻訳する言語間毎に用意さ
れている。Reference numerals 10 to 60 denote translation example collections and example analysis / information adding units included in the example section 5 and prepared in advance as a database for similar sentence retrieval.
0 is a bilingual example in which a Japanese sentence (first natural language) with a sentence number and a corresponding English sentence (second natural language) are stored as a plurality of data as illustrated in FIG. FIG. 3 shows a case of Japanese and English bilingual translations of soccer. Such bilingual examples are prepared for each field and for each language to be translated as necessary.

【００３３】１０は、分野毎の用例に依存しない共通的
に使用される用例非依存の語句データ（データベース）
であり、用例非依存の語句データの例を図４に示す。図
４では、語句をそのまま列挙した辞書、複数の品詞や表
記を列挙した対訳パターン、ある条件により後述の対象
を決定するルールに分けられ、かつそれぞれ、入力文に
おける置換対象、追加一致対象、類似候補文における置
換対象、削除対象、追加対象をあらかじめ定めてある。
この各対象の指定は、あらかじめ手動または自動で行わ
れる。入力文における追加一致対象は類似候補文の追加
対象との一致を調べる対象であり、類似候補文における
削除対象と共通である。Reference numeral 10 denotes example-independent phrase data (database) commonly used without depending on examples in each field.
FIG. 4 shows an example of example-independent phrase data. In FIG. 4, the dictionary is divided into a dictionary in which words are enumerated as it is, a bilingual pattern in which a plurality of parts of speech and notations are enumerated, and rules for determining an object to be described later according to certain conditions. Replacement targets, deletion targets, and addition targets in the candidate sentence are determined in advance.
The designation of each target is performed manually or automatically in advance. The additional matching target in the input sentence is a target for checking the match with the adding target of the similar candidate sentence, and is common to the deletion target in the similar candidate sentence.

【００３４】入力文、類似候補文における置換対象は、
解析・情報付与部２、用例解析・情報付与部５０で形態
素解析した結果をもとに、接続詞や副詞や数詞や連体詞
などのパターンを置換対象とする。また、ルールによっ
て形容詞、形容動詞は各々の活用や型ごとに分けて指定
する。The replacement target in the input sentence and the similar candidate sentence is
Based on the results of the morphological analysis performed by the analysis / information providing unit 2 and the example analysis / information providing unit 50, patterns such as connectives, adverbs, numerals, adnominals, etc. are set as replacement targets. In addition, adjectives and adjective verbs are specified for each conjugation or type according to rules.

【００３５】入力文における追加一致対象、類似候補文
における削除対象は、修飾する語や独立している語を主
に対象とする。解析・情報付与部２、用例解析・情報付
与部５０で形態素解析した結果をもとに、接続詞や副詞
や連体詞、形容詞の連体形および形容動詞の連体
形、“、”、名詞＋“に”などを削除対象として指定す
る。The additional matching target in the input sentence and the deletion target in the similar candidate sentence mainly target a word to be modified or an independent word. Based on the results of the morphological analysis performed by the analysis / information addition unit 2 and the example analysis / information addition unit 50, a conjunction, an adverb, an adverb, an adjective adnominal and an adjective verb adjunct, “,”, noun + “ni” Is specified as a deletion target.

【００３６】類似候補文における追加対象は、その種類
と語句の追加位置を指定する。図４のルールでは、ある
用例文１から削除可能箇所を挟んで前ｎ個の語、後ｍ個
の語を含むパターンを抽出し、そのうち削除可能箇所が
抜けているパターンを含むある用例文２を探し、前ｎ
個、後ｍ個の語に対応するものがそれぞれ品詞、型、活
用形で一致し、ある用例文１と同じ削除可能箇所の要素
がある用例文２に含まれていない場合に、その用例文２
の前ｎ個と後ｍ個の語間に追加可能箇所の情報を付与す
る。The type of the object to be added in the similar candidate sentence and the position to add the phrase are specified. According to the rule of FIG. 4, a pattern including n words before and m words after the deletable portion is extracted from a certain example sentence 1, and a certain example sentence 2 including a pattern in which the deletable portion is missing among them. Search for
If the words corresponding to the m and the last m words match in the part of speech, the type, and the inflected form, respectively, and the element in the same deletable part as the example sentence 1 is not included in the example sentence 2, the example sentence 2
Is added between the n preceding words and the m following words.

【００３７】他の置換、削除、追加対象には、人手もし
くは既にある一般の辞書を利用して、必要最小限の辞
書、パターン、ルールを指定する。その例は図４のパタ
ーンの欄にある「［時間］」や「［時間］“に”」が該
当する。For other replacements, deletions, and additions, the required minimum dictionaries, patterns, and rules are designated using a human or an existing general dictionary. The example corresponds to “[time]” or “[time]“ ni ”” in the pattern column of FIG.

【００３８】２０は用例依存の語句データ（データベー
ス）で、分野毎に用意されている。これは、分野非依存
な置換、削除、追加一致、追加対象とならない特有の表
現に関して補うことができる。用例依存の語句データ２
０の例を図５（ｂ）に示す。図５（ｂ）も図４と同様の
構成になっている。Reference numeral 20 denotes example-dependent phrase data (database) prepared for each field. This can be supplemented for field-independent substitutions, deletions, additional matches, and specific expressions not to be added. Example dependent phrase data 2
An example of 0 is shown in FIG. FIG. 5B has the same configuration as FIG.

【００３９】３０は用例依存データを自動的に作成する
データ作成部であり、例えば、置換対象などを得ること
ができる。置換対象を自動で得るためには、まず、対訳
用例集（データベース）４０を用例非依存、用例依存の
語句対応データ（図５（ａ））を順に用いて、用例解析
・情報付与部５０で用例文に情報を付与する。次に、そ
れらの情報を元に置換可能箇所と削除可能箇所が全く同
じ範囲に現れるものや、削除可能箇所に含まれていて全
ての名詞や形容詞といった自立語が置換可能箇所となっ
ている削除可能箇所を削り、得られた用例間で後述の入
力文と類似候補文との類似度計算（置換のみを考慮）を
利用し、類似度が閾値Ｔ以上の文を抽出し、各用例で置
換対象となっていないＬ語が同じ品詞列であり、その前
後Ｋ個が同じ要素である場合に、各用例のＬ語の語句を
新たな置換対象と定める。図６にその例を示す。その結
果追加されたものが図５（ｂ）の「［置換１］」であ
る。この場合、「ＰＫ」も「フリーキック」も名詞であ
るので文法的に一致しているが、分野依存の用例文から
似ている文を選び、前後の要素の一致で制限しているこ
とから、より意味的に類似しているものが得られてい
る。また、得られた置換対象が名詞の場合は、「［置換
１］“で”」を削除対象とする。閾値Ｔを最初は高めに
設定し、所定の下限まで順次閾値を下げて置換対象を抽
出する。Reference numeral 30 denotes a data creation unit for automatically creating example-dependent data, for example, an object to be replaced can be obtained. In order to automatically obtain the replacement target, first, the example analyzing / information adding unit 50 uses the bilingual example collection (database) 40 in the example-independent and example-dependent word correspondence data (FIG. 5A) in order. Add information to example sentences. Next, based on such information, replaceable parts and deleteable parts appear in exactly the same range, or deleteable parts that are included in deleteable parts and all independent words such as nouns and adjectives are replaceable parts A possible portion is cut off, and a sentence having a similarity greater than or equal to a threshold T is extracted between the obtained examples by using a similarity calculation between an input sentence and a similar candidate sentence (considering only replacement) described later, and replaced by each example. If the L words that are not the target are the same part-of-speech sequence and the preceding and succeeding K words are the same element, the L word phrase of each example is determined as a new replacement target. FIG. 6 shows an example. As a result, “[Replacement 1]” in FIG. 5B is added. In this case, both "PK" and "free kick" are nouns, so they match grammatically. However, similar sentences are selected from field-dependent example sentences and restricted by matching before and after elements. And more semantically similar. If the obtained replacement target is a noun, “[replacement 1]“ de ”is to be deleted. The threshold value T is initially set to a higher value, and the threshold value is sequentially lowered to a predetermined lower limit to extract a replacement target.

【００４０】５０は、用例非依存の語句データ１０、用
例依存の語句データ２０、対訳用例集４０から類似文と
して抽出された候補文を図２（ａ）に示すように形態素
解析等により解析する用例解析・情報付与部で、処理す
る入力文の解析・情報付与部２（図２（ｂ））と同様に
置換、削除、追加可能箇所を調べ、情報を付与する。Reference numeral 50 denotes an example-independent phrase data 10, an example-dependent phrase data 20, and a candidate sentence extracted as a similar sentence from the bilingual example collection 40 by morphological analysis or the like as shown in FIG. In the example analysis / information addition unit, replacement / deletion / addition possible portions are checked in the same manner as in the analysis / information addition unit 2 (FIG. 2B) of the input sentence to be processed, and information is added.

【００４１】６０は、解析済み対訳用例集（データベー
ス）で、用例解析・情報付与部５０の出力を保持し、情
報付与された入力文と検索部３において比較するための
部分であり、その例を図８および図９に示す。Reference numeral 60 denotes an analyzed bilingual example collection (database) for holding the output of the example analyzing / information adding unit 50 and comparing the input sentence with the information with the search unit 3. Are shown in FIGS. 8 and 9.

【００４２】図８は、図３に示した対訳用例集の日本語
用例を、それぞれの文について解析し（図７）、図４に
示した用例非依存の語句データおよび図５（ｂ）に示し
た用例依存の語句データにしたがい置換、削除、追加可
能箇所を示したものである。解析結果からの情報付与
は、用例非依存、用例依存の順に、置換対象、削除対
象、追加対象の順で行う。用例非依存と用例依存の語句
データは同じ種類であれば、非依存、依存に関係なく同
じ集合として扱う（例、［時間］）。図７における品詞
および図８における置換、削除、追加の欄は、図９にあ
るように略称で示してある。ただし、図８の削除の欄は
削除語句の範囲を示したものである。また解析済み対訳
用例集６０には、図２（ａ）の処理により文中の単語と
文番号の対応表（図１０）も格納されている。FIG. 8 shows an example of the Japanese translation of the bilingual example collection shown in FIG. 3 analyzed for each sentence (FIG. 7), and the example-independent phrase data shown in FIG. 4 and FIG. It shows replacement, deletion, and addition locations according to the example-dependent phrase data shown. Information is added from the analysis result in the order of example-independent and example-dependent, in the order of replacement target, deletion target, and addition target. As long as the example-independent and example-dependent phrase data are of the same type, they are treated as the same set regardless of the dependency or dependence (eg, [time]). The parts of speech in FIG. 7 and the replacement, deletion, and addition fields in FIG. 8 are abbreviated as shown in FIG. However, the column of deletion in FIG. 8 shows the range of the deletion phrase. The analyzed translation example book 60 also stores a correspondence table (FIG. 10) between words in a sentence and a sentence number by the process of FIG. 2A.

【００４３】用例解析・情報付与部５０および解析済み
対訳用例集６０は、実施例では各データ１０，２０，対
訳用例集４０に対応してあらかじめ解析済みデータを保
持している場合について説明するが、処理毎に各データ
１０，２０，対訳用例集４０から抽出し、情報を付与す
る方式としてもよい。In the embodiment, the example analysis / information adding unit 50 and the analyzed bilingual example collection 60 will be described in a case where the analyzed data is stored in advance in correspondence with the data 10, 20, and the bilingual example collection 40. Alternatively, a method may be used in which data is extracted from each of the data 10, 20 and the bilingual example collection 40 for each process and information is added.

【００４４】[0044]

【実施例】以下、図面と共に本発明の実施例を説明す
る。以下の実施例では入力される語句を日本語、検索さ
れた類似文の訳文の語句を英語として説明するが、これ
に限定されない。Embodiments of the present invention will be described below with reference to the drawings. In the following embodiments, the input phrase will be described as Japanese and the translated phrase of the searched similar sentence will be described as English, but the present invention is not limited to this.

【００４５】［実施例１］まず、事前に図１の用例部５
にあるデータを準備する。[Embodiment 1] First, the example section 5 of FIG.
Prepare the data in.

【００４６】図３を図１の対訳用例集４０の対訳用例と
すると、図１の用例解析・情報付与部５０内において、
図２の用例文の解析で形態素解析処理により図７が作成
される。図７は文節ごとに“｜”で、品詞ごとに“／”
で区切っており、品詞、型の番号、活用形を記してい
る。また、図７と用例非依存の語句データ１０と用例依
存の語句データ２０を用いて、用例文ごとに置換、削
除、追加可能箇所を調べ、情報を付与すると図８が作成
される。このとき同時に、用例文に含まれている単語と
文番号の対応表（図１０）も作成される。図８と図１０
は、図１の解析済み対訳用例集６０に蓄積される。Assuming that FIG. 3 is a bilingual example of the bilingual example collection 40 of FIG. 1, the example analyzing / information adding unit 50 of FIG.
FIG. 7 is created by the morphological analysis processing in the analysis of the example sentence of FIG. FIG. 7 shows “|” for each phrase and “/” for each part of speech.
The part of speech, type number, and inflected form are described. In addition, using FIG. 7, the example-independent phrase data 10 and the example-dependent phrase data 20, a possible replacement, deletion, and addition portion is checked for each example sentence, and information is added, thereby creating FIG. At this time, a correspondence table (FIG. 10) between words and sentence numbers included in the example sentence is also created. 8 and 10
Are stored in the analyzed parallel translation example collection 60 of FIG.

【００４７】図８を作成する際に利用される図１の用例
依存の語句データ２０（図５（ｂ））の一部は自動的に
作成される。まず、用例依存の語句対応データの置換箇
所を追加するために、対訳用例集４０について、用例非
依存の語句データ１０（図４）と用例依存の語句データ
２０（図５（ａ））を用いて情報付与を行う。次に、図
６に従うように置換可能箇所と削除可能箇所が全く同じ
範囲に現れているものや削除可能箇所に含まれている全
ての自立語が置換箇所となっている削除可能箇所を削
る。次に、後述の入力文と類似候補文の類似度計算と同
じ方法で得られた用例間の類似度計算（置換のみを考
慮）を行い、類似度が閾値Ｔ以上の文を抽出する。次
に、各用例で置換対象となっていないＬ語が同じ品詞列
であり、その前後Ｋ個が同じ要素となる場合に、各用例
のＬ語の語句を置換対象と定める。その結果、図５
（ｂ）の置換対象の辞書の欄に「［置換１］」が、「対
訳辞書：［置換１］」が追加される。また、得られた置
換対象である［置換１］が名詞の場合は、他の単語と同
じように「［置換１］“で”」を削除対象のパターンに
追加する。閾値Ｔを最初は高めに設定し、所定の下限ま
で順次閾値を下げて置換対象を抽出することも可能であ
る。A part of the example-dependent phrase data 20 (FIG. 5B) of FIG. 1 used when creating FIG. 8 is automatically created. First, in order to add a replacement part of the example-dependent phrase correspondence data, the example-independent word data 10 (FIG. 4) and the example-dependent word data 20 (FIG. 5A) are used for the bilingual example collection 40. To give information. Next, as shown in FIG. 6, a part where the replaceable part and the part which can be deleted appear in exactly the same range, and a part which can be deleted where all the independent words included in the part which can be deleted are replaced are deleted. Next, the similarity between the examples obtained by the same method as the similarity calculation between the input sentence and the similar candidate sentence, which will be described later, is calculated (considering only substitution), and a sentence having a similarity equal to or greater than the threshold T is extracted. Next, in a case where the L words that are not replaced in each example are the same part-of-speech sequence and the preceding and succeeding K words are the same element, the L word in each example is determined as a replacement target. As a result, FIG.
“[Replacement 1]” and “Bilingual dictionary: [Replacement 1]” are added to the column of the dictionary to be replaced in (b). When the obtained replacement target [replacement 1] is a noun, “[replacement 1]“ de ”” is added to the pattern to be deleted as in the case of other words. It is also possible to initially set the threshold value T to a higher value and then sequentially lower the threshold value to a predetermined lower limit to extract the replacement target.

【００４８】次に、実際の処理において、図１の１から
入力文「中田がＰＫで貴重な得点をあげた。」が入力さ
れ、図１の解析、情報付与部２内において、図２の入力
文の解析で形態素解析処理により図１１が作成される。
また、置換可能箇所と類似候補文との追加一致箇所を調
べ、情報を付与すると図１２が作成される。Next, in the actual processing, the input sentence “Nakada gave a valuable score in PK” was input from 1 in FIG. 1, and in the analysis and information adding unit 2 in FIG. FIG. 11 is created by the morphological analysis processing in the analysis of the input sentence.
In addition, FIG. 12 is created by examining an additional matching portion between the replaceable portion and the similar candidate sentence and adding information.

【００４９】次に、図１の検索部３において、入力文と
図３の用例文（日本語用例）から入力文の類似文を検索
する。検索部３は図１３のようになっている。Next, the search unit 3 in FIG. 1 searches the input sentence and the example sentence (Japanese example) in FIG. 3 for a similar sentence of the input sentence. The search unit 3 is as shown in FIG.

【００５０】類似候補文抽出部３０１では、事前にあま
り似ていない用例文の処理を省くために、入力文の単語
と同じ単語が所定の閾値以上含んでいる文を選択する。
実際は入力文の単語を図１０で調べ、入力文の単語数に
おける一致単語数の割合が閾値以上の文を選択する。文
１＝８／１０＝０．８、文２＝８／１０＝０．８、文３
＝９／１０＝０．９、文４＝１／１０＝０．１となり、
閾値が０．７であったとすると、文１〜３が選択され
る。The similar candidate sentence extracting unit 301 selects a sentence in which the same word as the word of the input sentence exceeds a predetermined threshold value in order to omit the processing of the example sentence that is not very similar.
Actually, the words of the input sentence are checked in FIG. 10, and a sentence in which the ratio of the number of matching words to the number of words of the input sentence is equal to or more than a threshold is selected. Sentence 1 = 8/10 = 0.8, Sentence 2 = 8/10 = 0.8, Sentence 3
= 9/10 = 0.9, sentence 4 = 1/10 = 0.1,
If the threshold is 0.7, sentences 1 to 3 are selected.

【００５１】類似候補文・入力文加工部３０２で、入力
文と類似候補文がより類似するように類似候補文につい
て置換箇所の一般化や不要語句の削除や必要語句の追加
を行い、入力文について置換箇所の一般化を行う。図１
４に入力文と文３の例を示す。まず、入力文、類似候補
文の解析結果から、文節ごとに表記の一致を調べる。図
１４のでは、入力文の先頭の文節から文３に同じもの
があるかどうか調べると、“中田が”、“ＰＫで”、
“貴重な”、“得点を”が一致するので、対応している
印として１を格納している。次に、表記が一致しない文
節について語句の置換を施した文節ごとに一致を調べ
る。置換を施す際には、図１４のの文３のように、１
文節内の表現が複数考えられる場合があるので、表記そ
のものが多く含まれるものを優先して、文節内の語と置
換可能箇所からなる要素の合計数が多く、置換可能箇所
に該当する単語数が少ない順に並列に蓄積しておき、そ
の順番で一致を調べる。文３の“３０分に”は、
“［数］分に”と“［時間］に”があるが、前者の方
が優先される。結果、図１４のでは、入力文の“［動
＿２４１＿用］た。”は一致しない。図１４のでは、
語句の置換を施した文節でも一致しない場合について、
その文節内で取りうる全ての単語や置換箇所を単位とし
て一致を調べる。ここでは、優先順位を複数の単語が一
致する置換可能箇所、１語の表記、１語の置換可能箇所
として順に並列に蓄積しておき、その順番で一致を調べ
る。結果、図１４のでは、“た”と“。”が一致する
ので、対応している印として１を格納している。図１４
のでは、類似候補文において１度も一致しない箇所に
ついて削除可能箇所ならば削除を施し、入力文において
１度も一致しない箇所と同じものが類似候補文に追加可
能箇所としてあれば、類似候補文にそれを追加する。ま
ず、入力文、文３ともに〜で対応した箇所を調べ、
次に文３の削除可能箇所を調べる。削除可能箇所は、ま
ず文節全体で削除できるものがあるか調べ、なければ残
りの削除可能箇所を組合せて最も多くの単語が削除でき
る削除可能箇所を選ぶ。結果、“そして、”と“３０分
に”が削除される。次に文３において追加可能箇所を調
べるが、そのようなものはないため、入力文は「中田／
が／ＰＫ／で／貴重な／得点／を／あげ／た／。」、文
３は「中田／が／ＰＫ／で／貴重な／得点／を／し／た
／。」に加工される。同様に、文１、文２も加工すると
図１５のようになる。The similar candidate sentence / input sentence processing unit 302 generalizes replacement parts, deletes unnecessary phrases, and adds necessary phrases so that the input sentence and the similar candidate sentence are more similar to each other. Is generalized for the replacement part. FIG.
FIG. 4 shows an example of the input sentence and the sentence 3. First, based on the analysis result of the input sentence and the similar candidate sentence, the matching of the notation is checked for each phrase. In FIG. 14, if it is checked whether the same sentence is found in sentence 3 from the head clause of the input sentence, “Nakata”, “PK”,
Since "precious" and "score" match, "1" is stored as the corresponding mark. Next, a match is checked for each of the clauses in which notation does not match and the phrase is replaced. When performing replacement, as shown in sentence 3 of FIG.
Since there may be more than one expression in a clause, the one that contains a lot of notation itself is prioritized, and the total number of elements consisting of words and replaceable parts in the clause is large, and the number of words corresponding to replaceable parts Are stored in parallel in ascending order, and matching is checked in that order. In sentence 3, "in 30 minutes"
There are "in [number] minutes" and "in [time]", but the former takes precedence. As a result, in FIG. 14, the input sentence “[Dynamic_241_for]” does not match. In FIG. 14,
If the phrase does not match even if the phrase is replaced,
A match is checked for all possible words and replacements in the phrase. Here, priorities are sequentially stored in parallel as a replaceable portion where a plurality of words match, a notation of one word, and a replaceable portion of one word, and matching is checked in that order. As a result, in FIG. 14, since “ta” and “.” Match, “1” is stored as the corresponding mark. FIG.
In a similar candidate sentence, if a part that does not match even once is a part that can be deleted, it is deleted. If a part that does not match even once in the input sentence is a part that can be added to the similar candidate sentence, a similar candidate sentence is deleted. Add it to First, the input sentence and sentence 3 are checked for the corresponding parts in
Next, the portion where the sentence 3 can be deleted is checked. First, a check is made to see if there is any part that can be deleted in the whole phrase, and if not, the remaining deleteable parts are combined to select a part that can delete the most words. As a result, "and" and "in 30 minutes" are deleted. Next, the place where addition is possible is examined in sentence 3, but there is no such thing, so the input sentence is “Nakada /
But / PK / in / precious / score / in / up / in /. And sentence 3 are processed into "Nakada / ga / PK // precious / score / was / ta /." Similarly, sentence 1 and sentence 2 are processed as shown in FIG.

【００５２】次に、類似度計算部３０３において、加工
した入力文と類似候補文を類似度計算する。ここでは、
類似度の計算式を以下のようにするが、他の方法で適切
なものがあればそれを利用しても構わない。Next, the similarity calculator 303 calculates the similarity between the processed input sentence and the similar candidate sentence. here,
The formula for calculating the degree of similarity is as follows, but any other suitable method may be used.

【００５３】類似度＝２×一致要素数／（入力文の要素
数＋類似候補文の要素数）結果、図１５にあるように、文２が類似文として選択さ
れる。Similarity = 2 × number of matching elements / (number of elements of input sentence + number of elements of similar candidate sentence) As a result, as shown in FIG. 15, sentence 2 is selected as a similar sentence.

【００５４】最後に、図１の出力部において類似文と類
似文の訳文が出力される。この例では、文２「中山がフ
リーキックで貴重な得点をあげた。」と「Ｎａｋａｙａ
ｍａａｄｄｅｄａｖａｌｕａｂｌｅｇｏａｌｆ
ｒｏｍａｆｒｅｅｋｉｃｋ．」が出力される。仮
に、類似度が同じものが複数あった場合は、図１５に示
すように、での削除後の要素（単語や置換可能箇所）
一致度、での要素一致度、での文節一致度、での
文節一致度を順に比較し、類似度に差が出た時点で類似
度の高い方を選択する。要素の一致とは、文節中に含ま
れる単語や置換箇所のレベルで行うことである。Finally, a similar sentence and a translated sentence of the similar sentence are output from the output unit in FIG. In this example, sentence 2 "Nakayama scored a precious score in a free kick."
maadded a variable goal f
roma free kick. Is output. If there are a plurality of elements having the same similarity, as shown in FIG. 15, the elements (words and replaceable parts) after the deletion
The degree of matching, the degree of element matching at, the degree of phrase matching at, and the degree of phrase matching at, are compared in order, and the one with the higher degree of similarity is selected when there is a difference in the degree of similarity. Element matching is performed at the level of a word or replacement part included in a phrase.

【００５５】もし、図５（ｂ）の置換対象として「［置
換１］」と「対訳辞書：［置換１］」を自動的に追加で
きていなければ、文１〜文３まで類似度が２×９／（１
０＋１０）＝０．９となり、仮に複数の類似文を結果と
して出力せず、同じ類似度では文番号の早いものを出力
するとした場合、文全体の意味として他の文よりも入力
文に似ていない文１が選択されてしまう。If “[Replacement 1]” and “Bilingual dictionary: [Replacement 1]” cannot be automatically added as replacement targets in FIG. 5B, the similarity between sentences 1 to 3 is 2 × 9 / (1
0 + 10) = 0.9. If it is assumed that a plurality of similar sentences are not output as a result, and that a sentence with the same similarity is output at a higher sentence number, the sentence as a whole is more similar to the input sentence than the other sentences. Missing sentence 1 is selected.

【００５６】この例では、特定の分野の用例文を用い
て、文書の分野に依存するデータを自動作成しているた
め、文法的または意味的な置換箇所を増やすことで、よ
り細かな点を考慮した類似度計算をすることができ、ま
た、入力文に似ていない文をあらかじめ削除すること
で、類似度計算の時間を短縮できることを示した。In this example, since the data depending on the field of the document is automatically created using the example sentence of the specific field, more detailed points can be obtained by increasing the number of grammatical or semantic replacement parts. It was shown that the similarity calculation can be performed in consideration of the above, and that the time required for the similarity calculation can be reduced by deleting sentences that are not similar to the input sentence in advance.

【００５７】［実施例２］実施例１と同様に説明する。[Embodiment 2] A description will be given in the same manner as in Embodiment 1.

【００５８】まず、事前に図１の用例部５にあるデータ
を準備する。First, data in the example section 5 of FIG. 1 is prepared in advance.

【００５９】図１６を図１の対訳用例集４０の対訳用例
とすると、図１の用例解析・情報付与部５０内におい
て、図２（ａ）の用例文の解析で形態素解析処理により
図１７が作成される。図１７は文節ごとに“｜”で、品
詞ごとに“／”で区切っており、品詞、型の番号、活用
形を記している。また、図１７と用例非依存の語句デー
タ１０と用例依存の語句データ２０を用いて、用例文ご
とに置換、削除、追加可能箇所を調べ、情報を付与する
と図１８が作成される。このとき同時に、各対訳用例の
文ごとに含まれている単語と文番号の対応表（図１９）
も作成される。図１８と図１９は、図１の解析済み対訳
用例集６０に蓄積される。Assuming that FIG. 16 is a bilingual example of the bilingual example collection 40 of FIG. 1, in the example analyzing / information adding unit 50 of FIG. 1, FIG. 17 is obtained by analyzing the example sentence of FIG. Created. In FIG. 17, each phrase is delimited by “|” and each part of speech is delimited by “/”, and the part of speech, type number, and inflected form are described. In addition, using FIG. 17, the example-independent phrase data 10 and the example-dependent phrase data 20, the possible replacement, deletion, and addition locations are checked for each example sentence, and information is added to create FIG. At this time, at the same time, the correspondence table between the words and the sentence numbers included in each sentence of each translation example (FIG. 19)
Is also created. 18 and 19 are stored in the analyzed bilingual example collection 60 of FIG.

【００６０】用例依存の語句データ２０は、この例の対
訳用例が図１６であるため、図５（ａ）が利用される。
また、図１７、図１８の置換、削除可能箇所と図４の用
例非依存の語句データにある追加対象のルール（ｍ＝２
とした場合）により、図１８の文１の文頭に追加可能箇
所が付与されている（文２の文頭：「そして／、／固＿
１９０＿^*/が」と文１の文頭：「固＿１９０＿^*/が」に
おいて、「そして／、」が削除可能箇所となってい
る）。FIG. 5A is used for the phrase data 20 depending on the example because the translation example of this example is shown in FIG.
In addition, the addition target rule (m = 2) in the replacement / deletion points in FIGS. 17 and 18 and the example-independent phrase data in FIG.
18), an addable portion is added to the beginning of sentence 1 in FIG. 18 (the beginning of sentence 2: "and /, / fix_"
190 _ ^{* /} ga ”and the beginning of sentence 1:“ in _190 _ ^{* /} ga ”,“ and /, ”is a part that can be deleted).

【００６１】次に、実際の処理において、図１の入力部
１から入力文「そして、中田が貴重な得点をした。」が
入力され、図１の解析・情報付与部２内において、図２
（ｂ）の入力文の解析で形態素解析処理により図２０が
作成される。また、置換可能箇所と類似候補文との追加
一致箇所を調べ、情報を付与すると図２１が作成され
る。Next, in the actual processing, the input sentence "And Nakata scored a valuable score." Is input from the input unit 1 of FIG. 1, and the input sentence of FIG.
FIG. 20 is created by the morphological analysis processing in the analysis of the input sentence of (b). In addition, FIG. 21 is created by examining an additional matching portion between the replaceable portion and the similar candidate sentence and adding information.

【００６２】次に、図１の検索部３において、入力文と
図１６の用例文（日本語用例）から入力文の類似文を検
索する。図１３に沿って説明する。Next, the search unit 3 in FIG. 1 searches the input sentence and the example sentence (Japanese example) in FIG. 16 for a similar sentence of the input sentence. This will be described with reference to FIG.

【００６３】類似候補文抽出部３０１では、事前にあま
り似ていない用例文の処理を省くために、入力文の単語
と同じ単語が所定の閾値以上含んでいる文を選択する。
実際は入力文の単語を図１９で調べ、入力文の単語数に
おける一致単語数の割合が閾値以上の文を選択する。文
１＝７／１０＝０．７、文２＝８／１０＝０．８とな
り、閾値が０．７であったとすると、文１、文２の両方
が選択される。The similar candidate sentence extracting unit 301 selects a sentence containing the same word as the word of the input sentence in a predetermined threshold or more in order to omit the processing of the example sentence that is not very similar in advance.
Actually, the words of the input sentence are checked in FIG. 19, and a sentence in which the ratio of the number of matching words to the number of words of the input sentence is equal to or more than a threshold is selected. Sentence 1 = 7/10 = 0.7, sentence 2 = 8/10 = 0.8, and if the threshold value is 0.7, both sentence 1 and sentence 2 are selected.

【００６４】類似候補文・入力文加工部３０２では、入
力文と類似候補文がより類似するように類似候補文につ
いて置換箇所の一般化や不要語句の削除や必要語句の追
加を行い、入力文について置換箇所の一般化を行う。図
２２に入力文と文１の例を示す。まず、入力文、類似候
補文の解析結果から、文節ごとに表記の一致を調べる
と、において、“貴重な”、“得点を”、“した
。”が一致する。次に、表記が一致しない文節につい
て語句の置換を施した文節ごとに一致を調べると、に
おいて、“［固＿１９０］が”一致する。語句の置換を
施しても一致しない文節について、その文節内で取りう
る全ての単語や置換箇所の要素の一致を調べると、文１
にはそのようなものがないのでは処理が省略される。
では、類似候補文において１度も一致しない箇所につ
いて削除可能箇所ならば削除を施し、入力文において１
度も一致しない箇所と同じ追加可能箇所が類似候補文に
あれば、類似候補文にそれを追加する。まず、入力文、
文１ともに〜で対応した箇所を調べる。次に文１の
削除可能箇所を調べると、文１に削除可能箇所はないの
で削除は行われない。次に入力文に一致していないもの
があり、文１に追加可能箇所があるので、それが一致す
るか調べる。処理は、、と同様の処理を行うが、
違いは入力文側のまだ対応が付いていない箇所と、文１
側は追加可能箇所の一致である。結果、“［接］、”で
一致するので、入力文は「（接、）／［固＿１９
０］／が／貴重な／得点／を／し／た／。」、文１は
「（接、）／［固＿１９０］／が／貴重な／得点／を
／し／た／。」に加工される。同様に、文２も加工する
と図２３のようになる。The similar candidate sentence / input sentence processing unit 302 generalizes replacement parts, deletes unnecessary words, and adds necessary words so that the input sentence and the similar candidate sentence are more similar to each other. Is generalized for the replacement part. FIG. 22 shows an example of the input sentence and the sentence 1. First, from the analysis result of the input sentence and the similar candidate sentence, when the matching of the notation is checked for each phrase, "precious", "score", and "do" match. Next, when a match is checked for each phrase in which the notation does not match, the phrase is replaced with “[Gu_190]”. When a phrase that does not match even after the replacement of a phrase is checked for a match between all possible words in the phrase and the element of the replacement part, sentence 1
If there is no such, the processing is omitted.
In a similar candidate sentence, a portion that does not match at least once is deleted if it can be deleted.
If there is an addable part in the similar candidate sentence that is the same as the part that does not match the degree, it is added to the similar candidate statement. First, the input sentence,
In the first sentence 1, the corresponding part is checked by. Next, when the deleteable portion of the sentence 1 is examined, there is no deleteable portion in the sentence 1, so the deletion is not performed. Next, there is an input sentence that does not match, and since there is an addable part in sentence 1, it is checked whether it matches. The processing is the same as that of,
The difference is that there is no correspondence on the input sentence side and sentence 1
The side is a match of the addable part. As a result, since “[tangent]” matches, the input sentence is “(tangent,) / [fix_19”
0] / was / precious / score / was / was /. The sentence 1 is processed into “(contact,) / [solid_190] / is / precious / score / is / is / was /.”. Similarly, sentence 2 is processed as shown in FIG.

【００６５】次に、類似度計算部３０３において、加工
した入力文と類似候補文を類似度計算する。Next, the similarity calculator 303 calculates the similarity between the processed input sentence and the similar candidate sentence.

【００６６】結果、図２３にあるように、文１が類似類
として選択される。As a result, as shown in FIG. 23, sentence 1 is selected as a similar class.

【００６７】最後に、図１の出力部４において類似文と
類似文の訳文が出力される。この例では、文１「中山が
貴重な得点をした。」と「Ｎａｋａｙａｍａａｄｄｅ
ｄａｖａｌｕａｂｌｅｇｏａｌ．」が出力される。
仮に、類似度が同じなのが複数あった場合は、図２３に
示すように、での削除後の要素（単語や置換可能箇
所）一致度、での要素一致度、での文節一致度、
での文節一致度の順に比較し、類似度に差が出た時点で
類似度の高い方を選択する。要素の一致とは、文節中に
含まれる単語や置換箇所のレベルで一致を行うことであ
る。Finally, the output unit 4 of FIG. 1 outputs a similar sentence and a translated sentence of the similar sentence. In this example, sentence 1 “Nakayama scored a valuable score.” And “Nakayama adde”
da value goal. Is output.
If there is a plurality of similarities, the element matching degree of the element (word or replaceable part) after deletion, the phrase matching degree of
Are compared in the order of the degree of phrase matching in the above, and when a difference is found in the similarity, the one with the higher similarity is selected. Element matching refers to matching at the level of a word or replacement part included in a phrase.

【００６８】もし、図１３の類似候補文・入力文加工部
３０２において、類似候補文に追加対象を扱えなけれ
ば、文１に「［接、］」を追加できなくなるので、文１
の類似度が２×８／（１０＋８）＝０．８８となり、文
全体の意味として文１よりも入力文に似ていない文２が
選択されてしまう。If the similar candidate sentence / input sentence processing unit 302 shown in FIG. 13 cannot handle an addition target in the similar candidate sentence, it is not possible to add “[contact,]” to sentence 1.
Is 2.times.8 / (10 + 8) = 0.88, and sentence 2 which is less similar to the input sentence than sentence 1 is selected as the meaning of the entire sentence.

【００６９】この例では、置換、削除のみならず、追加
も考慮することで、より細かな類似度計算をすることが
できることを示した。In this example, it has been shown that a more detailed similarity calculation can be performed by considering not only replacement and deletion but also addition.

【００７０】［実施例３］実施例１，２と同様に説明す
る。[Embodiment 3] A description will be given in the same manner as Embodiments 1 and 2.

【００７１】まず、事前に図１の用例部５にあるデータ
を準備する。First, data in the example section 5 of FIG. 1 is prepared in advance.

【００７２】図２４を図１の対訳用例集４０の対訳用例
とすると、図１の用例解析・情報付与部５０内におい
て、図２（ａ）の用例文の解析で形態素解析処理により
図２５が作成される。図２５は文節ごとに“｜”で、品
詞ごとに“／”で区切っており、品詞、型の番号、活用
形を記している。また、図２５と用例非依存の語句デー
タ１０と用例依存の語句データ２０を用いて、用例ごと
に置換、削除、追加可能箇所を調べ、情報を付与すると
図２６が作成される。このとき同時に、各対訳用例の文
ごとに含まれている単語と文番号の対応表（図２７）も
作成される。図２６と図２７は、図１の解析済み対訳用
例集６０に蓄積される。If FIG. 24 is a bilingual example of the bilingual example set 40 of FIG. 1, the example analysis / information adding unit 50 of FIG. 1 analyzes the example sentence of FIG. Created. In FIG. 25, each phrase is delimited by "|" and each part of speech is delimited by "/", and the part of speech, type number, and inflected form are described. Further, using FIG. 25, the example-independent phrase data 10 and the example-dependent phrase data 20, the replaceable, deleteable, and addable portions are checked for each example, and information is added, thereby creating FIG. 26. At this time, a correspondence table (FIG. 27) between words and sentence numbers included in each sentence of each translation example is also created. 26 and 27 are stored in the analyzed bilingual example collection 60 of FIG.

【００７３】用例依存の語句データは、この例の対訳用
例が図２４であるため、図５（ａ）が利用される。FIG. 24A is used as the example-dependent phrase data because the bilingual example of this example is shown in FIG.

【００７４】次に、実際の処理において、図１の入力部
１から入力文「中山が３０分に貴重な得点をした。」が
入力され、図１の解析・情報付与部２内において、図２
（ｂ）の入力文の解析で形態素解析処理により図２８が
作成される。また、置換可能箇所と類似候補文との追加
一致箇所を調べ、情報を付与すると図２９が作成され
る。Next, in the actual processing, the input sentence “Nakayama scored a valuable score in 30 minutes.” Is input from the input unit 1 of FIG. 2
FIG. 28 is created by the morphological analysis processing in the analysis of the input sentence of (b). In addition, FIG. 29 is created by examining an additional matching portion between the replaceable portion and the similar candidate sentence and adding information.

【００７５】次に、図１の検索部３において、入力文と
図２４の用例文（日本語用例）から入力文の類似文を検
索する。図１３に沿って説明する。Next, the search unit 3 of FIG. 1 searches the input sentence and the example sentence (Japanese example) of FIG. 24 for a similar sentence of the input sentence. This will be described with reference to FIG.

【００７６】類似候補文抽出部３０１では、事前にあま
り似ていない用例文の処理を省くために、入力文の単語
と同じ単語が所定の閾値以上含んでいる文を選択する。
実際は入力文の単語を図２７で調べ、入力文の単語数に
おける一致単語数の割合が閾値以上の文を選択する。文
１＝９／１１＝０．８１、文２＝１０／１１＝０．９０
となり、閾値が０．７であったとすると、文１、文２の
両方が選択される。The similar candidate sentence extracting unit 301 selects a sentence in which the same word as the word of the input sentence exceeds a predetermined threshold value in order to omit the processing of an example sentence that is not very similar in advance.
Actually, the words of the input sentence are checked in FIG. 27, and a sentence in which the ratio of the number of matching words to the number of words of the input sentence is equal to or more than a threshold is selected. Sentence 1 = 9/11 = 0.81, Sentence 2 = 10/11 = 0.90
Assuming that the threshold is 0.7, both sentence 1 and sentence 2 are selected.

【００７７】類似候補文・入力文加工部３０２では、各
類似候補文と入力文がより類似するように類似候補文に
ついて置換箇所の一般化や不要語句の削除や必要語句の
追加を行い、入力文について置換箇所の一般化を行う。
実施例１、実施例２と同様な方法で処理を行った結果、
図３０に示す文のようになる。The similar candidate sentence / input sentence processing unit 302 generalizes replacement parts, deletes unnecessary words, and adds necessary words so as to make each similar candidate sentence more similar to the input sentence. Generalize the replacement part of the sentence.
As a result of performing processing in the same manner as in Example 1 and Example 2,
The sentence is as shown in FIG.

【００７８】類似度計算部３０３では、加工した入力文
と類似候補文を類似度計算する。The similarity calculator 303 calculates the similarity between the processed input sentence and the similar candidate sentence.

【００７９】結果、図３０にあるように、文１が類似文
として選択される。As a result, as shown in FIG. 30, sentence 1 is selected as a similar sentence.

【００８０】最後に、図１の出力部４において類似文と
類似文の訳文が出力される。この例では、文１「中山が
開始直後貴重な得点をした。」と「Ａｆｔｅｒｂｅｇ
ｉｎｎｉｎｇ，Ｎａｋａｙａｍａａｄｄｅｄａｖ
ａｌｕａｂｌｅｇｏａｌ．」が出力される。仮に、類
似度が同じものが複数あった場合は、図３０に示すよう
に、での削除後の要素（単語や置換可能箇所）一致
度、での要素一致度、での文節一致度、での文節
一致度の順に比較し、類似度に差が出た時点で類似度の
高い方を選択する。要素の一致とは、文節中に含まれる
単語や置換箇所のレベルで一致を行うことである。Finally, the output unit 4 of FIG. 1 outputs a similar sentence and a translated sentence of the similar sentence. In this example, the sentence 1 “Nakayama scored a valuable score immediately after the start” and “After beg”
inning, Nakayama added a v
available goal. Is output. If there is a plurality of items having the same similarity, as shown in FIG. 30, the element (word or replaceable part) coincidence after deletion, the phrase coincidence with Are compared in the order of the phrase coincidence, and when the similarity differs, the one with the higher similarity is selected. Element matching refers to matching at the level of a word or replacement part included in a phrase.

【００８１】もし、図１３の類似候補文・入力文加工部
３０２において、置換対象として複数の形態素列を１つ
の置換対象として扱えないならば、文１の類似度が２×
９／（１１＋１１）＝０．８１となり、文全体の意味と
して他の文よりも入力文に似ていない文２が選択されて
しまう。If the similar candidate sentence / input sentence processing unit 302 in FIG. 13 cannot handle a plurality of morpheme strings as one to be replaced as one to be replaced, the similarity of sentence 1 is 2 ×
9 / (11 + 11) = 0.81, and sentence 2 is selected as the meaning of the entire sentence that is less similar to the input sentence than other sentences.

【００８２】この例では、置換可能箇所の単位を複数の
単語も許すことで、より細かな点を考慮した類似度計算
をすることができることを示した。In this example, it has been shown that the similarity calculation can be performed in consideration of more detailed points by allowing a plurality of words as the unit of the replaceable portion.

【００８３】尚、図１の検索部３は、類似度の最も高い
類似候補文に加え、類似度が高い方から所定の数の類似
候補文を類似文として抽出するようにしてもよい。The search unit 3 shown in FIG. 1 may extract a predetermined number of similar candidate sentences in descending order of similarity as similar sentences in addition to the similar candidate sentences having the highest similarity.

【００８４】また、本発明における類似文検索方法は、
具体的にはパーソナルコンピュータ（ＰＣ）等のコンピ
ュータにより、予め所定のコンピュータ読み取り可能な
記録媒体に記録された類似文検索プログラムに基づいて
実行される。すなわち、例文集から入力文の類似文を検
索する類似文検索プログラムを記録したコンピュータ読
み取り可能な記録媒体において、例文集の類似候補文に
ついて事前に文法的もしくは意味的に置換、削除、追加
が可能な箇所に各情報を付与し、入力文にも同様に置換
が可能な箇所や類似候補文の追加可能箇所との一致が可
能な箇所に各情報を付与した上で、入力文と類似候補文
との類似度計算の際に、各文の差分箇所に対しての同種
の置換箇所の一致、不要箇所の削除や不足箇所の追加を
考慮した処理を行い、最も類似度の高い類似候補文を類
似文として類似度とともに抽出する処理をコンピュータ
に実行させる。The similar sentence retrieval method according to the present invention
Specifically, it is executed by a computer such as a personal computer (PC) based on a similar sentence search program recorded in a predetermined computer-readable recording medium in advance. In other words, similar candidate sentences in the example sentence can be grammatically or semantically replaced, deleted, or added in advance on a computer-readable recording medium that records a similar sentence search program that searches the example sentence for a similar sentence of the input sentence. Information is added to the input sentence, and the input sentence and the similar candidate sentence are added after adding each information to a place where the input sentence can be replaced or a similar candidate sentence can be added. When calculating the similarity, the similar candidate sentence with the highest similarity is determined by matching the same type of replacement with the difference of each sentence, deleting unnecessary parts, and adding missing parts. The computer is caused to execute a process of extracting the similar sentence together with the similarity.

【００８５】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、類似度の最も高い類似候
補文に加え、類似度が高い方から所定の数の類似候補文
を類似文として出力する処理をコンピュータに実行させ
る。According to the present invention, a predetermined number of similar candidate sentences having the highest similarity are output as similar sentences in addition to the similar candidate sentences having the highest similarity in the recording medium storing the similar sentence search program. Causes the computer to execute the processing.

【００８６】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、置換、削除、追加の情報
を付与するための基となるデータとして、汎用的に利用
できるものと、文書の分野に依存するものに分けて各デ
ータを作成し、文書分野に依存するデータの自動作成に
おいて、既存の汎用的もしくは分野依存のデータを用い
て情報を例文に付与し、置換可能かつ削除可能な箇所を
削った例文集から類似している文を集め、文中の置換情
報が付与されていない箇所で、その前後の箇所の表記や
置換の種類が一致しており、該当箇所の情報が同じで表
記の異なるものの集合を新たな置換対象のデータとして
作成し、同時に、新たな置換対象のデータと前後の表記
などを考慮して、新たな削除対象のデータを作成する処
理をコンピュータに実行させる。The present invention also relates to a recording medium on which the similar sentence search program is recorded, which can be generally used as data as a base for adding replacement, deletion, and additional information, as well as in the field of documents. Create each data separately for dependent items, and in the automatic creation of data that depends on the document field, add information to example sentences using existing general-purpose or field-dependent data, and specify places that can be replaced and deleted Collect similar sentences from the cut example sentence collection, and in places where replacement information is not added in the sentence, the notation and replacement type of the preceding and following parts match, and the information of the corresponding part is the same and the notation The computer creates a set of different data as new data to be replaced and, at the same time, creates new data to be deleted in consideration of the new data to be replaced and the notation before and after. To be executed.

【００８７】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、類似候補文について、例
文集の文が大量にある場合に、入力文の語句と同じ語句
の数が所定の閾値以上である類似候補文を新たな類似候
補文とする処理をコンピュータに実行させる。Further, according to the present invention, in a recording medium on which the similar sentence retrieval program is recorded, when there are a large number of sentences in the example sentence collection of similar candidate sentences, the number of the same phrases as the input sentence exceeds a predetermined threshold value. Is made to be executed by the computer as a new similar candidate sentence.

【００８８】また本発明は、前記類似文検索プログラム
を記録した記録媒体において、例文集の各文と訳文の組
である対訳用例を用いて、入力文の類似文とその対訳を
抽出する処理をコンピュータに実行させる。Further, according to the present invention, a process for extracting a similar sentence of an input sentence and a bilingual translation thereof on a recording medium storing the similar sentence search program, using a bilingual example which is a set of each sentence of the example sentence collection and a translated sentence. Let the computer run.

【００８９】[0089]

【発明の効果】以上述べたように本発明によれば、第１
自然言語の文のみ、もしくは、それに対応する第２自然
言語の文の組を含む対訳例文集を用いて、読み込まれた
第１自然言語の入力文から類似文を選択する際に、解析
された入力文と、解析された対訳用例文の類似度の比較
において、表記そのままや置換や削除だけでなく追加も
考慮し、また、置換、削除、追加の単位を複数単語列も
考慮するので、表記だけでは類似度が高くない場合で
も、文法的または意味的に類似した文が検索できる。ま
た、一部の置換や削除や追加の情報を付与するための元
となるデータを特定分野の対訳用例から自動的に得るこ
とができる。また、入力文に似ていない文をあらかじめ
削除することで、類似度計算の時間を短縮できる。As described above, according to the present invention, the first
When a similar sentence is selected from the input sentence of the first natural language using the bilingual example sentence including only the sentence of the natural language or the corresponding set of sentences of the second natural language, the sentence is analyzed. When comparing the similarity between the input sentence and the analyzed bilingual example sentence, not only the notation itself, replacement or deletion, but also addition is taken into consideration, and replacement, deletion, and addition units are also considered for multiple word strings. Even if the degree of similarity is not high by itself, sentences that are grammatically or semantically similar can be searched. In addition, it is possible to automatically obtain data serving as a base for providing partial replacement, deletion, or additional information from a bilingual translation example in a specific field. In addition, by deleting sentences that are not similar to the input sentence in advance, the time for calculating the similarity can be reduced.

【００９０】また、特に対訳用例文の訳文を編集して入
力文の翻訳を行う処理の一部として、編集が容易で適切
な訳文となる対訳用例を選択するのに利用できる。Further, as a part of the process of editing the translation of the bilingual example sentence and translating the input sentence, the present invention can be used to select a bilingual example which is easy to edit and which becomes an appropriate translated sentence.

[Brief description of the drawings]

【図１】本発明の実施形態例に係る類似文検索方法によ
る処理手順および類似文検索装置の構成説明図である。FIG. 1 is an explanatory diagram of a processing procedure by a similar sentence search method and a configuration of a similar sentence search device according to an embodiment of the present invention.

【図２】本発明の実施例に係る解析処理の説明図であ
る。FIG. 2 is an explanatory diagram of an analysis process according to the embodiment of the present invention.

【図３】本発明の実施例に係る対訳用例集の例（１）を
示す説明図である。FIG. 3 is an explanatory diagram showing an example (1) of a bilingual example book according to the embodiment of the present invention.

【図４】本発明の実施例に係る用例非依存の語句データ
の例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of example-independent phrase data according to the embodiment of the present invention.

【図５】本発明の実施例に係る用例依存の語句データの
例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of example-dependent phrase data according to the embodiment of the present invention.

【図６】本発明の実施例に係る用例依存の語句データの
自動抽出例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of automatic extraction of example-dependent phrase data according to the embodiment of the present invention.

【図７】本発明の実施例に係る対訳用例集の例（１）の
日本語用例の形態素解析結果を示す説明図である。FIG. 7 is an explanatory diagram showing a morphological analysis result of a Japanese example of an example (1) of a bilingual example collection according to the embodiment of the present invention.

【図８】本発明の実施例に係る対訳用例集の例（１）の
日本語用例の解析済み用例集を示す説明図である。FIG. 8 is an explanatory diagram showing an analyzed example collection of Japanese examples in example (1) of a bilingual example collection according to the embodiment of the present invention.

【図９】本発明の実施例に係る品詞、区分の説明図であ
る。FIG. 9 is an explanatory diagram of part of speech and division according to the embodiment of the present invention.

【図１０】本発明の実施例に係る対訳用例集の例（１）
の日本語用例における各文に含まれている単語と文番号
の対応を示す説明図である。FIG. 10 is an example of a bilingual example collection according to the embodiment of the present invention (1).
FIG. 8 is an explanatory diagram showing correspondence between words and sentence numbers included in each sentence in the Japanese example of FIG.

【図１１】本発明の実施例に係る入力文の形態素解析結
果（１）を示す説明図である。FIG. 11 is an explanatory diagram showing a morphological analysis result (1) of an input sentence according to the embodiment of the present invention.

【図１２】本発明の実施例に係る入力文の解析結果
（１）を示す説明図である。FIG. 12 is an explanatory diagram showing an analysis result (1) of an input sentence according to the embodiment of the present invention.

【図１３】本発明の実施例に係る検索部を示す構成説明
図である。FIG. 13 is an explanatory diagram illustrating a configuration of a search unit according to an embodiment of the present invention.

【図１４】本発明の実施例に係る類似候補用例文と入力
文の加工の例（１）を示す説明図である。FIG. 14 is an explanatory diagram showing an example (1) of processing a similar candidate example sentence and an input sentence according to the embodiment of the present invention.

【図１５】本発明の実施例に係る類似度計算に利用する
文と類似度の例（１）を示す説明図である。FIG. 15 is an explanatory diagram showing an example (1) of a sentence used for similarity calculation and a similarity according to the embodiment of the present invention.

【図１６】本発明の実施例に係る対訳用例集の例（２）
を示す説明図である。FIG. 16 shows an example of a bilingual translation example (2) according to the embodiment of the present invention.
FIG.

【図１７】本発明の実施例に係る対訳用例集の例（２）
の日本語用例の形態素解析結果を示す説明図である。FIG. 17 shows an example of a bilingual translation example according to the embodiment of the present invention (2).
FIG. 10 is an explanatory diagram showing a morphological analysis result of the Japanese example.

【図１８】本発明の実施例に係る対訳用例集の例（２）
の日本語用例の解析済み用例集を示す説明図である。FIG. 18 is an example (2) of a bilingual translation example book according to the embodiment of the present invention.
It is explanatory drawing which shows the analyzed example collection of Japanese examples.

【図１９】本発明の実施例に係る対訳用例集の例（２）
の日本語用例における各文に含まれている単語と文番号
の対応を示す説明図である。FIG. 19 is an example of a bilingual translation example according to the embodiment of the present invention (2).
FIG. 8 is an explanatory diagram showing correspondence between words and sentence numbers included in each sentence in the Japanese example of FIG.

【図２０】本発明の実施例に係る入力文の形態素解析結
果（２）を示す説明図である。FIG. 20 is an explanatory diagram showing a morphological analysis result (2) of an input sentence according to the embodiment of the present invention.

【図２１】本発明の実施例に係る入力文の解析結果
（２）を示す説明図である。FIG. 21 is an explanatory diagram showing an analysis result (2) of an input sentence according to the embodiment of the present invention.

【図２２】本発明の実施例に係る類似候補用例文と入力
文の加工の例（２）を示す説明図である。FIG. 22 is an explanatory diagram showing an example (2) of processing a similar candidate example sentence and an input sentence according to the embodiment of the present invention.

【図２３】本発明の実施例に係る類似度計算に利用する
文と類似度の例（２）を示す説明図である。FIG. 23 is an explanatory diagram showing an example (2) of a sentence used for similarity calculation and a similarity according to the embodiment of the present invention.

【図２４】本発明の実施例に係る対訳用例集の例（３）
を示す説明図である。FIG. 24 is an example of a bilingual example book according to the embodiment of the present invention (3).
FIG.

【図２５】本発明の実施例に係る対訳用例集の例（３）
の日本語用例の形態素解析結果を示す説明図である。FIG. 25 is an example (3) of a bilingual translation example book according to the embodiment of the present invention.
FIG. 10 is an explanatory diagram showing a morphological analysis result of the Japanese example.

【図２６】本発明の実施例に係る対訳用例集の例（３）
の日本語用例の解析済み用例集を示す説明図である。FIG. 26 is an example (3) of a bilingual translation example book according to the embodiment of the present invention.
It is explanatory drawing which shows the analyzed example collection of Japanese examples.

【図２７】本発明の実施例に係る対訳用例集の例（３）
の日本語用例における各文に含まれている単語と文番号
の対応を示す説明図である。FIG. 27 is an example (3) of a bilingual translation example book according to the embodiment of the present invention.
FIG. 8 is an explanatory diagram showing correspondence between words and sentence numbers included in each sentence in the Japanese example of FIG.

【図２８】本発明の実施例に係る入力文の形態素解析結
果（３）を示す説明図である。FIG. 28 is an explanatory diagram showing a morphological analysis result (3) of an input sentence according to the embodiment of the present invention.

【図２９】本発明の実施例に係る入力文の解析結果
（３）を示す説明図である。FIG. 29 is an explanatory diagram showing an analysis result (3) of an input sentence according to the embodiment of the present invention.

【図３０】本発明の実施例に係る類似度計算に利用する
文と類似度の例（３）を示す説明図である。FIG. 30 is an explanatory diagram showing an example (3) of a sentence used for similarity calculation and a similarity according to the embodiment of the present invention.

[Explanation of symbols]

１入力部２解析・情報付与部３検索部４出力部５用例部１０用例非依存の語句データ２０用例依存の語句データ３０データ作成部４０対訳用例集５０用例解析・情報付与部６０解析済み対訳用例集３０１類似候補文抽出部３０２類似候補文・入力文加工部３０３類似度計算部 DESCRIPTION OF SYMBOLS 1 Input part 2 Analysis and information provision part 3 Retrieval part 4 Output part 5 Example part 10 Example-independent phrase data 20 Example-dependent phrase data 30 Data creation part 40 Parallel translation example collection 50 Example analysis and information provision part 60 Analyzed translation Example collection 301 Similar candidate sentence extraction unit 302 Similar candidate sentence / input sentence processing unit 303 Similarity calculation unit

Claims

[Claims]

In a similar sentence search method for searching for a similar sentence of an input sentence from an example sentence collection, each information is stored in a place where a similar candidate sentence of an example sentence can be grammatically or semantically replaced, deleted, or added in advance. After assigning each information to a place where the input sentence can be replaced and a place where a similar candidate sentence can be added, a similarity calculation between the input sentence and the similar candidate sentence is performed. At this time, processing is performed in consideration of matching of the same type of replacement part with respect to the difference part of each sentence, deletion of unnecessary parts and addition of missing parts, and the similar candidate sentence with the highest similarity as a similar sentence along with similarity A similar sentence search method characterized by extraction.

2. The similar sentence search method according to claim 1, wherein a predetermined number of similar candidate sentences having the highest similarity are output as similar sentences in addition to the similar candidate sentences having the highest similarity.

3. Data that can be universally used as base data for adding replacement, deletion, and additional information;
Create each data separately according to the field of the document,
In the automatic creation of data that depends on the document field, information is added to the example sentences using existing general-purpose or field-dependent data, and similar sentences are collected from the example sentence collection in which replaceable and deleteable parts have been deleted. Creates a set of data with the same information but different notation at the place where the replacement information in the sentence is not added and the place before and after it is the same, and the information of the place is the same as the new data to be replaced 3. A similar sentence search method according to claim 1 or 2, wherein the similar sentence search method is created as new data to be deleted in consideration of the new data to be replaced and the notation before and after the new data.

4. When there are a large number of sentences in an example sentence collection of similar candidate sentences, a similar candidate sentence in which the number of words identical to the words of the input sentence is equal to or greater than a predetermined threshold value is determined as a new similar candidate sentence. 4. The similar sentence search method according to claim 1, 2 or 3.

5. A similar sentence of an input sentence and a bilingual translation thereof are extracted by using a bilingual example which is a set of each sentence and a translated sentence of an example sentence collection. Similar sentence search method.

6. An example section in which a plurality of example sentences are stored, input means for reading an input sentence, and a similar candidate sentence obtained from the example sentence of the example section is analyzed in terms of phrases, and grammatically or semantically replaced. Example analysis / information addition means for adding each information to a place where deletion and addition are possible, and an input sentence read by the input means are analyzed in terms of words and phrases, grammatically or semantically replaceable parts or similar. An analysis and information adding means for adding each information to a position where the candidate sentence can be matched with an addable portion; and for calculating the similarity between the input sentence and the similar candidate sentence for the analyzed similar candidate sentence, A search unit that calculates the similarity in consideration of matching of the same type of replacement part with respect to the difference part of the sentence, deletion of an unnecessary part and addition of a missing part, and extracts a similar candidate sentence having the highest similarity as a similar sentence. , Extracted by the search means An output unit for outputting a similar sentence together with the degree of similarity.

7. The similarity according to claim 1, wherein the retrieval means extracts a predetermined number of similar candidate sentences from the one with the highest similarity as similar sentences in addition to the similar candidate sentences with the highest similarity. Sentence search device.

8. When replacing, deleting, and adding additional information, each data is described separately as data that can be used for general purposes and data that depends on the field of the document. In the automatic creation of domain-dependent data, similar sentences are collected from the sentences in the example sentence collection in which replaceable and deleteable parts are deleted using existing general-purpose or field-dependent data, and the replacement information in the sentence is collected. In the parts that are not assigned, the notation and replacement types before and after that are the same, a set of items with the same information but different notations is created as new data to be replaced, and at the same time, a new 8. The similar sentence retrieval apparatus according to claim 6, further comprising data creation means for creating data to be newly deleted in consideration of the data to be replaced and the notation before and after.

9. The search means according to claim 6, 7, or 8, wherein a sentence in which the number of words in the input sentence is equal to or greater than a predetermined threshold value is determined as a similar candidate sentence in advance.
Similar sentence search device.

10. A parallel sentence in which a translated sentence is associated with an example sentence, comprising a similar sentence extracted by the search unit and an output unit for outputting the translated sentence. Item 6, 7, 8 or 9 similar sentence retrieval apparatus.

11. A storage medium storing a similar sentence search program for searching a similar sentence of an input sentence from a set of example sentences, in which a similar candidate sentence of the set of example sentences can be grammatically or semantically replaced, deleted, or added in advance. Each information is added to the location, and each information is added to a location where the input sentence can be replaced and a location where a similar candidate sentence can be added can be matched.
When calculating the similarity between the input sentence and the similar candidate sentence, processing is performed in consideration of the matching of the same type of replacement part with the difference part of each sentence, the deletion of unnecessary parts, and the addition of missing parts. A computer-readable recording medium storing a similar sentence search program for causing a computer to execute a process of extracting a similar candidate sentence having a high similarity as a similar sentence together with the similarity.

12. A storage medium storing the similar sentence search program according to claim 11, wherein a predetermined number of similar candidate sentences having the highest similarity are output as similar sentences in addition to the similar candidate sentences having the highest similarity. And a computer-readable recording medium storing a similar sentence search program for causing a computer to execute a process to be performed.

13. A recording medium on which the similar sentence retrieval program according to claim 11 or 12 is recorded, wherein replacement, deletion,
The data to be used as a basis for adding additional information is divided into data that can be used for general purposes and data that depends on the field of the document. Information is given to example sentences using general-purpose or field-dependent data, and similar sentences are collected from an example sentence collection in which replaceable and deleteable parts are deleted. , The notation and the type of replacement before and after it match, and a set of items with the same information but different notations is created as new data to be replaced, and at the same time, the new data to be replaced A computer-readable recording medium that records a similar sentence search program for causing a computer to execute a process of creating new data to be deleted in consideration of the notation of a document.

14. A recording medium on which the similar sentence retrieval program according to claim 11, 12 or 13 is recorded, wherein, when there are a large number of sentences in an example sentence collection, the number of phrases that are the same as the words in the input sentence for similar candidate sentences. A computer-readable recording medium storing a similar sentence search program for causing a computer to execute a process of setting a similar candidate sentence having a value equal to or greater than a predetermined threshold value as a new similar candidate sentence.

15. A recording medium on which the similar sentence retrieval program according to claim 11, 12, 13, or 14 is recorded,
A computer-readable recording medium which records a similar sentence search program for causing a computer to execute a process of extracting a similar sentence of an input sentence and a translation thereof using a translation example which is a set of each sentence of the example sentence and a translated sentence.