JP2012113422A

JP2012113422A - Document processing apparatus, method and program

Info

Publication number: JP2012113422A
Application number: JP2010260265A
Authority: JP
Inventors: Seimin Ooyo; セイミンオウヨウ; Do Kevin; ドゥケヴィン; Masaaki Nagata; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-11-22
Filing date: 2010-11-22
Publication date: 2012-06-14
Anticipated expiration: 2030-11-22
Also published as: JP5441872B2

Abstract

PROBLEM TO BE SOLVED: To support an editor of an article described by a joint support system by automatically extracting, from one of sentences in different languages, new information which is not contained in the other sentence.SOLUTION: An article (each sentence e) in English and an article (each sentence c) in Chinese are translated into the same language, similarity between each sentence e' of the article in English after the translation and each sentence c' of the article in Chinese after the translation is computed for all combinations of e' and c', N pieces of sentences are extracted from among sentences of which the maximum similarity is minimum, in the sentences e', and a flag "+1" indicating a sentence including new information is given to the corresponding English sentence before the translation. A graph is created in which a relationship between nodes for the English and Chinese sentences is represented as a weighted edge corresponding to the similarity. Then, a label 1 is given to a node representing the English sentence with the flag "+1" and 0 is given to the others. A position before or after a Chinese sentence corresponding to a node with a maximum label given through label propagation is determined as an insertion position.

Description

本発明は、文書処理装置、方法、及びプログラムに係り、特に、２つの異なる言語で記述された文章を比較して、一方の文章から他方の文章には含まれない新しい情報を抽出する文書処理装置、方法、及びプログラムに関する。 The present invention relates to a document processing apparatus, method, and program, and in particular, document processing for comparing sentences written in two different languages and extracting new information not included in the other sentence from one sentence. The present invention relates to an apparatus, a method, and a program.

近年、ウィキペディアに代表されるようなウェブ上のフリー百科事典が注目されている。ウィキペディアでは、共同執筆システムが採用されており、大勢の編集者の協力によって記事が記述され、１つの記事について複数の言語版が存在し、各々の言語版の記事が異なるユーザグループによって維持されている。そのため、これらの異なる言語版の記事は、同一の項目に関する記事であっても必ずしも同じ内容が記載されているとは限らない。そこで、編集者はある言語版での記事が正しいまたは最新の情報によって構成されていることを確保するために、他の言語版の記事から情報を取得する場合がある。 In recent years, free encyclopedias on the web such as Wikipedia have attracted attention. Wikipedia employs a co-authoring system where articles are described with the help of many editors, and there are multiple language versions of an article, with each language version maintained by a different group of users. Yes. For this reason, these different language versions of the articles do not necessarily have the same content even if they are articles related to the same item. Thus, an editor may obtain information from articles in other language versions in order to ensure that articles in one language version are composed of correct or latest information.

このような共同執筆システムの編集者を支援するために、いくつかのシステムが提案されている。例えば、ウィキペディア内の異なる言語版のinfoboxesをマッチングするシステムが提案されている（非特許文献１参照）。また、記事の階層構造を考慮して、既存のテキストに新しい情報を挿入するアルゴリズムが提案されている（非特許文献２参照）。また、ウェブ上の情報を収集して、簡単なウィキペディアの記事を自動的に生成する手法が提案されている（非特許文献３参照）。 Several systems have been proposed to assist editors of such co-writing systems. For example, a system for matching different language versions of infoboxes in Wikipedia has been proposed (see Non-Patent Document 1). Also, an algorithm for inserting new information into existing text in consideration of the hierarchical structure of articles has been proposed (see Non-Patent Document 2). In addition, a method for collecting information on the web and automatically generating a simple Wikipedia article has been proposed (see Non-Patent Document 3).

Eytan Adar, Michael Skinner, and Daniel S. Weld. 2009. Information arbitrage across multi-lingual Wikipedia. In Proceedings of the Second ACM international Conference on Web Search and Data Mining, 94-103.Eytan Adar, Michael Skinner, and Daniel S. Weld. 2009. Information arbitrage across multi-lingual Wikipedia. In Proceedings of the Second ACM international Conference on Web Search and Data Mining, 94-103. Erdong Chen, Benjamin Snyder, and Regina Barzilay. 2007. Incremental Text Structuring with Online Hierarchical Ranking. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning 2007, 83-91.Erdong Chen, Benjamin Snyder, and Regina Barzilay. 2007. Incremental Text Structuring with Online Hierarchical Ranking. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning 2007, 83-91. Christina Sauper and Regina Barzilay. 2009. Automatically generating Wikipedia articles: a structure-aware approach. In Proceedings of the 47th Annual Meeting of the ACL, 208-216.Christina Sauper and Regina Barzilay. 2009. Automatically generating Wikipedia articles: a structure-aware approach. In Proceedings of the 47th Annual Meeting of the ACL, 208-216.

しかしながら、非特許文献１に記載された技術は、infoboxesという予め定められた形式の項目を扱う技術であり、記事のテキストに適用することができない、という問題がある。 However, the technique described in Non-Patent Document 1 is a technique that handles items of a predetermined format called infoboxes, and has a problem that it cannot be applied to the text of articles.

また、非特許文献２及び３に記載された技術では、単一の言語の記事のみを対象としており、異なる言語で記述された記事について考慮されていない、という問題がある。 Further, the techniques described in Non-Patent Documents 2 and 3 have a problem that only articles in a single language are targeted, and articles written in different languages are not considered.

本発明は上記問題点に鑑みてなされたものであり、異なる言語で記述された文章を比較して、一方の文章から他方の文章には含まれない新しい情報を自動的に抽出することにより、共同支援システムにより記述された記事の編集者を支援することができる文章処理装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, by comparing sentences written in different languages, and automatically extracting new information not included in the other sentence from one sentence, It is an object of the present invention to provide a text processing apparatus, method, and program capable of supporting an editor of an article described by a joint support system.

上記目的を達成するために、本発明の文章処理装置は、同一の言語で記述された２種類の文章が得られるように、第１の言語で記述された１または複数の文を含む第１の文章を翻訳した第３の文章、及び前記第１の言語と異なる第２の言語で記述された１または複数の文を含む第２の文章を翻訳した第４の文章のいずれか一方または双方を得る翻訳手段と、前記翻訳手段により第３の文章を得た場合には、前記第３の文章に含まれる各文と、前記第２の文章に含まれる各文または前記翻訳手段により得た第４の文章に含まれる各文との類似度を、前記第３の文章に含まれる各文と、前記第２の文章に含まれる各文または前記第４の文章に含まれる各文との全ての組み合わせについて計算し、前記翻訳手段により第３の文章を得なかった場合には、前記第１の文章に含まれる各文と、前記第４の文章に含まれる各文との類似度を、前記第１の文章に含まれる各文と前記第４の文章に含まれる各文との全ての組み合わせについて計算する類似度計算手段と、前記翻訳手段により第３の文章を得た場合には、前記第３の文章の各文について、前記類似度計算手段により計算された類似度が最大となる最大類似度を求め、該最大類似度が所定の閾値以下となる前記第３の文章の文に対応する前記第１の文章の文を、前記第２の文章に含まれていない新しい情報を含む文として抽出し、前記翻訳手段により第３の文章を得なかった場合には、前記第１の文章の各文について、前記最大類似度を求め、該最大類似度が所定の閾値以下となる前記第１の文章の文を、前記第２の文章に含まれていない新しい情報を含む文として抽出する新情報抽出手段と、を含んで構成されている。 In order to achieve the above object, a sentence processing apparatus according to the present invention includes a first sentence including one or more sentences written in a first language so that two kinds of sentences written in the same language can be obtained. One or both of the third sentence translated from the first sentence and the fourth sentence translated from the second sentence including one or more sentences described in a second language different from the first language And when the third sentence is obtained by the translation means, each sentence included in the third sentence and each sentence included in the second sentence or obtained by the translation means The degree of similarity with each sentence included in the fourth sentence is determined by comparing each sentence included in the third sentence with each sentence included in the second sentence or each sentence included in the fourth sentence. When all combinations are calculated and the third sentence is not obtained by the translation means The similarity between each sentence included in the first sentence and each sentence included in the fourth sentence indicates the degree of similarity between each sentence included in the first sentence and each sentence included in the fourth sentence. Similarity calculation means for calculating all combinations with the above, and when the third sentence is obtained by the translation means, the similarity calculated by the similarity calculation means for each sentence of the third sentence The maximum similarity is calculated so that the maximum similarity is not more than a predetermined threshold, and the sentence of the first sentence corresponding to the sentence of the third sentence whose maximum similarity is not more than a predetermined threshold is not included in the second sentence When the sentence is extracted as a sentence including new information and the third sentence is not obtained by the translation unit, the maximum similarity is obtained for each sentence of the first sentence, and the maximum similarity is a predetermined threshold value. The sentence of the first sentence that is the following is not included in the second sentence It is configured to include the new information extraction means, the extracting as a statement that contains the new information.

本発明の文章処理装置によれば、まず、翻訳手段が、同一の言語で記述された２種類の文章が得られるように、第１の言語で記述された複数の文を含む第１の文章を翻訳した第３の文章、及び第１の言語と異なる第２の言語で記述された複数の文を含む第２の文章を翻訳した第４の文章のいずれか一方または双方を得る。そして、類似度計算手段が、翻訳手段により第３の文章を得た場合、すなわち第１の文章を翻訳した場合には、第３の文章に含まれる各文と、第２の文章に含まれる各文または翻訳手段により得た第４の文章に含まれる各文との類似度を、第３の文章に含まれる各文と、第２の文章に含まれる各文または第４の文章に含まれる各文との全ての組み合わせについて計算する。また、翻訳手段により第３の文章を得なかった場合、すなわち第１の文章を翻訳しなかった場合には、第１の文章に含まれる各文と、第４の文章に含まれる各文との類似度を、第１の文章に含まれる各文と第４の文章に含まれる各文との全ての組み合わせについて計算する。 According to the sentence processing apparatus of the present invention, first, the translation means includes a first sentence including a plurality of sentences described in the first language so that two kinds of sentences described in the same language can be obtained. One or both of the third sentence translated from, and the fourth sentence translated from the second sentence including a plurality of sentences described in a second language different from the first language are obtained. When the similarity calculation unit obtains the third sentence by the translation unit, that is, when the first sentence is translated, each sentence included in the third sentence and the second sentence are included. Similarity between each sentence or each sentence included in the fourth sentence obtained by the translation means is included in each sentence included in the third sentence and each sentence or fourth sentence included in the second sentence Calculate for all combinations with each sentence. Further, when the third sentence is not obtained by the translation means, that is, when the first sentence is not translated, each sentence included in the first sentence, each sentence included in the fourth sentence, Are calculated for all combinations of each sentence included in the first sentence and each sentence included in the fourth sentence.

次に、新情報抽出手段が、翻訳手段により第３の文章を得た場合には、第３の文章の各文について、類似度計算手段により計算された類似度が最大となる最大類似度を求め、該最大類似度が所定の閾値以下となる第３の文章の文に対応する第１の文章の文を、第２の文章に含まれていない新しい情報を含む文として抽出し、翻訳手段により第３の文章を得なかった場合には、第１の文章の各文について、最大類似度を求め、該最大類似度が所定の閾値以下となる第１の文章の文を、第２の文章に含まれていない新しい情報を含む文として抽出する。 Next, when the new information extraction unit obtains the third sentence by the translation unit, the maximum similarity that maximizes the similarity calculated by the similarity calculation unit is obtained for each sentence of the third sentence. Obtaining a sentence of the first sentence corresponding to the sentence of the third sentence having the maximum similarity equal to or less than a predetermined threshold as a sentence including new information not included in the second sentence; If the third sentence is not obtained by the above, the maximum similarity is obtained for each sentence of the first sentence, and the sentence of the first sentence in which the maximum similarity is equal to or less than a predetermined threshold value is obtained. Extracted as a sentence containing new information not included in the sentence.

このように、異なる言語で記述された文章のいずれか一方または双方を翻訳して、同一の言語で記述された２種類の文章を得て、同一の言語となった２種類の文章の各々の各文の類似度を全ての組み合わせについて計算し、類似度が最大となる最大類似度が所定の閾値以下となる第３の文章に対応する第１の文章の文、または第１の文章の文を、第２の文章に含まれていない新しい情報を含む文として自動的に抽出することにより、共同支援システムにより記述された記事の編集者を支援することができる。 In this way, one or both of sentences written in different languages are translated to obtain two kinds of sentences written in the same language, and each of the two kinds of sentences in the same language is obtained. The similarity of each sentence is calculated for all combinations, and the sentence of the first sentence or the sentence of the first sentence corresponding to the third sentence whose maximum similarity is the predetermined threshold value or less. Is automatically extracted as a sentence including new information that is not included in the second sentence, so that the editor of the article described by the joint support system can be supported.

また、本発明の文章処理装置は、前記第１の文章の各文から特徴を抽出する特徴抽出手段と、前記特徴抽出手段により抽出された前記第１の文章の各文の特徴の各々と、該第１の文章の各文の前記新情報抽出手段による抽出結果とを用いて、前記第１の文章の各文の特徴に基づいて、該第１の文章の各文が新しい情報を含むか否かを識別するための分類器を学習する分類器学習手段と、前記分類器学習手段により学習された分類器に前記第１の文章の各文を入力して得られる識別結果に基づいて、前記新しい情報を含む文を再抽出する再抽出手段と、をさらに含んで構成することができる。これにより、新しい情報を含む文の抽出精度を向上させることができる。 The sentence processing apparatus of the present invention includes a feature extraction unit that extracts a feature from each sentence of the first sentence, each feature of each sentence of the first sentence extracted by the feature extraction unit, Whether each sentence of the first sentence contains new information based on the feature of each sentence of the first sentence using the extraction result of each sentence of the first sentence by the new information extraction unit Based on the classification result obtained by inputting each sentence of the first sentence to the classifier learned by the classifier learning means, classifier learning means for learning a classifier for identifying whether or not, Re-extracting means for re-extracting a sentence including the new information. Thereby, the extraction precision of the sentence containing new information can be improved.

また、本発明の文章処理装置は、前記第１の文章及び前記第２の文章の各文を表すノードと、各ノード間の関係を該各ノード間の類似度に応じた重みを付したエッジとを有するグラフを生成し、前記第１の文章の各文に対応したノードに、前記新しい情報を含む文か否かに基づくラベルを付与し、前記グラフを用いたラベル伝播法により得られる前記第２の文章の各文に対応したノードに付与されるラベルに基づいて、前記新しい情報を含む文に対応するノードに付与されたラベルに最も近いラベルが付与されたノードに対応する前記第２の文章の文の前または後を、前記新しい情報を含む文を挿入する位置として決定する決定手段をさらに含んで構成することができる。このように、新しい文章を挿入するのに適した位置を自動的に決定することにより、共同支援システムにより記述された記事の編集者を支援することができる。 Further, the sentence processing apparatus of the present invention includes a node representing each sentence of the first sentence and the second sentence, and an edge to which a relationship between the nodes is weighted according to a similarity between the nodes. And a label based on whether or not the sentence includes the new information is given to a node corresponding to each sentence of the first sentence, and the label obtained by the label propagation method using the graph The second corresponding to the node given the label closest to the label given to the node corresponding to the sentence containing the new information based on the label given to the node corresponding to each sentence of the second sentence It is possible to further comprise determining means for determining before or after the sentence of the sentence as a position to insert the sentence including the new information. Thus, by automatically determining a position suitable for inserting a new sentence, it is possible to assist the editor of the article described by the joint support system.

また、本発明の文章処理方法は、同一の言語で記述された２種類の文章が得られるように、第１の言語で記述された１または複数の文を含む第１の文章を翻訳した第３の文章、及び前記第１の言語と異なる第２の言語で記述された１または複数の文を含む第２の文章を翻訳した第４の文章のいずれか一方または双方を得、前記第３の文章を得た場合には、前記第３の文章に含まれる各文と、前記第２の文章に含まれる各文または前記第４の文章に含まれる各文との類似度を、前記第３の文章に含まれる各文と、前記第２の文章に含まれる各文または前記第４の文章に含まれる各文との全ての組み合わせについて計算し、前記第３の文章を得なかった場合には、前記第１の文章に含まれる各文と、前記第４の文章に含まれる各文との類似度を、前記第１の文章に含まれる各文と前記第４の文章に含まれる各文との全ての組み合わせについて計算し、前記第３の文章を得た場合には、前記第３の文章の各文について、類似度が最大となる最大類似度を求め、該最大類似度が所定の閾値以下となる前記第３の文章の文に対応する前記第１の文章の文を、前記第２の文章に含まれていない新しい情報を含む文として抽出し、前記第３の文章を得なかった場合には、前記第１の文章の各文について、前記最大類似度を求め、該最大類似度が所定の閾値以下となる前記第１の文章の文を、前記第２の文章に含まれていない新しい情報を含む文として抽出する方法である。 Further, the sentence processing method of the present invention translates the first sentence including one or more sentences described in the first language so that two kinds of sentences described in the same language can be obtained. 3 or a fourth sentence translated from a second sentence including one or more sentences described in a second language different from the first language, and the third sentence Is obtained, the degree of similarity between each sentence included in the third sentence and each sentence included in the second sentence or each sentence included in the fourth sentence is expressed as When all the combinations of each sentence included in the third sentence and each sentence included in the second sentence or each sentence included in the fourth sentence are calculated and the third sentence is not obtained Includes the similarity between each sentence included in the first sentence and each sentence included in the fourth sentence, When all the combinations of each sentence included in the first sentence and each sentence included in the fourth sentence are calculated and the third sentence is obtained, each sentence of the third sentence For the above, the maximum similarity that maximizes the similarity is obtained, and the sentence of the first sentence corresponding to the sentence of the third sentence that has the maximum similarity equal to or less than a predetermined threshold is set as the second sentence. When it is extracted as a sentence including new information that is not included and the third sentence is not obtained, the maximum similarity is obtained for each sentence of the first sentence, and the maximum similarity is a predetermined value. In this method, a sentence of the first sentence that is equal to or less than a threshold is extracted as a sentence including new information that is not included in the second sentence.

また、本発明の文章処理方法は、さらに、前記第１の文章の各文から特徴を抽出し、抽出された前記第１の文章の各文の特徴の各々と、該第１の文章の各文が新しい情報を含むか否かを示す抽出結果とを用いて、前記第１の文章の各文の特徴に基づいて、該第１の文章の各文が新しい情報を含むか否かを識別するための分類器を学習し、学習された分類器に前記第１の文章の各文を入力して得られる識別結果に基づいて、前記新しい情報を含む文を再抽出するようにしてもよい。 The sentence processing method of the present invention further extracts features from each sentence of the first sentence, and extracts each feature of each sentence of the first sentence and each of the first sentence. Identify whether each sentence of the first sentence contains new information based on the characteristics of each sentence of the first sentence using an extraction result indicating whether the sentence contains new information A classifier for learning may be learned, and a sentence including the new information may be re-extracted based on an identification result obtained by inputting each sentence of the first sentence to the learned classifier. .

また、本発明の文章処理方法は、さらに、前記第１の文章及び前記第２の文章の各文を表すノードと、各ノード間の関係を該各ノード間の類似度に応じた重みを付したエッジとを有するグラフを生成し、前記第１の文章の各文に対応したノードに、前記新しい情報を含む文か否かに基づくラベルを付与し、前記グラフを用いたラベル伝播法により得られる前記第２の文章の各文に対応したノードに付与されるラベルに基づいて、前記新しい情報を含む文に対応するノードに付与されたラベルに最も近いラベルが付与されたノードに対応する前記第２の文章の文の前または後を、前記新しい情報を含む文を挿入する位置として決定するようにしてもよい。 Further, the sentence processing method of the present invention further assigns a weight corresponding to the degree of similarity between the nodes and the node representing each sentence of the first sentence and the second sentence. And a label based on whether or not the sentence includes the new information is assigned to a node corresponding to each sentence of the first sentence, and is obtained by a label propagation method using the graph. Based on the label given to the node corresponding to each sentence of the second sentence, the node corresponding to the node assigned the label closest to the label given to the node corresponding to the sentence containing the new information You may make it determine before or after the sentence of a 2nd sentence as a position which inserts the sentence containing the said new information.

また、本発明の文章処理プログラムは、コンピュータを、上記の文章処理装置を構成する各手段として機能させるためのプログラムである。 The sentence processing program of the present invention is a program for causing a computer to function as each means constituting the sentence processing apparatus.

以上説明したように、本発明の文章処理装置、方法、及びプログラムによれば、異なる言語で記述された文章を同一の言語に翻訳し、第１の翻訳文章の各文と第２の翻訳文章の各文との類似度を全ての組み合わせについて計算し、類似度が最大となる最大類似度が所定の閾値以下となる第１の翻訳文章の文に対応する第１の文章の文を、第２の文章に含まれていない新しい情報を含む文として自動的に抽出することにより、共同支援システムにより記述された記事の編集者を支援することができる、という効果が得られる。 As described above, according to the sentence processing apparatus, method, and program of the present invention, sentences described in different languages are translated into the same language, and each sentence of the first translated sentence and the second translated sentence are translated. Similarity with each sentence is calculated for all combinations, and the sentence of the first sentence corresponding to the sentence of the first translation sentence in which the maximum similarity that maximizes the similarity is equal to or less than a predetermined threshold is By automatically extracting as a sentence including new information that is not included in the second sentence, it is possible to support the editor of the article described by the joint support system.

第１の実施の形態の文章処理装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the text processing apparatus of 1st Embodiment. ラベル伝播法のためのグラフ作成を説明するためのイメージ図である。It is an image figure for demonstrating the graph preparation for a label propagation method. 第１の実施の形態の文章処理装置における文章処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the text processing routine in the text processing apparatus of 1st Embodiment. 第１の実施の形態の文章処理装置における新情報識別処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the new information identification process routine in the text processing apparatus of 1st Embodiment. 第１の実施の形態の文章処理装置における情報挿入位置探索処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the information insertion position search processing routine in the text processing apparatus of 1st Embodiment. 第２の実施の形態の文章処理装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the text processing apparatus of 2nd Embodiment. 第２の実施の形態の文章処理装置における新情報識別処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the new information identification process routine in the text processing apparatus of 2nd Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。第１の実施の形態では、英語で記述された記事と中国語で記述された記事とを比較して、英語で記述された記事（第１の文章）から中国語で記述された記事（第２の文章）に含まれていない新しい情報を含む文を抽出して、中国語で記述された記事の適切な位置に新しい情報を含む文を挿入する適切な位置を決定する場合について説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the first embodiment, an article written in Chinese is compared with an article written in English (first sentence) by comparing an article written in English with an article written in Chinese. A case will be described in which a sentence including new information not included in the second sentence) is extracted and an appropriate position for inserting a sentence including new information at an appropriate position of an article written in Chinese is described.

第１の実施の形態に係る文章処理装置１０は、ＣＰＵと、ＲＡＭと、後述する文章処理ルーチンを実行するためのプログラムを記憶したＲＯＭと、を備えたコンピュータで構成されている。また、記憶手段としてのＨＤＤを含んで構成するようにしてもよい。コンピュータは、機能的には、図１に示すように、データ読込部１２と、新情報識別部２０と、情報挿入位置探索部４０と、を含んだ構成で表すことができる。 The text processing apparatus 10 according to the first embodiment is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a text processing routine described later. Further, an HDD as a storage unit may be included. As shown in FIG. 1, the computer can be functionally represented by a configuration including a data reading unit 12, a new information identification unit 20, and an information insertion position search unit 40.

データ読込部１２は、英語で記述された記事、及び中国語で記述された記事を、内部または外部の記憶装置から読み込む。または、外部装置に記憶された記事を、ネットワークを介して読み込むようにしてもよい。ここで読み込まれた英語で記述された記事は、Ｌ個の文（ｅ_１、ｅ_２、・・・、ｅ_Ｌ）から構成された文章であり、中国語で記述された記事は、Ｍ個の文（ｃ_１、ｃ_２、・・・、ｃ_Ｍ）から構成された文章であるものとする。 The data reading unit 12 reads an article written in English and an article written in Chinese from an internal or external storage device. Or you may make it read the article memorize | stored in the external device via a network. The article written in English read here is a sentence composed of L sentences (e ₁ , e ₂ ,..., E _L ), and the article written in Chinese is M articles. It is assumed that the sentence is composed of sentences (c ₁ , c ₂ ,..., C _M ).

新情報識別部２０は、さらに、前処理部２２と、機械翻訳部２４と、類似度計算部２６と、フラグ付与部２８と、を含んだ構成で表すことができる。 The new information identification unit 20 can be expressed by a configuration including a preprocessing unit 22, a machine translation unit 24, a similarity calculation unit 26, and a flag assignment unit 28.

前処理部２２は、英語で記述された記事（第１の文章）及び中国語で記述された記事（第２の文章）の各々に対して、セクションや段落を認識する処理や語幹処理等の前処理を施す。 The pre-processing unit 22 performs processing for recognizing a section and paragraph, stem processing, etc. for each of an article written in English (first sentence) and an article written in Chinese (second sentence). Pre-processing is performed.

機械翻訳部２４は、前処理が施された英語で記述された記事及び中国語で記述された記事のいずれか一方または双方を翻訳することにより、同一の言語で記述された２種類の記事を得る。なお、翻訳後の言語は、読み込んだ記事の言語のいずれか（英語または中国語）でもよいし、読み込んだ記事の言語とは異なる第三の言語としてもよい。例えば、翻訳後の言語を英語とした場合は、中国語で記述された記事（第２の文章）のみを英語に翻訳する。これにより、英語で記述された記事（第１の文章）と中国語で記述された記事を翻訳した記事（第４の文章）とが同一の言語（英語）で記述された記事となる。同様に、翻訳後の言語を中国語とした場合は、英語で記述された記事（第１の文章）のみを中国語に翻訳する。これにより、英語で記述された記事を翻訳した記事（第３の文章）と中国語で記述された記事（第２の文章）とが同一の言語（中国語）で記述された記事となる。また、翻訳後の言語を第三の言語（例えば、日本語）とした場合には、英語で記述された記事（第１の文章）及び中国語で記述された記事（第２の文章）の双方を翻訳する。これにより、英語で記述された記事を翻訳した記事（第３の文章）と中国語で記述された記事を翻訳した記事（第４の文章）とが同一の言語（日本語）で記述された記事となる。 The machine translation unit 24 translates one or both of the pre-processed article written in English and the article written in Chinese into two kinds of articles written in the same language. obtain. The translated language may be one of the languages of the read article (English or Chinese), or may be a third language different from the language of the read article. For example, when the translated language is English, only the article (second sentence) written in Chinese is translated into English. As a result, the article (first sentence) written in English and the article (fourth sentence) translated from the article written in Chinese become articles written in the same language (English). Similarly, when the translated language is Chinese, only the article (first sentence) written in English is translated into Chinese. Thus, an article (third sentence) translated from an article written in English and an article (second sentence) written in Chinese become articles written in the same language (Chinese). If the translated language is the third language (for example, Japanese), the article written in English (first sentence) and the article written in Chinese (second sentence) Translate both. As a result, an article translated from English (third sentence) and an article translated from Chinese (fourth sentence) are written in the same language (Japanese). Become an article.

なお、本実施の形態では、翻訳処理の有無にかかわらず、同一の言語で記述された記事として英語で記述された記事を扱う場合には、Ｌ個の文（ｅ_１’、ｅ_２’、・・・、ｅ_Ｌ’）と表記し、同一の言語で記述された記事として中国語で記述された記事を扱う場合には、Ｍ個の文（ｃ_１’、ｃ_２’、・・・、ｃ_Ｍ’）と表記する。すなわち、英語で記述された記事を翻訳しなかった場合には（ｅ_１、ｅ_２、・・・、ｅ_Ｌ）＝（ｅ_１’、ｅ_２’、・・・、ｅ_Ｌ’）であり、中国語で記述された記事を翻訳しなかった場合には、（ｃ_１、ｃ_２、・・・、ｃ_Ｍ）＝（ｃ_１’、ｃ_２’、・・・、ｃ_Ｍ’）である。以下、本実施の形態では、「翻訳後」の記事とは、いずれか一方の記事の翻訳処理の有無にかかわらず、異なる言語で記述された２つの記事が同一の言語で記述された記事となった後の記事を指すものとする。 In the present embodiment, L articles (e ₁ ′, e ₂ ′, e ₂ ′, e ₂ ′, e. ···, e _L ') and expressed, in the case dealing with articles written in Chinese as articles that were written in the same language, M number of statements _{_{(c 1', c 2 '}} , ··· , referred to as _{c M} '). That is, when an article written in English is not translated, (e ₁ , e ₂ ,..., E _L ) = (e ₁ ′, e ₂ ′,..., E _L ′) If the article written in Chinese is not translated, (c ₁ , c ₂ ,..., C _M ) = (c ₁ ′, c ₂ ′,..., C _M ′) is there. Hereinafter, in this embodiment, an “post-translation” article is an article in which two articles written in different languages are written in the same language regardless of whether or not one of the articles is translated. Refers to the article after becoming.

また、機械翻訳部２４における翻訳には、周知の機械翻訳ツールを用いることができる。機械翻訳ツールは、２つの異なる言語で記述された記事を同一の言語に変換できるものであればよく、特別な機械翻訳ツールに限定されない。 For translation in the machine translation unit 24, a known machine translation tool can be used. The machine translation tool only needs to be able to convert an article written in two different languages into the same language, and is not limited to a special machine translation tool.

類似度計算部２６は、英語で記述された記事を機械翻訳部２４で翻訳した記事（以下、「翻訳後の英語の記事」という）の各文（ｅ_１’、ｅ_２’、・・・、ｅ_Ｌ’）と、中国語で記述された記事を機械翻訳部２４で翻訳した記事（以下、「翻訳後の中国語の記事」という）の各文（ｃ_１’、ｃ_２’、・・・、ｃ_Ｍ’）との類似度を計算する。まず、翻訳後の英語の記事の各文を、ｎを語彙の大きさ（２つの記事に出現する単語の和集合の要素数）とし、各文を単語の重みからなるベクトルで表現する。具体的には、翻訳後の英語の記事のｉ番目の文ｅ_ｉ’のベクトル表現をＴ^ｅ _ｉとし、Ｔ^ｅ _ｉ＝（ｗ_１、ｗ_２、・・・、ｗ_ｎ）で表現する。翻訳後の中国語の記事についても同様に、ｊ番目の文ｃ_ｊ’のベクトル表現をＴ^ｃ _ｊとして単語の重みからなるベクトルで表わす。ここで、ｗ_ｉは、文書中の特徴的な単語を抽出するためのアルゴリズムによって得られる単語の重みであり、例えば、情報検索や文書要約などの分野で利用されるＴＦ−ＩＤＦアルゴリズムを用いて得ることができる。 The similarity calculation unit 26 translates each sentence (e ₁ ′, e ₂ ′,...) Of an article (hereinafter referred to as “translated English article”) obtained by translating an article written in English by the machine translation unit 24. , E _L ′) and articles (c ₁ ′, c ₂ ′) of articles (hereinafter referred to as “translated Chinese articles”) translated by the machine translation unit 24 from articles written in Chinese .., c _M ') and the similarity is calculated. First, each sentence of an English article after translation is represented by a vector composed of word weights, where n is the size of the vocabulary (the number of elements in the union of words appearing in two articles). Specifically, the vector representation of the i-th sentence e _i ′ of the English article after translation is represented by T ^e _i and represented by T ^e _i = (w ₁ , w ₂ ,..., W _n ). Similarly, for translated Chinese articles, the vector representation of the j-th sentence c _j ′ is represented by a vector composed of word weights with T ^c _j . Here, w _i is a word weight obtained by an algorithm for extracting characteristic words in a document, and for example, using a TF-IDF algorithm used in fields such as information retrieval and document summarization. Obtainable.

次に、上述のベクトルＴ^ｅ _ｉ及びＴ^ｃ _ｊを用いて、２つの文ｅ_ｉ’とｃ_ｊ’との類似度を計算する。類似度は、例えば、下記（１）式に示すようなベクトル空間のコサイン類似度として計算することができる。ｅ_ｉ’（Ｔ^ｅ _ｉ）とｃ_ｊ’（Ｔ^ｃ _ｊ）との全ての組み合わせ（ｉ＝１、２、・・・、Ｌ，ｊ＝１、２、・・・、Ｍ）について、類似度を計算する。 Next, using the vectors T ^e _i and T ^c _j described above, the similarity between the two sentences e _i ′ and c _j ′ is calculated. The similarity can be calculated, for example, as a cosine similarity in a vector space as shown in the following equation (1). Similar for all combinations (i = 1, 2,..., L, j = 1, 2,..., M) of e _i ′ (T ^e _i ) and c _j ′ (T ^c _j ) Calculate the degree.

なお、類似度の計算は上記の方法に限定されず、他の方法を用いることもできる。例えば、２つの文に同じハイパーリンクが含まれているか否かによって類似度を計算することができる。また、単語の重みについても、例えば、人物、場所、またはイベントを表す単語に高い重みを与えるなど、他の方法を用いてもよい。 The calculation of similarity is not limited to the above method, and other methods can be used. For example, the similarity can be calculated based on whether two sentences contain the same hyperlink. As for the weight of the word, other methods such as giving a high weight to a word representing a person, place, or event may be used.

フラグ付与部２８は、類似度計算部２６によって計算された類似度に基づいて、英語で記述された記事の各文ｅ_ｉに新しい情報（中国語の記事に含まれていない情報）が含まれているか否かを示すフラグを付与する。具体的には、翻訳後の英語の記事の各文ｅ_ｉ’について計算された翻訳後の中国語の記事の各文ｃ_ｊ’との全ての類似度の中で、最も類似度が高い最大類似度を求め、最大類似度が最小の文からＮ個の文（ｅ_ｍｉｎ１’、ｅ_ｍｉｎ２’、・・・、ｅ_ｍｉｎＮ’）を抽出し、抽出した文に対応する翻訳前の英語で記述された記事の文（ｅ_ｍｉｎ１、ｅ_ｍｉｎ２、・・・、ｅ_ｍｉｎＮ）に、新しい情報を含む文であることを示すフラグ「＋１」を付与する。その他の文には、新しい情報が含まれていないことを示すフラグ「−１」を付与する。これは、中国語の記事のどの文とも類似度が低い英語の文は、新しい情報が含まれている可能性が高いということを想定したものである。 Flag addition module 28, based on the calculated degree of similarity by the similarity calculation unit 26, the new information (information that is not included in the Chinese article) each statement e _i article written in English include A flag indicating whether or not the file is present is assigned. Specifically, among all similarities with each sentence c _j 'of the translated Chinese article calculated for each sentence e _i ' of the translated English article, the maximum with the highest similarity _{Find similarities} , extract N sentences (e _min1 ′, e _min2 ′,..., E _minN ′) from sentences with the smallest maximum similarity, and describe in English before translation corresponding to the extracted sentences A flag “+1” indicating that the sentence includes new information is _{added to} the sentence (e _min1 , e _min2 ,..., E _minN ) of the article. Other sentences are given a flag “−1” indicating that no new information is included. This is based on the assumption that English sentences with low similarity to any sentence in Chinese articles are likely to contain new information.

なお、ここでは、最大類似度が所定の閾値以下となる文を抽出する一例として、最大類似度を小さい順に並べて所定個までの文を新しい情報を含む文として抽出する場合（所定個目の文の類似度を閾値とする場合）について説明したが、予め定めた一定の閾値と比較して、最大類似度が閾値以下の文を抽出するようにしてもよい。 Here, as an example of extracting sentences whose maximum similarity is equal to or less than a predetermined threshold, a maximum of similarities are arranged in ascending order and up to a predetermined number of sentences are extracted as sentences including new information (a predetermined number of sentences). However, it is also possible to extract a sentence whose maximum similarity is equal to or less than the threshold value as compared to a predetermined threshold value.

情報挿入位置探索部４０は、フラグ付与部２８により新しい情報を含む文であることを示すフラグ「＋１」が付与された英語の文について、その文を中国語で記述された記事に挿入するのに最も適した位置を探索する。この問題を解決するために、ラベル伝播法を使用する。 The information insertion position search unit 40 inserts the sentence into an article written in Chinese for an English sentence to which a flag “+1” indicating a sentence including new information is added by the flag assigning unit 28. Search for the most suitable position. To solve this problem, the label propagation method is used.

具体的には、例えば、図２に示すように、英語で記述された記事の各文、及び中国語で記述された記事の各文を表すノードと、各ノード間の関係を各ノード間の類似度に応じた重みを付したエッジとを有するグラフＧ＝（Ｖ，Ｅ）を作成する。Ｖはノードの集合で、Ｌ個の英語の文（ｅ_１、ｅ_２、・・・、ｅ_Ｌ）及びＭ個の中国語の文（ｃ_１、ｃ_２、・・・、ｃ_Ｍ）を表す。Ｅはエッジの集合である。英語のノードと中国語のノードとの間に、Ｌ×Ｍ個のエッジが存在し、そのエッジの重みは、英語の文ｅ_ｉの翻訳後の文ｅ_ｉ’と中国語の文ｃ_ｊの翻訳後の文ｃ_ｊ’とのコサイン類似度で表す。また、英語のノード同士、または中国語のノード同士でもエッジを形成する。例えば、中国語の２つの文ｃ_ｉ，ｃ_ｊが同じ段落の中にある場合には、ｃ_ｉとｃ_ｊとのノード間にエッジを形成する。このエッジの重みは、ｗ_ｉｊ＝１／ｄｉｓｔ（ｃ_ｉ，ｃ_ｊ）で計算する。ｄｉｓｔは、ｃ_ｉとｃ_ｊとの間の距離（その間にある文の数）である。このグラフは、すべての類似のリンクや文書構造に関する情報を表現できる。 Specifically, for example, as shown in FIG. 2, each sentence of an article written in English and each sentence of an article written in Chinese, and the relationship between the nodes A graph G = (V, E) having an edge weighted according to the similarity is created. V is a set of nodes, which includes L English sentences (e ₁ , e ₂ ,..., E _L ) and M Chinese sentences (c ₁ , c ₂ ,..., C _M ). To express. E is a set of edges. Between English node and Chinese node exists L × M pieces of edge, the weight of that edge, the statement after the translation sentence e _i English e _{i 'and} the sentence c _j in Chinese The cosine similarity with the translated sentence c _j 'is expressed. An edge is also formed between English nodes or Chinese nodes. For example, if two Chinese sentences c _i and c _j are in the same paragraph, an edge is formed between nodes of c _i and c _j . The weight of this edge is calculated by w _ij = 1 / dist (c _i , c _j ). dist is the distance between c _i and c _j (the number of sentences in between). This graph can represent information about all similar links and document structure.

次に、グラフを初期化するために、フラグ「＋１」が付与された英語の文を表すノードに、ラベル「１」を付与し、その他の英語のノードにはラベル「０」を付与する。この段階では、中国語のノードのラベルは未定の状態である。そして、ラベル伝播法により、英語のノードに付与されたラベルを、ノード間の関係に従ってラベルが未定のノード（中国語のノード）へ伝播する。これは上記のように作成したグラフ上でマルコフ連鎖を実行していると見なすことができる。 Next, in order to initialize the graph, the label “1” is assigned to the node representing the English sentence to which the flag “+1” is assigned, and the label “0” is assigned to the other English nodes. At this stage, the Chinese node label is undecided. Then, according to the label propagation method, the label given to the English node is propagated to the node (Chinese node) whose label is undetermined according to the relationship between the nodes. This can be regarded as executing a Markov chain on the graph created as described above.

ラベル伝播法では、ラベルが未定のノードに付与される値を、マルコフ連鎖の反復計算（iterative Markov chain computation）や直接固有値ベクトル計算（direct eigenvector computation）により計算することができる。直接固有値ベクトル計算を使う場合は、下記（２）式の方程式を解くことで付与すべき値を得る。 In the label propagation method, a value given to a node whose label is undetermined can be calculated by iterative Markov chain computation or direct eigenvector computation. When direct eigenvalue vector calculation is used, a value to be given is obtained by solving the following equation (2).

ここで、ｆは（Ｎ＋Ｍ）次元のラベルのベクトルである。ｆは英語のノードでは「１」または「０」に制限され、中国語のノードでは未定である。上記の目的関数は、もしエッジの重みｗ_ｉｊが大きければノードのペア（ｉ，ｊ）が類似したラベルｆ_ｉとｆ_ｊを持つことを強制することによりラベル伝播を実現する。fはグラフラプラシアンの固有ベクトルを求めることにより計算することができる。 Here, f is a vector of (N + M) -dimensional labels. f is limited to “1” or “0” in an English node, and is undecided in a Chinese node. The objective function that if implementing the label propagation by the edge of the weight w _ij is larger if the node pair (i, j) is forced to have a label f _i and f _j similar. f can be calculated by obtaining the eigenvector of the graph Laplacian.

ラベルを伝播した後、中国語の各ノードには［０，１］の間にある数値のラベルが付与される。上述のように、エッジの重みが大きいノード間では類似したラベルが付与されるため、ラベル「１」が付与されたノードに対応する英語の文と、ラベル「１」に最も近い値のラベルが付与されたノードに対応する中国語の文とは、関連が深いことを示している。従って、ラベル「１」に最も近い値のラベルが付与されたノード、ここでは最大値のラベルが付与されたノードに対応した中国語の文の前または後の位置を、新しい情報を含む文を挿入する最適な位置として決定することができる。 After propagating the label, each Chinese node is given a numeric label between [0, 1]. As described above, since a similar label is assigned between nodes having a large edge weight, an English sentence corresponding to the node assigned the label “1” and a label having a value closest to the label “1” are displayed. This indicates that the Chinese sentence corresponding to the given node is deeply related. Therefore, the node with the label closest to the label “1”, here the position before or after the Chinese sentence corresponding to the node with the maximum label, the sentence containing the new information. It can be determined as the optimal position to insert.

次に、図３を参照して、第１の実施の形態の文章処理装置１０において実行される文章処理ルーチンについて説明する。 Next, a sentence processing routine executed in the sentence processing apparatus 10 according to the first embodiment will be described with reference to FIG.

ステップ１００で、第１の文章から第２の文章に含まれていない新しい情報を含む文を抽出する新情報識別処理を実行する。 In step 100, a new information identification process is performed for extracting a sentence including new information not included in the second sentence from the first sentence.

ここで、図４を参照して、新情報識別処理ルーチンについて説明する。ここでも、英語で記述された記事（第１の文章）から中国語で記述された記事（第２の文章）に含まれていない新しい情報を含む文を抽出する場合について説明する。 Here, the new information identification processing routine will be described with reference to FIG. Here, a case will be described in which a sentence including new information that is not included in an article (second sentence) written in Chinese is extracted from an article (first sentence) written in English.

ステップ１２０で、英語で記述された記事、及び中国語で記述された記事を、内部または外部の記憶装置から読み込む。または、外部装置に記憶された記事を、ネットワークを介して読み込むようにしてもよい。ここで読み込まれた英語で記述された記事は、Ｌ個の文（ｅ_１、ｅ_２、・・・、ｅ_Ｌ）から構成された文章であり、中国語で記述された記事は、Ｍ個の文（ｃ_１、ｃ_２、・・・、ｃ_Ｍ）から構成された文章であるものとする。 In step 120, an article written in English and an article written in Chinese are read from an internal or external storage device. Or you may make it read the article memorize | stored in the external device via a network. The article written in English read here is a sentence composed of L sentences (e ₁ , e ₂ ,..., E _L ), and the article written in Chinese is M articles. It is assumed that the sentence is composed of sentences (c ₁ , c ₂ ,..., C _M ).

次に、ステップ１２２で、上記ステップ１２０で読み込んだ英語で記述された記事及び中国語で記述された記事の各々に対して、セクションや段落を認識する処理や語幹処理等の前処理を施す。 Next, in step 122, preprocessing such as processing for recognizing sections and paragraphs and word stem processing is performed on each of the article written in English and the article written in Chinese read in step 120.

次に、ステップ１２４で、上記ステップ１２２で前処理が施された英語で記述された記事及び中国語で記述された記事の各々を、同一の言語に翻訳する。ここでは、英語で記述された記事は、Ｌ個の文（ｅ_１’、ｅ_２’、・・・、ｅ_Ｌ’）、中国語で記述された記事は、Ｍ個の文（ｃ_１’、ｃ_２’、・・・、ｃ_Ｍ’）に翻訳される。 Next, in step 124, each of the article written in English and the article written in Chinese pre-processed in step 122 is translated into the same language. Here, articles written in English are L sentences (e ₁ ′, e ₂ ′,..., E _L ′), and articles written in Chinese are M sentences (c ₁ ′). , C ₂ ′,..., C _M ′).

次に、ステップ１２６で、上記ステップ１２４で翻訳された翻訳後の英語の記事の各文（ｅ_１’、ｅ_２’、・・・、ｅ_Ｌ’）のｉ番目の文ｅ_ｉ’を、ｎを語彙の大きさとし、各文をＴＦ−ＩＤＦアルゴリズム等を用いて得た単語の重みｗ_ｉからなるベクトルＴ^ｅ _ｉ＝（ｗ_１、ｗ_２、・・・、ｗ_ｎ）で表現する。翻訳後の中国語の記事についても同様に、ｊ番目の文ｃ_ｊ’をベクトルＴ^ｃ _ｊで表現する。そして、２つの文ｅ_ｉ’とｃ_ｊ’との類似度を、例えば（１）式に示すようなベクトル空間のコサイン類似度として、ｅ_ｉ’（Ｔ^ｅ _ｉ）とｃ_ｊ’（Ｔ^ｃ _ｊ）との全ての組み合わせ（ｉ＝１、２、・・・、Ｌ，ｊ＝１、２、・・・、Ｍ）について計算する。 Next, in step 126, the i-th sentence e _i ′ of each sentence (e ₁ ′, e ₂ ′,..., E _L ′) of the translated English article translated in step 124 above, the size of the vocabulary Satoshi n, vector ^T _e i ₌ consists weight _{w i} of the word of each sentence obtained using TF-IDF algorithm or the like _{(w 1, w 2, ···} , w n) is expressed by. Similarly, for a translated Chinese article, the j-th sentence c _j ′ is expressed by a vector T ^c _j . Then, the similarity between the two sentences e _i ′ and c _j ′ is set as, for example, a cosine similarity in a vector space as shown in equation (1), and e _i ′ (T ^e _i ) and c _j ′ (T ^c _j )) for all combinations (i = 1, 2,..., L, j = 1, 2,..., M).

次に、ステップ１２８では、上記ステップ１２６において、翻訳後の英語の記事の各文ｅ_ｉ’について計算された翻訳後の中国語の記事の各文ｃ_ｊ’との全ての類似度の中で、最も類似度が高い最大類似度を求める。そして、求めた最大類似度が一番小さい文からＮ個の文（ｅ_ｍｉｎ１’、ｅ_ｍｉｎ２’、・・・、ｅ_ｍｉｎＮ’）を抽出し、抽出した文に対応する翻訳前の英語で記述された記事の文（ｅ_ｍｉｎ１、ｅ_ｍｉｎ２、・・・、ｅ_ｍｉｎＮ）に、新しい情報を含む文であることを示すフラグ「＋１」を付与し、その他の英語の文には、新しい情報が含まれていないことを示すフラグ「−１」を付与して、リターンする。 Next, in step 128, among all the similarities with each sentence c _j 'of the translated Chinese article calculated for each sentence e _i ' of the translated English article in step 126 above. Find the maximum similarity with the highest similarity. Then, N sentences (e _min1 ′, e _min2 ′,..., E _minN ′) are extracted from the sentence having the smallest maximum similarity, and described in English before translation corresponding to the extracted sentence. The flag “+1” indicating that the sentence includes new information is _{added to} the sentence (e _min1 , e _min2 ,..., E _minN ) of the published article, and new information is _added to the other English sentences. A flag “−1” indicating that it is not included is assigned and the process returns.

文章処理ルーチン（図３）に戻って、ステップ１０２へ移行し、抽出された新しい情報を含む文を挿入するための位置を探索する情報挿入位置探索処理を実行する。 Returning to the sentence processing routine (FIG. 3), the process proceeds to step 102, and an information insertion position search process for searching for a position for inserting a sentence including the extracted new information is executed.

ここで、図５を参照して、情報挿入位置探索処理ルーチンについて説明する。 Here, the information insertion position search processing routine will be described with reference to FIG.

ステップ１４０で、英語で記述された記事の各文（ｅ_１、ｅ_２、・・・、ｅ_Ｌ）、及び中国語で記述された記事の各文（ｃ_１、ｃ_２、・・・、ｃ_Ｍ）を表すノードと、各ノード間の関係を各ノード間の類似度に応じた重みを付したエッジとを有するグラフＧ＝（Ｖ，Ｅ）を作成する。 In step 140, each sentence of the article written in English (e ₁ , e ₂ ,..., E _L ) and each sentence of the article written in Chinese (c ₁ , c ₂ ,..., A graph G = (V, E) having nodes representing c _M ) and edges weighted according to the degree of similarity between the nodes is created.

次に、ステップ１４２で、上記ステップ１４０で作成したグラフを初期化するために、新情報識別処理（図４）でフラグ「＋１」が付与された英語の文を表すノードに、ラベル「１」を付与し、その他の英語のノードにはラベル「０」を付与する。 Next, in step 142, in order to initialize the graph created in step 140, the label “1” is added to the node representing the English sentence to which the flag “+1” is added in the new information identification process (FIG. 4). And a label “0” is assigned to other English nodes.

次に、ステップ１４４で、ラベルが付与された英語のノードから、ラベルが未定の中国語のノードへラベルを伝播する。ラベル伝播により、中国語のノードには［０，１］の間にある数値のラベルが付与される。 Next, in step 144, the label is propagated from the English node to which the label is assigned to the Chinese node whose label is undetermined. By the label propagation, the Chinese node is given a numeric label between [0, 1].

次に、ステップ１４６で、上記ステップ１４４でのラベル伝播により中国語のノードに付与されたラベルの値が最大値のノードに対応した中国語の文の前または後の位置を、新しい情報を含む文を挿入する最適な位置として決定して、リターンする。 Next, in step 146, the position before or after the Chinese sentence corresponding to the node whose label value given to the Chinese node by the label propagation in step 144 is the maximum is included with the new information. Determine the best position to insert the sentence and return.

文章処理ルーチン（図３）に戻って、ステップ１０４へ移行し、情報挿入位置探索処理により決定した新情報を含む文を挿入する位置のデータを出力して、処理を終了する。なお、ここでは、新情報識別処理で付与されたフラグの情報、すなわち新しい情報を含む文のデータは、後段の情報挿入位置探索処理へ受け渡すものとして説明したが、新しい情報を含む文のデータを、後段へ受け渡すことなく処理結果として出力するようにしてもよい。 Returning to the sentence processing routine (FIG. 3), the process proceeds to step 104, where the data at the position where the sentence including the new information determined by the information insertion position search process is inserted is output, and the process ends. Here, the flag information given in the new information identification process, that is, the sentence data including the new information has been described as being transferred to the subsequent information insertion position search process. However, the sentence data including the new information is described. May be output as a processing result without passing on to the subsequent stage.

以上説明したように、第１の実施の形態の文章処理装置によれば、異なる言語で記述された文章（英語の記事と中国語の記事）を同一の言語に翻訳し、第１の翻訳文章（翻訳後の英語の記事）の各文と第２の翻訳文章（翻訳後の中国語の記事）の各文との類似度を全ての組み合わせについて計算し、類似度が最大となる最大類似度が所定の閾値以下となる第１の翻訳文章の文に対応する第１の文章の文を、第２の文章に含まれていない新しい情報を含む文として自動的に抽出することにより、共同支援システムにより記述された記事の編集者を支援することができる。 As described above, according to the sentence processing apparatus of the first embodiment, sentences (English articles and Chinese articles) written in different languages are translated into the same language, and the first translated sentence The similarity between each sentence of (translated English article) and each sentence of the second translated sentence (translated Chinese article) is calculated for all combinations, and the maximum similarity is the maximum similarity By automatically extracting the sentence of the first sentence corresponding to the sentence of the first translated sentence whose is less than or equal to a predetermined threshold as a sentence including new information not included in the second sentence, joint support It can assist the editor of the articles described by the system.

また、各文をノード、ノード間の関係を類似度に応じた重み付きエッジで表したグラフを作成し、第１の文章の各文に対応するノード（英語のノード）に新しい情報を含む文か否かに基づくラベルを付与し、第１の文章の各文に対応するノードからラベルを伝播して、ラベルの値が未定の第２の文章の各文に対応するノード（中国語のノード）にラベルを付与し、第２の文章の各文に対応するノードに付与されたラベルの値に基づいて、新しい文章を挿入するのに適した位置を自動的に決定することにより、共同支援システムにより記述された記事の編集者を支援することができる。 In addition, a graph is created in which each sentence is represented by a node and the relationship between the nodes is represented by a weighted edge corresponding to the degree of similarity, and a sentence (English node) corresponding to each sentence of the first sentence includes new information. A label based on whether or not, a label is propagated from a node corresponding to each sentence of the first sentence, and a node corresponding to each sentence of the second sentence whose label value is undetermined (a Chinese node) ), And automatically determining a suitable position to insert a new sentence based on the label value assigned to the node corresponding to each sentence of the second sentence. It can assist the editor of the articles described by the system.

次に、第２の実施の形態について説明する。第２の実施の形態では、分類器の識別結果を用いて、新しい情報を含む文を抽出する場合について説明する。なお、英語で記述された記事（第１の文章）から中国語で記述された記事（第２の文章）に含まれていない新しい情報を含む文を抽出する場合について説明する。 Next, a second embodiment will be described. In the second embodiment, a case where a sentence including new information is extracted using the classification result of the classifier will be described. A case will be described in which a sentence including new information that is not included in an article written in Chinese (second sentence) is extracted from an article written in English (first sentence).

図６に示すように、第２の実施の形態に係る文章処理装置２１０は、第１の実施の形態に係る文章処理装置１０とは新情報識別部の構成が異なる。以下、第１の実施の形態と異なる点について説明する。 As shown in FIG. 6, the sentence processing apparatus 210 according to the second embodiment is different from the sentence processing apparatus 10 according to the first embodiment in the configuration of the new information identification unit. Hereinafter, differences from the first embodiment will be described.

第２の実施の形態に係る文章処理装置２１０における新情報識別部２２０は、第１の実施の形態に係る文章処理装置１０における新情報識別部２０の構成に、特徴抽出部３０と、分類器学習部３２と、フラグ更新部３４と、を加え、フラグ付与部２８をフラグ付与部２２８に替えた構成で表すことができる。 The new information identification unit 220 in the text processing device 210 according to the second embodiment is configured by adding a feature extraction unit 30 and a classifier to the configuration of the new information identification unit 20 in the text processing device 10 according to the first embodiment. The learning unit 32 and the flag updating unit 34 are added, and the flag providing unit 28 can be represented by a configuration in which the flag adding unit 228 is replaced.

フラグ付与部２２８は、翻訳後の英語の記事の各文ｅ_ｉ’について計算された翻訳後の中国語の記事の各文ｃ_ｊ’との全ての類似度の中で、最も類似度が高い最大類似度を求める。求めた最大類似度が大きい順に各文ｅ_ｉ’を並べ、最大類似度が最小の文からＮ個の文（ｅ_ｍｉｎ１’、ｅ_ｍｉｎ２’、・・・、ｅ_ｍｉｎＮ’）を抽出し、抽出した文に対応する翻訳前の英語で記述された記事の文（ｅ_ｍｉｎ１、ｅ_ｍｉｎ２、・・・、ｅ_ｍｉｎＮ）に、新しい情報を含む文であることを示すフラグ「＋１」を付与する。また、最大類似度が最大の文からＮ個の文（ｅ_ｍａｘ１’、ｅ_ｍａｘ２’、・・・、ｅ_ｍａｘＮ’）を抽出し、抽出した文に対応する翻訳前の英語で記述された記事の文（ｅ_ｍａｘ１、ｅ_ｍａｘ２、・・・、ｅ_ｍａｘＮ）に、新しい情報が含まれていないことを示すフラグ「−１」を付与する。最大類似度が最大または最小の文からＮ個に含まれない文にはフラグは付与されない。 The flag assigning unit 228 has the highest similarity among all the similarities with each sentence c _j 'of the translated Chinese article calculated for each sentence e _i ' of the translated English article. Find the maximum similarity. Each sentence e _i ′ is arranged in descending order of the obtained maximum similarity, and N sentences (e _min1 ′, e _min2 ′,..., E _minN ′) are extracted from the sentence with the smallest maximum similarity and extracted. The flag “+1” indicating that the sentence includes new information is _{added to} the sentence (e _min1 , e _min2 ,..., E _minN ) of the article described in English corresponding to the sentence before translation. Also, N sentences (e _max1 ′, e _max2 ′,..., E _maxN ′) are extracted from the sentence with the maximum similarity, and the article written in English before translation corresponding to the extracted sentence (E _max1 , e _max2 ,..., E _maxN ) is given a flag “−1” indicating that no new information is included. No flag is given to sentences that are not included in the N sentences from the largest or smallest maximum similarity.

特徴抽出部３０は、英語で記述された記事の各文ｅ_ｉについて、後述の分類器学習部３２で分類器の学習に利用するための特徴を抽出する。例えば、以下のような特徴を抽出することができる。 Feature extraction unit 30, for each sentence e _i article written in English, and extracts a feature for use in learning classifier in classifier learning unit 32 will be described later. For example, the following features can be extracted.

類似度：英語で記述された記事の各文ｅ_ｉの翻訳後の各文ｅ_ｉ’と、中国語で記述された記事の各文ｃ_ｊの翻訳後の各文ｃ_ｊ’とのコサイン類似度の最大値
近隣の類似度：隣接する（直前及び直後の）文ｅ_ｉ−１及びｅ_ｉ＋１の翻訳後の文ｅ_ｉ−１’及びｅ_ｉ＋１’と、中国語で記述された記事の各文ｃ_ｊの翻訳後の各文ｃ_ｊ’とのコサイン類似度の最大値
エントロピー：英語で記述された記事の各文ｅ_ｉの翻訳後の各文ｅ_ｉ’と、中国語で記述された記事の各文ｃ_ｊの翻訳後の各文ｃ_ｊ’との類似度を下記（３）式により条件付確率に変換することにより計算した中国語で記述された記事の各文ｃ_ｊの翻訳後の各文ｃ_ｊ’との類似度のエントロピー Similarity: cosine similarity between each sentence e _i ′ after translation of each sentence e _i of the article written in English and each sentence c _j ′ after translation of each sentence c _j of the article written in Chinese Maximum degree of neighbors Similarity of neighborhood: each of the sentences (e _i-1 ′ and e _{i + 1} ′) after translation of the adjacent (immediately and immediately following) sentences e _i-1 and e _{i + 1} and each article written in Chinese sentence c _j of each sentence c _j of post-translational: and _'cosine similarity of maximum entropy of the sentence e _i after the translation of each statement e _i of the described articles in _English', written in Chinese the similarity between each sentence c _{j 'posttranslational} each sentence c _j article below (3) translation of each sentence c _j of articles written in Chinese calculated by converting the conditional probability by formula Entropy of similarity with each subsequent sentence c _j '

なお、上記３つの特徴の全てを用いる必要はなく、少なくとも１つ以上を用いればよい。また、他の特徴と組み合わせて利用することもできる。 Note that it is not necessary to use all of the above three features, and at least one or more may be used. It can also be used in combination with other features.

分類器学習部３２は、特徴抽出部３０で抽出した英語で記述された記事の各文ｅ_ｉの特徴と、フラグ付与部２８により英語で記述された記事の各文ｅ_ｉに付与されたフラグをサポートベクターマシン（ＳＶＭ）に与えて、入力された英語で記述された記事の各文ｅ_ｉの特徴に基づいて、各文ｅ_ｉに新しい情報が含まれるか否かを識別した結果（フラグ）を返す分類器を学習させる。なお、学習の手法は、従来公知の技術を用いればよく、ＳＶＭに限定されない。 Classifier learning unit 32, the feature of each sentence e _i article written in English extracted by the feature extraction unit 30, which by the flag addition module 28 is assigned to each sentence e _i article written in English flag the given to support vector machine (SVM), based on the characteristics of each sentence e _i article written in the inputted English, a result of identifying whether or not contain new information in each sentence e _i (flag ) Is learned. The learning method may be a conventionally known technique, and is not limited to SVM.

フラグ更新部３４は、分類器学習部３２で学習された分類器に、英語で記述された記事の各文ｅ_ｉの特徴を入力し、その識別結果に基づいて、英語で記述された記事の各文ｅ_ｉに、新しい情報を含む文であることを示すフラグ「＋１」、または、新しい情報が含まれていないことを示すフラグ「−１」を付与する。既にフラグ付与部２２８でフラグが付与されている文については、分類器の識別結果によりフラグを更新する。 Flag update unit 34, the classifier learned by the classifier learning unit 32, and inputs the feature of each sentence e _i article written in English, based on the identification result, the articles that have been written in English each statement e _i, the flag "+1" indicating that the sentence containing the new information, or to impart a flag "-1" indicating that it does not contain any new information. For a sentence that has already been flagged by the flag assigning unit 228, the flag is updated based on the classification result of the classifier.

次に、第２の実施の形態の文章処理装置２１０において実行される文章処理ルーチンについて説明する。第１の実施の形態における文章処理ルーチンと新情報識別処理が異なるだけであるので、図７を参照して、第２の実施の形態における新情報識別処理ルーチンについて説明する。なお、第１の実施の形態の新情報識別処理と同一の処理については、同一の符号を付して詳細な説明は省略する。 Next, a text processing routine executed in the text processing apparatus 210 according to the second embodiment will be described. Since the text processing routine and the new information identification process in the first embodiment are only different, the new information identification process routine in the second embodiment will be described with reference to FIG. In addition, about the process same as the new information identification process of 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

ステップ１２０〜１２８を経て、Ｌ個の文（ｅ_１、ｅ_２、・・・、ｅ_Ｌ）から構成された英語で記述された記事、及びＭ個の文（ｃ_１、ｃ_２、・・・、ｃ_Ｍ）から構成された中国語で記述された記事を読み込んで、英語で記述された記事の各文ｅ_ｉに、中国語で記述された記事には含まれていない新しい情報を含むか否かのフラグを付す。 Through steps 120 to 128, an article written in English composed of _L sentences (e ₁ , e ₂ ,..., E _L ), and M sentences (c ₁ , c ₂ ,... - reads the articles that are written in Chinese, which is composed of c _M), in each sentence e _i of the articles that have been written in English, including the new information that has not been included in the article that was written in Chinese Or not flag.

次に、ステップ２００で、英語で記述された記事の各文ｅ_ｉについて、分類器の学習に利用するための特徴を抽出する。 Next, at step 200, for each sentence e _i article written in English, and it extracts a feature for use in learning classifier.

次に、ステップ２０２で、上記ステップ２００で抽出した英語で記述された記事の各文ｅ_ｉの特徴と、上記ステップ１２８で付与された英語で記述された記事の各文ｅ_ｉのフラグをＳＶＭに与えて、入力された英語で記述された記事の各文ｅ_ｉの特徴に基づいて、各文ｅ_ｉに新しい情報が含まれるか否かを識別した結果（フラグ）を返す分類器を学習させる。 Next, in step 202, the feature of each sentence e _i of the article described in English extracted in step 200 and the flag of each sentence e _i of the article described in English given in step 128 are set to SVM. given to learning based on the characteristics of each sentence e _i article written in the inputted English, a result of identifying whether or not contain new information in each sentence e _i return classifier (the flag) Let

次に、ステップ２０４で、上記ステップ２０２で学習された分類器に、上記ステップ２００で抽出された英語で記述された記事の各文ｅ_ｉの特徴を入力し、その識別結果に基づいて、英語で記述された記事の各文ｅ_ｉに、新しい情報を含む文であることを示すフラグ「＋１」、または、新しい情報が含まれていないことを示すフラグ「−１」を付与する。既に上記ステップ１２８でフラグが付与されている文については、本ステップで得られる分類器の識別結果によりフラグを更新して、リターンする。 Next, in step 204, the classifier learned in step 202, and inputs the feature of each sentence e _i article written in English that was extracted in step 200, based on the identification result, English A flag “+1” indicating that the sentence includes new information or a flag “−1” indicating that no new information is included is assigned to each sentence e _{i of} the article described in (1). For the sentence to which the flag has already been assigned in the above step 128, the flag is updated with the classifier identification result obtained in this step, and the process returns.

以上説明したように、第２の実施の形態の文章処理装置によれば、第１の文章（英語の記事）の各文から抽出される特徴と、新しい情報を含むか否かを示すフラグとを用いて分類器を学習し、学習された分類器の識別結果を用いてフラグを更新するため、新しい情報を含む文の抽出精度を向上させることができる。 As described above, according to the sentence processing apparatus of the second embodiment, the feature extracted from each sentence of the first sentence (English article) and the flag indicating whether or not new information is included. Since the classifier is learned using and the flag is updated using the learned classification result of the classifier, it is possible to improve the accuracy of extracting a sentence including new information.

なお、上記第１の実施の形態及び第２の実施の形態では、２つの異なる言語で記述された記事として、英語で記述された記事と中国語で記述された記事とを用いる場合について説明したが、他の言語で記述された記事に対しても適用可能である。 In the first embodiment and the second embodiment described above, the case where an article written in English and an article written in Chinese are used as articles written in two different languages has been described. However, it can also be applied to articles written in other languages.

次に、上記実施の形態の効果を説明するために、下記の実験結果について説明する。 Next, in order to explain the effects of the above embodiment, the following experimental results will be described.

（実験１）新しい情報を含む文の抽出に対する評価
ウィキペディアから、９つ（Ａ〜Ｉ）の英語の記事（十分に推敲されて成熟し総合的な内容を持つ、ウィキペディアの編集者の選考を経た記事）及び対応する中国語の記事を収集した。 (Experiment 1) Evaluation for Extracting Sentences Containing New Information Nine (A to I) English articles from Wikipedia (selected by Wikipedia editors who are well-thought, mature, and have comprehensive content) Articles) and corresponding Chinese articles were collected.

次に、人手によって各記事の英語版と中国語版とを比較し、英語の記事から新しい情報を含む文を識別した。また、中国語版の中で、英語の記事の文と同じ情報が記載された文にフラグを付与した。 Next, the English version of each article was manually compared with the Chinese version, and sentences containing new information were identified from the English articles. In the Chinese version, a flag was added to a sentence that contains the same information as the sentence in an English article.

次に、上記第１の実施の形態、第２の実施の形態に加えて、以下の二つの方法を比較した。 Next, in addition to the first embodiment and the second embodiment, the following two methods were compared.

正フラグ：第２の実施の形態と同様に分類器を学習して、分類器の識別結果により新しい情報を含む文を抽出する方法であって、分類器を学習する際に、新しい情報を含むか否かの正しいフラグを持つデータのみを用いてＳＶＭを学習させた場合
ランダム：英語の文に新しい情報を含むか否かのフラグをランダムに付与した場合
ここで、ＡＵＣ（area under the precision-recall curve、曲線下面積）により、各方法の性能を評価した。この評価方法では、類似性の閾値を特に指定する必要はなく、基本的に、同じリコールレベルでＡＵＣの値が高いほど精度が良いと考えられる。評価結果を、以下の表１に示す。 Positive flag: A method of learning a classifier as in the second embodiment and extracting a sentence including new information from the classification result of the classifier, and includes new information when learning the classifier. When SVM is trained using only data with the correct flag of whether or not Random: When a flag indicating whether or not new information is included in an English sentence is randomly assigned Here, AUC (area under the precision- The performance of each method was evaluated by the recall curve (area under the curve). In this evaluation method, it is not necessary to specify the similarity threshold value. Basically, the higher the AUC value at the same recall level, the higher the accuracy. The evaluation results are shown in Table 1 below.

第１の実施の形態及び第２の実施の形態は、ほとんどの場合に７０から９５の高いＡＵＣ値を達成した。また、全体的に第１の実施の形態よりも第２の実施の形態の方が高いＡＵＣ値となっている。この結果から、新しい情報をほとんど自動的に取得できていることがわかる。 The first and second embodiments achieved high AUC values of 70 to 95 in most cases. Moreover, the AUC value of the second embodiment is higher than that of the first embodiment as a whole. From this result, it can be seen that new information can be acquired almost automatically.

（実験２）新しい情報を含む文の挿入位置の決定に対する評価
次に、ラベル伝播法に基づいて、最も適切な挿入位置を決定する方法を評価する実験を行った。まず、各記事について、英語版の文から中国版の文にマッチしている文をランダムに選択し、これらの英語の文は中国語の記事における正しい位置が判明していることを用いて評価を行った。 (Experiment 2) Evaluation for Determination of Insertion Position of Sentence Containing New Information Next, an experiment was performed to evaluate a method for determining the most appropriate insertion position based on the label propagation method. First, for each article, a sentence that matches the Chinese sentence is randomly selected from the English sentence, and these English sentences are evaluated using the fact that the correct position in the Chinese article is known. Went.

比較対象として、人手による対応付けを利用する方法を使用した。まず、英語の記事中のいくつかの文が中国語の記事中の文と人手により対応付けられていると仮定し、ある英語の文と同じ情報を含む中国語の対応する文が判明した。新しい情報を含む英語の文をｅ_ｉとすると、最も適切な挿入位置はｅ_ｉ−１とマッチした中国語の文ｃ_ｊの後だと考えられる。もし、ｅ_ｉ−１とマッチした中国語の文がない場合は、ｅ_ｉ−２やｅ_ｉ−３などを調べた。この方法は人手による作業が必要で、完全に自動的ではない。この方法を、マニュアルアライメント法（manual alignment-based）と呼ぶ。 As a comparison target, a method using manual association was used. First, assuming that some sentences in an English article are manually associated with sentences in an Chinese article, a corresponding sentence in Chinese containing the same information as an English sentence was found. When a sentence of English, including the new information and e _i, the most appropriate insertion position is thought that after the e _i-1 and match the sentence c _j of the Chinese. If there is no Chinese sentence matching e _i-1 , e _i-2 and e _i-3 were examined. This method requires manual work and is not completely automatic. This method is called a manual alignment-based method.

ラベル伝播法については、以下の２つのバリエーションを用いた。以下の２つのバリエーションは、ラベル伝播法で用いるグラフ作成の際のエッジの形成手法が異なる。 The following two variations were used for the label propagation method. The following two variations differ in the method of forming an edge when creating a graph used in the label propagation method.

段落ベースエッジ形成法：上記第１及び第２の実施の形態の情報挿入位置探索部４０と同様の手法
セクションベースエッジ形成法：上記第１及び第２の実施の形態の情報挿入位置探索部４０と同様の手法であり、同じセクションにある２つの文（ノード）の間にもエッジを形成する手法（第１及び第２の実施の形態の他の例）
ここでは、各方法の性能を評価するために、以下の３つの指標を使用した。 Paragraph base edge formation method: the same method as the information insertion position search unit 40 of the first and second embodiments Section base edge formation method: information insertion position search unit 40 of the first and second embodiments And a method of forming an edge between two sentences (nodes) in the same section (another example of the first and second embodiments)
Here, in order to evaluate the performance of each method, the following three indicators were used.

平均距離(Average Distance）：予測位置と正しい挿入位置との間の距離
セクション精度(Section Accuracy)：正しいセクションに挿入されたかどうか反映する指標
段落精度(Paragraph Accuracy)：正しい段落に挿入されたかどうか反映する指標
英語の記事から３０％と５０％の文を選択して実験を行った。結果は以下の表２に示す。 Average Distance: Distance between the predicted position and the correct insertion position Section Accuracy: An indicator that reflects whether it was inserted in the correct section Paragraph Accuracy: Reflected whether it was inserted in the correct paragraph Indicators to be tested We selected 30% and 50% sentences from English articles. The results are shown in Table 2 below.

第１及び第２の実施の形態の手法（段落ベースエッジ形成法、セクションベースエッジ形成法）は、マニュアルアライメント法より優れていることが確認できた。 It was confirmed that the methods (paragraph base edge forming method, section base edge forming method) of the first and second embodiments are superior to the manual alignment method.

本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０、２１０文章処理装置
１２データ読込部
２０、２２０新情報識別部
２２前処理部
２４機械翻訳部
２６類似度計算部
２８、２２８フラグ付与部
３０特徴抽出部
３２分類器学習部
３４フラグ更新部
４０情報挿入位置探索部 DESCRIPTION OF SYMBOLS 10,210 Text processing apparatus 12 Data reading part 20,220 New information identification part 22 Preprocessing part 24 Machine translation part 26 Similarity calculation part 28,228 Flag assignment part 30 Feature extraction part 32 Classifier learning part 34 Flag update part 40 Information insertion position search unit

Claims

A third sentence obtained by translating a first sentence including one or more sentences written in the first language so that two kinds of sentences written in the same language are obtained; and the first language Translation means for obtaining one or both of a fourth sentence obtained by translating a second sentence including one or a plurality of sentences described in a second language different from
When the third sentence is obtained by the translation means, each sentence included in the third sentence, and each sentence included in the second sentence or included in the fourth sentence obtained by the translation means Similarity with each sentence is calculated for all combinations of each sentence included in the third sentence and each sentence included in the second sentence or each sentence included in the fourth sentence. When the third sentence is not obtained by the translating means, the similarity between each sentence included in the first sentence and each sentence included in the fourth sentence is determined as the first sentence. Similarity calculation means for calculating all combinations of each sentence included in the sentence and each sentence included in the fourth sentence;
When the third sentence is obtained by the translating means, a maximum similarity that maximizes the similarity calculated by the similarity calculating means is obtained for each sentence of the third sentence, and the maximum similarity is obtained. The sentence of the first sentence corresponding to the sentence of the third sentence that is less than or equal to a predetermined threshold is extracted as a sentence containing new information not included in the second sentence, and the translation means When the sentence of 3 is not obtained, the maximum similarity is obtained for each sentence of the first sentence, and the sentence of the first sentence in which the maximum similarity is equal to or less than a predetermined threshold value is determined. A new information extracting means for extracting as a sentence including new information not included in the sentence of 2,
A sentence processing apparatus including:

Feature extraction means for extracting features from each sentence of the first sentence;
Using each of the features of each sentence of the first sentence extracted by the feature extraction means and the extraction result by the new information extraction means of each sentence of the first sentence, Classifier learning means for learning a classifier for identifying whether each sentence of the first sentence contains new information based on the characteristics of each sentence;
Re-extracting means for re-extracting a sentence including the new information based on an identification result obtained by inputting each sentence of the first sentence to the classifier learned by the classifier learning means;
The sentence processing apparatus according to claim 1, comprising:

Generating a graph having a node representing each sentence of the first sentence and the second sentence, and an edge weighted according to a similarity between the nodes and a relationship between the nodes; The node corresponding to each sentence of one sentence is given a label based on whether or not the sentence includes the new information, and corresponds to each sentence of the second sentence obtained by the label propagation method using the graph. Based on the label given to the node, before or after the sentence of the second sentence corresponding to the node given the label closest to the label given to the node corresponding to the sentence containing the new information, The sentence processing apparatus according to claim 1, further comprising a determining unit that determines a position at which a sentence including the new information is to be inserted.

A third sentence obtained by translating a first sentence including one or more sentences written in the first language so that two kinds of sentences written in the same language are obtained; and the first language Obtaining one or both of a fourth sentence translated from a second sentence containing one or more sentences written in a second language different from
When the third sentence is obtained, the similarity between each sentence included in the third sentence and each sentence included in the second sentence or each sentence included in the fourth sentence is calculated. , Calculating all combinations of each sentence included in the third sentence and each sentence included in the second sentence or each sentence included in the fourth sentence to obtain the third sentence If not, the similarity between each sentence included in the first sentence and each sentence included in the fourth sentence is determined based on the similarity between each sentence included in the first sentence and the fourth sentence. Calculate all combinations with each sentence included in
When the third sentence is obtained, for each sentence of the third sentence, the maximum similarity that maximizes the similarity is obtained, and the third sentence that has the maximum similarity equal to or less than a predetermined threshold. If the sentence of the first sentence corresponding to the sentence is extracted as a sentence including new information not included in the second sentence, and the third sentence is not obtained, the first sentence For each sentence of the sentence, the maximum similarity is obtained, and the sentence of the first sentence in which the maximum similarity is not more than a predetermined threshold is extracted as a sentence including new information not included in the second sentence. Sentence processing method.

Extracting features from each sentence of the first sentence;
Each sentence of the first sentence using each extracted feature of each sentence of the first sentence and an extraction result indicating whether each sentence of the first sentence includes new information Learning a classifier for identifying whether each sentence of the first sentence contains new information based on the features of
The sentence processing method according to claim 4, wherein the sentence including the new information is re-extracted based on an identification result obtained by inputting each sentence of the first sentence to the learned classifier.

Generating a graph having a node representing each sentence of the first sentence and the second sentence, and an edge weighted according to a similarity between the nodes and a relationship between the nodes; The node corresponding to each sentence of one sentence is given a label based on whether or not the sentence includes the new information, and corresponds to each sentence of the second sentence obtained by the label propagation method using the graph. Based on the label given to the node, before or after the sentence of the second sentence corresponding to the node given the label closest to the label given to the node corresponding to the sentence containing the new information, 6. The sentence processing method according to claim 4, wherein the sentence is determined as a position where a sentence including the new information is inserted.

A sentence processing program for causing a computer to function as each means constituting the sentence processing apparatus according to any one of claims 1 to 3.