JP2015088064A

JP2015088064A - Text summarization device, text summarization method, and program

Info

Publication number: JP2015088064A
Application number: JP2013227578A
Authority: JP
Inventors: 克人別所; Katsuto Bessho; 仁西川; Hitoshi Nishikawa; 牧野　俊朗; Toshiaki Makino; 俊朗牧野; 松尾　義博; Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-10-31
Filing date: 2013-10-31
Publication date: 2015-05-07

Abstract

PROBLEM TO BE SOLVED: To enable a summary in alignment with a meaning of an entire text to be output.SOLUTION: Analysis means 22 performs a sentence division and a morphological analysis on an input text. Input-text-concept-vector generation means 24 generates an input-text concept vector representing a meaning of the input text while referring to a word concept base 20. Partial-sentence extraction means 26 extracts a partial sentence set that is a set of clauses in a sentence from the input text. Partial-sentence ranking means 28 generates a partial-sentence concept vector representing a meaning of each of partial sentences extracted by the partial-sentence extraction means 26 while referring to the word concept base 20, calculates a similarity between the input-text concept vector and the partial-sentence concept vector, and outputs partial sentences high in the similarity as a summary.

Description

本発明は、テキスト全体の内容を適確に表す要約文を出力するテキスト要約装置、方法、及びプログラムに関する。 The present invention relates to a text summarization apparatus, method, and program for outputting a summary sentence that accurately represents the contents of an entire text.

テキスト要約の手法として、非特許文献１で述べられている手法がある。非特許文献１の手法では、テキスト中の各文に重要度をつけ、文群を、重要度の総和がなるべく高くなるように、異なる単語がなるべく出るように、なおかつ、可読性がなるべく高くなるように提示する。ここで、文の重要度は、文の構成単語の重要度の総和としており、構成単語の重要度としては、出現頻度やＩＤＦが想定されている。 As a text summarization technique, there is a technique described in Non-Patent Document 1. In the method of Non-Patent Document 1, importance is given to each sentence in the text, and a sentence group is made so that different words appear as much as possible so that the sum of importance is as high as possible, and so that readability is as high as possible. To present. Here, the importance of the sentence is the sum of the importance of the constituent words of the sentence, and the appearance frequency and IDF are assumed as the importance of the constituent words.

西川仁、長谷川隆明、松尾義博、菊井玄一郎、「文の選択と順序付けを同時に行う評価文書要約モデル」、2013、人工知能学会論文誌, 28巻1号H, p.88-99Hitoshi Nishikawa, Takaaki Hasegawa, Yoshihiro Matsuo, Genichiro Kikui, “Summary Model for Evaluation Documents that Simultaneously Select and Order Sentences”, 2013, Journal of the Japanese Society for Artificial Intelligence, Vol. 28, No. 1, H., p.88-99

従来手法の第１の課題を述べる。例えば、要約対象のテキストが図１０（Ａ）の内容であったとする。従来手法では、１番目の文の「Ａ氏がテントでさんまを歌いながら焼いた。」が、構成単語数が最も多く、重要度が最も高くなるので、１番目の文が要約として出力される。 The first problem of the conventional method will be described. For example, assume that the text to be summarized is the content of FIG. In the conventional method, the first sentence “Mr. A sang while singing sanma in the tent” has the largest number of constituent words and the highest importance, so the first sentence is output as a summary. .

しかしながら、２番目の文はいわしを煮たことを、３番目の文は魚を調理したことを述べており、当該テキスト全体は魚の調理のことを述べている。しかるに、出力される文は魚の一部であるさんまを焼いたことを述べており、当該テキスト全体の意味からは、ずれた内容である。 However, the second sentence says that the sardines have been boiled, the third sentence says that the fish has been cooked, and the whole text says that the fish has been cooked. However, the output sentence states that the saury, which is a part of the fish, has been burned, and the contents deviate from the meaning of the whole text.

このように、従来手法では、文の重要度を、構成単語の重要度の総和で算出しているため、抽出される文が必ずしも、要約対象テキスト全体の意味に沿ったものでなく、要約対象テキスト全体の意味から、ずれたものとなることがあるという課題がある。 In this way, in the conventional method, since the importance of the sentence is calculated by the sum of the importance of the constituent words, the extracted sentence does not necessarily conform to the meaning of the entire text to be summarized. There is a problem that it may be shifted from the meaning of the whole text.

本発明の第１の目的は、要約対象テキストから、該テキスト全体の意味に沿った要約を出力することにある。 A first object of the present invention is to output a summary in accordance with the meaning of the entire text from the text to be summarized.

従来手法の第２の課題を述べる。例えば、要約対象のテキストが図１０（Ｂ）の内容であったとする。従来手法では、１番目の文の「さんまを焼いた。」が、重要度が最も高くなるので、１番目の文が要約として出力される。 The second problem of the conventional method will be described. For example, assume that the text to be summarized has the contents shown in FIG. In the conventional method, the first sentence “Sanma baked” has the highest importance, so the first sentence is output as a summary.

しかしながら、２番目の文はいわしを煮たことを述べており、当該テキスト全体は魚の調理のことを述べている。しかるに、出力される文は魚の一部であるさんまを焼いたことを述べており、当該テキストの内容の一部しか表していない。 However, the second sentence states that the sardines have been boiled, and the entire text describes the cooking of fish. However, the output sentence states that the saury that was part of the fish has been burned, and represents only a part of the content of the text.

このように、従来手法は、重要度が高い文を抽出するという方法であるが、抽出される文は、要約対象テキストの内容の一部しか表しておらず、要約対象テキストの内容を全部はカバーしていないという課題がある。 As described above, the conventional method is a method of extracting a sentence having high importance, but the extracted sentence represents only a part of the content of the text to be summarized, and the entire content of the text to be summarized is completely There is a problem that it is not covered.

本発明の第２の目的は、要約対象テキストから、該テキスト全体の内容をカバーするような要約を出力することにある。 The second object of the present invention is to output a summary that covers the entire contents of the text to be summarized.

従来手法の第３の課題を述べる。例えば、要約対象のテキストが図１０（Ｃ）の内容であったとする。従来手法では、１番目の文の「Ａ氏がテントでさんまを歌いながら焼いた。」と、２番目の文の「Ｂ氏が秋のいわしを煮た。」が、構成単語数が多く、重要度が高くなるので、１番目の文と２番目の文が要約として出力される。 A third problem of the conventional method will be described. For example, assume that the text to be summarized has the contents shown in FIG. In the conventional method, the first sentence “Mr. A sang while sang sanma in a tent” and the second sentence “Mr. B boiled autumn sardines.” Since the degree of importance increases, the first sentence and the second sentence are output as a summary.

しかしながら、１、２番目の文は魚を調理したことを、３、４番目の文は果物を栽培したことを述べており、当該テキストには、複数個の異なる話題が混在している。しかるに、出力される文は、魚の調理の話題に関するもののみであり、当該テキスト中の一部の話題しか表していない。 However, the 1st and 2nd sentences describe cooking fish, and the 3rd and 4th sentences describe growing fruit, and the text contains a plurality of different topics. However, the output sentence is only related to the topic of cooking fish, and only a part of the topic in the text is represented.

このように、従来手法では、抽出される文が、要約対象テキスト中の話題を全部はカバーしていないことがあるという課題がある。 Thus, in the conventional method, there is a problem that the extracted sentence may not cover all the topics in the summary target text.

本発明の第３の目的は、要約対象テキストから、該テキスト中の話題を全てカバーするような要約を出力することにある。 A third object of the present invention is to output a summary that covers all the topics in the text from the text to be summarized.

上記の目的を達成するために第１の発明に係るテキスト要約装置は、入力テキストに対して、文分割、及び形態素解析を行う解析手段と、単語と該単語の意味を表す単語概念ベクトルとの対の集合を記憶した単語概念ベースと、前記単語概念ベースを参照することにより、該入力テキストの意味を表す入力テキスト概念ベクトルを生成する入力テキスト概念ベクトル生成手段と、該入力テキストから、文中の文節の集合である部分文の集合を抽出する部分文抽出手段と、前記部分文抽出手段によって抽出された各部分文に対し、前記単語概念ベースを参照することにより、該部分文の意味を表す部分文概念ベクトルを生成し、入力テキスト概念ベクトルと該部分文概念ベクトルとの類似度を算出し、前記類似度の高い部分文を要約文として出力する部分文ランキング手段と、を有することを特徴とする。 In order to achieve the above object, a text summarizing apparatus according to a first invention comprises an analysis means for performing sentence division and morphological analysis on an input text, and a word and a word concept vector representing the meaning of the word. A word concept base that stores a set of pairs; an input text concept vector generating means that generates an input text concept vector representing the meaning of the input text by referring to the word concept base; and A partial sentence extracting means for extracting a set of partial sentences, which is a set of clauses, and the meaning of the partial sentence by referring to the word concept base for each partial sentence extracted by the partial sentence extracting means A partial sentence concept vector is generated, the similarity between the input text concept vector and the partial sentence concept vector is calculated, and the partial sentence with the high similarity is output as a summary sentence. And having a partial sentence ranking means, a.

第２の発明に係るテキスト要約方法は、解析手段、単語と該単語の意味を表す単語概念ベクトルとの対の集合を記憶した単語概念ベース、入力テキスト概念ベクトル生成手段、部分文抽出手段、及び部分文ランキング手段を含むテキスト要約装置におけるテキスト要約方法であって、前記解析手段によって、入力テキストに対して、文分割、及び形態素解析を行うステップと、前記入力テキスト概念ベクトル生成手段によって、前記単語概念ベースを参照することにより、該入力テキストの意味を表す入力テキスト概念ベクトルを生成するステップと、前記部分文抽出手段によって、該入力テキストから、文中の文節の集合である部分文の集合を抽出するステップと、前記部分文ランキング手段によって、前記部分文抽出手段によって抽出された各部分文に対し、前記単語概念ベースを参照することにより、該部分文の意味を表す部分文概念ベクトルを生成し、入力テキスト概念ベクトルと該部分文概念ベクトルとの類似度を算出し、前記類似度の高い部分文を要約文として出力するステップと、を有することを特徴とする。 A text summarizing method according to a second invention comprises an analysis means, a word concept base storing a set of pairs of a word and a word concept vector representing the meaning of the word, an input text concept vector generation means, a partial sentence extraction means, and A text summarization method in a text summarizing apparatus including a partial sentence ranking means, wherein the analysis means performs sentence division and morphological analysis on an input text, and the input text concept vector generation means generates the word. A step of generating an input text concept vector representing the meaning of the input text by referring to the concept base, and a partial sentence set that is a set of clauses in the sentence is extracted from the input text by the partial sentence extraction means Extracted by the partial sentence extracting means by the partial sentence ranking means. For each partial sentence, a partial sentence concept vector representing the meaning of the partial sentence is generated by referring to the word concept base, and a similarity between the input text concept vector and the partial sentence concept vector is calculated, And outputting the partial sentence with high similarity as a summary sentence.

第３の発明に係るテキスト要約装置は、入力テキストに対して、文分割、及び形態素解析を行う解析手段と、単語と該単語の意味を表す単語概念ベクトルとの対の集合を記憶した単語概念ベースと、前記単語概念ベースを参照することにより、該入力テキストの意味を表す入力テキスト概念ベクトルを生成する入力テキスト概念ベクトル生成手段と、該入力テキスト中の文節から得られる自立部の集合をクラスタリングする自立部クラスタリング手段と、前記自立部クラスタリング手段によって得られた各自立部クラスタに対し、前記自立部クラスタに所属する自立部の上位概念となる語を取得する上位概念語取得手段と、該入力テキストから、文中の文節の集合である部分文の集合を抽出する部分文抽出手段と、前記部分文抽出手段によって抽出された各部分文に対し、該部分文中の各自立部を、前記上位概念語取得手段によって得られた、該自立部が所属する自立部クラスタの上位概念語に変換した変換後部分文を生成する部分文変換手段と、前記部分文変換手段によって得られた各変換後部分文に対し、前記単語概念ベースを参照することにより、該変換後部分文の意味を表す変換後部分文概念ベクトルを生成し、入力テキスト概念ベクトルと前記変換後部分文概念ベクトルとの類似度を算出し、前記類似度の高い変換後部分文を要約文として出力する変換後部分文ランキング手段と、を有することを特徴とする。 A text summarization apparatus according to a third aspect of the present invention is a word concept that stores a set of pairs of an analysis unit that performs sentence division and morphological analysis on an input text, and a word concept vector that represents the meaning of the word. A base, and an input text concept vector generating means for generating an input text concept vector representing the meaning of the input text by referring to the word concept base; and a set of independent parts obtained from clauses in the input text is clustered A self-supporting part clustering means, and for each self-supporting part cluster obtained by the self-supporting part clustering means, a superordinate concept word acquiring means for acquiring a word that is a superordinate concept of the self-supporting part belonging to the self-supporting part cluster, and the input A partial sentence extracting means for extracting a set of partial sentences, which is a set of clauses in the sentence, from the text, and the partial sentence extracting means For each extracted partial sentence, a converted partial sentence obtained by converting each independent part in the partial sentence into a higher-order concept word of the independent part cluster to which the independent part belongs, obtained by the higher-order concept word acquisition unit. The generated partial sentence conversion means and the converted partial sentence concept vector representing the meaning of the converted partial sentence by referring to the word concept base for each converted partial sentence obtained by the partial sentence conversion means And a converted partial sentence ranking means for calculating a similarity between the input text concept vector and the converted partial sentence concept vector, and outputting the converted partial sentence having a high similarity as a summary sentence. It is characterized by.

第４の発明に係るテキスト要約方法は、解析手段、単語と該単語の意味を表す単語概念ベクトルとの対の集合を記憶した単語概念ベース、入力テキスト概念ベクトル生成手段、自立部クラスタリング手段、上位概念語取得手段、部分文抽出手段、部分文変換手段、及び変換後部分文ランキング手段を含むテキスト要約装置におけるテキスト要約方法であって、前記解析手段によって、入力テキストに対して、文分割、及び形態素解析を行うステップと、前記入力テキスト概念ベクトル生成手段によって、前記単語概念ベースを参照することにより、該入力テキストの意味を表す入力テキスト概念ベクトルを生成するステップと、前記自立部クラスタリング手段によって、前記入力テキスト中の文節から得られる自立部の集合をクラスタリングするステップと、前記上位概念語取得手段によって、前記自立部クラスタリング手段によって得られた各自立部クラスタに対し、前記自立部クラスタに所属する自立部の上位概念となる語を取得するステップと、前記部分文抽出手段によって、該入力テキストから、文中の文節の集合である部分文の集合を抽出するステップと、前記部分文変換手段によって、前記部分文抽出手段によって抽出された各部分文に対し、該部分文中の各自立部を、前記上位概念語取得手段によって得られた、該自立部が所属する自立部クラスタの上位概念語に変換した変換後部分文を生成するステップと、前記変換後部分文ランキング手段によって、前記部分文変換手段によって得られた各変換後部分文に対し、前記単語概念ベースを参照することにより、該変換後部分文の意味を表す変換後部分文概念ベクトルを生成し、入力テキスト概念ベクトルと前記変換後部分文概念ベクトルとの類似度を算出し、前記類似度の高い変換後部分文を要約文として出力するステップと、を有することを特徴とする。 A text summarization method according to a fourth invention comprises an analysis means, a word concept base storing a set of pairs of words and word concept vectors representing the meaning of the word, input text concept vector generation means, self-supporting part clustering means, host A text summarization method in a text summarization apparatus including a concept word acquisition means, a partial sentence extraction means, a partial sentence conversion means, and a post-conversion partial sentence ranking means, wherein the analysis means divides a sentence into input text, and A step of performing morphological analysis; a step of generating an input text concept vector representing the meaning of the input text by referring to the word concept base by the input text concept vector generating unit; Clustering a set of freestanding parts obtained from clauses in the input text Obtaining a word that is a superordinate concept of an independent part belonging to the independent part cluster for each independent part cluster obtained by the independent part clustering means by the superordinate concept word obtaining means; A step of extracting a set of partial sentences, which is a set of clauses in the sentence, from the input text by the partial sentence extracting means; and for each partial sentence extracted by the partial sentence extracting means by the partial sentence converting means, Generating a converted partial sentence obtained by converting each independent part in the partial sentence into a high-order concept word of the independent part cluster to which the independent part belongs, obtained by the higher-order concept word acquisition unit; and the converted part The sentence ranking means refers to the word concept base for each converted partial sentence obtained by the partial sentence conversion means, thereby changing the change. A converted partial sentence concept vector representing the meaning of the subsequent partial sentence is generated, a similarity between the input text concept vector and the converted partial sentence concept vector is calculated, and the converted partial sentence having a high similarity is used as a summary sentence And a step of outputting.

第１及び第２の発明においては、入力文書から得られる部分テキストの集合をクラスタリングする部分テキストクラスタリング手段を更に含み、前記部分テキストクラスタリング手段によって得られた各クラスタについて、前記クラスタを構成する部分テキストの集合を前記入力テキストとして、前記解析手段、前記入力テキスト概念ベクトル生成手段、前記部分文抽出手段、及び前記部分文ランキング手段による各処理を行って要約文を出力するようにすることができる。 In the first and second inventions, the method further includes partial text clustering means for clustering a set of partial texts obtained from the input document, and for each cluster obtained by the partial text clustering means, the partial text constituting the cluster. As a set of the input text, a summary sentence can be output by performing each process by the analyzing means, the input text concept vector generating means, the partial sentence extracting means, and the partial sentence ranking means.

第３及び第４の発明においては、入力文書から得られる部分テキストの集合をクラスタリングする部分テキストクラスタリング手段を更に含み、前記部分テキストクラスタリング手段によって得られた各クラスタについて、前記クラスタを構成する部分テキストの集合を前記入力テキストとして、前記解析手段、前記入力テキスト概念ベクトル生成手段、前記自立部クラスタリング手段、前記上位概念語取得手段、前記部分文抽出手段、前記部分文変換手段、及び前記変換後部分文ランキング手段による各処理を行って要約文を出力するようにすることができる。 In the third and fourth inventions, the method further comprises partial text clustering means for clustering a set of partial texts obtained from the input document, and for each cluster obtained by the partial text clustering means, the partial text constituting the cluster. As the input text, the analysis means, the input text concept vector generation means, the self-supporting part clustering means, the superordinate concept word acquisition means, the partial sentence extraction means, the partial sentence conversion means, and the post-conversion part A summary sentence can be output by performing each process by the sentence ranking means.

本発明に係るプログラムは、コンピュータを、本発明に係るテキスト要約装置の各手段として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the text summarizing device according to the present invention.

本発明に係るテキスト要約装置、方法、及びプログラムによれば、テキスト全体の意味に沿った要約を出力することができる、という効果が得られる。 According to the text summarizing apparatus, method, and program of the present invention, it is possible to output a summary in accordance with the meaning of the entire text.

上位概念語を説明するための図である。It is a figure for demonstrating a high-order concept word. 本発明の第１の実施の形態に係るテキスト要約装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the text summarizing apparatus based on the 1st Embodiment of this invention. 単語概念ベースに格納される集合の一例を示す図である。It is a figure which shows an example of the set stored in a word concept base. 文の係り受け関係を説明するための図である。It is a figure for demonstrating the dependency relation of a sentence. 本発明の第１の実施の形態に係るテキスト要約装置におけるテキスト要約処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the text summarization processing routine in the text summarizing apparatus based on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係るテキスト要約装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the text summarizing apparatus based on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係るテキスト要約装置におけるテキスト要約処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the text summarization processing routine in the text summarizing apparatus based on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係るテキスト要約装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the text summarizing apparatus based on the 3rd Embodiment of this invention. 本発明の第３の実施の形態に係るテキスト要約装置におけるテキスト要約処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the text summary processing routine in the text summarizing apparatus based on the 3rd Embodiment of this invention. 従来技術を説明するための図である。It is a figure for demonstrating a prior art.

＜発明の概要＞
従来手法の第１の課題を解決するための本発明の構成により、以下の効果を奏する。 <Summary of invention>
The configuration of the present invention for solving the first problem of the conventional method has the following effects.

本発明によって出力される部分文は、その意味を表す部分文概念ベクトルが、入力テキストの意味を表す入力テキスト概念ベクトルとの類似度が高いため、該部分文の意味は、入力テキストの意味と近くなる。したがって、出力される部分文は、入力テキスト全体の意味に沿った要約となる。 Since the partial sentence concept vector representing the meaning of the partial sentence output by the present invention has a high similarity to the input text conceptual vector representing the meaning of the input text, the meaning of the partial sentence is the meaning of the input text. Get closer. Therefore, the output partial sentence is a summary in accordance with the meaning of the entire input text.

上記図１０（Ａ）の要約対象テキストを例にとると、３番目の文中の部分文「魚を調理した」の概念ベクトルが、該テキストの概念ベクトルとの類似度が最も高くなるため出力される。この部分文は、該テキスト全体の意味に沿った要約となっている。 Taking the text to be summarized in FIG. 10A as an example, the concept vector of the partial sentence “fish cooked” in the third sentence is output because the similarity to the concept vector of the text is the highest. The This partial sentence is a summary in line with the meaning of the entire text.

従来手法の第２の課題を解決するための本発明の構成により奏される効果を述べる。 The effect produced by the configuration of the present invention for solving the second problem of the conventional method will be described.

ここで、本構成を取るに至ったベースとなる考え方について、まず述べる。 Here, the basic idea that led to this configuration will be described first.

上記図１０（Ｂ）の要約対象テキストを例にとると、該テキスト中の「さんま」「いわし」という語は、意味的に近く、それらの語の集合には、「魚」という上位概念の語を対応付けることができる。要約文の中で、上位概念語「魚」を使えば、下位概念語「さんま」「いわし」を使う場合よりも、要約文の、該テキスト全体の内容に対する被覆度が、より高くなる。同様に、該テキスト中の「焼いた」「煮た」という語は、意味的に近く、それらの語の集合には、「調理した」という上位概念の語を対応付けることができる。要約文の中で、上位概念語「調理した」を使えば、下位概念語「焼いた」「煮た」を使う場合よりも、要約文の、該テキスト全体の内容に対する被覆度が、より高くなる。 Taking the text to be summarized in FIG. 10B as an example, the words “sanma” and “iwashi” in the text are semantically close, and the set of these words has a superordinate concept of “fish”. Words can be associated. If the broader concept word “fish” is used in the summary sentence, the degree of coverage of the summary sentence with respect to the content of the entire text is higher than when the lower order concept words “sanma” and “sardine” are used. Similarly, the words “baked” and “boiled” in the text are semantically close, and a set of these words can be associated with a high-level concept word “cooked”. If you use the broader concept word “cooked” in the summary sentence, the coverage of the summary text with respect to the entire text is higher than when you use the lower concept words “baked” or “boiled”. Become.

そこで要約対象テキスト中にある語群の上位概念語を使った「魚を調理した」という要約文を生成すれば、該要約文は、該テキスト全体の内容をカバーするものとなっている。
以下、本発明の構成に沿って、奏される効果を述べる。 Therefore, if a summary sentence “cooked fish” using a high-level concept word in a summary target text is generated, the summary sentence covers the entire contents of the text.
Hereinafter, the effect produced according to the configuration of the present invention will be described.

自立部クラスタリング手段により、要約対象テキスト中にある自立部をクラスタリングし、上位概念語取得手段により、各自立部クラスタに、所属する自立部の上位概念に相当する語を対応付ける。図１（Ａ）では、要約対象テキスト中の自立部をクラスタリングし上位概念語を対応付けた結果、「さんま」「いわし」「さば」という自立部群からなるクラスタに上位概念語「魚」が対応づけられ、「焼いた」「煮た」という自立部群からなるクラスタに上位概念語「調理した」が対応づけられ、「家」「住まい」「自宅」という自立部群からなるクラスタに上位概念語「家庭」が対応し、「クラッシック」「ジャズ」「ポップス」という自立部群からなるクラスタに上位概念語「音楽」が対応づけられる。 The self-supporting part clustering means clusters the self-supporting parts in the text to be summarized, and the superordinate concept word acquiring means associates the word corresponding to the superordinate concept of the self-supporting part belonging to each self-supporting part cluster. In FIG. 1A, as a result of clustering the independent parts in the text to be summarized and associating the higher-order concept words, the higher-order concept word “fish” is found in the cluster composed of independent parts of “Sanma”, “Iwashi”, and “mackerel”. Corresponding, the high-level concept word “cooked” is associated with a cluster consisting of independent groups of “baked” and “boiled”, and higher than the cluster consisting of independent families of “house”, “house” and “home” The concept word “home” corresponds, and the superordinate concept word “music” is associated with a cluster composed of independent groups “classic”, “jazz”, and “pops”.

次に、各自立部クラスタの上位概念語を、付属語でつなげて文を生成し、要約文とすることを考える。これは、部分文変換手段により、要約対象テキスト中の各部分文に対し、その中の各自立部を、該自立部が所属する自立部クラスタに対応する上位概念語に変換することにより、該部分文を変換することによって実現する。例えば、要約対象テキスト中の部分文「家でさんまを焼いた」に対し、その中の自立部「家」「さんま」「焼いた」を、それぞれ、「家庭」「魚」「調理した」に変換することにより、該部分文を「家庭で魚を調理した」に変換する。 Next, it is considered that a high-order concept word of each independent cluster is connected with an attached word to generate a sentence to be a summary sentence. The partial sentence conversion means converts each independent part in the partial sentence in the summary target text into a broader concept word corresponding to the independent part cluster to which the independent part belongs. This is realized by converting the partial sentence. For example, for the partial sentence "Sanma baked at home" in the text to be summarized, the independent parts "Home", "Sanma", and "Baken" in it are changed to "Home", "Fish", and "Cooked" respectively. By converting, the partial sentence is converted into “cooked fish at home”.

そして、変換後部分文ランキング手段により、変換後部分文で、その意味を表す部分文概念ベクトルが、入力テキストの意味を表す入力テキスト概念ベクトルとの類似度が高いものを選択する。この選択された部分文は、該テキスト全体の意味に沿った要約となっており、なおかつ、該テキスト全体の内容をカバーするような要約となっている。 Then, the converted partial sentence ranking means selects a converted partial sentence whose partial sentence concept vector representing the meaning is highly similar to the input text concept vector representing the meaning of the input text. The selected partial sentence is a summary in accordance with the meaning of the entire text, and is a summary that covers the contents of the entire text.

以上のようにして、本発明の構成により、第２の課題を解決する効果を奏することができるが、以下のような構成の変更も考えられる。 As described above, the configuration of the present invention can achieve the effect of solving the second problem, but the following configuration changes are also conceivable.

自立部クラスタリング手段により得られる自立部クラスタの中には、その所属する自立部の集合が同義語の集合（例｛さんま，さんま，秋刀魚，サンマ｝）となっているケースもある。このような自立部クラスタに対し、上位概念語取得手段により、所属する自立部の上位概念語（例「魚」）を取得し、部分文変換手段により、この上位概念語に変換して得られる部分文は、要約対象テキスト全体の内容をカバーしてはいる。 In the independent part cluster obtained by the independent part clustering means, there is a case where the set of independent parts to which the independent part cluster belongs is a set of synonyms (for example, {sama, sanma, katana, saury}). For such a self-supporting part cluster, the high-order concept word acquisition means acquires the high-order concept word (eg “fish”) of the self-supporting part to which it belongs, and the partial sentence conversion means obtains it by converting it to this high-order concept word. The partial sentence covers the entire contents of the text to be summarized.

しかしながら、要約文として、元の自立部の語を使う方が、内容の被覆度を保ったまま、情報の詳細度は高く、適切であるといえる。そこで、所属する自立部の集合が同義語の集合であるような自立部クラスタに対しては、上位概念語取得手段で上位概念語は取得せず、部分文変換手段では、部分文中の自立部で、所属する自立部クラスタが同義語の集合である場合は、上位概念語への変換処理は行わないというようにするという構成をとることもできる。このような構成をとることにより、要約対象テキスト全体の内容をカバーし、なおかつ、情報の詳細度をできるだけ高めた要約文を出力することができる。 However, it can be said that it is appropriate to use the words of the original self-supporting part as a summary sentence with a high degree of detail of information while maintaining the coverage of the contents. Therefore, for a self-supporting part cluster in which the set of free-standing parts to which it belongs is a set of synonyms, the high-order concept word acquisition means does not acquire the high-order concept word, and the partial sentence conversion means does not acquire the high-order concept word Thus, when the independent cluster to which it belongs is a set of synonyms, it is possible to adopt a configuration in which the conversion process to the broader concept word is not performed. By adopting such a configuration, it is possible to output a summary sentence that covers the entire contents of the text to be summarized and further increases the level of detail of the information as much as possible.

図１（Ｂ）の要約対象テキストを例にとると、自立部クラスタリング手段で得られる自立部クラスタ｛さんま，さんま｝に対しては、同義語の集合なので上位概念語取得手段で上位概念語は取得せず、自立部クラスタ｛焼いた，煮た｝に対しては、上位概念語取得手段で上位概念語「調理した」を取得する。部分文変換手段で、部分文「さんまを焼いた」における自立部「さんま」に対しては、所属する自立部クラスタが同義語の集合なので上位概念語への変換処理は行わず、自立部「焼いた」に対しては、所属する自立部クラスタの上位概念語「調理した」に変換し、変換後部分文「さんまを調理した」が得られ、変換後部分文ランキング手段で、この変換後部分文が要約文として出力されることとなる。この変換後部分文は、図１（Ｂ）の要約対象テキスト全体の内容をカバーし、なおかつ、情報の詳細度をできるだけ高めたものとなっている。 Taking the text to be summarized in FIG. 1B as an example, for the self-supporting part cluster {sanma, sanma} obtained by the self-supporting part clustering means, since the set of synonyms, For the self-supporting portion cluster {baked, boiled}, the higher-level concept word “cooked” is acquired by the higher-level concept word acquisition unit. In the partial sentence conversion means, for the independent part "Sanma" in the partial sentence "Sanma baked", the independent part cluster to which it belongs is a set of synonyms, so conversion processing to a broader concept word is not performed, and the independent part " For `` baked '', it is converted to the broader concept word `` cooked '' in the independent part cluster to which it belongs, and the converted partial sentence `` cooked sama '' is obtained, and the converted partial sentence ranking means The partial sentence is output as a summary sentence. This converted partial sentence covers the entire contents of the text to be summarized shown in FIG. 1B, and further increases the level of detail of the information as much as possible.

従来手法の第３の課題を解決するための本発明の構成により奏される効果を述べる。 The effect produced by the configuration of the present invention for solving the third problem of the conventional method will be described.

部分テキストクラスタリング手段により、入力文書を構成する部分テキストの集合を、話題単位にクラスタリングする。この結果得られる各クラスタは、単一の話題から構成される部分テキスト集合であり、各クラスタを構成する部分テキスト集合を入力として、後述する第１の実施の形態の、解析手段、入力テキスト概念ベクトル生成手段、部分文抽出手段、及び部分文ランキング手段による処理、又は、後述する第２の実施の形態の、解析手段、入力テキスト概念ベクトル生成手段、自立部クラスタリング手段、上位概念語取得手段、部分文抽出手段、部分文変換手段、及び変換後部分文ランキング手段による処理を行うことにより、話題ごとの要約文を出力できるので、元の入力文書中の話題を全てカバーするような要約を出力できる。 A partial text clustering unit clusters a set of partial texts constituting an input document into topic units. Each cluster obtained as a result is a partial text set made up of a single topic. With the partial text set constituting each cluster as an input, the analysis means, the input text concept of the first embodiment to be described later Processing by vector generation means, partial sentence extraction means, and partial sentence ranking means, or analysis means, input text concept vector generation means, independent part clustering means, superordinate concept word acquisition means of the second embodiment described later, A summary sentence for each topic can be output by performing processing by the partial sentence extraction means, the partial sentence conversion means, and the post-conversion partial sentence ranking means, so a summary that covers all the topics in the original input document is output. it can.

上記図１０（Ｃ）の要約対象テキストを例にとると、部分テキストクラスタリング手段により該テキスト中の文の集合をクラスタリングすることにより、１、２番目の文から構成される魚の調理に関するクラスタである部分テキスト集合と、３、４番目の文から構成される果物の栽培に関するクラスタである部分テキスト集合が得られる。 Taking the text to be summarized in FIG. 10C as an example, it is a cluster related to cooking of fish composed of the first and second sentences by clustering a set of sentences in the text by the partial text clustering means. A partial text set, which is a cluster related to the cultivation of fruits composed of the third and fourth sentences, is obtained.

各部分テキスト集合を入力として、第１の実施の形態の、解析手段、入力テキスト概念ベクトル生成手段、部分文抽出手段、及び部分文ランキング手段による処理、又は、本発明の第２の実施の形態の、解析手段、入力テキスト概念ベクトル生成手段、自立部クラスタリング手段、上位概念語取得手段、部分文抽出手段、部分文変換手段、及び変換後部分文ランキング手段による処理を行うことにより、魚の調理に関する要約と、果物の栽培に関する要約が得られる。 With each partial text set as an input, the processing by the analysis means, input text concept vector generation means, partial sentence extraction means, and partial sentence ranking means of the first embodiment, or the second embodiment of the present invention It is related to cooking of fish by performing processing by analysis means, input text concept vector generation means, independent part clustering means, superordinate concept word acquisition means, partial sentence extraction means, partial sentence conversion means, and converted partial sentence ranking means. A summary and a summary of fruit cultivation are available.

[第１の実施の形態]
＜システム構成＞
以下、図面を参照して本発明の第１の実施の形態を詳細に説明する。図２は、本発明の第１の実施の形態に係るテキスト要約装置１００を示すブロック図である。テキスト要約装置１００は、入力テキストに対応する要約文を出力する。テキスト要約装置１００は、ＣＰＵと、ＲＡＭと、後述するテキスト要約処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 [First embodiment]
<System configuration>
Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings. FIG. 2 is a block diagram showing the text summarizing apparatus 100 according to the first embodiment of the present invention. The text summarization apparatus 100 outputs a summary sentence corresponding to the input text. The text summarization apparatus 100 is composed of a computer including a CPU, a RAM, and a ROM storing a program for executing a text summarization processing routine described later, and is functionally configured as follows. .

本実施の形態に係るテキスト要約装置１００は、図２に示すように、入力手段１と、演算手段２と、出力手段３とを備えている。 As shown in FIG. 2, the text summarizing apparatus 100 according to the present embodiment includes an input unit 1, a calculation unit 2, and an output unit 3.

入力手段１は、要約対象の入力テキストの入力を受け付ける。 The input unit 1 accepts input of summary input text.

演算手段２は、単語概念ベース２０と、解析手段２２と、入力テキスト概念ベクトル生成手段２４と、部分文抽出手段２６と、部分文ランキング手段２８とを備えている。 The computing unit 2 includes a word concept base 20, an analysis unit 22, an input text concept vector generation unit 24, a partial sentence extraction unit 26, and a partial sentence ranking unit 28.

単語概念ベース２０には、単語（名詞、動詞、形容詞等の内容語）と該単語の意味を表す単語概念ベクトルとの対の集合が記憶されている。図３に、単語概念ベース２０に記憶される集合の一例を示す。単語概念ベース２０は、例えば、以下の参考文献の手法によって生成される。 The word concept base 20 stores a set of pairs of words (content words such as nouns, verbs, and adjectives) and word concept vectors representing the meaning of the words. FIG. 3 shows an example of a set stored in the word concept base 20. The word concept base 20 is generated by, for example, the following reference technique.

参考文献：別所克人、内山俊郎、内山匡、片岡良治、奥雅博、「単語・意味属性間共起に基づくコーパス概念ベースの生成方式」、Dec. 2008、情報処理学会論文誌, Vol.49, No.12, p.3997-4006 References: Katsuto Bessho, Toshiro Uchiyama, Atsushi Uchiyama, Ryoji Kataoka, Masahiro Oku, “Corpus concept-based generation method based on co-occurrence between words and semantic attributes”, Dec. 2008, Transactions of Information Processing Society of Japan, Vol.49 , No.12, p.3997-4006

単語は、該単語の終止形で登録されており、単語概念ベース２０内を検索する際は、単語の終止形で検索する。各単語の概念ベクトルは、Ｄ次元空間内にあり、意味的に近い単語の概念ベクトルは、近くに配置されている。単語同士の類似度は、例えば、対応する概念ベクトル間のコサインによって算出される。 The word is registered in the word end form, and when searching in the word concept base 20, the word end form is searched. The concept vector of each word is in the D-dimensional space, and the concept vectors of words that are semantically close are arranged nearby. The similarity between words is calculated by, for example, cosine between corresponding concept vectors.

解析手段２２は、入力手段１によって受け付けた入力テキストを文分割し、得られた各文に対し、形態素解析を行う。部分文抽出手段２６で抽出する部分文として部分木をとる場合は、解析手段２２において、さらに係り受け解析まで行うというようにしてもよい。 The analysis means 22 divides the input text received by the input means 1 and performs morphological analysis on each obtained sentence. When a partial tree is taken as a partial sentence extracted by the partial sentence extraction unit 26, the analysis unit 22 may further perform dependency analysis.

例として、入力テキストが上記図１０（Ａ）の内容のとき、文分割により、句点で区切られた文に分割する。上記図１０（Ａ）の２番目の文「Ｂ氏が秋のいわしを煮た」を形態素解析、係り受け解析することにより、図４に示すように、形態素の列、文節の列、及び文節間の係り受け関係が得られる。１形態素は、表記と、品詞や終止形（非活用語の場合は、表記を終止形と看做す）といった付随情報とを含んで構成される。１文節は、１個の自立部と、０〜１個の付属部とを含んで構成される。１自立部及び１付属部は、それぞれ、１個以上の形態素を含んで構成される。自立部は少なくとも1つの自立語を含み、付属部は１つ以上の付属語からなる。 As an example, when the input text has the contents shown in FIG. 10A, the input text is divided into sentences separated by punctuation by sentence division. By performing morphological analysis and dependency analysis on the second sentence “Mr. B has boiled autumn sardines” in FIG. 10A, as shown in FIG. 4, a morpheme string, a phrase string, and a phrase A dependency relationship between them is obtained. One morpheme includes notation and accompanying information such as part of speech and final form (in the case of a non-use word, the notation is regarded as final form). One clause includes one self-supporting part and 0 to 1 attachment part. Each of the one self-supporting portion and the one attachment portion includes one or more morphemes. The self-supporting part includes at least one self-supporting word, and the appendix consists of one or more appendices.

自立部を構成する形態素で、最終形態素以外の形態素については表記をとり、最終形態素については終止形をとって連結した文字列により、同一の自立部かどうか識別するというようにしてもよい。 In the morpheme constituting the self-supporting part, the morpheme other than the final morpheme may be described, and the final morpheme may be identified as the same self-supporting part by a character string concatenated in a terminal form.

本実施の形態では、自立部の表記といった場合は、全構成形態素について表記をとった文字列を意味し、自立部の終止形といった場合は、最終形態素のみ終止形をとった文字列を意味するとする。２つの文節が係り受け関係にあるとき、該文節間に係り受け関係のリンクをつける。 In this embodiment, in the case of notation of a free standing part, it means a character string written for all constituent morphemes, and in the case of a closing form of a free standing part, it means a character string that takes only the final form. To do. When two clauses are in a dependency relationship, a dependency relationship link is provided between the clauses.

上記図４のように、文節「いわしを」と文節「煮た」の間では、文節「いわしを」が文節「煮た」に係っているので、文節「いわしを」をリンク元とし、文節「煮た」をリンク先とする係り受け関係のリンクをつける。 As shown in FIG. 4 above, between the phrase “Iwashi wo” and the phrase “boiled”, the phrase “Iwashi wo” is related to the phrase “boiled”. Add a dependency-related link with the phrase “Nibuta” as the link destination.

入力テキスト概念ベクトル生成手段２４は、単語概念ベース２０を参照することにより、該入力テキストの意味を表す入力テキスト概念ベクトルを生成する。生成方式の一例を述べる。入力テキストを構成する各文書に対し、該文書中の自立部を構成する内容語の概念ベクトルを単語概念ベース２０から取得し、該単語概念ベクトルの和を長さ１に正規化したものを、該文書の概念ベクトルとする。各文書概念ベクトルの和を長さ１に正規化したものを、該入力テキストの概念ベクトルとする。 The input text concept vector generation means 24 refers to the word concept base 20 to generate an input text concept vector representing the meaning of the input text. An example of the generation method will be described. For each document constituting the input text, the concept vector of the content word constituting the self-supporting part in the document is obtained from the word concept base 20, and the sum of the word concept vectors is normalized to length 1, The concept vector of the document is used. A concept vector of the input text is obtained by normalizing the sum of each document concept vector to length 1.

部分文抽出手段２６は、該入力テキストから、文中の文節の集合である部分文の集合を抽出する。
各文から取りだす部分文の集合の一例として、該文のみからなる集合がある。この場合は、要約文の単位を、該入力テキスト中の文単位とすることになる。
各文から取りだす部分文の集合の別の一例として、解析手段２２で係り受け解析を行った結果から得られる、係り受け関係によって連結した文節の集合である部分木の集合がある。各文の係り受け解析結果において、各文節を基点として、係り受け関係のリンクを、係り元方向に任意回数分だけ辿って形成される連結グラフの構成文節の集合を、部分木と呼ぶ。上記図４の文節「煮た」を基点とした場合、部分木として、「煮た」、「いわしを煮た」、「秋のいわしを煮た」、「Ｂ氏が煮た」、「Ｂ氏がいわしを煮た」、「Ｂ氏が秋のいわしを煮た」が得られる。文節「いわし」を基点とした場合、部分木として、「いわし」、「秋のいわし」が得られる。 The partial sentence extracting unit 26 extracts a set of partial sentences, which is a set of clauses in the sentence, from the input text.
As an example of a set of partial sentences taken out from each sentence, there is a set consisting only of the sentences. In this case, the summary sentence unit is the sentence unit in the input text.
As another example of a set of partial sentences taken out from each sentence, there is a set of subtrees, which is a set of clauses connected by a dependency relationship, obtained from the result of dependency analysis performed by the analysis means 22. In the dependency analysis result of each sentence, a set of clauses of a connected graph formed by tracing a dependency relationship link an arbitrary number of times in the direction of the dependency origin with each clause as a base point is called a subtree. When the phrase “boiled” in FIG. 4 is used as a base point, the subtrees are “boiled”, “boiled sardines”, “boiled sardines in autumn”, “boiled Mr. B”, “B Mr. Boiled sardines "and" Mr. B boiled autumn sardines "are obtained. When the phrase “Iwashi” is used as a base point, “Iwashi” and “Autumn Iwashi” are obtained as subtrees.

この部分文抽出手段２６の処理において、構成文節数が指定した文節数となる部分文のみを抽出するというようにしてもよい。また、各部分文の言語尤度を、形態素表記や品詞のＮ−ｇｒａｍ等を用いて算出し、言語尤度が高い部分文のみを抽出するというようにしてもよい。 In the process of the partial sentence extracting unit 26, only partial sentences having the specified number of phrases may be extracted. Alternatively, the language likelihood of each partial sentence may be calculated using morpheme notation, part-of-speech N-gram, etc., and only the partial sentence having a high language likelihood may be extracted.

部分文ランキング手段２８は、部分文抽出手段２６によって抽出された各部分文に対し、単語概念ベース２０を参照することにより、該部分文の意味を表す部分文概念ベクトルを生成し、入力テキスト概念ベクトルと該部分文概念ベクトルとの類似度を算出し、類似度の高い部分文を要約文として出力する。 The partial sentence ranking means 28 generates a partial sentence concept vector representing the meaning of the partial sentence by referring to the word concept base 20 for each partial sentence extracted by the partial sentence extraction means 26, and the input text concept. The similarity between the vector and the partial sentence concept vector is calculated, and a partial sentence with a high similarity is output as a summary sentence.

処理方式の一例を述べる。各部分文に対し、該部分文中の自立部を構成する内容語の概念ベクトルを単語概念ベース２０から取得し、該単語概念ベクトルの和を長さ１に正規化したものを、該部分文の概念ベクトルとする。
処理方式の別の一例を述べる。各部分文中の各自立部に対し、該自立部を構成する内容語の概念ベクトルを単語概念ベース２０から取得し、該単語概念ベクトルの和を長さ１に正規化したものを、該自立部の概念ベクトルとする。該部分文中の各自立部の概念ベクトルの和を長さ１に正規化したものを、該部分文の概念ベクトルとする。
入力テキストと該部分文との類似度として、入力テキスト概念ベクトルと該部分文概念ベクトルとの内積を算出する。 An example of the processing method will be described. For each partial sentence, the concept vector of the content word constituting the self-supporting part in the partial sentence is acquired from the word concept base 20, and the sum of the word concept vectors normalized to length 1 is A concept vector.
Another example of the processing method will be described. For each self-supporting part in each partial sentence, the concept vector of the content word constituting the self-supporting part is obtained from the word concept base 20, and the sum of the word concept vectors normalized to length 1 The concept vector. The concept vector of the partial sentence is obtained by normalizing the sum of the concept vectors of the independent parts in the partial sentence to length 1.
The inner product of the input text concept vector and the partial sentence concept vector is calculated as the similarity between the input text and the partial sentence.

そして、類似度の高いものから指定した個数の部分文を、あるいは、指定した閾値以上の類似度をもつ部分文を要約文として出力する。その際に、２番目以降に出力する部分文は、それまでに出力した部分文と同じ文字列になるもの、あるいは、共通の文節を持つもの、あるいは、同一の文に出現するものは採用しないというようにしてもよい。 Then, a specified number of partial sentences from those having a high degree of similarity, or a partial sentence having a similarity equal to or higher than a specified threshold is output as a summary sentence. In this case, the sentence that is output after the second sentence is not adopted if it is the same character string as the partial sentence that has been output so far, or has a common clause, or appears in the same sentence. It may be said that.

出力手段３は、部分文ランキング手段２８によって出力された要約文を出力する。 The output unit 3 outputs the summary sentence output by the partial sentence ranking unit 28.

＜テキスト要約装置の作用＞
次に、本実施の形態に係るテキスト要約装置１００の作用について説明する。要約対象の入力テキストがテキスト要約装置１００に入力されると、入力手段１によって入力テキストを受け付ける。そして、テキスト要約装置１００によって、図５に示すテキスト要約処理ルーチンが実行される。 <Operation of text summarization device>
Next, the operation of the text summarization apparatus 100 according to the present embodiment will be described. When the input text to be summarized is input to the text summarizing apparatus 100, the input text is received by the input means 1. Then, the text summarization processing routine shown in FIG.

まず、ステップＳ１００において、解析手段２２によって、入力手段１によって受け付けた入力テキストを文分割し、得られた各文に対し、形態素解析を行う。場合によっては係り受け解析も行う。 First, in step S100, the analysis unit 22 divides the input text received by the input unit 1 and performs morphological analysis on each obtained sentence. In some cases, dependency analysis is also performed.

次に、ステップＳ１０２において、入力テキスト概念ベクトル生成手段２４によって、単語概念ベース２０を参照することにより、該入力テキストの意味を表す入力テキスト概念ベクトルを生成する。 Next, in step S102, the input text concept vector generation means 24 refers to the word concept base 20 to generate an input text concept vector representing the meaning of the input text.

ステップＳ１０４において、部分文抽出手段２６によって、該入力テキストから、文中の文節の集合である部分文の集合を抽出する。場合によっては、上記ステップＳ１００による係り受け解析の結果から、係り受け関係によって連結した文節の集合である部分木の集合を抽出する。 In step S104, the partial sentence extraction unit 26 extracts a set of partial sentences, which is a set of clauses in the sentence, from the input text. In some cases, a set of subtrees, which is a set of clauses connected by the dependency relationship, is extracted from the result of the dependency analysis in step S100.

ステップＳ１０６において、部分文ランキング手段２８によって、上記ステップＳ１０４で抽出された各部分文に対し、単語概念ベース２０を参照することにより、該部分文の意味を表す部分文概念ベクトルを生成し、入力テキスト概念ベクトルと該部分文概念ベクトルとの類似度を算出し、類似度の高い部分文を要約文として出力する。 In step S106, the partial sentence ranking unit 28 generates a partial sentence concept vector representing the meaning of the partial sentence by referring to the word concept base 20 for each partial sentence extracted in step S104. A similarity between the text concept vector and the partial sentence concept vector is calculated, and a partial sentence with a high similarity is output as a summary sentence.

ステップＳ１０８において、出力手段３によって、上記ステップＳ１０６で出力された要約文を結果として出力して、テキスト要約処理ルーチンを終了する。 In step S108, the output means 3 outputs the summary sentence output in step S106 as a result, and the text summarization processing routine ends.

以上説明したように、第１の実施の形態によれば、単語と該単語の意味を表す単語概念ベクトルとの対の集合を記憶した単語概念ベースを参照することにより、入力テキストの意味を表す入力テキスト概念ベクトルを生成し、入力テキストから、文中の文節の集合である部分文の集合を抽出し、抽出された各部分文に対し、単語概念ベースを参照することにより、当該部分文の意味を表す部分文概念ベクトルを生成し、入力テキスト概念ベクトルと部分文概念ベクトルとの類似度を算出し、類似度の高い部分文を要約文として出力する。 As described above, according to the first embodiment, the meaning of the input text is expressed by referring to the word concept base that stores a set of pairs of a word and a word concept vector representing the meaning of the word. Generates an input text concept vector, extracts a set of partial sentences, which is a set of clauses in the sentence, from the input text, and refers to the word concept base for each extracted partial sentence, meaning of the partial sentence Is generated, a similarity between the input text concept vector and the partial sentence concept vector is calculated, and a partial sentence with a high similarity is output as a summary sentence.

出力される部分文は、その意味を表す部分文概念ベクトルが、入力テキストの意味を表す入力テキスト概念ベクトルとの類似度が高いため、該部分文の意味は、入力テキストの意味と近くなる。したがって、出力される部分文は、入力テキスト全体の意味に沿った要約となる。 Since the partial sentence concept vector representing the meaning of the output partial sentence has a high similarity to the input text concept vector representing the meaning of the input text, the meaning of the partial sentence is close to the meaning of the input text. Therefore, the output partial sentence is a summary in accordance with the meaning of the entire input text.

[第２の実施の形態]
次に、第２の実施の形態について説明する。第２の実施の形態は、入力テキスト中の自立部を、上位概念語に置き換える点が、第１の実施の形態と異なる。なお、第２の実施の形態に係るテキスト要約装置２００について、第１の実施の形態に係るテキスト要約装置１００と同一の構成については、同一符号を付して、詳細な説明を省略する。 [Second Embodiment]
Next, a second embodiment will be described. The second embodiment differs from the first embodiment in that the self-supporting part in the input text is replaced with a broader concept word. In addition, about the text summarizing apparatus 200 which concerns on 2nd Embodiment, about the same structure as the text summarizing apparatus 100 which concerns on 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

＜システム構成＞
第２の実施の形態に係るテキスト要約装置２００は、第１の実施の形態に係るテキスト要約装置１００と同様に、ＣＰＵと、ＲＡＭと、ＲＯＭとを備えたコンピュータで構成されている。このコンピュータは、機能的には、図６に示すように、入力手段１と、演算手段２と、出力手段３とを含んだ構成で表すことができる。 <System configuration>
The text summarization apparatus 200 according to the second embodiment is configured by a computer including a CPU, a RAM, and a ROM, like the text summarization apparatus 100 according to the first embodiment. Functionally, this computer can be represented by a configuration including an input unit 1, a calculation unit 2, and an output unit 3, as shown in FIG.

演算手段２は、単語概念ベース２０と、解析手段２２と、入力テキスト概念ベクトル生成手段２４と、シソーラスデータベース２２０と、言語モデルデータベース２２２と、自立部クラスタリング手段２２４と、上位概念語取得手段２２６と、部分文抽出手段２６と、部分文変換手段２２８と、変換後部分文ランキング手段２３０とを備えている。 The computing means 2 includes a word concept base 20, an analysis means 22, an input text concept vector generation means 24, a thesaurus database 220, a language model database 222, a self-supporting part clustering means 224, and a superordinate concept word acquisition means 226. , Partial sentence extraction means 26, partial sentence conversion means 228, and post-conversion partial sentence ranking means 230.

シソーラスデータベース２２０には、語の上位・下位関係が予め格納されている。 The thesaurus database 220 stores the upper / lower relationship of words in advance.

言語モデルデータベース２２２には、言語コーパスから獲得したN-gramデータが格納されている。 The language model database 222 stores N-gram data acquired from the language corpus.

自立部クラスタリング手段２２４は、該入力テキスト中の文節から得られる自立部の集合をクラスタリングする。ここでの自立部の集合の要素は、終止形が同一でもテキスト中の出現位置が異なるものは別要素と看做す考えもあるし、テキスト中の出現位置が異なっても終止形が同一であれば同一要素と看做す考えもある。 The independent part clustering means 224 clusters the set of independent parts obtained from the clauses in the input text. The elements of the set of self-supporting parts here may be considered as different elements if they have the same end form but different appearance positions in the text, and the end form is the same even if the appearance position in the text is different. There is also an idea to consider it as the same element.

自立部クラスタリング手段２２４は、例えば、該入力テキスト中の各自立部に対し、該自立部を構成する内容語の概念ベクトルを単語概念ベース２０から取得し、該単語概念ベクトルの和を長さ１に正規化したものを、該自立部の概念ベクトルとする。該入力テキスト中の自立部の集合のクラスタリングを、対応する自立部概念ベクトル集合のクラスタリングによって行う。クラスタリングは、例えば、ｋ‐ｍｅａｎｓ法によって行う。 For example, for each independent part in the input text, the independent part clustering means 224 obtains the concept vector of the content word constituting the independent part from the word concept base 20 and sets the sum of the word concept vectors to the length 1 The concept vector of the self-supporting part is normalized to. Clustering of a set of independent parts in the input text is performed by clustering of a corresponding independent part concept vector set. Clustering is performed by, for example, the k-means method.

クラスタリングにより、自立部クラスタＣ_ｉの集合｛Ｃ_ｉ｝が生成される。 Clustering generates a set {C _i } of self-supporting part clusters C _i .

上位概念語取得手段２２６は、自立部クラスタリング手段２２４により得られた各自立部クラスタに対し、当該自立部クラスタに所属する自立部の上位概念となる語を取得する。 The superordinate concept word acquisition unit 226 acquires, for each independent part cluster obtained by the independent part clustering means 224, a word that is a superordinate concept of the independent part belonging to the independent part cluster.

具体的には、上位概念語取得手段２２６は、各自立部クラスタに対し、当該自立部クラスタに所属する自立部でシソーラス２２０を検索し、その上位概念語を取得する。取得される上位概念語は一般に複数あり、取得された各上位概念語に、対応する自立部の数に応じてスコアを付与することができる。また、自立部クラスタに所属する自立部の共通の上位概念語を取得してもよい。この場合も、該上位概念語と各自立部とのシソーラス上での距離に応じて、該上位概念語にスコアを付与することができる。 Specifically, the high-order concept word acquisition unit 226 searches the thesaurus 220 for each self-supporting part cluster in the self-supporting part belonging to the self-supporting part cluster, and acquires the high-order concept word. In general, there are a plurality of higher-order concept words to be acquired, and a score can be given to each acquired higher-order concept word according to the number of corresponding independent parts. Moreover, you may acquire the common high-order concept word of the independent part which belongs to an independent part cluster. Also in this case, a score can be given to the higher-order concept word according to the distance between the higher-order concept word and each independent part on the thesaurus.

また、他の上位概念語取得の方法として、単語概念ベース２０中の単語概念ベクトルを利用した算術演算による方法も考えられる。この場合も、上位概念語は一般に複数あり、取得された各上位概念語にスコアを付与することができる。 In addition, as another method of acquiring the high-level concept word, a method using an arithmetic operation using a word concept vector in the word concept base 20 is also conceivable. Also in this case, there are generally a plurality of high-order concept words, and a score can be given to each acquired high-order concept word.

このようにして、ｉ番目の自立部クラスタＣ_ｉに対し、対応するｊ番目の上位概念語ｕ_ｉｊとそのスコアｓ_ｉｊの組の集合｛（ｕ_ｉｊ，ｓ_ｉｊ）｝が導出される。 In this way, a set {(u _ij , s _ij )} of a set of a corresponding j-th higher-order concept word u _ij and its score s _ij is derived for the i-th independent cluster C _i .

また、自立部クラスタが同義語の集合である場合は、該自立部クラスタの上位概念語は取得しないというようにしてもよい。自立部クラスタが同義語の集合であるかどうかを判断する方法として、シソーラス２２０や単語辞書等の情報を参照して、所属する自立語が互いに同義語であるかを判断する方法が考えられる。 In addition, when the self-supporting part cluster is a set of synonyms, the high-order concept word of the self-supporting part cluster may not be acquired. As a method of determining whether or not the independent part cluster is a set of synonyms, a method of determining whether the independent words belonging to each other are synonymous with reference to information such as the thesaurus 220 or a word dictionary can be considered.

また別の方法として、自立部クラスタリング手段２２４の処理において、自立部クラスタの概念ベクトル（所属する自立部の概念ベクトルの重心、または、該重心を長さ１に正規化したもの）を求め、自立部クラスタ概念ベクトルと所属する自立部の概念ベクトルとの距離の二乗の平均の平方根である標準偏差を算出し、該標準偏差が、ある閾値以下であるならば、所属する自立部の概念ベクトルは極めて近接していることから、同義語の集合と判断するという方法も考えられる。 As another method, in the process of the self-supporting part clustering means 224, the concept vector of the self-supporting part cluster (the center of gravity of the concept vector of the self-supporting part to which it belongs, or the center of gravity normalized to length 1) is obtained, and self-supporting The standard deviation, which is the square root of the mean square of the distance between the part cluster concept vector and the concept vector of the independent part to which it belongs, is calculated, and if the standard deviation is below a certain threshold, the concept vector of the independent part to which it belongs is Since they are very close together, a method of determining a set of synonyms is also conceivable.

部分文変換手段２２８は、部分文抽出手段２６によって抽出された各部分文に対し、該部分文中の各自立部を、上位概念語取得手段２２６により得られた、該自立部が所属する自立部クラスタの上位概念語に変換した変換後部分文を生成する。本実施の形態では、部分文抽出手段２６において、部分木の集合をとった場合を述べる。 For each partial sentence extracted by the partial sentence extracting unit 26, the partial sentence converting unit 228 obtains each independent part in the partial sentence obtained by the high-order concept word acquiring unit 226, and the independent part to which the independent part belongs. Generates a converted partial sentence converted into a broader concept word of the cluster. In the present embodiment, a case will be described in which the partial sentence extraction unit 26 takes a set of partial trees.

対象部分木中の文節の列を｛（ｗ_ｋ，ｈ_ｋ）｝とする。ここでｗ_ｋ，ｈ_ｋは、それぞれ、該部分木中のｋ番目の文節の自立部、付属部である。係り受け関係の集合を｛（ｋ_ｒ０，ｋ_ｒ１）｝とする。ここでｋ_ｒ０，ｋ_ｒ１は、それぞれ、ｒ番目の係り受け関係の係り元の文節番号、係り先の文節番号である。 Let {(w _k , h _k )} be the sequence of clauses in the target subtree. Here, w _k and h _k are a self-supporting part and an appending part of the k-th clause in the subtree, respectively. A set of dependency relationships is assumed to be {(k _r0 , k _r1 )}. Here, k _r0 and k _r1 are a source clause number and a destination clause number of the r-th dependency relationship, respectively.

自立部ｗ_ｋに対し、所属するｉ番目の自立部クラスタ The i-th independent cluster to which the independent part w _k belongs

のｊ番目の上位概念語 Jth broader term

の中で、その表記が、付属部ｈ_ｋと文法的に接続可能なものとそのスコアの組の集合 In which the notation is grammatically connectable to the appendix h _k and the set of scores

をとる。文法的に接続可能かは、上位概念語 Take. Whether it can be connected grammatically is a high-level concept word

の最終形態素の品詞Ｐ_０と付属部ｈ_ｋの先頭形態素の品詞Ｐ_１とが接続可能かを、形態素解析等で使用する文法ルールを参照することにより判断できる。あるいは、言語モデルデータベース２２２内に格納されている、言語コーパスから獲得した、以下の式で表わされる値が、ある閾値以上である場合に、接続可能と判断するという方法をとることもできる。 Of whether the last morpheme word class P ₀ and the part of speech P ₁ of the first morpheme appendages h _k can be connected, it can be determined by referring to the grammar rule to be used by the morphological analysis or the like. Alternatively, it is possible to adopt a method of determining that connection is possible when a value represented by the following expression acquired from a language corpus and stored in the language model database 222 is equal to or greater than a certain threshold value.

（自立部の最終形態素の品詞がＰ_０でかつ付属部の先頭形態素の品詞がＰ_１の文節数)／
（自立部の最終形態素の品詞がＰ_０の文節数) (The number of phrases in which the part of speech of the final morpheme of the self-supporting part is P ₀ and the part of speech of the _first morpheme of the appendix is P ₁ ) /
(The number of phrases in which the part of speech of the final morpheme of the self-supporting part is P ₀ )

ここで、付属部ｈ_ｋと同じ働きをする付属部ｈ_ｋ’で表記が異なるものと、 Here, the notation different at appendage h _{k 'that} are equivalent to the appendages h _k,

の表記とが接続可能であれば、 If you can connect with the notation of

とそのスコアの組も含めておき、部分文変換手段２２８の処理における後述する処理で、自立部ｗ_ｋの上位概念語が And the set of scores in advance, including the process to be described later in the process of partial text conversion unit 228, broad term free-standing unit w _k is

に決まった場合は、部分文変換の際に、ｈ_ｋをｈ_ｋ’に変換するというようにしてもよい。 If it is decided, h _k may be converted to h _k ′ at the time of partial sentence conversion.

また、所属する自立部クラスタ Independent cluster

が同義語の集合であった場合は、自立部ｗ_ｋとそのスコアｓ_ｋ（自立部の取り得るスコアの最大値とする）の組のみからなる集合｛（ｗ_ｋ，ｓ_ｋ）｝をとるようにしてもよい。以下、この集合に対しても、便宜上 Is a set of synonyms, a set {(w _k , s _k )} consisting only of a set of the independent part w _k and its score s _k (the maximum value of the independent part can be taken) is taken. You may do it. Hereinafter, for this set as well, for convenience

という書き方で代用する。 It substitutes with the way of writing.

次に、各自立部ｗ_ｋに対する Next, for each independent part w _k

で、以下の部分木全体のスコアＶが最大となるものを求める。 Then, the one having the maximum score V of the following subtree is obtained.

ここで here

は、ｒ番目の係り受け関係のスコアで、ｋ_ｒ０番目の文節 Is the score of the r-th dependency relationship, the k _r0- th clause

からｋ_ｒ１番目の文節 To k _r1 th clause

への係り受け関係のスコアである。このスコアは、言語モデルデータベース２２２内に格納されている、言語コーパスから獲得した以下の値とする。 It is a dependency relationship score. This score is the following value acquired from the language corpus stored in the language model database 222.

上記式で、係り元または係り先の自立部をそれが属する意味カテゴリ（単語辞書中に記載）に置き換えて算出するという方法も考えられる。 A method is also conceivable in which the calculation is performed by replacing the independent part of the relation source or the relation destination with the semantic category (described in the word dictionary) to which the relation belongs.

部分木全体のスコアＶが最大となる Score V of the entire subtree is maximized

は、ビタビ・アルゴリズムにより、効率的に求めることが可能である。
なお、対象部分木の言語尤度を、形態素表記や品詞のN-gram等を用いて算出し、該言語尤度を該部分木のスコアＶに加味するというようにしてもよい。 Can be efficiently obtained by the Viterbi algorithm.
The language likelihood of the target subtree may be calculated using morpheme notation, part-of-speech N-gram, and the like, and the language likelihood may be added to the score V of the subtree.

以上のようにして、対象部分木は、自立部を変換した文節 As described above, the target subtree is a phrase converted from a self-supporting part.

の列に変換される。
なお、部分文が、係り受け構造をもつ部分木でない場合は、スコアＶの算出式として、[数１１]から[数１２]に関する項をなくしたものを考えればよい。 Is converted to
If the partial sentence is not a partial tree having a dependency structure, the expression for calculating the score V may be one in which the terms relating to [Equation 11] to [Equation 12] are eliminated.

変換後部分文ランキング手段２３０は、部分文変換手段２２８によって得られた各変換後部分文に対し、単語概念ベース２０を参照することにより、該変換後部分文の意味を表す変換後部分文概念ベクトルを生成し、入力テキスト概念ベクトルと変換後部分文概念ベクトルとの類似度を算出し、該類似度の高い変換後部分文を要約文として出力する。変換後部分文ランキング手段２３０の処理内容は、処理対象が変換後部分文であることを除いて、第１の実施の形態の部分文ランキング手段２８の処理内容と同様である。 The post-conversion partial sentence ranking unit 230 refers to the word concept base 20 for each post-conversion partial sentence obtained by the partial sentence conversion unit 228, thereby expressing the post-conversion partial sentence concept. A vector is generated, the similarity between the input text concept vector and the converted partial sentence concept vector is calculated, and the converted partial sentence having a high similarity is output as a summary sentence. The processing content of the post-conversion partial sentence ranking unit 230 is the same as the processing content of the partial sentence ranking unit 28 of the first embodiment, except that the processing target is a post-conversion partial sentence.

＜テキスト要約装置の作用＞
次に、本実施の形態に係るテキスト要約装置２００の作用について説明する。要約対象の入力テキストがテキスト要約装置２００に入力されると、入力手段１によって入力テキストを受け付ける。そして、テキスト要約装置２００によって、図７に示すテキスト要約処理ルーチンが実行される。なお、第１の実施の形態と同様の処理については、同一符号を付して説明を省略する。 <Operation of text summarization device>
Next, the operation of the text summarizing apparatus 200 according to this embodiment will be described. When the input text to be summarized is input to the text summarizing apparatus 200, the input means 1 receives the input text. Then, the text summarization apparatus 200 executes a text summarization processing routine shown in FIG. In addition, about the process similar to 1st Embodiment, the same code is attached and description is abbreviate | omitted.

ステップＳ２００において、自立部クラスタリング手段２２４によって、該入力テキスト中の文節から得られる自立部の集合をクラスタリングする。 In step S <b> 200, the independent part clustering means 224 clusters the set of independent parts obtained from the clauses in the input text.

次に、ステップＳ２０２において、上位概念語取得手段２２６によって、上記ステップＳ２００で得られた各自立部クラスタに対し、当該自立部クラスタに所属する自立部でシソーラス２２０を検索し、その上位概念語を取得する。 Next, in step S202, the high-order concept word acquisition unit 226 searches for the thesaurus 220 for each self-supporting part cluster obtained in step S200 in the self-supporting part belonging to the self-supporting part cluster. get.

ステップＳ２０４において、部分文変換手段２２８によって、上記ステップＳ１０４で抽出された各部分文に対し、該部分文中の各自立部を、上記ステップＳ２０２で得られた、該自立部が所属する自立部クラスタに対する上位概念語に変換した変換後部分文を生成する。 In step S204, for each partial sentence extracted in step S104 by the partial sentence conversion means 228, each independent part in the partial sentence is obtained as the independent part cluster to which the independent part belongs obtained in step S202. Generates a converted partial sentence converted to a broader concept word for.

なお、第２の実施の形態に係るテキスト要約装置２００の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, since the other structure and effect | action of the text summarizing apparatus 200 which concern on 2nd Embodiment are the same as that of 1st Embodiment, description is abbreviate | omitted.

以上説明したように、第２の実施の形態によれば、入力テキスト中の文節から得られる自立部の集合をクラスタリングし、得られた各自立部クラスタに対し、自立部クラスタに所属する自立部の上位概念となる語を取得し、自立部が所属する自立部クラスタの上位概念語に変換した変換後部分文を生成し、得られた各変換後部分文に対し、単語概念ベースを参照することにより、変換後部分文の意味を表す変換後部分文概念ベクトルを生成し、入力テキスト概念ベクトルと変換後部分文概念ベクトルとの類似度を算出し、類似度の高い部分文を要約文として出力することにより、テキスト全体の内容をカバーするような要約を出力することができる。 As described above, according to the second embodiment, the set of independent parts obtained from the clauses in the input text is clustered, and the independent part belonging to the independent part cluster is obtained for each obtained independent part cluster. The word which becomes the superordinate concept of is obtained, the converted partial sentence converted into the superordinate concept word of the independent part cluster to which the independent part belongs is generated, and the word concept base is referred to for each obtained converted partial sentence As a result, a converted partial sentence concept vector representing the meaning of the converted partial sentence is generated, the similarity between the input text concept vector and the converted partial sentence concept vector is calculated, and the partial sentence having a high similarity is used as a summary sentence. By outputting, it is possible to output a summary that covers the entire contents of the text.

[第３の実施の形態]
次に、第３の実施の形態について説明する。第３の実施の形態は、入力文書中の話題毎に要約文を生成する点が、第１及び第２の実施の形態と異なる。なお、第３の実施の形態に係るテキスト要約装置３００について、第１の実施の形態に係るテキスト要約装置１００と同一の構成については、同一符号を付して、詳細な説明を省略する。 [Third embodiment]
Next, a third embodiment will be described. The third embodiment is different from the first and second embodiments in that a summary sentence is generated for each topic in the input document. In addition, about the text summarizing apparatus 300 which concerns on 3rd Embodiment, about the structure same as the text summarizing apparatus 100 which concerns on 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

＜システム構成＞
第３の実施の形態に係るテキスト要約装置３００は、第１の実施の形態に係るテキスト要約装置１００と同様に、ＣＰＵと、ＲＡＭと、ＲＯＭとを備えたコンピュータで構成されている。このコンピュータは、機能的には、図８に示すように、入力手段１と、演算手段２と、出力手段３とを含んだ構成で表すことができる。 <System configuration>
Similar to the text summarization device 100 according to the first embodiment, the text summarization device 300 according to the third embodiment is configured by a computer including a CPU, a RAM, and a ROM. Functionally, this computer can be represented by a configuration including an input means 1, an arithmetic means 2, and an output means 3, as shown in FIG.

演算手段２は、単語概念ベース２０と、部分テキストクラスタリング手段３２０と、解析手段２２と、入力テキスト概念ベクトル生成手段２４と、部分文抽出手段２６と、部分文ランキング手段２８と、制御手段３２２とを備えている。 The computing means 2 includes a word concept base 20, a partial text clustering means 320, an analyzing means 22, an input text concept vector generating means 24, a partial sentence extracting means 26, a partial sentence ranking means 28, and a control means 322. It has.

部分テキストクラスタリング手段３２０は、入力文書から得られる部分テキストの集合をクラスタリングする。
具体的には、入力文書を構成する部分テキストごとに、文分割と形態素解析を行う。各部分テキストに対し、単語概念ベース２０を参照することにより、該部分テキストの意味を表す部分テキスト概念ベクトルを生成する。 The partial text clustering means 320 clusters a set of partial texts obtained from the input document.
Specifically, sentence division and morphological analysis are performed for each partial text constituting the input document. By referring to the word concept base 20 for each partial text, a partial text concept vector representing the meaning of the partial text is generated.

生成方式の一例を述べる。各部分テキストに対し、該部分テキスト中の自立部を構成する内容語の概念ベクトルを単語概念ベース２０から取得し、該単語概念ベクトルの和を長さ１に正規化したものを、該部分テキストの概念ベクトルとする。 An example of the generation method will be described. For each partial text, the concept vector of the content word constituting the self-supporting part in the partial text is obtained from the word concept base 20, and the sum of the word concept vectors normalized to length 1 is obtained as the partial text. The concept vector.

該入力文書中の部分テキストの集合のクラスタリングを、対応する部分テキスト概念ベクトル集合のクラスタリングによって行う。クラスタリングは、例えば、ｋ−ｍｅａｎｓ法によって行う。 Clustering of a set of partial texts in the input document is performed by clustering of a corresponding partial text concept vector set. Clustering is performed by, for example, the k-means method.

部分テキストクラスタリング手段３２０により得られた各クラスタについて、当該クラスタを構成する部分テキスト集合を入力として、解析手段２２、入力テキスト概念ベクトル生成手段２４、部分文抽出手段２６、及び部分文ランキング手段２８の各処理を行う。 For each cluster obtained by the partial text clustering means 320, an analysis means 22, an input text concept vector generation means 24, a partial sentence extraction means 26, and a partial sentence ranking means 28, using the partial text set constituting the cluster as an input. Perform each process.

制御手段３２２は、未処理のクラスタを構成する部分テキスト集合があるか否かを判断し、未処理のクラスタがあれば、該未処理のクラスタを構成する部分テキスト集合に対し、解析手段２２以降の各処理を行う。 The control unit 322 determines whether or not there is a partial text set that forms an unprocessed cluster. If there is an unprocessed cluster, the control unit 322 analyzes the partial text set that forms the unprocessed cluster from the analysis unit 22 onward. Each process is performed.

出力手段３は、部分文ランキング手段２８によって各クラスタについて出力された要約文を結合して出力する。 The output means 3 combines and outputs the summary sentences output for each cluster by the partial sentence ranking means 28.

なお、部分テキストクラスタリング手段３２０の処理で、解析手段２２以降の処理の入力部分テキスト集合の文分割、形態素解析はなされているので、解析手段２２の処理は行わない、あるいは、係り受け解析のみ行うという構成をとることもできる。 In the processing of the partial text clustering means 320, sentence division and morphological analysis of the input partial text set in the processing after the analysis means 22 are performed, so the processing of the analysis means 22 is not performed, or only dependency analysis is performed. It can also take the structure of.

また、部分テキストクラスタリング手段３２０の処理において、クラスタリングが終わった後、各クラスタの概念ベクトルを、該クラスタを構成する部分テキストの概念ベクトルの和を長さ１に正規化したものとして算出しておけば、これが解析手段２２以降の処理の入力部分テキスト集合の概念ベクトルとなるので、入力テキスト概念ベクトル生成手段２４の処理をなくすという構成をとることもできる。 In addition, in the processing of the partial text clustering means 320, after clustering is finished, the concept vector of each cluster can be calculated as the sum of the concept vectors of the partial texts constituting the cluster normalized to length 1. For example, since this becomes a concept vector of the input partial text set for the processing after the analysis means 22, it is possible to adopt a configuration in which the processing of the input text concept vector generation means 24 is eliminated.

＜テキスト要約装置の作用＞
次に、本実施の形態に係るテキスト要約装置３００の作用について説明する。要約対象の入力文書がテキスト要約装置３００に入力されると、入力手段１によって入力文書を受け付ける。そして、テキスト要約装置３００によって、図９に示すテキスト要約処理ルーチンが実行される。なお、第１の実施の形態と同様の処理については、同一符号を付して説明を省略する。 <Operation of text summarization device>
Next, the operation of the text summarizing apparatus 300 according to this embodiment will be described. When an input document to be summarized is input to the text summarization apparatus 300, the input unit 1 receives the input document. Then, the text summarization processing routine shown in FIG. In addition, about the process similar to 1st Embodiment, the same code is attached and description is abbreviate | omitted.

まず、ステップＳ３００において、部分テキストクラスタリング手段３２０によって、入力文書から得られる部分テキストの各々について、当該部分テキスト中の自立部を構成する内容語の概念ベクトルを単語概念ベースから取得し、部分テキスト概念ベクトルを求め、部分テキスト概念ベクトルに基づいて、部分テキストの集合をクラスタリングする。 First, in step S300, the partial text clustering means 320 obtains, from the word concept base, the concept vector of the content word constituting the self-supporting part in the partial text for each partial text obtained from the input document. A vector is obtained, and a set of partial texts is clustered based on the partial text concept vector.

次に、ステップＳ３０２において、対象となるクラスタの部分テキスト集合を設定する。 In step S302, a partial text set of the target cluster is set.

ステップＳ１００〜ステップＳ１０６において、対象となるクラスタを構成する部分テキスト集合を入力として、解析手段２２、入力テキスト概念ベクトル生成手段２４、部分文抽出手段２６、及び部分文ランキング手段２８の各処理を行う。 In steps S100 to S106, the processing of the analysis unit 22, the input text concept vector generation unit 24, the partial sentence extraction unit 26, and the partial sentence ranking unit 28 is performed with the partial text set constituting the target cluster as an input. .

ステップＳ３０４において、制御手段３２２によって、未処理のクラスタを構成する部分テキスト集合があるか否かを判断し、未処理のクラスタがあれば、ステップＳ３０２に戻る。一方、全てのクラスタに対し、上記ステップＳ３０２〜ステップＳ１０６の処理を実行した場合には、ステップＳ１０８へ進む。 In step S304, the control unit 322 determines whether there is a partial text set that constitutes an unprocessed cluster. If there is an unprocessed cluster, the process returns to step S302. On the other hand, if the processing of step S302 to step S106 has been executed for all clusters, the process proceeds to step S108.

ステップＳ１０８において、出力手段３によって、上記ステップＳ１０６で各クラスタについて出力された要約文を結合して出力して、テキスト要約処理ルーチンを終了する。 In step S108, the output unit 3 combines and outputs the summary sentence output for each cluster in step S106, and ends the text summarization processing routine.

なお、第３の実施の形態に係るテキスト要約装置３００の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, since the other structure and effect | action of the text summarizing apparatus 300 which concern on 3rd Embodiment are the same as that of 1st Embodiment, description is abbreviate | omitted.

以上説明したように、第３の実施の形態によれば、入力文書から得られる部分テキストの集合をクラスタリングし、当該クラスタ毎に類似度の高い部分文を選択し、選択された部分文の各々を結合して、要約文として出力することにより、話題ごとの要約文を出力できるので、入力文書中の話題を全てカバーするような要約を出力することができる。 As described above, according to the third embodiment, a set of partial texts obtained from an input document is clustered, a partial sentence having a high similarity is selected for each cluster, and each selected partial sentence is selected. Can be output as a summary sentence, so that a summary sentence for each topic can be output, so that a summary covering all the topics in the input document can be output.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。例えば、入力文書中の部分テキストの集合のクラスタリングを行い、各クラスタについて、当該クラスタを構成する部分テキスト集合を入力として、各処理を行う技術を、上記第２の実施の形態に適用してもよい。この場合には、部分テキストクラスタリング手段３２０により得られた各クラスタについて、当該クラスタを構成する部分テキスト集合を入力として、解析手段２２、入力テキスト概念ベクトル生成手段２４、自立部クラスタリング手段２２４、上位概念語取得手段２２６、部分文抽出手段２６、部分文変換手段２２８、及び変換後部分文ランキング手段２３０の各処理を行う。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention. For example, a technique for performing clustering of a set of partial texts in an input document, and performing processing for each cluster using a partial text set constituting the cluster as an input may be applied to the second embodiment. Good. In this case, for each cluster obtained by the partial text clustering unit 320, the partial text set constituting the cluster is used as an input, and the analysis unit 22, the input text concept vector generation unit 24, the self-supporting part clustering unit 224, the superordinate concept Each process of the word acquisition means 226, the partial sentence extraction means 26, the partial sentence conversion means 228, and the converted partial sentence ranking means 230 is performed.

また、単語概念ベース２０、シソーラスデータベース２２０、及び言語モデルデータベース２２２の少なくとも１つが外部に設けられ、テキスト要約装置とネットワークで接続されていてもよい。 Further, at least one of the word concept base 20, the thesaurus database 220, and the language model database 222 may be provided externally and connected to the text summarization apparatus via a network.

上述のテキスト要約装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The above text summarization apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

本発明は、テキスト全体の内容を適確に表す要約文を出力する技術に適用可能である。 The present invention is applicable to a technique for outputting a summary sentence that accurately represents the contents of the entire text.

１入力手段
２演算手段
３出力手段
２０単語概念ベース
２２解析手段
２４入力テキスト概念ベクトル生成手段
２６部分文抽出手段
２８部分文ランキング手段
１００，２００，３００テキスト要約装置
２２０シソーラスデータベース
２２２言語モデルデータベース
２２４自立部クラスタリング手段
２２６上位概念語取得手段
２２８部分文変換手段
２３０変換後部分文ランキング手段
３２０部分テキストクラスタリング手段
３２２制御手段 DESCRIPTION OF SYMBOLS 1 Input means 2 Calculation means 3 Output means 20 Word concept base 22 Analysis means 24 Input text concept vector generation means 26 Partial sentence extraction means 28 Partial sentence ranking means 100,200,300 Text summarization apparatus 220 Thesaurus database 222 Language model database 224 Independent Partial clustering means 226 Higher concept word acquisition means 228 Partial sentence conversion means 230 After conversion partial sentence ranking means 320 Partial text clustering means 322 Control means

Claims

An analysis means for performing sentence division and morphological analysis on the input text;
A word concept base storing a set of pairs of a word and a word concept vector representing the meaning of the word;
Input text concept vector generation means for generating an input text concept vector representing the meaning of the input text by referring to the word concept base;
Partial sentence extraction means for extracting a set of partial sentences, which is a set of clauses in the sentence, from the input text;
By referring to the word concept base for each partial sentence extracted by the partial sentence extracting means, a partial sentence concept vector representing the meaning of the partial sentence is generated, and an input text concept vector and the partial sentence concept vector are generated. A partial sentence ranking means for calculating a similarity with the above and outputting the partial sentence with a high similarity as a summary sentence;
A text summarization apparatus comprising:

An analysis means for performing sentence division and morphological analysis on the input text;
A word concept base storing a set of pairs of a word and a word concept vector representing the meaning of the word;
Input text concept vector generation means for generating an input text concept vector representing the meaning of the input text by referring to the word concept base;
Self-supporting part clustering means for clustering a set of self-supporting parts obtained from clauses in the input text;
For each self-supporting part cluster obtained by the self-supporting part clustering means, a superordinate concept word acquiring means for acquiring a word that is a superordinate concept of the self-supporting part belonging to the self-supporting part cluster;
Partial sentence extraction means for extracting a set of partial sentences, which is a set of clauses in the sentence, from the input text;
For each partial sentence extracted by the partial sentence extracting means, each independent part in the partial sentence is converted into a higher-order concept word of the independent part cluster to which the independent part belongs, obtained by the higher-order concept word acquiring means. Partial sentence conversion means for generating the converted partial sentence,
By referring to the word concept base for each converted partial sentence obtained by the partial sentence converting means, a converted partial sentence concept vector representing the meaning of the converted partial sentence is generated, and an input text concept vector And the converted partial sentence concept vector, and the converted partial sentence ranking means for outputting the converted partial sentence having a high similarity as a summary sentence;
A text summarization apparatus comprising:

Further comprising partial text clustering means for clustering a set of partial texts obtained from the input document;
For each cluster obtained by the partial text clustering means, a set of partial texts constituting the cluster is used as the input text, the analyzing means, the input text concept vector generating means, the partial sentence extracting means, and the partial sentence 2. The text summarization apparatus according to claim 1, wherein a summary sentence is output by performing each process by the ranking means.

Further comprising partial text clustering means for clustering a set of partial texts obtained from the input document;
For each cluster obtained by the partial text clustering means, a set of partial texts constituting the cluster is used as the input text, and the analyzing means, the input text concept vector generating means, the self-supporting part clustering means, the superordinate concept words 3. The text summarization apparatus according to claim 2, wherein the summarization text is output by performing each processing by the acquisition means, the partial text extraction means, the partial text conversion means, and the post-conversion partial text ranking means.

Text in text summarizing apparatus including analysis means, word concept base storing a set of pairs of a word and a word concept vector representing the meaning of the word, input text concept vector generation means, partial sentence extraction means, and partial sentence ranking means A summarization method,
Performing sentence division and morphological analysis on the input text by the analyzing means;
Generating an input text concept vector representing the meaning of the input text by referring to the word concept base by the input text concept vector generating means;
Extracting a set of partial sentences, which is a set of clauses in the sentence, from the input text by the partial sentence extracting means;
The partial sentence ranking unit generates a partial sentence concept vector representing the meaning of the partial sentence by referring to the word concept base for each partial sentence extracted by the partial sentence extracting unit, and the input text concept Calculating a similarity between a vector and the partial sentence concept vector, and outputting the partial sentence having a high similarity as a summary sentence;
A text summarization method characterized by comprising:

Analysis means, word concept base storing a set of pairs of words and word concept vectors representing the meaning of the words, input text concept vector generation means, freestanding clustering means, superordinate concept word acquisition means, partial sentence extraction means, part A text summarizing method in a text summarizing device including a sentence converting means and a converted partial sentence ranking means,
Performing sentence division and morphological analysis on the input text by the analyzing means;
Generating an input text concept vector representing the meaning of the input text by referring to the word concept base by the input text concept vector generating means;
Clustering a set of free standing parts obtained from clauses in the input text by the free standing part clustering means;
Obtaining a word that is a superordinate concept of an independent part belonging to the independent part cluster for each independent part cluster obtained by the independent part clustering means by the superordinate concept word acquiring means;
Extracting a set of partial sentences, which is a set of clauses in the sentence, from the input text by the partial sentence extracting means;
For each partial sentence extracted by the partial sentence extraction means by the partial sentence conversion means, each independent part in the partial sentence is obtained by the broader concept word acquisition means, and the independent part to which the independent part belongs Generating a converted partial sentence converted into a broader concept word of the cluster;
The converted partial sentence concept representing the meaning of the converted partial sentence by referring to the word concept base for each converted partial sentence obtained by the converted partial sentence ranking means by the converted partial sentence ranking means. Generating a vector, calculating a similarity between the input text concept vector and the converted partial sentence concept vector, and outputting the converted partial sentence having a high similarity as a summary sentence;
A text summarization method characterized by comprising:

The program for functioning a computer as each means of the text summarization apparatus of any one of Claims 1-4.