JP2010128677A

JP2010128677A - Text summarization apparatus, method therefor, and program

Info

Publication number: JP2010128677A
Application number: JP2008301058A
Authority: JP
Inventors: Takaaki Hasegawa; 隆明長谷川; Hitoshi Nishikawa; 仁西川; Kenji Imamura; 賢治今村; Genichiro Kikui; 玄一郎菊井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-11-26
Filing date: 2008-11-26
Publication date: 2010-06-10
Anticipated expiration: 2028-11-26
Also published as: JP4942727B2

Abstract

<P>PROBLEM TO BE SOLVED: To create a query-biased summary matching a search word input by a user, for a CGM (consumer generated media) document or the like. <P>SOLUTION: A relevance calculation unit 4, from a word frequency table 1 storing a document frequency of arbitrary word in a corpus and a co-occurrence frequency of two or more arbitrary words, acquires a document frequency of a search word received by a search word input unit 2, a document frequency of a word included in an input document received by a document input unit 3, and a co-occurrence frequency corresponding to the search word and a word included in the input document, and calculates relevance based on them. An important sentence selection unit 5, for every sentence composing the input document, calculates sentence importance based on the relevance between each word included in the sentence and the search word calculated by the relevance calculation unit 4, and selects a sentence of high sentence importance. A control unit 6 selects sentences composing the input document within a range of a predesignated restricted number of characters or a summary rate in the descending order of sentence importance, rearranges the sentences in the order of appearance, and outputs the rearranged sentences. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、ユーザが検索することにより得られたテキスト（文書）等について、その内容を当該ユーザの関心に沿って要約する技術に関する。 The present invention relates to a technique for summarizing the contents of a text (document) obtained by a user's search according to the user's interest.

要約の利用目的を考えた場合、要約方法は大きく２つに分けることができる。１つはどんな目的やどんなユーザにも使える汎用的な要約（ｇｅｎｅｒｉｃな要約）を作成する方法であるのに対し、もう１つはユーザの特定の興味・関心に適応する要約（ｑｕｅｒｙ−ｂｉａｓｅｄな要約）を作成する方法である。ｑｕｅｒｙ−ｂｉａｓｅｄな要約方法としては、重要文抽出の際にスタイルの情報や単語の汎用的な重要度等に基づく要約に加えてユーザからのクエリに重みを加える方法（非特許文献１参照）や、ユーザからのクエリによって検索された文書の集合をクラスタリングする際に情報利得比という、あるクラスタをさらにクラスタリングするときに貢献する度合いの高い用語に重みを加える方法が提案されている（非特許文献２参照）。
Anastasios Tombros et al., "Advantage of Query Biased Summaries in Information Retrieval", Research and Development in Information Retrieval, 1998, pp.2-10 森辰則、「検索結果表示向け文書要約における情報利得比に基づく語の重要度計算」、自然言語処理、Ｖｏｌ．９、Ｎｏ．４、２００２、ｐｐ．３−３２ Considering the purpose of summarization, the summarization method can be roughly divided into two. One is to create a general summary that can be used for any purpose or for any user, while the other is a query-biased summary that adapts to a user's specific interests. (Summary). As a query-biased summarization method, a method of adding weight to a query from a user in addition to summarization based on style information, general importance of words, etc. at the time of important sentence extraction (see Non-Patent Document 1), A method has been proposed in which weighting is applied to terms having a high degree of contribution when further clustering a certain cluster, called information gain ratio, when clustering a set of documents retrieved by a query from a user (non-patent literature). 2).
Anastasios Tombros et al., "Advantage of Query Biased Summaries in Information Retrieval", Research and Development in Information Retrieval, 1998, pp.2-10 Masanori Mori, “Calculation of Word Importance Based on Information Gain Ratio in Document Summary for Search Result Display”, Natural Language Processing, Vol. 9, no. 4, 2002, pp. 3-32

しかしながら、前述した従来の方法は主に新聞記事を対象としており、近年、爆発的に増加しているブログなどのＣＧＭ（ＣｏｎｓｕｍｅｒＧｅｎｅｒａｔｅｄＭｅｄｉａ）に適用する場合にはいくつか課題が存在する。 However, the above-described conventional method is mainly intended for newspaper articles, and there are some problems when applied to CGM (Consumer Generated Media) such as blogs, which have been increasing explosively in recent years.

即ち、ＣＧＭでは様々なユーザが情報を発信しており、汎用的な目的のためのｇｅｎｅｒｉｃな要約方法で想定している新聞記事のような統一的なスタイルは存在しない。このため、統一的なスタイルのときに有効である文の位置などの情報を利用して重要文を抽出することは難しかった。 That is, in CGM, various users transmit information, and there is no unified style like a newspaper article assumed in a generic summarization method for general purposes. For this reason, it has been difficult to extract important sentences using information such as the position of a sentence that is effective in a unified style.

また、例えば検索エンジンにおいて検索された文書がユーザの関心に関連するかどうかを判断するための要約方法では、出力できる制限文字数が少なくなることから、既にユーザには既知であるクエリに含まれる検索語に高い重みを付けることは、ユーザにとって未知の情報を出力できなくなる可能性が高くなるという問題があった。また、検索エンジンにより検索される文書では、ユーザによる検索語を含む文が複数存在することが考えられるため、限られた制限文字数の制約の中では、ユーザからの検索語を含む文の中でどの文を選択するかは重要な課題となっていた。 For example, in the summarization method for determining whether or not a document searched in a search engine is related to the user's interest, the number of limited characters that can be output decreases, so that the search included in the query that is already known to the user Adding a high weight to a word has a problem that it is likely that the user cannot output unknown information. In addition, in a document searched by a search engine, it is considered that there are a plurality of sentences including a search word by a user, and therefore, within a sentence including a search word from a user, within a limited restriction on the number of characters. Which sentence to select has been an important issue.

また、情報利得比を得るためのクラスタリングは文書数が増えれば処理負荷が大きくなり、高速なレスポンスが期待される検索エンジンとの親和性が比較的低いという問題があった。さらにまた、ＣＧＭでは１つの文書に複数のトピックを含む可能性が大きいため、うまくクラスタリングできず、望ましい用語が獲得できない場合も存在すると考えられる。 In addition, clustering for obtaining the information gain ratio has a problem that the processing load increases as the number of documents increases, and the affinity with a search engine for which a high-speed response is expected is relatively low. Furthermore, since there is a high possibility that a single document includes a plurality of topics in CGM, it may not be possible to perform clustering well and desirable terms cannot be obtained.

本発明の目的は、ＣＧＭ文書等に対し、ユーザより入力された検索語に適合するｑｕｅｒｙ−ｂｉａｓｅｄな要約を生成することにある。 An object of the present invention is to generate a query-biased summary that matches a search term input by a user for a CGM document or the like.

前記目的を達成するため、本発明では、複数の文で構成される入力文書から少なくとも１つの文を選択して当該入力文書に対応する要約を生成するテキスト要約装置であって、コーパスにおける任意の単語の文書頻度および２個以上の任意の単語の共起頻度を格納する単語頻度テーブルと、ユーザより入力された検索語を受け付ける検索語入力部と、形態素解析済みの入力文書を受け付ける文書入力部と、単語頻度テーブルから、前記検索語に対応する文書頻度、前記入力文書に含まれる単語に対応する文書頻度、並びに前記検索語および入力文書に含まれる単語に対応する共起頻度を取得し、これらに基づいて前記検索語と入力文書に含まれる単語との関連度を計算する関連度計算部と、前記入力文書を構成する全ての文に対し、関連度計算部で計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算し、当該文重要度の高い文を選択する重要文選択部と、前述した各部を制御し、前記入力文書を構成する文を予め指定された制限文字数または要約率の範囲内で前記文重要度の高い順に選択し、出現順に並べ替えて出力する制御部とを備えたことを特徴とする。 To achieve the above object, according to the present invention, there is provided a text summarizing device that selects at least one sentence from an input document composed of a plurality of sentences and generates a summary corresponding to the input document. A word frequency table storing the word document frequency and the co-occurrence frequency of two or more arbitrary words, a search word input unit for receiving a search word input by the user, and a document input unit for receiving an input document after morphological analysis And, from the word frequency table, obtaining the document frequency corresponding to the search word, the document frequency corresponding to the word included in the input document, and the co-occurrence frequency corresponding to the word included in the search word and the input document, Based on these, a relevance calculator that calculates the relevance between the search word and the word included in the input document, and a relevance meter for all sentences constituting the input document The sentence importance is calculated based on the degree of association between each word included in the sentence calculated in the section and the search word, and an important sentence selection unit that selects a sentence having a high sentence importance, A control unit that controls and selects sentences constituting the input document in descending order of the importance of sentences within a range of a limited number of characters or a summarization rate, and rearranges them in the order of appearance. And

本発明によれば、入力文書からユーザの検索要求である検索語に関連する単語を含む文を抽出するので、ユーザの検索要求に適合した要約を生成する効果を有する。特にユーザからの検索要求で検索された入力文書、即ち検索語を含む文が複数存在する入力文書を対象とする場合に、汎用的な重要度と検索語に対する重要度の付加を用いた従来の重要文抽出方法よりも、ユーザからの検索語に関連する単語を含む文を選択するという方法により、よりユーザの検索要求に適合した要約を生成することが可能である。さらに、検索語自体が存在しない文でも検索語に関連のある単語を多く含む文を選択することも可能である。また、スタイルの情報を用いないため、多様なスタイルの文書に対しても有効である。 According to the present invention, since a sentence including a word related to a search word that is a user search request is extracted from the input document, there is an effect of generating a summary adapted to the user search request. In particular, when an input document searched by a search request from a user, that is, an input document having a plurality of sentences including a search term is used as a target, the general importance and the addition of the importance to the search term are used. A summary adapted to the user's search request can be generated by a method of selecting a sentence including a word related to a search word from the user, rather than the important sentence extraction method. Furthermore, it is possible to select a sentence including many words related to the search word even if the search word itself does not exist. Also, since style information is not used, it is also effective for various style documents.

次に、本発明の実施の形態について図面を参照して説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

図１は本発明のテキスト要約装置の実施の形態の一例を示すもので、本実施の形態のテキスト要約装置は、単語頻度テーブル１と、検索語入力部２と、文書入力部３と、関連度計算部４と、重要文選択部５と、制御部６とからなる。 FIG. 1 shows an example of an embodiment of a text summarization apparatus according to the present invention. The text summarization apparatus according to the present embodiment includes a word frequency table 1, a search word input unit 2, a document input unit 3, and related items. It comprises a degree calculator 4, an important sentence selector 5, and a controller 6.

単語頻度テーブル１は、所定のコーパス（文書の集合）における任意の単語の文書頻度および２個以上の任意の単語の共起頻度を格納している。ここで、所定のコーパスにおける任意の単語の文書頻度とは、所定のコーパスに対して当該任意の単語をキーワードとする検索を行い、その結果、得られる文書の数（所定のコーパス内の文書のうち当該任意の単語が出現する文書の数）を意味し、また、所定のコーパスにおける２個以上の任意の単語の共起頻度とは、所定のコーパスに対して当該２個以上の任意の単語をキーワードとする検索（ＡＮＤ検索）を行い、その結果、得られる文書の数（所定のコーパス内の文書のうち当該２個以上の任意の単語が同時に出現する文書の数）を意味している。 The word frequency table 1 stores a document frequency of an arbitrary word and a co-occurrence frequency of two or more arbitrary words in a predetermined corpus (a set of documents). Here, the document frequency of an arbitrary word in a predetermined corpus is a search using the arbitrary word as a keyword for the predetermined corpus, and the number of documents obtained as a result (the number of documents in the predetermined corpus) Number of documents in which the arbitrary word appears), and the co-occurrence frequency of two or more arbitrary words in a predetermined corpus is the two or more arbitrary words for the predetermined corpus Is a keyword (AND search), and the number of documents obtained as a result (the number of documents in which two or more arbitrary words appear simultaneously among documents in a predetermined corpus) is obtained. .

なお、インターネット上の検索エンジンを利用して任意の単語あるいは２個以上の任意の単語をキーワードとして検索し、ヒットした文書数を前述した文書頻度あるいは共起頻度とすることもできる（この場合の「所定のコーパス」とは検索エンジンを介してアクセス可能なインターネット上の全ての文書ということになる。）。 A search engine on the Internet can be used to search for an arbitrary word or two or more arbitrary words as keywords, and the number of hit documents can be set to the document frequency or the co-occurrence frequency described above (in this case) “Predetermined corpus” refers to all documents on the Internet that are accessible via a search engine.)

図２は単語頻度テーブル１の一例を示すもので、ここでは１〜ｎ個の任意の単語（キーワード）に対応する文書頻度（共起頻度）を格納している。例えば、後述する検索語が「土用丑の日」（文字間のスペースは単語の区切を表す。）である場合、キーワード１が「土用」、キーワード２が「丑の日」である文書頻度「９４４」が得られる。また、後述する入力文書内に出現した単語が「鰻」である場合、キーワード１が「鰻」である文書頻度「４１７０」が得られる。また、検索語「土用丑の日」と単語「鰻」との共起頻度は、キーワード１が「土用」、キーワード２が「丑の日」、キーワード３が「鰻」である文書頻度「２６０」が得られる。これらにより、単語「鰻」と検索語「土用丑の日」との関連度を後述するように求めることができる。 FIG. 2 shows an example of the word frequency table 1, in which document frequencies (co-occurrence frequencies) corresponding to 1 to n arbitrary words (keywords) are stored. For example, when a search term to be described later is “day for octopus” (a space between characters represents a word break), a document frequency “944” in which keyword 1 is “for territory” and keyword 2 is “day for octopus”. Is obtained. When a word that appears in the input document described later is “鰻”, a document frequency “4170” in which the keyword 1 is “鰻” is obtained. In addition, the co-occurrence frequency of the search term “Do-no-Tatsumi Day” and the word “鰻” is the document frequency “260” in which the keyword 1 is “Do-yo”, the keyword 2 is “Tatsumi Day”, and the keyword 3 is “鰻”. can get. As a result, the degree of relevance between the word “と” and the search term “day of the house” can be obtained as described later.

検索語入力部２は、ユーザより、図示しないキーボード等から直接入力され又は記憶媒体から読み出されて入力され又は通信媒体を介して他の装置等から入力された検索語を受け付ける。 The search term input unit 2 accepts a search term input from a user directly from a keyboard or the like (not shown) or read from a storage medium or input from another device or the like via a communication medium.

文書入力部３は、図示しない記憶手段から読み出されて入力され又は通信媒体を介して他の装置等から入力された、要約の対象となる入力文書を受け付ける。ここで、入力文書は複数の文で構成され、各文は形態素解析済みであるとする。形態素解析前の入力文書の一例を図３に、また、図３の入力文書（形態素解析後）中の一文を図４に示す。 The document input unit 3 receives an input document to be summarized, read from a storage unit (not shown) or input, or input from another device or the like via a communication medium. Here, it is assumed that the input document is composed of a plurality of sentences, and each sentence has been subjected to morphological analysis. An example of an input document before morphological analysis is shown in FIG. 3, and a sentence in the input document (after morphological analysis) in FIG. 3 is shown in FIG.

関連度計算部４は、単語頻度テーブル１から、検索語入力部２で受け付けた検索語に対応する文書頻度、文書入力部３で受け付けた入力文書に含まれる（詳細には入力文書を構成する文に含まれる）単語に対応する文書頻度、並びに前記検索語および入力文書に含まれる単語に対応する共起頻度を取得し、これらに基づいて前記検索語と入力文書に含まれる単語（あるいは単語列）との関連度を計算する。 The degree-of-association calculation unit 4 is included in the document frequency corresponding to the search word received by the search word input unit 2 and the input document received by the document input unit 3 from the word frequency table 1 (details constitute the input document) A document frequency corresponding to a word included in a sentence and a co-occurrence frequency corresponding to a word included in the search word and the input document are obtained, and based on these, the word (or word) included in the search word and the input document Column).

関連度の一例として相互情報量（ＰＭＩ）を挙げる。なお、入力文書に含まれる単語のうち、文書頻度を取得する単語は名詞に限定するなど、品詞による制限を設けても良い。また、取得した文書頻度の低い単語は関連度計算の対象から除外しても良い。 A mutual information amount (PMI) is given as an example of the degree of association. In addition, you may provide restrictions by a part of speech, for example, the word which acquires document frequency among the words contained in an input document is limited to a noun. Moreover, you may exclude the word with low acquired document frequency from the object of relevance calculation.

検索語ｑと入力文書に含まれる単語ｗとの相互情報量（ＰＭＩ）の計算式は下記の通りである。 The formula for calculating the mutual information (PMI) between the search term q and the word w included in the input document is as follows.

但し、Ｃ（ｑ）は検索語ｑの文書頻度、Ｃ（ｗ）は入力文書に含まれる単語ｗの文書頻度、Ｃ（ｑ，ｗ）は検索語ｑおよび入力文書に含まれる単語ｗの共起頻度、Ｎはコーパスに格納されている文書数とする。なお、検索語が複数の用語からなる場合、当該複数の用語の文書頻度や複数の用語と文を構成する単語との共起頻度が小さいこともある。これらが事前に設けた文書頻度や共起頻度の閾値を下回る場合には、例えば、複数の用語について各々の用語の文書頻度と、各々の用語と文を構成する単語との共起頻度を取得することで、各々の用語と文を構成する単語との関連度を計算し、これらの関連度の平均値を、複数の用語と文を構成する単語の関連度としても良い。 However, C (q) is the document frequency of the search word q, C (w) is the document frequency of the word w included in the input document, and C (q, w) is the share of the search word q and the word w included in the input document. The occurrence frequency, N is the number of documents stored in the corpus. When the search word is composed of a plurality of terms, the document frequency of the plurality of terms and the co-occurrence frequency of the plurality of terms and words constituting the sentence may be small. If these are below the thresholds of the document frequency and co-occurrence frequency set in advance, for example, for multiple terms, obtain the document frequency of each term and the co-occurrence frequency of each term and the words that make up the sentence Thus, the degree of association between each term and the word constituting the sentence may be calculated, and the average of these degrees of association may be used as the degree of association between the plurality of terms and the word constituting the sentence.

また、関連度計算部４では、入力文書に含まれる単語について検索語に依存しない一般的な重要度も考慮するために、当該単語の入力文書内での出現頻度（ＴＦ）や、所定のコーパス内の文書のうち当該単語が出現する文書の数の逆数（ＩＤＦ）を計算し、相互情報量（ＰＭＩ）との積を求めても良い。また、検索語と入力文書に含まれる単語との関連度として、対数尤度比などの単語共起に基づく他の尺度を用いることもできる。 In addition, in the relevance calculation unit 4, in order to consider a general importance level that does not depend on a search word for a word included in the input document, an appearance frequency (TF) of the word in the input document or a predetermined corpus is used. The reciprocal number (IDF) of the number of documents in which the word appears may be calculated, and the product with the mutual information (PMI) may be obtained. In addition, as a degree of association between a search word and a word included in the input document, another scale based on word co-occurrence such as a log likelihood ratio can be used.

重要文選択部５は、文書入力部３で受け付けた入力文書を構成する全ての文について、関連度計算部４により計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算し、当該文重要度の高い文を選択する。ここで、文の重要度としては、文に含まれる各単語と検索語との関連度の総和を用いても良いし、関連度の平均を用いても良いし、関連度の最大値を用いても良いし、または関連度の対数や平方根による平均などを用いても良い。 The important sentence selection unit 5 is based on the degree of association between each word included in the sentence calculated by the degree-of-association calculation unit 4 and the search word for all sentences constituting the input document received by the document input unit 3. Sentence importance is calculated, and a sentence having a high sentence importance is selected. Here, as the importance of the sentence, the sum of the relevance between each word included in the sentence and the search word may be used, the average of the relevance may be used, or the maximum value of the relevance may be used. Alternatively, a logarithm of relevance or an average based on a square root may be used.

図３に示した５つの文からなる入力文書の場合を例に挙げて重要文選択部５の動作について説明する。 The operation of the important sentence selection unit 5 will be described by taking the case of an input document composed of five sentences shown in FIG. 3 as an example.

この入力文書（形態素解析後）を入力とし、関連度計算の対象を名詞の類に限定すると、図５に示すような名詞のＩＤＦとＰＭＩ（相互情報量）が得られる。入力文書中の５つの文のうち重要な２つの文を選択する場合において、文の重要度を各単語のＩＤＦの総和とした時の選択結果を図６に、また、文の重要度を各単語のＰＭＩの総和とした時の選択結果を図７に示す。ＩＤＦを用いた場合は、「長野」や「岡谷」といった地名が高いＩＤＦ値を持っているため、５番目の文が最も重要として出力される。一方、ＰＭＩを用いた場合は、相対的に「夏」「栄養」「習慣」のＰＭＩ値が高く、「人口」や「寒」のＰＭＩ値が低いので、２番目の文が重要な２文に含まれる。このため、ＰＭＩを用いる要約の方が、検索語により関連の深い単語を含む要約となる。 If this input document (after morphological analysis) is used as an input, and the target of relevance calculation is limited to the class of nouns, the IDF and PMI (mutual information) of nouns as shown in FIG. 5 are obtained. When selecting two important sentences among the five sentences in the input document, FIG. 6 shows the selection results when the importance of the sentences is the sum of the IDFs of the respective words. FIG. 7 shows the selection result when the total PMI of words is used. When IDF is used, since the place names such as “Nagano” and “Okaya” have high IDF values, the fifth sentence is output as the most important. On the other hand, when PMI is used, the PMI values for “summer”, “nutrition”, and “custom” are relatively high, and the PMI values for “population” and “cold” are relatively low. include. For this reason, the summary using PMI is a summary including words that are more closely related to the search word.

また、重要文選択部５では、ＭＭＲ（ＭａｘｉｍａｌＭａｒｇｉｎａｌＲｅｌｅｖａｎｃｅ）という公知の技術を用いて、文重要度が高くかつ文同士の類似度が小さいほど高くなるスコアを求め、重要な文の中から既に選択した文との内容の重複を避けて、新規性の高い文を選択することもできる。 Further, the important sentence selection unit 5 uses a known technique called MMR (Maximal Marginal Relevance) to obtain a score that increases as the sentence importance is high and the similarity between sentences is small. It is also possible to select a highly novel sentence while avoiding duplication of content with the selected sentence.

ＭＭＲは次式により計算することができる。 The MMR can be calculated by the following equation.

ここで、Ｑは検索語、Ｒは入力文書に含まれる全ての文、ＳはＲのサブセットで既に重要文として選択されているもの、Ｒ＼ＳはＲのサブセットでまだ重要文として選択されていないもの、Ｓｉｍ₁（Ｐ_i，Ｑ）は検索語Ｑと文Ｐ_iとの適合度を算出する関数、Ｓｉｍ₂（Ｐ_i，Ｐ_j）はＳに含まれる文Ｐ_jと抽出侯補の文Ｐ_iとの類似度である。 Where Q is the search term, R is all sentences contained in the input document, S is already selected as an important sentence in the subset of R, and R \ S is still selected as an important sentence in the subset of R None, Sim ₁ (P _i , Q) is a function for calculating the degree of matching between the search word Q and the sentence P _i, and Sim ₂ (P _i , P _j ) is a sentence P _j included in S and the extracted complement This is the similarity to the sentence P _i .

ここでは、Ｓｉｍ₁（Ｐ_i，Ｑ）の代わりに文の重要度を採用し、既に選択した文との類似度の尺度にコサイン類似度を採用してＭＭＲのスコアを計算する。但し、両者のダイナミックレンジを揃えるために、各文の重要度は最も大きな重要度を有する文の値で正規化する。正規化する場合、各文の重要度の対数をとってから、対数の最大値で正規化しても良い。以下、具体的な例を示す。 Here, the importance of the sentence is adopted instead of Sim ₁ (P _i , Q), the cosine similarity is adopted as a measure of the similarity with the already selected sentence, and the MMR score is calculated. However, the importance of each sentence is normalized with the value of the sentence having the greatest importance in order to make the dynamic range of both the same. When normalizing, after taking the logarithm of the importance of each sentence, normalization may be performed with the maximum value of the logarithm. Specific examples will be shown below.

例えば、図３の入力文書において２番目の文が始めに選択されたと仮定すると、１番目の文の正規化された重要度が１、コサイン類似度が０に対して、５番目の文の正規化された重要度が０．７１７、コサイン類似度は０．６５９となる。λを０．５とすると、ＭＭＲのスコアは、それぞれ０．５と０．０２９となり、ＭＭＲのスコアにより大きな差が開く。また、入力文書にタイトルが存在する場合、タイトルを既に選択した文とみなすことでＭＭＲの値を計算することによりタイトルとの冗長性を避けることができる。 For example, assuming that the second sentence is selected first in the input document of FIG. 3, the normalized importance of the first sentence is 1 and the normalized importance of the cosine similarity is 0. The converted importance is 0.717, and the cosine similarity is 0.659. When λ is 0.5, the MMR scores are 0.5 and 0.029, respectively, and there is a large difference depending on the MMR score. Further, when a title exists in the input document, redundancy with the title can be avoided by calculating the MMR value by regarding the title as an already selected sentence.

制御部６は、前述した各部を制御し、予め指定された制限文字数または要約率の範囲内で前記入力文書を構成する文を前記文重要度またはスコアの高い順に選択し、出現順に並べ替えて出力する。 The control unit 6 controls each unit described above, selects sentences constituting the input document within the range of the limited number of characters or the summarization rate specified in advance in descending order of sentence importance or score, and rearranges them in the order of appearance. Output.

図８に前述した文重要度および文同士の類似度に基づくスコアを求める場合の重要文選択部５および制御部６における処理の流れを示す。 FIG. 8 shows a flow of processing in the important sentence selection unit 5 and the control unit 6 in the case of obtaining a score based on the sentence importance degree and the similarity degree between sentences.

重要文選択部５は、入力文書中に処理すべき文が存在すれば（ｓ１）、当該文を選択し、その文に含まれる各単語と検索語との関連度を計算し（ｓ２）、該計算した関連度に基づいて文重要度を計算する（ｓ３）。そして、既に選択済みの文の全てについて（ｓ４）、前記選択した１文との類似度を計算する（ｓ５）。さらに、各文の類似度を計算した結果、類似度が最大となる文の類似度を記憶し（ｓ６）、各文の重要度と当該記憶した類似度とに基づくスコアを計算する（ｓ７）。以上を処理する文がなくなるまで繰り返す（ｓ１）。 If there is a sentence to be processed in the input document (s1), the important sentence selection unit 5 selects the sentence, calculates the relevance between each word included in the sentence and the search word (s2), The sentence importance is calculated based on the calculated relevance (s3). Then, for all the already selected sentences (s4), the similarity with the selected one sentence is calculated (s5). Further, as a result of calculating the similarity of each sentence, the similarity of the sentence having the maximum similarity is stored (s6), and a score based on the importance of each sentence and the stored similarity is calculated (s7). . The above is repeated until there is no more sentence to process (s1).

そして、制御部６は、ＭＭＲに基づくスコアの高い順に入力文書を構成する文を選択していき（ｓ８）、制限文字数または要約率を越えるときに文の選択を中止し（ｓ９）、選択した文を出現順に並べ替えて出力する（ｓ１０）。 Then, the control unit 6 selects sentences composing the input document in descending order of the score based on the MMR (s8), stops selecting the sentence when the limit number of characters or the summarization rate is exceeded (s9), and selects the sentence. The sentences are rearranged in the order of appearance and output (s10).

このように、検索語に関連する単語に基づく文の重要度と既に選択した文との類似度を考慮しながら文を選択することにより、ユーザの検索要求に適合し、冗長性の少ない要約を生成することができる。 In this way, by selecting a sentence while considering the importance of the sentence based on the word related to the search word and the similarity between the sentence and the already selected sentence, it is possible to adapt the user's search request and provide a summary with less redundancy. Can be generated.

なお、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、図１の構成図に示された機能を実現するプログラムあるいは図８のフローチャートに示された手順を備えるプログラムをインストールすることによっても実現可能である。 The present invention installs a program for realizing the functions shown in the configuration diagram of FIG. 1 or a program having the procedure shown in the flowchart of FIG. 8 via a medium or a communication line in a known computer. Is also feasible.

本発明のテキスト要約装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the text summarization apparatus of this invention 単語頻度テーブルの一例を示す説明図Explanatory drawing which shows an example of a word frequency table 形態素解析前の入力文書の一例を示す説明図Explanatory drawing showing an example of input document before morphological analysis 図３の入力文書（形態素解析後）中の一文を示す説明図Explanatory drawing which shows one sentence in the input document (after morphological analysis) of FIG. 図３の入力文書中の名詞のＩＤＦとＰＭＩを示す説明図Explanatory drawing which shows IDF and PMI of the noun in the input document of FIG. ＩＤＦによる重要文の選択結果の一例を示す説明図Explanatory drawing which shows an example of the selection result of the important sentence by IDF ＰＭＩによる重要文の選択結果の一例を示す説明図Explanatory drawing which shows an example of the selection result of the important sentence by PMI 重要文選択部および制御部における処理の流れ図Flow chart of processing in important sentence selector and controller

Explanation of symbols

１：単語頻度テーブル、２：検索語入力部、３：文書入力部、４：関連度計算部、５：重要文選択部、６：制御部。 1: word frequency table, 2: search word input unit, 3: document input unit, 4: relevance calculation unit, 5: important sentence selection unit, 6: control unit.

Claims

A text summarization device for selecting at least one sentence from an input document composed of a plurality of sentences and generating a summary corresponding to the input document,
A word frequency table storing the document frequency of arbitrary words in the corpus and the co-occurrence frequency of two or more arbitrary words;
A search term input unit that accepts a search term input by a user;
A document input unit that accepts an input document that has undergone morphological analysis;
From the word frequency table, obtain the document frequency corresponding to the search word, the document frequency corresponding to the word included in the input document, and the co-occurrence frequency corresponding to the word included in the search word and the input document, A relevance calculator that calculates a relevance between the search term and a word included in the input document,
For all sentences constituting the input document, sentence importance is calculated based on the degree of association between each search word and each word included in the sentence calculated by the degree-of-association calculation unit. An important sentence selection section for selecting a high sentence;
A control unit that controls each unit described above, selects sentences constituting the input document within a range of a limited number of characters or a summary rate specified in advance, in descending order of sentence importance, and rearranges them in the order of appearance. A text summarization device.

For all sentences included in the input document, sentence importance is calculated based on the degree of association between each word included in the sentence calculated by the relevance calculation unit and the search word, and similarities between sentences An important sentence selection unit that calculates a degree, obtains a score that increases as the sentence importance is high and the similarity between sentences is small, and selects a sentence with a high score;
A control unit that controls each unit described above, selects sentences constituting the input document within a range of a limited number of characters or a summary rate specified in advance, in descending order of the score, and rearranges them in the order of appearance; The text summarization apparatus according to claim 1.

A text summarization method for selecting at least one sentence from an input document composed of a plurality of sentences and generating a summary corresponding to the input document,
A search word input unit accepting a search word input by a user;
A step in which the document input unit receives an input document that has been subjected to morphological analysis;
From the word frequency table in which the relevance calculation unit stores the document frequency of an arbitrary word in the corpus and the co-occurrence frequency of two or more arbitrary words, the document frequency corresponding to the search word and the word included in the input document And a co-occurrence frequency corresponding to words included in the search word and the input document, and calculating a degree of association between the search word and the word included in the input document based on these ,
The important sentence selection unit calculates sentence importance based on the degree of association between each word included in the sentence calculated in the relevance calculation step and the search word for all sentences constituting the input document. Selecting a sentence with a high degree of importance,
A step of selecting a sentence that constitutes the input document within a range of a limited number of characters or a summarization ratio specified in advance in descending order of sentence importance, rearranging the sentences in order of appearance, and outputting them. Text summarization method.

The important sentence selection unit calculates sentence importance for all sentences included in the input document based on the degree of association between each word included in the sentence calculated in the relevance calculation step and the search word. And calculating the similarity between sentences, obtaining a score that increases as the sentence importance is high and the similarity between sentences is small, and selecting a sentence with a high score;
The control unit includes a step of selecting sentences constituting the input document within a range of a limited number of characters or a summary rate specified in advance in descending order of scores, rearranging them in the order of appearance, and outputting them. Item 4. The text summarization method according to Item 3.

The program for functioning a computer as each means of the text summarization apparatus of Claim 1 or 2.