JP2010128677A - Text summarization apparatus, method therefor, and program - Google Patents

Text summarization apparatus, method therefor, and program Download PDF

Info

Publication number
JP2010128677A
JP2010128677A JP2008301058A JP2008301058A JP2010128677A JP 2010128677 A JP2010128677 A JP 2010128677A JP 2008301058 A JP2008301058 A JP 2008301058A JP 2008301058 A JP2008301058 A JP 2008301058A JP 2010128677 A JP2010128677 A JP 2010128677A
Authority
JP
Japan
Prior art keywords
sentence
word
document
input
input document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2008301058A
Other languages
Japanese (ja)
Other versions
JP4942727B2 (en
Inventor
Takaaki Hasegawa
隆明 長谷川
Hitoshi Nishikawa
仁 西川
Kenji Imamura
賢治 今村
Genichiro Kikui
玄一郎 菊井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2008301058A priority Critical patent/JP4942727B2/en
Publication of JP2010128677A publication Critical patent/JP2010128677A/en
Application granted granted Critical
Publication of JP4942727B2 publication Critical patent/JP4942727B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To create a query-biased summary matching a search word input by a user, for a CGM (consumer generated media) document or the like. <P>SOLUTION: A relevance calculation unit 4, from a word frequency table 1 storing a document frequency of arbitrary word in a corpus and a co-occurrence frequency of two or more arbitrary words, acquires a document frequency of a search word received by a search word input unit 2, a document frequency of a word included in an input document received by a document input unit 3, and a co-occurrence frequency corresponding to the search word and a word included in the input document, and calculates relevance based on them. An important sentence selection unit 5, for every sentence composing the input document, calculates sentence importance based on the relevance between each word included in the sentence and the search word calculated by the relevance calculation unit 4, and selects a sentence of high sentence importance. A control unit 6 selects sentences composing the input document within a range of a predesignated restricted number of characters or a summary rate in the descending order of sentence importance, rearranges the sentences in the order of appearance, and outputs the rearranged sentences. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、ユーザが検索することにより得られたテキスト(文書)等について、その内容を当該ユーザの関心に沿って要約する技術に関する。   The present invention relates to a technique for summarizing the contents of a text (document) obtained by a user's search according to the user's interest.

要約の利用目的を考えた場合、要約方法は大きく2つに分けることができる。1つはどんな目的やどんなユーザにも使える汎用的な要約(genericな要約)を作成する方法であるのに対し、もう1つはユーザの特定の興味・関心に適応する要約(query−biasedな要約)を作成する方法である。query−biasedな要約方法としては、重要文抽出の際にスタイルの情報や単語の汎用的な重要度等に基づく要約に加えてユーザからのクエリに重みを加える方法(非特許文献1参照)や、ユーザからのクエリによって検索された文書の集合をクラスタリングする際に情報利得比という、あるクラスタをさらにクラスタリングするときに貢献する度合いの高い用語に重みを加える方法が提案されている(非特許文献2参照)。
Anastasios Tombros et al., "Advantage of Query Biased Summaries in Information Retrieval", Research and Development in Information Retrieval, 1998, pp.2-10 森 辰則、「検索結果表示向け文書要約における情報利得比に基づく語の重要度計算」、自然言語処理、Vol.9、No.4、2002、pp.3−32
Considering the purpose of summarization, the summarization method can be roughly divided into two. One is to create a general summary that can be used for any purpose or for any user, while the other is a query-biased summary that adapts to a user's specific interests. (Summary). As a query-biased summarization method, a method of adding weight to a query from a user in addition to summarization based on style information, general importance of words, etc. at the time of important sentence extraction (see Non-Patent Document 1), A method has been proposed in which weighting is applied to terms having a high degree of contribution when further clustering a certain cluster, called information gain ratio, when clustering a set of documents retrieved by a query from a user (non-patent literature). 2).
Anastasios Tombros et al., "Advantage of Query Biased Summaries in Information Retrieval", Research and Development in Information Retrieval, 1998, pp.2-10 Masanori Mori, “Calculation of Word Importance Based on Information Gain Ratio in Document Summary for Search Result Display”, Natural Language Processing, Vol. 9, no. 4, 2002, pp. 3-32

しかしながら、前述した従来の方法は主に新聞記事を対象としており、近年、爆発的に増加しているブログなどのCGM(Consumer Generated Media)に適用する場合にはいくつか課題が存在する。   However, the above-described conventional method is mainly intended for newspaper articles, and there are some problems when applied to CGM (Consumer Generated Media) such as blogs, which have been increasing explosively in recent years.

即ち、CGMでは様々なユーザが情報を発信しており、汎用的な目的のためのgenericな要約方法で想定している新聞記事のような統一的なスタイルは存在しない。このため、統一的なスタイルのときに有効である文の位置などの情報を利用して重要文を抽出することは難しかった。   That is, in CGM, various users transmit information, and there is no unified style like a newspaper article assumed in a generic summarization method for general purposes. For this reason, it has been difficult to extract important sentences using information such as the position of a sentence that is effective in a unified style.

また、例えば検索エンジンにおいて検索された文書がユーザの関心に関連するかどうかを判断するための要約方法では、出力できる制限文字数が少なくなることから、既にユーザには既知であるクエリに含まれる検索語に高い重みを付けることは、ユーザにとって未知の情報を出力できなくなる可能性が高くなるという問題があった。また、検索エンジンにより検索される文書では、ユーザによる検索語を含む文が複数存在することが考えられるため、限られた制限文字数の制約の中では、ユーザからの検索語を含む文の中でどの文を選択するかは重要な課題となっていた。   For example, in the summarization method for determining whether or not a document searched in a search engine is related to the user's interest, the number of limited characters that can be output decreases, so that the search included in the query that is already known to the user Adding a high weight to a word has a problem that it is likely that the user cannot output unknown information. In addition, in a document searched by a search engine, it is considered that there are a plurality of sentences including a search word by a user, and therefore, within a sentence including a search word from a user, within a limited restriction on the number of characters. Which sentence to select has been an important issue.

また、情報利得比を得るためのクラスタリングは文書数が増えれば処理負荷が大きくなり、高速なレスポンスが期待される検索エンジンとの親和性が比較的低いという問題があった。さらにまた、CGMでは1つの文書に複数のトピックを含む可能性が大きいため、うまくクラスタリングできず、望ましい用語が獲得できない場合も存在すると考えられる。   In addition, clustering for obtaining the information gain ratio has a problem that the processing load increases as the number of documents increases, and the affinity with a search engine for which a high-speed response is expected is relatively low. Furthermore, since there is a high possibility that a single document includes a plurality of topics in CGM, it may not be possible to perform clustering well and desirable terms cannot be obtained.

本発明の目的は、CGM文書等に対し、ユーザより入力された検索語に適合するquery−biasedな要約を生成することにある。   An object of the present invention is to generate a query-biased summary that matches a search term input by a user for a CGM document or the like.

前記目的を達成するため、本発明では、複数の文で構成される入力文書から少なくとも1つの文を選択して当該入力文書に対応する要約を生成するテキスト要約装置であって、コーパスにおける任意の単語の文書頻度および2個以上の任意の単語の共起頻度を格納する単語頻度テーブルと、ユーザより入力された検索語を受け付ける検索語入力部と、形態素解析済みの入力文書を受け付ける文書入力部と、単語頻度テーブルから、前記検索語に対応する文書頻度、前記入力文書に含まれる単語に対応する文書頻度、並びに前記検索語および入力文書に含まれる単語に対応する共起頻度を取得し、これらに基づいて前記検索語と入力文書に含まれる単語との関連度を計算する関連度計算部と、前記入力文書を構成する全ての文に対し、関連度計算部で計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算し、当該文重要度の高い文を選択する重要文選択部と、前述した各部を制御し、前記入力文書を構成する文を予め指定された制限文字数または要約率の範囲内で前記文重要度の高い順に選択し、出現順に並べ替えて出力する制御部とを備えたことを特徴とする。   To achieve the above object, according to the present invention, there is provided a text summarizing device that selects at least one sentence from an input document composed of a plurality of sentences and generates a summary corresponding to the input document. A word frequency table storing the word document frequency and the co-occurrence frequency of two or more arbitrary words, a search word input unit for receiving a search word input by the user, and a document input unit for receiving an input document after morphological analysis And, from the word frequency table, obtaining the document frequency corresponding to the search word, the document frequency corresponding to the word included in the input document, and the co-occurrence frequency corresponding to the word included in the search word and the input document, Based on these, a relevance calculator that calculates the relevance between the search word and the word included in the input document, and a relevance meter for all sentences constituting the input document The sentence importance is calculated based on the degree of association between each word included in the sentence calculated in the section and the search word, and an important sentence selection unit that selects a sentence having a high sentence importance, A control unit that controls and selects sentences constituting the input document in descending order of the importance of sentences within a range of a limited number of characters or a summarization rate, and rearranges them in the order of appearance. And

本発明によれば、入力文書からユーザの検索要求である検索語に関連する単語を含む文を抽出するので、ユーザの検索要求に適合した要約を生成する効果を有する。特にユーザからの検索要求で検索された入力文書、即ち検索語を含む文が複数存在する入力文書を対象とする場合に、汎用的な重要度と検索語に対する重要度の付加を用いた従来の重要文抽出方法よりも、ユーザからの検索語に関連する単語を含む文を選択するという方法により、よりユーザの検索要求に適合した要約を生成することが可能である。さらに、検索語自体が存在しない文でも検索語に関連のある単語を多く含む文を選択することも可能である。また、スタイルの情報を用いないため、多様なスタイルの文書に対しても有効である。   According to the present invention, since a sentence including a word related to a search word that is a user search request is extracted from the input document, there is an effect of generating a summary adapted to the user search request. In particular, when an input document searched by a search request from a user, that is, an input document having a plurality of sentences including a search term is used as a target, the general importance and the addition of the importance to the search term are used. A summary adapted to the user's search request can be generated by a method of selecting a sentence including a word related to a search word from the user, rather than the important sentence extraction method. Furthermore, it is possible to select a sentence including many words related to the search word even if the search word itself does not exist. Also, since style information is not used, it is also effective for various style documents.

次に、本発明の実施の形態について図面を参照して説明する。   Next, embodiments of the present invention will be described with reference to the drawings.

図1は本発明のテキスト要約装置の実施の形態の一例を示すもので、本実施の形態のテキスト要約装置は、単語頻度テーブル1と、検索語入力部2と、文書入力部3と、関連度計算部4と、重要文選択部5と、制御部6とからなる。   FIG. 1 shows an example of an embodiment of a text summarization apparatus according to the present invention. The text summarization apparatus according to the present embodiment includes a word frequency table 1, a search word input unit 2, a document input unit 3, and related items. It comprises a degree calculator 4, an important sentence selector 5, and a controller 6.

単語頻度テーブル1は、所定のコーパス(文書の集合)における任意の単語の文書頻度および2個以上の任意の単語の共起頻度を格納している。ここで、所定のコーパスにおける任意の単語の文書頻度とは、所定のコーパスに対して当該任意の単語をキーワードとする検索を行い、その結果、得られる文書の数(所定のコーパス内の文書のうち当該任意の単語が出現する文書の数)を意味し、また、所定のコーパスにおける2個以上の任意の単語の共起頻度とは、所定のコーパスに対して当該2個以上の任意の単語をキーワードとする検索(AND検索)を行い、その結果、得られる文書の数(所定のコーパス内の文書のうち当該2個以上の任意の単語が同時に出現する文書の数)を意味している。   The word frequency table 1 stores a document frequency of an arbitrary word and a co-occurrence frequency of two or more arbitrary words in a predetermined corpus (a set of documents). Here, the document frequency of an arbitrary word in a predetermined corpus is a search using the arbitrary word as a keyword for the predetermined corpus, and the number of documents obtained as a result (the number of documents in the predetermined corpus) Number of documents in which the arbitrary word appears), and the co-occurrence frequency of two or more arbitrary words in a predetermined corpus is the two or more arbitrary words for the predetermined corpus Is a keyword (AND search), and the number of documents obtained as a result (the number of documents in which two or more arbitrary words appear simultaneously among documents in a predetermined corpus) is obtained. .

なお、インターネット上の検索エンジンを利用して任意の単語あるいは2個以上の任意の単語をキーワードとして検索し、ヒットした文書数を前述した文書頻度あるいは共起頻度とすることもできる(この場合の「所定のコーパス」とは検索エンジンを介してアクセス可能なインターネット上の全ての文書ということになる。)。   A search engine on the Internet can be used to search for an arbitrary word or two or more arbitrary words as keywords, and the number of hit documents can be set to the document frequency or the co-occurrence frequency described above (in this case) “Predetermined corpus” refers to all documents on the Internet that are accessible via a search engine.)

図2は単語頻度テーブル1の一例を示すもので、ここでは1〜n個の任意の単語(キーワード)に対応する文書頻度(共起頻度)を格納している。例えば、後述する検索語が「土用 丑の日」(文字間のスペースは単語の区切を表す。)である場合、キーワード1が「土用」、キーワード2が「丑の日」である文書頻度「944」が得られる。また、後述する入力文書内に出現した単語が「鰻」である場合、キーワード1が「鰻」である文書頻度「4170」が得られる。また、検索語「土用 丑の日」と単語「鰻」との共起頻度は、キーワード1が「土用」、キーワード2が「丑の日」、キーワード3が「鰻」である文書頻度「260」が得られる。これらにより、単語「鰻」と検索語「土用 丑の日」との関連度を後述するように求めることができる。   FIG. 2 shows an example of the word frequency table 1, in which document frequencies (co-occurrence frequencies) corresponding to 1 to n arbitrary words (keywords) are stored. For example, when a search term to be described later is “day for octopus” (a space between characters represents a word break), a document frequency “944” in which keyword 1 is “for territory” and keyword 2 is “day for octopus”. Is obtained. When a word that appears in the input document described later is “鰻”, a document frequency “4170” in which the keyword 1 is “鰻” is obtained. In addition, the co-occurrence frequency of the search term “Do-no-Tatsumi Day” and the word “鰻” is the document frequency “260” in which the keyword 1 is “Do-yo”, the keyword 2 is “Tatsumi Day”, and the keyword 3 is “鰻”. can get. As a result, the degree of relevance between the word “と” and the search term “day of the house” can be obtained as described later.

検索語入力部2は、ユーザより、図示しないキーボード等から直接入力され又は記憶媒体から読み出されて入力され又は通信媒体を介して他の装置等から入力された検索語を受け付ける。   The search term input unit 2 accepts a search term input from a user directly from a keyboard or the like (not shown) or read from a storage medium or input from another device or the like via a communication medium.

文書入力部3は、図示しない記憶手段から読み出されて入力され又は通信媒体を介して他の装置等から入力された、要約の対象となる入力文書を受け付ける。ここで、入力文書は複数の文で構成され、各文は形態素解析済みであるとする。形態素解析前の入力文書の一例を図3に、また、図3の入力文書(形態素解析後)中の一文を図4に示す。   The document input unit 3 receives an input document to be summarized, read from a storage unit (not shown) or input, or input from another device or the like via a communication medium. Here, it is assumed that the input document is composed of a plurality of sentences, and each sentence has been subjected to morphological analysis. An example of an input document before morphological analysis is shown in FIG. 3, and a sentence in the input document (after morphological analysis) in FIG. 3 is shown in FIG.

関連度計算部4は、単語頻度テーブル1から、検索語入力部2で受け付けた検索語に対応する文書頻度、文書入力部3で受け付けた入力文書に含まれる(詳細には入力文書を構成する文に含まれる)単語に対応する文書頻度、並びに前記検索語および入力文書に含まれる単語に対応する共起頻度を取得し、これらに基づいて前記検索語と入力文書に含まれる単語(あるいは単語列)との関連度を計算する。   The degree-of-association calculation unit 4 is included in the document frequency corresponding to the search word received by the search word input unit 2 and the input document received by the document input unit 3 from the word frequency table 1 (details constitute the input document) A document frequency corresponding to a word included in a sentence and a co-occurrence frequency corresponding to a word included in the search word and the input document are obtained, and based on these, the word (or word) included in the search word and the input document Column).

関連度の一例として相互情報量(PMI)を挙げる。なお、入力文書に含まれる単語のうち、文書頻度を取得する単語は名詞に限定するなど、品詞による制限を設けても良い。また、取得した文書頻度の低い単語は関連度計算の対象から除外しても良い。   A mutual information amount (PMI) is given as an example of the degree of association. In addition, you may provide restrictions by a part of speech, for example, the word which acquires document frequency among the words contained in an input document is limited to a noun. Moreover, you may exclude the word with low acquired document frequency from the object of relevance calculation.

検索語qと入力文書に含まれる単語wとの相互情報量(PMI)の計算式は下記の通りである。   The formula for calculating the mutual information (PMI) between the search term q and the word w included in the input document is as follows.

Figure 2010128677
Figure 2010128677

但し、C(q)は検索語qの文書頻度、C(w)は入力文書に含まれる単語wの文書頻度、C(q,w)は検索語qおよび入力文書に含まれる単語wの共起頻度、Nはコーパスに格納されている文書数とする。なお、検索語が複数の用語からなる場合、当該複数の用語の文書頻度や複数の用語と文を構成する単語との共起頻度が小さいこともある。これらが事前に設けた文書頻度や共起頻度の閾値を下回る場合には、例えば、複数の用語について各々の用語の文書頻度と、各々の用語と文を構成する単語との共起頻度を取得することで、各々の用語と文を構成する単語との関連度を計算し、これらの関連度の平均値を、複数の用語と文を構成する単語の関連度としても良い。   However, C (q) is the document frequency of the search word q, C (w) is the document frequency of the word w included in the input document, and C (q, w) is the share of the search word q and the word w included in the input document. The occurrence frequency, N is the number of documents stored in the corpus. When the search word is composed of a plurality of terms, the document frequency of the plurality of terms and the co-occurrence frequency of the plurality of terms and words constituting the sentence may be small. If these are below the thresholds of the document frequency and co-occurrence frequency set in advance, for example, for multiple terms, obtain the document frequency of each term and the co-occurrence frequency of each term and the words that make up the sentence Thus, the degree of association between each term and the word constituting the sentence may be calculated, and the average of these degrees of association may be used as the degree of association between the plurality of terms and the word constituting the sentence.

また、関連度計算部4では、入力文書に含まれる単語について検索語に依存しない一般的な重要度も考慮するために、当該単語の入力文書内での出現頻度(TF)や、所定のコーパス内の文書のうち当該単語が出現する文書の数の逆数(IDF)を計算し、相互情報量(PMI)との積を求めても良い。また、検索語と入力文書に含まれる単語との関連度として、対数尤度比などの単語共起に基づく他の尺度を用いることもできる。   In addition, in the relevance calculation unit 4, in order to consider a general importance level that does not depend on a search word for a word included in the input document, an appearance frequency (TF) of the word in the input document or a predetermined corpus is used. The reciprocal number (IDF) of the number of documents in which the word appears may be calculated, and the product with the mutual information (PMI) may be obtained. In addition, as a degree of association between a search word and a word included in the input document, another scale based on word co-occurrence such as a log likelihood ratio can be used.

重要文選択部5は、文書入力部3で受け付けた入力文書を構成する全ての文について、関連度計算部4により計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算し、当該文重要度の高い文を選択する。ここで、文の重要度としては、文に含まれる各単語と検索語との関連度の総和を用いても良いし、関連度の平均を用いても良いし、関連度の最大値を用いても良いし、または関連度の対数や平方根による平均などを用いても良い。   The important sentence selection unit 5 is based on the degree of association between each word included in the sentence calculated by the degree-of-association calculation unit 4 and the search word for all sentences constituting the input document received by the document input unit 3. Sentence importance is calculated, and a sentence having a high sentence importance is selected. Here, as the importance of the sentence, the sum of the relevance between each word included in the sentence and the search word may be used, the average of the relevance may be used, or the maximum value of the relevance may be used. Alternatively, a logarithm of relevance or an average based on a square root may be used.

図3に示した5つの文からなる入力文書の場合を例に挙げて重要文選択部5の動作について説明する。   The operation of the important sentence selection unit 5 will be described by taking the case of an input document composed of five sentences shown in FIG. 3 as an example.

この入力文書(形態素解析後)を入力とし、関連度計算の対象を名詞の類に限定すると、図5に示すような名詞のIDFとPMI(相互情報量)が得られる。入力文書中の5つの文のうち重要な2つの文を選択する場合において、文の重要度を各単語のIDFの総和とした時の選択結果を図6に、また、文の重要度を各単語のPMIの総和とした時の選択結果を図7に示す。IDFを用いた場合は、「長野」や「岡谷」といった地名が高いIDF値を持っているため、5番目の文が最も重要として出力される。一方、PMIを用いた場合は、相対的に「夏」「栄養」「習慣」のPMI値が高く、「人口」や「寒」のPMI値が低いので、2番目の文が重要な2文に含まれる。このため、PMIを用いる要約の方が、検索語により関連の深い単語を含む要約となる。   If this input document (after morphological analysis) is used as an input, and the target of relevance calculation is limited to the class of nouns, the IDF and PMI (mutual information) of nouns as shown in FIG. 5 are obtained. When selecting two important sentences among the five sentences in the input document, FIG. 6 shows the selection results when the importance of the sentences is the sum of the IDFs of the respective words. FIG. 7 shows the selection result when the total PMI of words is used. When IDF is used, since the place names such as “Nagano” and “Okaya” have high IDF values, the fifth sentence is output as the most important. On the other hand, when PMI is used, the PMI values for “summer”, “nutrition”, and “custom” are relatively high, and the PMI values for “population” and “cold” are relatively low. include. For this reason, the summary using PMI is a summary including words that are more closely related to the search word.

また、重要文選択部5では、MMR(Maximal Marginal Relevance)という公知の技術を用いて、文重要度が高くかつ文同士の類似度が小さいほど高くなるスコアを求め、重要な文の中から既に選択した文との内容の重複を避けて、新規性の高い文を選択することもできる。   Further, the important sentence selection unit 5 uses a known technique called MMR (Maximal Marginal Relevance) to obtain a score that increases as the sentence importance is high and the similarity between sentences is small. It is also possible to select a highly novel sentence while avoiding duplication of content with the selected sentence.

MMRは次式により計算することができる。   The MMR can be calculated by the following equation.

Figure 2010128677
Figure 2010128677

ここで、Qは検索語、Rは入力文書に含まれる全ての文、SはRのサブセットで既に重要文として選択されているもの、R\SはRのサブセットでまだ重要文として選択されていないもの、Sim1(Pi,Q)は検索語Qと文Piとの適合度を算出する関数、Sim2(Pi,Pj)はSに含まれる文Pjと抽出侯補の文Piとの類似度である。 Where Q is the search term, R is all sentences contained in the input document, S is already selected as an important sentence in the subset of R, and R \ S is still selected as an important sentence in the subset of R None, Sim 1 (P i , Q) is a function for calculating the degree of matching between the search word Q and the sentence P i, and Sim 2 (P i , P j ) is a sentence P j included in S and the extracted complement This is the similarity to the sentence P i .

ここでは、Sim1(Pi,Q)の代わりに文の重要度を採用し、既に選択した文との類似度の尺度にコサイン類似度を採用してMMRのスコアを計算する。但し、両者のダイナミックレンジを揃えるために、各文の重要度は最も大きな重要度を有する文の値で正規化する。正規化する場合、各文の重要度の対数をとってから、対数の最大値で正規化しても良い。以下、具体的な例を示す。 Here, the importance of the sentence is adopted instead of Sim 1 (P i , Q), the cosine similarity is adopted as a measure of the similarity with the already selected sentence, and the MMR score is calculated. However, the importance of each sentence is normalized with the value of the sentence having the greatest importance in order to make the dynamic range of both the same. When normalizing, after taking the logarithm of the importance of each sentence, normalization may be performed with the maximum value of the logarithm. Specific examples will be shown below.

例えば、図3の入力文書において2番目の文が始めに選択されたと仮定すると、1番目の文の正規化された重要度が1、コサイン類似度が0に対して、5番目の文の正規化された重要度が0.717、コサイン類似度は0.659となる。λを0.5とすると、MMRのスコアは、それぞれ0.5と0.029となり、MMRのスコアにより大きな差が開く。また、入力文書にタイトルが存在する場合、タイトルを既に選択した文とみなすことでMMRの値を計算することによりタイトルとの冗長性を避けることができる。   For example, assuming that the second sentence is selected first in the input document of FIG. 3, the normalized importance of the first sentence is 1 and the normalized importance of the cosine similarity is 0. The converted importance is 0.717, and the cosine similarity is 0.659. When λ is 0.5, the MMR scores are 0.5 and 0.029, respectively, and there is a large difference depending on the MMR score. Further, when a title exists in the input document, redundancy with the title can be avoided by calculating the MMR value by regarding the title as an already selected sentence.

制御部6は、前述した各部を制御し、予め指定された制限文字数または要約率の範囲内で前記入力文書を構成する文を前記文重要度またはスコアの高い順に選択し、出現順に並べ替えて出力する。   The control unit 6 controls each unit described above, selects sentences constituting the input document within the range of the limited number of characters or the summarization rate specified in advance in descending order of sentence importance or score, and rearranges them in the order of appearance. Output.

図8に前述した文重要度および文同士の類似度に基づくスコアを求める場合の重要文選択部5および制御部6における処理の流れを示す。   FIG. 8 shows a flow of processing in the important sentence selection unit 5 and the control unit 6 in the case of obtaining a score based on the sentence importance degree and the similarity degree between sentences.

重要文選択部5は、入力文書中に処理すべき文が存在すれば(s1)、当該文を選択し、その文に含まれる各単語と検索語との関連度を計算し(s2)、該計算した関連度に基づいて文重要度を計算する(s3)。そして、既に選択済みの文の全てについて(s4)、前記選択した1文との類似度を計算する(s5)。さらに、各文の類似度を計算した結果、類似度が最大となる文の類似度を記憶し(s6)、各文の重要度と当該記憶した類似度とに基づくスコアを計算する(s7)。以上を処理する文がなくなるまで繰り返す(s1)。   If there is a sentence to be processed in the input document (s1), the important sentence selection unit 5 selects the sentence, calculates the relevance between each word included in the sentence and the search word (s2), The sentence importance is calculated based on the calculated relevance (s3). Then, for all the already selected sentences (s4), the similarity with the selected one sentence is calculated (s5). Further, as a result of calculating the similarity of each sentence, the similarity of the sentence having the maximum similarity is stored (s6), and a score based on the importance of each sentence and the stored similarity is calculated (s7). . The above is repeated until there is no more sentence to process (s1).

そして、制御部6は、MMRに基づくスコアの高い順に入力文書を構成する文を選択していき(s8)、制限文字数または要約率を越えるときに文の選択を中止し(s9)、選択した文を出現順に並べ替えて出力する(s10)。   Then, the control unit 6 selects sentences composing the input document in descending order of the score based on the MMR (s8), stops selecting the sentence when the limit number of characters or the summarization rate is exceeded (s9), and selects the sentence. The sentences are rearranged in the order of appearance and output (s10).

このように、検索語に関連する単語に基づく文の重要度と既に選択した文との類似度を考慮しながら文を選択することにより、ユーザの検索要求に適合し、冗長性の少ない要約を生成することができる。   In this way, by selecting a sentence while considering the importance of the sentence based on the word related to the search word and the similarity between the sentence and the already selected sentence, it is possible to adapt the user's search request and provide a summary with less redundancy. Can be generated.

なお、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、図1の構成図に示された機能を実現するプログラムあるいは図8のフローチャートに示された手順を備えるプログラムをインストールすることによっても実現可能である。   The present invention installs a program for realizing the functions shown in the configuration diagram of FIG. 1 or a program having the procedure shown in the flowchart of FIG. 8 via a medium or a communication line in a known computer. Is also feasible.

本発明のテキスト要約装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the text summarization apparatus of this invention 単語頻度テーブルの一例を示す説明図Explanatory drawing which shows an example of a word frequency table 形態素解析前の入力文書の一例を示す説明図Explanatory drawing showing an example of input document before morphological analysis 図3の入力文書(形態素解析後)中の一文を示す説明図Explanatory drawing which shows one sentence in the input document (after morphological analysis) of FIG. 図3の入力文書中の名詞のIDFとPMIを示す説明図Explanatory drawing which shows IDF and PMI of the noun in the input document of FIG. IDFによる重要文の選択結果の一例を示す説明図Explanatory drawing which shows an example of the selection result of the important sentence by IDF PMIによる重要文の選択結果の一例を示す説明図Explanatory drawing which shows an example of the selection result of the important sentence by PMI 重要文選択部および制御部における処理の流れ図Flow chart of processing in important sentence selector and controller

符号の説明Explanation of symbols

1:単語頻度テーブル、2:検索語入力部、3:文書入力部、4:関連度計算部、5:重要文選択部、6:制御部。   1: word frequency table, 2: search word input unit, 3: document input unit, 4: relevance calculation unit, 5: important sentence selection unit, 6: control unit.

Claims (5)

複数の文で構成される入力文書から少なくとも1つの文を選択して当該入力文書に対応する要約を生成するテキスト要約装置であって、
コーパスにおける任意の単語の文書頻度および2個以上の任意の単語の共起頻度を格納する単語頻度テーブルと、
ユーザより入力された検索語を受け付ける検索語入力部と、
形態素解析済みの入力文書を受け付ける文書入力部と、
単語頻度テーブルから、前記検索語に対応する文書頻度、前記入力文書に含まれる単語に対応する文書頻度、並びに前記検索語および入力文書に含まれる単語に対応する共起頻度を取得し、これらに基づいて前記検索語と入力文書に含まれる単語との関連度を計算する関連度計算部と、
前記入力文書を構成する全ての文に対し、関連度計算部で計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算し、当該文重要度の高い文を選択する重要文選択部と、
前述した各部を制御し、予め指定された制限文字数または要約率の範囲内で前記入力文書を構成する文を前記文重要度の高い順に選択し、出現順に並べ替えて出力する制御部とを備えた
ことを特徴とするテキスト要約装置。
A text summarization device for selecting at least one sentence from an input document composed of a plurality of sentences and generating a summary corresponding to the input document,
A word frequency table storing the document frequency of arbitrary words in the corpus and the co-occurrence frequency of two or more arbitrary words;
A search term input unit that accepts a search term input by a user;
A document input unit that accepts an input document that has undergone morphological analysis;
From the word frequency table, obtain the document frequency corresponding to the search word, the document frequency corresponding to the word included in the input document, and the co-occurrence frequency corresponding to the word included in the search word and the input document, A relevance calculator that calculates a relevance between the search term and a word included in the input document,
For all sentences constituting the input document, sentence importance is calculated based on the degree of association between each search word and each word included in the sentence calculated by the degree-of-association calculation unit. An important sentence selection section for selecting a high sentence;
A control unit that controls each unit described above, selects sentences constituting the input document within a range of a limited number of characters or a summary rate specified in advance, in descending order of sentence importance, and rearranges them in the order of appearance. A text summarization device.
前記入力文書に含まれる全ての文に対し、関連度計算部で計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算するとともに、文同士の類似度を計算し、当該文重要度が高くかつ文同士の類似度が小さいほど高くなるスコアを求め、当該スコアの高い文を選択する重要文選択部と、
前述した各部を制御し、予め指定された制限文字数または要約率の範囲内で前記入力文書を構成する文を前記スコアの高い順に選択し、出現順に並べ替えて出力する制御部とを備えた
ことを特徴とする請求項1に記載のテキスト要約装置。
For all sentences included in the input document, sentence importance is calculated based on the degree of association between each word included in the sentence calculated by the relevance calculation unit and the search word, and similarities between sentences An important sentence selection unit that calculates a degree, obtains a score that increases as the sentence importance is high and the similarity between sentences is small, and selects a sentence with a high score;
A control unit that controls each unit described above, selects sentences constituting the input document within a range of a limited number of characters or a summary rate specified in advance, in descending order of the score, and rearranges them in the order of appearance; The text summarization apparatus according to claim 1.
複数の文で構成される入力文書から少なくとも1つの文を選択して当該入力文書に対応する要約を生成するテキスト要約方法であって、
検索語入力部が、ユーザより入力された検索語を受け付けるステップと、
文書入力部が、形態素解析済みの入力文書を受け付けるステップと、
関連度計算部が、コーパスにおける任意の単語の文書頻度および2個以上の任意の単語の共起頻度を格納する単語頻度テーブルから、前記検索語に対応する文書頻度、前記入力文書に含まれる単語に対応する文書頻度、並びに前記検索語および入力文書に含まれる単語に対応する共起頻度を取得し、これらに基づいて前記検索語と入力文書に含まれる単語との関連度を計算するステップと、
重要文選択部が、前記入力文書を構成する全ての文に対し、関連度計算ステップで計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算し、当該文重要度の高い文を選択するステップと、
制御部が、予め指定された制限文字数または要約率の範囲内で前記入力文書を構成する文を前記文重要度の高い順に選択し、出現順に並べ替えて出力するステップとを含む
ことを特徴とするテキスト要約方法。
A text summarization method for selecting at least one sentence from an input document composed of a plurality of sentences and generating a summary corresponding to the input document,
A search word input unit accepting a search word input by a user;
A step in which the document input unit receives an input document that has been subjected to morphological analysis;
From the word frequency table in which the relevance calculation unit stores the document frequency of an arbitrary word in the corpus and the co-occurrence frequency of two or more arbitrary words, the document frequency corresponding to the search word and the word included in the input document And a co-occurrence frequency corresponding to words included in the search word and the input document, and calculating a degree of association between the search word and the word included in the input document based on these ,
The important sentence selection unit calculates sentence importance based on the degree of association between each word included in the sentence calculated in the relevance calculation step and the search word for all sentences constituting the input document. Selecting a sentence with a high degree of importance,
A step of selecting a sentence that constitutes the input document within a range of a limited number of characters or a summarization ratio specified in advance in descending order of sentence importance, rearranging the sentences in order of appearance, and outputting them. Text summarization method.
重要文選択部が、前記入力文書に含まれる全ての文に対し、関連度計算ステップで計算される当該文に含まれる各単語と前記検索語との関連度に基づいて文重要度を計算するとともに、文同士の類似度を計算し、当該文重要度が高くかつ文同士の類似度が小さいほど高くなるスコアを求め、当該スコアの高い文を選択するステップと、
制御部が、予め指定された制限文字数または要約率の範囲内で前記入力文書を構成する文を前記スコアの高い順に選択し、出現順に並べ替えて出力するステップとを含む
ことを特徴とする請求項3に記載のテキスト要約方法。
The important sentence selection unit calculates sentence importance for all sentences included in the input document based on the degree of association between each word included in the sentence calculated in the relevance calculation step and the search word. And calculating the similarity between sentences, obtaining a score that increases as the sentence importance is high and the similarity between sentences is small, and selecting a sentence with a high score;
The control unit includes a step of selecting sentences constituting the input document within a range of a limited number of characters or a summary rate specified in advance in descending order of scores, rearranging them in the order of appearance, and outputting them. Item 4. The text summarization method according to Item 3.
コンピュータを、請求項1または2に記載のテキスト要約装置の各手段として機能させるためのプログラム。   The program for functioning a computer as each means of the text summarization apparatus of Claim 1 or 2.
JP2008301058A 2008-11-26 2008-11-26 Text summarization apparatus, method and program thereof Active JP4942727B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2008301058A JP4942727B2 (en) 2008-11-26 2008-11-26 Text summarization apparatus, method and program thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2008301058A JP4942727B2 (en) 2008-11-26 2008-11-26 Text summarization apparatus, method and program thereof

Publications (2)

Publication Number Publication Date
JP2010128677A true JP2010128677A (en) 2010-06-10
JP4942727B2 JP4942727B2 (en) 2012-05-30

Family

ID=42329039

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2008301058A Active JP4942727B2 (en) 2008-11-26 2008-11-26 Text summarization apparatus, method and program thereof

Country Status (1)

Country Link
JP (1) JP4942727B2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012221293A (en) * 2011-04-11 2012-11-12 Nippon Telegr & Teleph Corp <Ntt> Document summarization device, document summarization method, and program
JP2013186608A (en) * 2012-03-07 2013-09-19 Nippon Telegr & Teleph Corp <Ntt> Document summarization device and method and program
JP2015132899A (en) * 2014-01-09 2015-07-23 日本放送協会 Keyword extraction device and program
US9734144B2 (en) 2014-09-18 2017-08-15 Empire Technology Development Llc Three-dimensional latent semantic analysis
US9767193B2 (en) 2015-03-27 2017-09-19 Fujitsu Limited Generation apparatus and method
JP2017174059A (en) * 2016-03-23 2017-09-28 株式会社東芝 Information processor, information processing method, and program
KR20180032541A (en) * 2018-03-20 2018-03-30 주식회사 위버플 Method and system for extracting sentences
US10430468B2 (en) 2015-09-09 2019-10-01 Uberple Co., Ltd. Method and system for extracting sentences
JP2020067987A (en) * 2018-10-26 2020-04-30 楽天株式会社 Summary creation device, summary creation method, and program
CN113343669A (en) * 2021-05-20 2021-09-03 北京明略软件系统有限公司 Method and system for learning word vector, electronic equipment and storage medium
JP2021131769A (en) * 2020-02-20 2021-09-09 ソフトバンク株式会社 Summary generation program, summary generation device, and summary generation method
US20220277138A1 (en) * 2019-07-17 2022-09-01 Nippon Telegraph And Telephone Corporation Training data generation device, training data generation method and training data generation program

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11184865A (en) * 1997-12-19 1999-07-09 Matsushita Electric Ind Co Ltd Document summarizing device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11184865A (en) * 1997-12-19 1999-07-09 Matsushita Electric Ind Co Ltd Document summarizing device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012221293A (en) * 2011-04-11 2012-11-12 Nippon Telegr & Teleph Corp <Ntt> Document summarization device, document summarization method, and program
JP2013186608A (en) * 2012-03-07 2013-09-19 Nippon Telegr & Teleph Corp <Ntt> Document summarization device and method and program
JP2015132899A (en) * 2014-01-09 2015-07-23 日本放送協会 Keyword extraction device and program
US9734144B2 (en) 2014-09-18 2017-08-15 Empire Technology Development Llc Three-dimensional latent semantic analysis
US9767193B2 (en) 2015-03-27 2017-09-19 Fujitsu Limited Generation apparatus and method
US10430468B2 (en) 2015-09-09 2019-10-01 Uberple Co., Ltd. Method and system for extracting sentences
JP2017174059A (en) * 2016-03-23 2017-09-28 株式会社東芝 Information processor, information processing method, and program
KR20180032541A (en) * 2018-03-20 2018-03-30 주식회사 위버플 Method and system for extracting sentences
KR102034302B1 (en) * 2018-03-20 2019-10-18 주식회사 딥서치 Method and system for extracting sentences
JP2020067987A (en) * 2018-10-26 2020-04-30 楽天株式会社 Summary creation device, summary creation method, and program
US11061950B2 (en) 2018-10-26 2021-07-13 Rakuten, Inc. Summary generating device, summary generating method, and information storage medium
US20220277138A1 (en) * 2019-07-17 2022-09-01 Nippon Telegraph And Telephone Corporation Training data generation device, training data generation method and training data generation program
JP2021131769A (en) * 2020-02-20 2021-09-09 ソフトバンク株式会社 Summary generation program, summary generation device, and summary generation method
JP7152437B2 (en) 2020-02-20 2022-10-12 ソフトバンク株式会社 Summary generation program, summary generation device and summary generation method
CN113343669A (en) * 2021-05-20 2021-09-03 北京明略软件系统有限公司 Method and system for learning word vector, electronic equipment and storage medium

Also Published As

Publication number Publication date
JP4942727B2 (en) 2012-05-30

Similar Documents

Publication Publication Date Title
JP4942727B2 (en) Text summarization apparatus, method and program thereof
US20190147000A1 (en) Systems and methods for performing search and retrieval of electronic documents using a big index
JP3820242B2 (en) Question answer type document search system and question answer type document search program
Culpepper et al. Dynamic cutoff prediction in multi-stage retrieval systems
US20090228468A1 (en) Using core words to extract key phrases from documents
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
JP2009525520A (en) Evaluation method for ranking and sorting electronic documents in search result list based on relevance, and database search engine
WO2014050002A1 (en) Query degree-of-similarity evaluation system, evaluation method, and program
EP2192503A1 (en) Optimised tag based searching
Song et al. A novel term weighting scheme based on discrimination power obtained from past retrieval results
US20140289260A1 (en) Keyword Determination
Luk et al. A comparison of Chinese document indexing strategies and retrieval models
Kim et al. Does selective search benefit from WAND optimization?
JP4074564B2 (en) Computer-executable dimension reduction method, program for executing the dimension reduction method, dimension reduction apparatus, and search engine apparatus using the dimension reduction apparatus
JP5565568B2 (en) Information recommendation device, information recommendation method and program
Yusuf et al. Arabic text stemming using query expansion method
JP5251099B2 (en) Term co-occurrence degree extraction device, term co-occurrence degree extraction method, and term co-occurrence degree extraction program
JP2009086903A (en) Retrieval service device
JP2004192546A (en) Information retrieval method, device, program, and recording medium
JP4009937B2 (en) Document search device, document search program, and medium storing document search program
US8745078B2 (en) Control computer and file search method using the same
JP4452527B2 (en) Document search device, document search method, and document search program
Fatmawati et al. Implementation of the common phrase index method on the phrase query for information retrieval
JP2010003266A (en) Query generation device, method, program and computer-readable recording medium
JP2002117043A (en) Device and method for document retrieval, and recording medium with recorded program for implementing the same method

Legal Events

Date Code Title Description
RD04 Notification of resignation of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7424

Effective date: 20101215

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20110511

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20110530

RD04 Notification of resignation of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7424

Effective date: 20110613

RD02 Notification of acceptance of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7422

Effective date: 20110614

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20110615

RD03 Notification of appointment of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7423

Effective date: 20110616

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20110725

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20120224

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20120228

R150 Certificate of patent or registration of utility model

Ref document number: 4942727

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20150309

Year of fee payment: 3

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350