JP5523929B2

JP5523929B2 - Text summarization apparatus, text summarization method, and text summarization program

Info

Publication number: JP5523929B2
Application number: JP2010117403A
Authority: JP
Inventors: 隆明長谷川; 仁西川; 義博松尾; 玄一郎菊井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-05-21
Filing date: 2010-05-21
Publication date: 2014-06-18
Anticipated expiration: 2030-05-21
Also published as: JP2011243166A

Description

本発明は、テキスト（文書）を要約する技術に関する。 The present invention relates to a technique for summarizing text (documents).

複数の文からなる文章から文を選択することで要約を作成する重要文抽出の手法と、文そのものを短くする文短縮の手法を組合せることにより、限られた要約長の中でもより多くの原文の情報を要約に含めることが可能になることが、従来技術（非特許文献１参照）として知られている。 By combining the important sentence extraction method that creates a summary by selecting a sentence from multiple sentences and the sentence shortening method that shortens the sentence itself, more original sentences can be obtained even within a limited summary length. It is known as the prior art (see Non-Patent Document 1) that the above information can be included in the summary.

従来の文短縮の方法として、文の中の重要な単語の集合を抽出し、それらをｎ-ｇｒａｍモデルを利用して接合することによって文を短縮する方法が非特許文献２に記載されている。あるいはそれらの単語を含む文節同士の係り受け構造に基づいて文全体から重要でない部分木を削除していくことで重要な部分木を取り出して、文を短縮する方法が非特許文献３に記載されている。また、コーパスから学習した文節間の係り受けの強度に基づいて文節を選択することにより文を短縮する手法が非特許文献４参照に提案されている。 As a conventional sentence shortening method, Non-Patent Document 2 describes a method of shortening a sentence by extracting a set of important words in a sentence and joining them using an n-gram model. . Alternatively, Non-Patent Document 3 describes a method of taking out important subtrees by deleting insignificant subtrees from the whole sentence based on the dependency structure of clauses including those words and shortening the sentences. ing. Also, a method for shortening a sentence by selecting a phrase based on the strength of dependency between phrases learned from a corpus is proposed in Non-Patent Document 4.

奥村学、難波英嗣著、「テキスト自動要約」、オーム社、2005年、p.55-56Manabu Okumura, Hideaki Namba, “Automatic Text Summarization”, Ohmsha, 2005, p.55-56 堀智織、古井貞煕著、「講演音声の自動要約の試み」、話し言葉の科学と工学ワークショップ講演予稿集、2001年、p.165-171Tomoori Hori and Sadaaki Furui, “Attempt for Automatic Summarization of Speech”, Proceedings of Spoken Language and Engineering Workshop, 2001, p.165-171 大森岳史、増田英孝、中川裕志著、「Ｗｅｂ新聞記事の要約とその携帯端末向け記事による評価」、情報処理学会研究報告、社団法人情報処理学会、2003年、NL153-1、p.1-8Takeshi Omori, Hidetaka Masuda, Hiroshi Nakagawa, “Summary of Web Newspaper Articles and Evaluation Based on Articles for Mobile Devices”, Information Processing Society of Japan, Information Processing Society of Japan, 2003, NL153-1, p.1-8 Kiwamu Yamagata、 Satoshi Fukutomi、 Kazuyuki Takagi、 Kazuhiko Ozeki、"Sentence Compression Using Statistical Information about Dependency Path Length"、Text, Speech and Dialogue、Springer Berlin / Heidelberg、2006年、Volume 4188、p. 127-134Kiwamu Yamagata, Satoshi Fukutomi, Kazuyuki Takagi, Kazuhiko Ozeki, "Sentence Compression Using Statistical Information about Dependency Path Length", Text, Speech and Dialogue, Springer Berlin / Heidelberg, 2006, Volume 4188, p. 127-134

しかしながら、従来技術は可読性の低い要約が生成されるという問題が起こり得る。従来の文短縮の手法は、重要な単語の選択の基準としてｔｆ＊ｉｄｆ等のコーパスにおける統計量に基づいていたり、コーパスから学習した文節間の係り受け強度による基準を利用したものである。このような基準による文短縮と重要文抽出を組合せると、重要文抽出により選択された文を短縮してこれらを並べる、あるいは、各文を短縮してから短縮文を選択して並べることになる。このとき、可読性の低い要約が生成されてしまう可能性がある。例えば、次の隣り合う文ｎと文ｎ＋１を要約する場合を考える。 However, the prior art can have the problem that a low-readable summary is generated. The conventional sentence shortening technique is based on a statistic in a corpus such as tf * idf as a criterion for selecting an important word, or uses a criterion based on dependency strength between phrases learned from the corpus. By combining sentence shortening and important sentence extraction based on these criteria, it is possible to shorten the sentences selected by the important sentence extraction and arrange them, or to shorten each sentence and then select and arrange the shortened sentences. Become. At this time, a summary with low readability may be generated. For example, consider the case where the next adjacent sentence n and sentence n + 1 are summarized.

文ｎ：（重要語１非重要語１非重要語２）
文ｎ＋１：（重要語２非重要語１重要語３）
隣り合う文同士で共通に存在する単語（例えば、非重要語１）は、要約に含まれる方が、可読性が高くなる場合があるが、テキスト全体ではそれほど重要でないと判断され、文短縮により削除されることがある。 Sentence n: (important word 1 unimportant word 1 unimportant word 2)
Sentence n + 1: (Important word 2 Non-important word 1 Important word 3)
Words that exist in common between adjacent sentences (for example, non-important word 1) may be more readable when included in the summary, but are judged to be less important in the whole text and are deleted by sentence shortening. May be.

逆に、テキスト全体で重要である異なる単語（例えば、文ｎに含まれる重要語１と文ｎ＋１に含まれる重要語２と重要語３で、重要語１と重要語２は異なる話題を示すような相互の関連が薄いものとする）で、一方の重要語（例えば、文ｎ＋１に含まれる重要語２）を要約に含めると話題がずれ、可読性が低い要約が作成されてしまう場合があるが、テキスト全体では、重要と判断されるため、文短縮により削除されることなく、その重要語も要約に含まれることがある。 Conversely, different words that are important in the entire text (for example, important word 1 included in sentence n and important word 2 and important word 3 included in sentence n + 1, such that important word 1 and important word 2 indicate different topics. If one important word (for example, the important word 2 included in the sentence n + 1) is included in the summary, the topic may be shifted and a summary with low readability may be created. Since the entire text is judged to be important, the important word may be included in the summary without being deleted by sentence shortening.

このように、従来の重要文抽出と文短縮の組合せにおける問題は、重要文抽出と文短縮の処理が独立であって、単語や文節の選択の尺度として大域的な情報（ｔｆ＊ｉｄｆやコーパスから得た係り受け強度）しか用いないところにある。このため、要約を構成する隣り合う短縮文において、局所的な情報を利用することができず、可読性の低い要約が生成される可能性がある。 Thus, the problem in the conventional combination of important sentence extraction and sentence shortening is that the important sentence extraction and sentence shortening processes are independent, and global information (tf * idf or corpus) is used as a measure for selecting words and phrases. Dependency strength obtained from For this reason, local information cannot be used in adjacent short sentences constituting the summary, and a summary with low readability may be generated.

上記の課題を解決するために、本発明のテキスト要約技術は、２つの素性要素の順序を考慮した組合せである素性に対する重みパラメタと、文を構成する文要素に対する文要素スコアを予め記憶しておき、入力文を変換し、各入力文に対し１以上の変換文を生成し、変換文から素性要素を抽出し、各変換文に含まれる文要素に対する文要素スコアを用いて、各変換文の内容性スコアを求め、素性要素抽出部で抽出した素性要素と重みパラメタを用いて、変換文の連接スコアを求め、内容性スコアと連接スコアの和が、最大値となる、または、最大値の近似値となる変換文の順列を探索する。 In order to solve the above problems, the text summarization technique of the present invention stores in advance a weight parameter for a feature that is a combination considering the order of two feature elements and a sentence element score for a sentence element constituting the sentence. Each input sentence is converted, one or more conversion sentences are generated for each input sentence, a feature element is extracted from the conversion sentence, and a sentence element score for each sentence element included in each conversion sentence is used to generate each conversion sentence. The content score is calculated, the concatenation score of the conversion sentence is calculated using the feature element extracted by the feature element extraction unit and the weight parameter, and the sum of the content score and the concatenation score becomes the maximum value, or the maximum value Search for permutations of conversion statements that are approximate values of.

本発明に係るテキスト要約技術は、各入力文に対し１以上の変換文を生成し、各変換文の内容性スコアと変換文の連接スコアを求めて、要約長以内の変換文の順列を探索するため、コーパス等から求めた大域的な情報に加えて、変換文同士の連接に関する局所的な情報を利用することができ、内容網羅性及び可読性の高い要約を生成することができるという効果を奏する。 The text summarization technique according to the present invention generates one or more conversion sentences for each input sentence, obtains a content score of each conversion sentence and a concatenation score of the conversion sentences, and searches permutations of the conversion sentences within the summary length. Therefore, in addition to the global information obtained from the corpus, etc., it is possible to use local information related to the concatenation of the conversion sentences, and to generate a summary with high content coverage and readability. Play.

テキスト要約装置１００、２００の構成例を示す図。The figure which shows the structural example of the text summarization apparatus 100,200. テキスト要約装置１００の処理フローを示す図。The figure which shows the processing flow of the text summarization apparatus. 形態素解析済みの文１のデータ例を示す図。The figure which shows the example of a data of the sentence 1 after morphological analysis. 形態素解析済みの文２のデータ例を示す図。The figure which shows the example of data of the sentence 2 by which the morphological analysis was completed. 形態素解析済みの文３のデータ例を示す図。The figure which shows the example of data of the sentence 3 by which the morphological analysis was completed. 形態素解析済みの文４のデータ例を示す図。The figure which shows the example of data of the sentence 4 by which the morphological analysis was completed. 変換文生成部１１２の構成例を示す図。The figure which shows the structural example of the conversion sentence production | generation part 112. FIG. 変換文生成部１１２の処理フローを示す図。The figure which shows the processing flow of the conversion sentence production | generation part 112. FIG. 言語モデルのデータ例を示す図。The figure which shows the example of data of a language model. 短縮文生成部１１２ａの処理フローを示す図。The figure which shows the processing flow of the short sentence production | generation part 112a. 素性ベクトルの生成方法を説明するための図。The figure for demonstrating the production | generation method of a feature vector. 平均化パーセプトロンを用いた学習アルゴリズムの疑似コード例を示す図。The figure which shows the pseudo code example of the learning algorithm using an averaging perceptron. 図１２のフローチャート例を示す図。The figure which shows the example of a flowchart of FIG. 重みパラメタのデータ例を示す図。The figure which shows the example of data of a weight parameter. 最大値を求める際に用いるHeld and Karp Algorithmの疑似コード例を示す図。The figure which shows the pseudo code example of Held and Karp Algorithm used when calculating | requiring a maximum value. 動的計画法及びビームサーチを説明するための図。The figure for demonstrating a dynamic programming and a beam search. 重要文順列探索部の処理フロー例を示す図。The figure which shows the example of a processing flow of an important sentence permutation search part. テキスト要約装置１００のブロック図。FIG. 3 is a block diagram of the text summarization apparatus 100. 変換文生成部２１２の構成例を示す図。The figure which shows the structural example of the conversion sentence production | generation part 212. FIG. 変換文生成部２１２の処理フローを示す図。The figure which shows the processing flow of the conversion sentence production | generation part 212. FIG. 重みパラメタ（条件付き確率）のデータ例を示す図。The figure which shows the data example of a weight parameter (conditional probability).

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

＜テキスト要約装置１００＞
図１及び図２を用いて実施例１に係るテキスト要約装置１００を説明する。テキスト要約装置１００は、記憶部１０３、入力部１３１、変換文生成部１１２、素性要素抽出部１１３、内容性スコア計算部１１５、連接スコア計算部１１７、重要文順列探索部１１９及び出力部１３５を有する。例えば、素性要素抽出部１１３、内容性スコア計算部１１５、連接スコア計算部１１７、重要文順列探索部１１９は、未公開の特願２０１０−０１０９０６に記載のテキスト要約装置と同様の構成を有する。 <Text Summarization Device 100>
A text summarizing apparatus 100 according to the first embodiment will be described with reference to FIGS. 1 and 2. The text summarizing apparatus 100 includes a storage unit 103, an input unit 131, a converted sentence generation unit 112, a feature element extraction unit 113, a content score calculation unit 115, a concatenation score calculation unit 117, an important sentence permutation search unit 119, and an output unit 135. Have. For example, the feature element extraction unit 113, the content score calculation unit 115, the concatenation score calculation unit 117, and the important sentence permutation search unit 119 have the same configuration as the text summarization device described in the unpublished Japanese Patent Application No. 2010-010906.

テキスト要約装置１００は、１以上の入力文からなるテキストデータを要約する。以下、詳細を説明する。なお、テキストデータは１以上の文章からなり、文章は話し手または書き手の思考や感情が表現されており、１つ以上の文からなるものである。 The text summarization apparatus 100 summarizes text data composed of one or more input sentences. Details will be described below. The text data is composed of one or more sentences, and the sentences express the thoughts and feelings of the speaker or the writer, and are composed of one or more sentences.

＜記憶部１０３及び入力部１３１＞
記憶部１０３は、２つの素性要素の順序を考慮した組合せである素性に対する重みパラメタと、文を構成する文要素に対する文要素スコアを予め記憶しておく。 <Storage unit 103 and input unit 131>
The storage unit 103 stores in advance a weight parameter for a feature that is a combination considering the order of two feature elements, and a sentence element score for a sentence element constituting the sentence.

各データの生成方法については、後述する。さらに、記憶部１０３は入出力される各データや演算過程の各データを、逐一、格納・読み出しする。それにより各演算処理が進められる。但し、必ずしも記憶部１０３に記憶しなければならないわけではなく、各部間で直接データを受け渡してもよい。 A method for generating each data will be described later. Further, the storage unit 103 stores / reads each input / output data and each data of the calculation process one by one. Thereby, each calculation process is advanced. However, the data need not necessarily be stored in the storage unit 103, and data may be directly transferred between the units.

テキスト要約装置１００は、入力部１３１を介して、要約の対象となるテキストデータを入力される。さらに、本実施例では、入力部１３１を介して、要約の長さを規定する要約長の数値も入力される。要約長は文字数、バイト数、単語数など要約の長さを規定するものならば何れでもよく、要約作成者等が適宜設定する。 The text summarization apparatus 100 receives text data to be summarized via the input unit 131. Furthermore, in this embodiment, a summary length value that defines the summary length is also input via the input unit 131. The summary length may be any number that defines the length of the summary, such as the number of characters, the number of bytes, the number of words, etc., and is appropriately set by the summary creator.

入力とするテキストデータは形態素解析済み、固有表現抽出済み、係り受け解析済みであってもよい。入力部１３１は、キーボード、ファイル、通信回線またはそれらの入力端子等であって、特に規定されるものではない。例えば、以下に示す文１〜文４を含むテキストデータのデータ例を、図３〜図６に示す。
文１：「ここのピザを食べましたが、ピザに対するイメージが大きく変わりました。」
文２：「イタリアで修行してきたシェフによる料理はどれも最高です。」
文３：「お料理は、イタリアンをベースとした創作料理で大変おいしいです。」
文４：「このお店はイタリアンの老舗であり、横浜駅から徒歩１０分の場所に位置する。」 The text data to be input may have been subjected to morphological analysis, extracted unique expressions, and dependency analysis. The input unit 131 is a keyboard, a file, a communication line, or an input terminal thereof, and is not particularly defined. For example, FIG. 3 to FIG. 6 show data examples of text data including sentences 1 to 4 shown below.
Sentence 1: “I ate pizza here, but the image of pizza has changed a lot.”
Sentence 2: “Every dish by a chef trained in Italy is the best.”
Sentence 3: “The food is a creative dish based on Italian and very delicious.”
Sentence 4: “This store is a long-established Italian restaurant located 10 minutes on foot from Yokohama Station.”

＜変換文生成部１１２＞
変換文生成部１１２は、入力部１３１を介して入力文を受信し、これを変換し、各入力文に対し１以上の変換文（以下、「変換文のバリエーション」という）を生成し（ｓ１１２）、素性要素抽出部１１３に送信する。例えば、変換文生成部１１２は、図７記載の短縮文生成部１１２ａのみを備える。 <Conversion sentence generation unit 112>
The conversion sentence generation unit 112 receives an input sentence via the input unit 131, converts the input sentence, and generates one or more conversion sentences (hereinafter referred to as “variations of conversion sentences”) for each input sentence (s112). ) And transmitted to the feature element extraction unit 113. For example, the converted sentence generation unit 112 includes only the shortened sentence generation unit 112a illustrated in FIG.

（短縮文生成部１１２ａ）
短縮文生成部１１２ａは入力文を用いて、各入力文に対し１以上の短縮文（以下、「短縮文のバリエーション」という）を生成し（ｓ１１２ａ、図８参照）、これを変換文として素性要素抽出部１１３に送信する。 (Short sentence generator 112a)
The abbreviated sentence generation unit 112a generates one or more abbreviated sentences (hereinafter referred to as “variations of abbreviated sentences”) for each input sentence using the input sentence (s112a, see FIG. 8), and uses this as a converted sentence. It transmits to the element extraction part 113.

短縮文生成部１１２ａは、文を短縮するときに不自然でない短縮文を生成する。なお、短縮文生成の方法は、特に規定するものではなく、既存の文短縮の方法を用いてもよい。例えば、参考文献１記載の方法を用いて短縮文のバリエーションを生成してもよいし、係り受けの構造のみを用いて単純に部分木となる短縮文候補を短縮文のバリエーションとして生成してもよい。
［参考文献１］長谷川隆明、西川仁、今村賢治、菊井玄一郎、奥村学著、「携帯端末のためのＷｅｂページからの概要文生成」、人工知能学会論文誌、2010年、Vol.25、No.1、p.133-143 The shortened sentence generation unit 112a generates a non-natural shortened sentence when the sentence is shortened. The method for generating a shortened sentence is not particularly specified, and an existing sentence shortening method may be used. For example, a variation of a shortened sentence may be generated using the method described in Reference 1, or a shortened sentence candidate that is simply a subtree may be generated as a shortened sentence variation using only the dependency structure. Good.
[Reference 1] Takaaki Hasegawa, Hitoshi Nishikawa, Kenji Imamura, Genichiro Kikui, Manabu Okumura, "Generation of Summary Text from Web Pages for Mobile Devices", Journal of the Japanese Society for Artificial Intelligence, 2010, Vol.25, No .1, p.133-143

一例として、係り受け関係に基づいてルートの文節を含むすべての部分木を文短縮の候補とし、単語ｎ-ｇｒａｍあるいは品詞ｎ-ｇｒａｍによる言語モデルを用いて各短縮文候補の生成確率を計算し、生成確率の高い上位Ｎ個の短縮文候補に絞り、これを短縮文のバリエーションとして出力する方法について説明する。なお、閾値を設けてその閾値以上の生成確率となる短縮文候補に絞ってもよい。また、原文、つまり短縮していない元の入力文を短縮文のバリエーションの１つとして含めてもよい。 As an example, all subtrees that contain the root clause based on the dependency relationship are used as sentence shortening candidates, and the generation probability of each shortened sentence candidate is calculated using a language model based on the word n-gram or part-of-speech n-gram. A method for narrowing down to the top N short sentence candidates with high generation probabilities and outputting them as variations of the short sentence will be described. Note that a threshold may be provided to narrow down to short sentence candidates that have a generation probability equal to or higher than the threshold. Also, the original sentence, that is, the original input sentence that has not been shortened, may be included as one of the variations of the shortened sentence.

この場合、記憶部１０３は、単語のつながりの確率を言語モデルとして予め記憶しておく。例えば、言語モデルは、大量のテキストコーパスから単語あるいは品詞の連鎖の出現頻度に基づいて、単語あるいは品詞のｎ-ｇｒａｍ確率を格納したものである（図９参照）。以下、図１０を用いて、短縮文生成部１１２ａの処理例を説明する。 In this case, the storage unit 103 stores a word connection probability in advance as a language model. For example, the language model stores n-gram probabilities of words or parts of speech based on the appearance frequency of a chain of words or parts of speech from a large number of text corpora (see FIG. 9). Hereinafter, a processing example of the shortened sentence generation unit 112a will be described with reference to FIG.

すべての入力文に対して以下の処理を行う。まず係り受け解析済みの文を入力し（ｓ１１２ａ−１）、ポインタを文末にセットする（ｓ１１２ａ−２）。ポインタが文頭ではない場合（ｓ１１２ａ−３）、ポインタのセットされている文節を当該文のすべての文短縮候補の前に接続する（ｓ１１２ａ−４）。ポインタのセットされている文節が述部を示すルート文節か、もしくはポインタのセットされている文節の係り先が存在する場合には（ｓ１１２ａ−５）、ルート文節もくしはポインタのセットされている文節を接続した文短縮候補を、新たに短縮文候補として生成し、この短縮文候補の生成確率を計算する（ｓ１１２ａ−６）。その後、ポインタを１文節だけ文頭に向かって動かす（ｓ１１２ａ−７）。また、ポインタの文節が先の条件を満たさなければ（ｓ１１２ａ−５）、新たに短縮文候補として生成せずにポインタを１文節だけ前に動かす（ｓ１１２ａ−７）。以上を処理できる文節がなくなるまで（文頭まで）繰り返す（ｓ１１２ａ−３）。なお、この過程で、短縮文候補の数を生成確率の最大上位Ｎ個に制限しながら探索（ビームサーチ）を行ってもよい。その他にも探索の過程において動的計画法などの効率的な手法を適用してもよい。すべての文節を処理し終えたら（ｓ１１２ａ−３）、短縮文の生成確率の上位Ｎ件（Ｎ-ｂｅｓｔ）の短縮文候補を、または、予め指定した閾値以上の生成確率となる短縮文候補を選択し（ｓ１１２ａ−８）、短縮文のバリエーションとして出力する。なお、例えば、生成確率は以下の式により求める。 The following processing is performed for all input sentences. First, a dependency analyzed sentence is input (s112a-1), and a pointer is set at the end of the sentence (s112a-2). If the pointer is not the beginning of a sentence (s112a-3), the clause in which the pointer is set is connected before all the sentence shortening candidates of the sentence (s112a-4). If the clause where the pointer is set is the root clause indicating the predicate, or there is a destination of the clause where the pointer is set (s112a-5), the root clause or the pointer is set. A sentence shortening candidate to which the phrase is connected is newly generated as a shortened sentence candidate, and the generation probability of the shortened sentence candidate is calculated (s112a-6). After that, the pointer is moved by one phrase toward the beginning of the sentence (s112a-7). If the clause of the pointer does not satisfy the previous condition (s112a-5), the pointer is moved forward by one clause without being newly generated as a shortened sentence candidate (s112a-7). The above process is repeated (until the beginning of the sentence) until there are no more phrases that can be processed (s112a-3). In this process, the search (beam search) may be performed while limiting the number of shortened sentence candidates to the highest N of the generation probabilities. In addition, an efficient method such as dynamic programming may be applied in the search process. When all the clauses have been processed (s112a-3), the top N (N-best) short sentence candidates of the short sentence generation probability, or the short sentence candidate having a generation probability equal to or higher than a predetermined threshold value are selected. Select (s112a-8) and output as a variation of the abbreviated sentence. For example, the generation probability is obtained by the following equation.

但し、Ｗは短縮文候補の単語の系列とし、ｗ_ｉはｉ番目の単語とし、ｗ_０とｗ_ｎ+１はそれぞれ文頭記号＜ｓ＞、文末記号＜／ｓ＞とする。生成確率は単語の長さで幾何平均を取ってもよい。 Here, W is a word sequence of abbreviated sentence candidates, w _i is the i-th word, and w ₀ and w _{n + 1} are a head symbol <s> and a sentence end symbol </ s>, respectively. The generation probability may be a geometric average based on the length of the word.

例として、文１から生成される短縮文のバリエーションを以下に示す。なお、短縮文候補の生成確率の上位１２件を短縮文のバリエーションとした例である。
短縮文１−１「大きく変わりました。」（１０文字）
短縮文１−２「イメージが変わりました。」（１２文字）
短縮文１−３「イメージが大きく変わりました。」（１５文字）
短縮文１−４「ピザに対するイメージが変わりました。」（１８文字）
短縮文１−５「ピザに対するイメージが大きく変わりました。」（２１文字）
短縮文１−６「ピザを食べましたが、イメージが変わりました。」（２２文字）
短縮文１−７「ピザを食べましたが、イメージが大きく変わりました。」（２５文字）
短縮文１−８「ピザを食べましたが、ピザに対するイメージが変わりました。」（２８文字）
短縮文１−９「ピザを食べましたが、ピザに対するイメージが大きく変わりました。」（３１文字）
短縮文１−１０「ここのピザを食べましたが、イメージが大きく変わりました。」（２８文字）
短縮文１−１１「ここのピザを食べましたが、ピザに対するイメージが変わりました。」（３１文字）
短縮文１−１２「ここのピザを食べましたが、ピザに対するイメージが大きく変わりました。」（３４文字） As an example, a variation of a short sentence generated from sentence 1 is shown below. It is an example in which the top 12 cases of the generation probability of the short sentence candidate are variations of the short sentence.
Short sentence 1-1 “changed drastically” (10 characters)
Short sentence 1-2 “Image has changed.” (12 characters)
Short sentence 1-3 "Image has changed a lot" (15 characters)
Short sentence 1-4 “The image of pizza has changed” (18 characters)
Short sentence 1-5 “The image of pizza has changed significantly” (21 characters)
Short sentence 1-6 “I ate pizza, but the image has changed.” (22 characters)
Short sentence 1-7 “I ate pizza, but the image changed a lot.” (25 characters)
Short sentence 1-8 “I ate pizza, but the image of pizza has changed.” (28 characters)
Short sentence 1-9 “I ate pizza, but the image of pizza has changed a lot.” (31 characters)
Short sentence 1-10 “I ate pizza here, but the image has changed a lot.” (28 characters)
Short sentence 1-11 “I ate pizza here, but the image of pizza has changed.” (31 characters)
Short sentence 1-12 “I ate pizza here, but the image of pizza has changed a lot.” (34 characters)

同様に、文２から生成される短縮文のバリエーションを以下に示す。
短縮文２−１「最高です。」（５文字）
短縮文２−２「どれも最高です。」（８文字）
短縮文２−３「料理は最高です。」（８文字）
短縮文２−４「料理はどれも最高です。」（１１文字）
短縮文２−５「シェフによる料理は最高です。」（１４文字）
短縮文２−６「シェフによる料理はどれも最高です。」（１７文字）
短縮文２−７「修行してきたシェフによる料理は最高です。」（２０文字）
短縮文２−８「修行してきたシェフによる料理はどれも最高です。」（２３文字）
短縮文２−９「イタリアで修行してきたシェフによる料理は最高です。」（２５文字）
短縮文２−１０「イタリアで修行してきたシェフによる料理はどれも最高です。」（２８文字） Similarly, the variation of the short sentence produced | generated from the sentence 2 is shown below.
Short sentence 2-1 “It ’s the best.” (5 characters)
Short sentence 2-2 "Everything is the best" (8 characters)
Short sentence 2-3 “Cooking is the best” (8 characters)
Short sentence 2-4 "Every dish is the best" (11 characters)
Short sentence 2-5 “Cook cooking is the best” (14 characters)
Short sentence 2-6 “Everything cooked by a chef is the best” (17 characters)
Short sentence 2-7 “Cooking by a chef who has trained is the best” (20 characters)
Short sentence 2-8 “Every dish by chefs who have trained is the best.” (23 characters)
Short sentence 2-9 “Cooking by a chef trained in Italy is the best” (25 characters)
Short sentence 2-10 “Everything cooked by a chef trained in Italy is the best” (28 characters)

同様に、文３から生成される短縮文のバリエーションを以下に示す。
短縮文３−１「おいしいです。」（７文字）
短縮文３−２「大変おいしいです。」（９文字）
短縮文３−３「創作料理でおいしいです。」（１２文字）
短縮文３−４「創作料理で大変おいしいです。」（１４文字）
短縮文３−５「イタリアンをベースとした創作料理でおいしいです。」（２４文字）
短縮文３−６「イタリアンをベースとした創作料理で大変おいしいです。」（２６文字）
短縮文３−７「お料理は、おいしいです。」（１２文字）
短縮文３−８「お料理は、大変おいしいです。」（１４文字）
短縮文３−９「お料理は、創作料理でおいしいです。」（１７文字）
短縮文３−１０「お料理は、創作料理で大変おいしいです。」（１９文字）
短縮文３−１１「お料理は、イタリアンをベースとした創作料理でおいしいです。」（２９文字）
短縮文３−１２「お料理は、イタリアンをベースとした創作料理で大変おいしいです。」（３１文字） Similarly, the variation of the short sentence produced | generated from the sentence 3 is shown below.
Short sentence 3-1 “Delicious.” (7 characters)
Short sentence 3-2 “Very delicious” (9 characters)
Short sentence 3-3 "Creative cuisine is delicious" (12 characters)
Short sentence 3-4 “It is very delicious with creative dishes” (14 characters)
Short sentence 3-5 "It is delicious with creative dishes based on Italian." (24 characters)
Short sentence 3-6 "It is very delicious with creative dishes based on Italian" (26 characters)
Short sentence 3-7 “Cooking is delicious” (12 characters)
Short sentence 3-8 “Cooking is very delicious” (14 characters)
Short sentence 3-9 “Cooking is delicious with creative dishes” (17 characters)
Short sentence 3-10 “Cooking is very delicious with creative dishes” (19 characters)
Short sentence 3-11 “Cooking is delicious with creative cuisine based on Italian” (29 characters)
Short sentence 3-12 “Cooking is an Italian-based creative dish that is very delicious.” (31 characters)

＜素性要素抽出部１１３＞
素性要素抽出部１１３は、変換文を受信し、この変換文から素性要素（例えば内容語（名詞、動詞、形容詞））を抽出し（ｓ１１３）、連接スコア計算部１１７に送信する。なお、変換文生成部１１２において、形態素解析済み、固有表現抽出済み、係り受け解析済みの入力文（図３〜図６参照）を用いて、変換文のバリエーションを生成しているため、各変換文に付加された形態素解析データに基づいて素性要素を抽出することができる。 <Feature Element Extraction Unit 113>
The feature element extraction unit 113 receives the conversion sentence, extracts a feature element (for example, a content word (noun, verb, adjective)) from the conversion sentence (s113), and transmits it to the connection score calculation unit 117. Note that the conversion sentence generation unit 112 uses the input sentence (see FIGS. 3 to 6) that has been subjected to morphological analysis, extracted from the specific expression, and subjected to dependency analysis to generate conversion sentence variations. Feature elements can be extracted based on the morphological analysis data added to the sentence.

＜内容性スコア計算部１１５＞
内容性スコア計算部１１５は、各変換文に含まれる文要素を受信し、これに対する文要素スコアを用いて、各変換文の内容性スコアを求め（ｓ１１５）、内容性スコアを重要文順列探索部１１９へ送信する。 <Content score calculator 115>
The content score calculation unit 115 receives the sentence element included in each conversion sentence, obtains the content score of each conversion sentence using the sentence element score for the sentence element (s115), and searches the important sentence permutation for the content score. To the unit 119.

例えば、文要素として、素性要素を用いる場合には（本実施例では素性要素は内容語）、素性要素抽出部１１３の出力を、入力とし、内容性スコア計算部１１５は、入力される内容語をキーとして、記憶部１０３からその内容語に対する文要素スコアを検索して、取得し、文に含まれるすべての内容語に対する文要素スコアの和を求める。この和を内容性スコアとし、以下の式で表すことができる。但し、Ｃｏｎｔｅｎｔ(ｓ)は文ｓの内容性スコアを、Ｗｅｉｇｈｔ(ｐ)は文sが含む内容語ｐの文要素スコアを表す。 For example, when a feature element is used as a sentence element (in this embodiment, a feature element is a content word), the output of the feature element extraction unit 113 is used as an input, and the content score calculation unit 115 receives the input content word. As a key, the sentence element score for the content word is searched and acquired from the storage unit 103, and the sum of the sentence element scores for all the content words included in the sentence is obtained. This sum is used as a content score, and can be expressed by the following formula. However, Content (s) represents the content score of the sentence s, and Weight (p) represents the sentence element score of the content word p included in the sentence s.

［文要素スコアの算出方法］
各内容語の文要素スコアは、予め記憶部１０３に記憶しておく。テキスト要約装置１００は、例えば、図示しない文要素スコア計算部を有し、文要素スコアを求める。文要素スコアとして、例えば要約の対象とするテキスト中において該単語が出現する回数などを用いることができる。 [Calculation method of sentence element score]
The sentence element score of each content word is stored in the storage unit 103 in advance. The text summarizing apparatus 100 has, for example, a sentence element score calculation unit (not shown), and obtains a sentence element score. As the sentence element score, for example, the number of times the word appears in the text to be summarized can be used.

また例えば、文要素スコア計算部は、予め文要素スコア学習用のテキスト集合を用いて、テキスト集合に含まれる文要素から文要素スコアを求める。例えば、文要素スコア計算部は、予め文要素スコア学習用のテキスト集合を用いて、文要素を含む文の数を数え、その数ｃｎｔを記録したデータベースを用いて、文要素スコアを求める。 For example, the sentence element score calculation unit obtains a sentence element score from sentence elements included in the text set by using a text set for learning sentence element scores in advance. For example, the sentence element score calculation unit counts the number of sentences including sentence elements using a sentence element score learning text set in advance, and obtains a sentence element score using a database in which the number cnt is recorded.

例えば、文要素スコア計算部は、学習用テキスト集合内に、多い文要素ほど重要である場合には、その数ｃｎｔが大きいほど文要素スコアが大きくなるように文要素スコアを算出する。この場合、文要素スコアはその数ｃｎｔ自体や、ｃｎｔの対数等である。また、学習用テキスト集合内に、多い文要素ほど重要でない場合には、その数ｃｎｔが大きいほど文要素スコアが小さくなるように文要素スコアを算出する。この場合、文要素スコアはテキスト集合に含まれるテキストの数をその数ｃｎｔで割った値や割った値の対数等である。このような構成とすることによって、要約対象の文の集合に対して適切な内容性スコアを算出することができる。 For example, when the sentence element score calculation unit is more important in the learning text set, the sentence element score is calculated so that the sentence element score increases as the number cnt increases. In this case, the sentence element score is the number cnt itself or the logarithm of cnt. In addition, when the number of sentence elements is not important in the learning text set, the sentence element score is calculated so that the sentence element score decreases as the number cnt increases. In this case, the sentence element score is a value obtained by dividing the number of texts included in the text set by the number cnt, the logarithm of the divided value, or the like. With such a configuration, an appropriate content score can be calculated for a set of sentences to be summarized.

もちろん、文要素として評価情報等の情報抽出の結果を用いることもできる。その場合、上述した内容語の代わりに「画質がよい」「料理がおいしい」などといった何らかの対象を評価している文言に対して文要素スコアを与え、それらに基づいて文に内容性スコアを与えることができる。 Of course, information extraction results such as evaluation information can also be used as sentence elements. In that case, a sentence element score is given to a word that evaluates an object such as “good image quality” or “dishes is delicious” instead of the above-described content word, and a content score is given to the sentence based on those words. be able to.

＜連接スコア計算部１１７＞
連接スコア計算部１１７は、素性要素を受信し、素性要素から得られる素性に対応する重みパラメタを用いて、変換文の連接スコアを求め（ｓ１１７）、重要文順列探索部１１９へ送信する。なお、２つの文の連接スコアは、その２つの文のつながりの良さを示す値である。 <Connection score calculation unit 117>
The concatenation score calculation unit 117 receives the feature element, obtains the concatenation score of the converted sentence using the weight parameter corresponding to the feature obtained from the feature element (s117), and transmits it to the important sentence permutation search unit 119. The connection score of two sentences is a value indicating the goodness of connection between the two sentences.

例えば、「昨日ご飯を食べました」という文と「しかしあまりおいしくありませんでした」という文があったとする。この２つの文は「昨日ご飯を食べました」「しかしあまりおいしくありませんでした」という並びで現れるならば自然であるが、「しかしあまりおいしくありませんでした」「昨日ご飯を食べました」という並びで現れると非常に不自然である。これは「しかしあまりおいしくありませんでした」という文が、暗黙のうちに前の文で食事に関する話題が出現していることを前提にしているからである。 For example, suppose there is a sentence “I ate rice yesterday” and a sentence “But it was n’t so delicious”. These two sentences are natural if they appear in the order of “I ate rice yesterday” or “But it was n’t so delicious”, but they were “but not so good” or “I ate rice yesterday” It is very unnatural when it appears at. This is because the sentence "but not so delicious" implicitly assumes that the topic about meals has appeared in the previous sentence.

同様に、複数の文を繋ぎ合せて要約を生成する場合、文を適切に並び替えることができなければ、生成された要約は非常に読みづらく不自然なものになる場合がある。 Similarly, when a summary is generated by connecting a plurality of sentences, the generated summary may be very difficult to read and unnatural if the sentences cannot be rearranged appropriately.

仮に、文のつながりの良さにスコアを与えることができ、「しかしあまりおいしくありませんでした」「昨日ご飯を食べました」という文の並びよりも、「昨日ご飯を食べました」「しかしあまりおいしくありませんでした」という並びの方に高いスコアを与えることができれば、スコアに従って文を並び替えることができる。つまり、仮に２つの文ｓｉとｓｊを与えられたときには、ｓｉ、ｓｊの順序と、ｓｊ、ｓｉの順序それぞれのスコアを計算し、スコアが高い順序を採用する。 For example, you can give a score to the goodness of the connection of sentences, but rather than the line of sentences "But it wasn't very delicious" and "I ate rice yesterday", "I ate rice yesterday" If you can give a high score to the line that says “There was no,” you can rearrange the sentences according to the score. That is, if two sentences si and sj are given, the order of si, sj and the order of sj, si are calculated, and the order with the highest score is adopted.

そこで、まず、連接スコアを定義する。本実施例では一例として、文ｓｉの次に文ｓｊが現れる場合の連接スコアを以下の関数Ｃｏｎｎｅｃｔ(ｓｊ|ｓｉ)で定義する。 First, a connection score is defined. In this embodiment, as an example, a connection score when a sentence sj appears after the sentence si is defined by the following function Connect (sj | si).

Connect(sj|si)=w^Tφ(si,sj) （2）
ここで、ｗは上述した重みパラメタであり、φ(ｓｉ,ｓｊ)は文ｓｉと文ｓｊのつながりを表すバイナリベクトル（以下「素性ベクトル」という）であり、Ｔは転置を表す。ｗ^Ｔφ(ｓｉ,ｓｊ)はｗ^Ｔとφ（ｓｉ，ｓｊ）の内積である。重みパラメタｗは、一例として後述する方法によって事前に計算され、記憶部１０３に格納され、要約を行う際には記憶部１０３から呼び出される。 Connect (sj | si) = w ^T φ (si, sj) (2)
Here, w is the above-described weight parameter, φ (si, sj) is a binary vector (hereinafter referred to as “feature vector”) representing the connection between the sentence si and the sentence sj, and T represents transposition. w ^T φ (si, sj) is an inner product of w ^T and φ (si, sj). The weight parameter w is calculated in advance by a method described later as an example, stored in the storage unit 103, and called from the storage unit 103 when summarizing.

連接スコア計算部１１７は、例えば、素性ベクトル生成部１１７ａと計算部１１７ｂを備える。
（素性ベクトル生成部１１７ａ）
素性ベクトル生成部１１７ａは、２つの文ｓｉ、ｓｊが含む素性要素の直積集合の各要素を該２つの文の素性とし、求めた素性に対応する次元を１とし、他の次元を０とする素性ベクトルφ（ｓｉ，ｓｊ）を生成する。 The connected score calculation unit 117 includes, for example, a feature vector generation unit 117a and a calculation unit 117b.
(Feature Vector Generation Unit 117a)
The feature vector generation unit 117a sets each element of the Cartesian product set of the feature elements included in the two sentences si and sj as the features of the two sentences, sets the dimension corresponding to the obtained feature to 1, and sets the other dimensions to 0. A feature vector φ (si, sj) is generated.

２つの文のつながりを表わす素性ベクトルは、一例として、２つの文が含む内容語（名詞、動詞、形容詞）の直積集合として与えられる。図１１を用いて説明する。文ｓｉが「昨日ご飯を食べました」という文、文ｓｊが「しかしあまりおいしくありませんでした」という文であったとする。文ｓｉには「昨日」「ご飯」「食べ」という内容語が含まれ、文ｓｊには「おいし」「あ」という内容語が含まれる。これらの直積集合は図１１に示すように「昨日」「おいし」、「昨日」「あ」、「ご飯」「おいし」、「ご飯」「あ」、「食べ」「おいし」「食べ」「あ」の6つの単語の組となる。素性ベクトルφ（ｓｉ，ｓｊ）はこれらの６種類の素性に対応する次元が１となっているバイナリベクトルである。素性ベクトルの次元は、素性の刈り込みを行わなければ、学習の際に用いるテキスト集合中に現れる素性の数と同数となる。そのため実際には遥かに高次元なベクトルとなるが、図１１では簡単のため図に示した６種類の素性に対応する６次元のベクトルとしている。素性としては上に示したもの他にも、一例として、単語間の係り受けや固有表現などを用いることもできる。なお、素性の刈り込みとは、重みパラメタを算出する際に、文のつながりの良さを示すパラメタとして、あまり重要でないと思われる素性については、削除し、素性の数を減らす処理のことである。 A feature vector representing a connection between two sentences is given as an example of a Cartesian product set of content words (nouns, verbs, and adjectives) included in the two sentences. This will be described with reference to FIG. It is assumed that the sentence si is a sentence “I ate rice yesterday” and the sentence sj is a sentence “but not very delicious”. The sentence si includes content words “Yesterday”, “rice”, and “eat”, and the sentence sj includes content words “delicious” and “a”. As shown in FIG. 11, these Cartesian product sets are “Yesterday”, “Delicious”, “Yesterday”, “A”, “Rice”, “Delicious”, “Rice”, “A”, “Eat”, “Delicious”, “Eat”, “A” "Is a set of six words. The feature vector φ (si, sj) is a binary vector whose dimension corresponding to these six types of features is 1. If the feature vector is not trimmed, the dimension of the feature vector is the same as the number of features that appear in the text set used for learning. Therefore, although it is a much higher-dimensional vector in practice, in FIG. 11, it is a 6-dimensional vector corresponding to the six types of features shown in the figure for simplicity. In addition to the features shown above, as an example, dependency between words, specific expressions, and the like can be used. Note that the feature pruning is a process of deleting a feature that is not very important as a parameter indicating the goodness of a sentence when calculating a weight parameter, and reducing the number of features.

（計算部１１７ｂ）
計算部１１７ｂは、重みパラメタｗと素性ベクトルφ（ｓｉ，ｓｊ）の内積を、２つの文の連接スコアＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）として求める。つまり、重みパラメタと素性ベクトルを用いて、式（２）の計算を行う。 (Calculation unit 117b)
The calculation unit 117b obtains the inner product of the weight parameter w and the feature vector φ (si, sj) as the concatenation score Connect (sj | si) of the two sentences. That is, the calculation of Expression (2) is performed using the weight parameter and the feature vector.

（３つ以上の文の連接スコアの算出部）
連接スコア計算部１１７は、２つの文の連接スコアＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）を用いて、３つ以上の文の並びの連接スコアｗ^ＴΦ（ｘ，ｙ）を求める（ｓ１１７）。この場合、連接スコアとは、３つ以上の文の集合の全体的なつながりの良さを表す。ｘは与えられた文の集合を表し、ｙは文の並びを現す。 (Calculation unit of connection score of 3 or more sentences)
The concatenated score calculation unit 117 calculates a concatenated score w ^T Φ (x, y) of a sequence of three or more sentences using the concatenated score Connect (sj | si) of the two sentences (s117). In this case, the connection score represents the goodness of the overall connection of a set of three or more sentences. x represents a given set of sentences, and y represents a sequence of sentences.

例えば文ｓ１、ｓ２、ｓ３が与えられたとき、これらには６通りの並べ方がある。この６通りの並べ方のうち、最も文の並びの連接スコアが高い並びを、与えられた３つの文の並びとする。そのためには３つ以上の文の並びの連接スコアを計算する必要があるが、ここでは、３つ以上の文の並びを２つの文の並びに分解し、分解された２つの文の連接スコアの和を３つ以上の文の並びの連接スコアとする。３つ以上の文の並びの連接スコアｗ^ＴΦ（ｘ，ｙ）を以下のように定義する。 For example, when sentences s1, s2, and s3 are given, there are six ways of arranging them. Of the six ways of arrangement, the sequence with the highest connection score of the sequence of sentences is defined as the sequence of three given sentences. For this purpose, it is necessary to calculate the connection score of a sequence of three or more sentences, but here, the sequence of three or more sentences is decomposed into a sequence of two sentences, and the connection score of the two decomposed sentences is calculated. Let the sum be the concatenation score of a sequence of three or more sentences. A connection score w ^T Φ (x, y) of a sequence of three or more sentences is defined as follows.

ｙは文の並びを表わすもので、この例では、与えられた文の集合ｘ=｛ｓ１，ｓ２，ｓ３｝がｓ２、ｓ３、ｓ１の順に並べられていることを表す。加えて、ｓ０とｓ４はそれぞれ文書の頭と末尾を表わすもので、即ち文ｓ２が３つの文の先頭に、文ｓ１が３つの文の末尾にあることを示している。ｓ０とｓ４を導入することで、文書の先頭になりやすい文や、末尾になりやすい文を考慮することが可能になる。 y represents a sequence of sentences. In this example, y represents that a given sentence set x = {s1, s2, s3} is arranged in the order of s2, s3, and s1. In addition, s0 and s4 represent the beginning and end of the document, respectively, that is, the sentence s2 is at the beginning of the three sentences and the sentence s1 is at the end of the three sentences. By introducing s0 and s4, it is possible to consider a sentence that tends to be at the beginning of a document or a sentence that tends to be at the end.

［重みパラメタの算出方法］
重みパラメタｗの算出方法について説明する。ここで、人間によって書かれたテキストの文の並びを正解として、その文の並びを再現することができる重みパラメタｗがよいパラメタであるという仮定を置く。つまり、文の集合が与えられたとき、それらを読みやすく並べることができるパラメタｗは、人手で書かれたあるテキストに含まれる文の集合が与えられたとき、それを元の並びに復元できる、あるいは元の並びに近い並びに並べることができるという仮定を置く。この仮定の下、人手によって書かれたテキスト集合から、パラメタｗを推定する。重みパラメタｗは、一例としてテキスト集合から図１２及び図１３に示すアルゴリズムで計算することができる（参考文献２参照）。
［参考文献２］Michael Collins, “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms”, In Proceedings of the 2002 Conference on Empirical Methods on Natural Language Processing （EMNLP）, Association for Computational Linguistics, 2002, Volume 10, pp.1 - 8 [Calculation method of weight parameter]
A method for calculating the weight parameter w will be described. Here, it is assumed that the sentence parameter of the text written by a human is a correct answer and the weight parameter w that can reproduce the sentence sequence is a good parameter. In other words, when a set of sentences is given, the parameter w that allows them to be arranged in an easy-to-read manner can be restored to the original sequence when given a set of sentences contained in a certain handwritten text. Or put the assumption that they can be arranged side by side as close as possible. Under this assumption, the parameter w is estimated from a manually written text set. For example, the weight parameter w can be calculated from a text set by the algorithm shown in FIGS. 12 and 13 (see Reference 2).
[Reference 2] Michael Collins, “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms”, In Proceedings of the 2002 Conference on Empirical Methods on Natural Language Processing (EMNLP), Association for Computational Linguistics, 2002, Volume 10, pp.1-8

テキスト要約装置１００は、例えば図示しない重みパラメタ算出部を有し、重みパラメタを求める。重みパラメタ算出部は、Ｑ個のｘ_ｑ（文の集合）と各文の集合に対する正しい文の並びｙ_ｑの組からなる訓練データτを入力され、図１２に示すアルゴリズムは重みパラメタｗを学習する。但し、ｑ＝１，２，…，Ｑである。具体的には、Ｑ個の訓練事例を一つずつ取り上げ、現在のｗを用いてｘ_ｑ内の文を並べる（図１２の４行目）。現在の重みパラメタｗを用いて得られる、可能な文の並びのうち連接スコアｗ^ＴΦ（ｘ，ｙ）が最大の文の並びｙ’を求める（図１２の４行目のａｒｇｍａｘ操作）。ａｒｇｍａｘ操作の詳細については後述する。ｙ_ｑ≠ｙ’の場合（つまり、連接スコアｗ^ＴΦ（ｘ，ｙ）が最大の文の並びと正しい文の並びが異なる場合）、現在の重みパラメタｗでは、正しい並びを再現することができなかったことになる。その時は重みパラメタｗを更新し、正しく文を並び替えることができるようにする（図１２の５行目）。現在のパラメタｗで正しい並びを再現することができたのならば、重みパラメタは更新しない（図１２の６行目）。 The text summarization apparatus 100 has a weight parameter calculation unit (not shown), for example, and obtains a weight parameter. The weight parameter calculation unit receives training data τ including a set of Q x _q (a set of sentences) and a correct sentence sequence y _q for each sentence set, and the algorithm shown in FIG. 12 learns the weight parameter w. To do. However, q = 1, 2,..., Q. Specifically, taken one by one Q-number of training cases, arrange the statements in x _q using the current w (4 line of FIG. 12). Among the possible sentence sequences obtained using the current weight parameter w, the sentence sequence y ′ having the maximum concatenation score w ^T Φ (x, y) is obtained (argmax operation on the fourth line in FIG. 12). Details of the argmax operation will be described later. When y _q ≠ y ′ (that is, when the sequence of sentences having the largest concatenation score w ^T Φ (x, y) is different from the sequence of correct sentences), the current weight parameter w can reproduce the correct sequence. It was impossible. At that time, the weight parameter w is updated so that sentences can be rearranged correctly (line 5 in FIG. 12). If the correct sequence can be reproduced with the current parameter w, the weight parameter is not updated (line 6 in FIG. 12).

これをＱ個の訓練データに対し行い、さらに、それをＮ回繰り返すことによって重みパラメタｗを更新し、ある瞬間の重みパラメタｗの和ｖを、ｗを足した回数であるＮ×Ｑで割ることによって平均化し、これを最終的な重みパラメタｗとし（図１２の９行目）、記憶部１０３に格納する。なお、Ｎは求めようとする重みパラメタｗの性質により異なり、重みパラメタｗの算出に先立ち、実験等により適宜求めることができる。また、訓練データτは、文書の頭と末尾、文境界が明示された任意のテキストである。もちろん、要約の対象とするテキストのジャンルと同一ジャンルのテキストのみからパラメタｗを学習するなどの工夫をしてもよい。一例として、推定された重みパラメタを図１４に示す。図１４に示された素性の列が図１１に示した素性と対応しており、重みの列が該素性の重みパラメタである。図１４によれば、文を並べる際に、「料理」という語を含む文の後には「野菜」、「油」、「友達」といった語を含む文を並べた方が正しい並びとなりやすい。一方、「料理」という語を含む文の後に「夜景」「優雅」という語を含む文を並べると誤りとなりやすい。 This is performed on Q training data, and the weight parameter w is updated by repeating it N times, and the sum v of the weight parameter w at a certain moment is divided by N × Q, which is the number of times w is added. This is averaged, and this is set as the final weight parameter w (9th line in FIG. 12) and stored in the storage unit 103. Note that N varies depending on the nature of the weight parameter w to be obtained, and can be appropriately obtained by experiments or the like prior to the calculation of the weight parameter w. Further, the training data τ is an arbitrary text in which the beginning and end of the document and the sentence boundary are specified. Of course, it is also possible to devise such as learning the parameter w only from text of the same genre as the genre of text to be summarized. As an example, the estimated weight parameter is shown in FIG. The feature column shown in FIG. 14 corresponds to the feature shown in FIG. 11, and the weight column is a weight parameter of the feature. According to FIG. 14, when arranging sentences, it is easier to arrange sentences including sentences such as “vegetables”, “oil”, and “friends” after sentences including the word “cooking”. On the other hand, if a sentence including the words “night view” and “elegance” is arranged after a sentence including the word “cooking”, an error is likely to occur.

（ａｒｇｍａｘ操作）
ａｒｇｍａｘ操作は、文の集合ｘ_ｑに含まれる文を用いて可能な文の並びのうち連接スコアが最大となる文の並びｙ’を求める。これはいわゆる巡回セールスマン問題であり、厳密解を短時間で求めることが難しい。例えば、Ｑ！通りの文の並びから最も連接スコアの高い文の並びを求める必要があり、Ｑの値が大きくなるに従い、指数的に計算量も増大する。 (Argmax operation)
argmax operation, articulation score of the list of possible sentences using the statements contained in the set x _q sentence obtaining the sequence y 'statement that maximizes. This is a so-called traveling salesman problem, and it is difficult to obtain an exact solution in a short time. For example, Q! It is necessary to obtain the sequence of sentences with the highest connection score from the sequence of street sentences, and the amount of calculation increases exponentially as the Q value increases.

そのため、一例として、動的計画法とビームサーチを用いて近似解を求め、ａｒｇｍａｘ操作を代替することができる。具体的には、一例として、動的計画法の一種であるHeld and Karp Algorithmを用いる（参考文献３参照）。Held and Karp Algorithmによる文の並びの近似解の探索を図１５に示す。
［参考文献３］Michael Held and Richard M. Karp, “A dynamic programming approach to sequencing problems”, In Journal of the Society for Industrial and Applied Mathematics （SIAM）, 1962, Vol.10, No.1, pp.196-210 Therefore, as an example, an approximate solution can be obtained using dynamic programming and beam search, and the argmax operation can be substituted. Specifically, as an example, Held and Karp Algorithm, which is a kind of dynamic programming, is used (see Reference 3). FIG. 15 shows a search for an approximate solution of a sentence sequence by Held and Karp Algorithm.
[Reference 3] Michael Held and Richard M. Karp, “A dynamic programming approach to sequencing problems”, In Journal of the Society for Industrial and Applied Mathematics (SIAM), 1962, Vol. 10, No. 1, pp.196 -210

Ｓは並び替えの対象とする文の集合であり、うちｓ０を文書の頭を示すもの、ｓ（Ｑ+１）を文書の末尾を示すものとする。すなわち、ｓ０から出発し、ｓ１からｓＱまでの文を必ず一度通過し、ｓ（Ｑ+１）に到着するすべての経路のうち最もスコアが高い経路を探索する問題となる。ＭはＳに含まれるすべての文の間の連接スコアを格納した行列である。例えばＭ_ｋ,ｊは文ｓｋとｓｊの連接スコアを示し、即ちｗ^Ｔφ（ｓｋ,ｓｊ）に相当する。Ｈ_ｉ（Ｃ,ｓｋ）は、既にＣ⊆Ｓを通過し、ｉ時点で文ｓｋを追加した仮説及び該仮説のスコアである。Ｈ^＊は最も文の並びの連接スコアの高い経路である。 S is a set of sentences to be rearranged, in which s0 indicates the head of the document and s (Q + 1) indicates the end of the document. That is, there is a problem of searching for the route having the highest score among all the routes starting from s0 and always passing through the sentences from s1 to sQ once and arriving at s (Q + 1). M is a matrix that stores connection scores between all sentences included in S. For example, M _{k, j} indicates a connection score of sentences sk and sj, that is, corresponds to w ^T φ (sk, sj). H _i (C, sk) is a hypothesis that has already passed C⊆S and added a sentence sk at time i and the score of the hypothesis. H ^* is the path with the highest connection score of the sentence sequence.

Held and Karp Algorithmは、文を並べる際に、最後に選ばれた文と、順序は関係なくそれまでに選ばれた文が同じである仮説が複数ある場合、スコアが最も高い仮説を除くすべての仮説を破棄することによって効率的に探索を行うものである（図１５の５行目）。例えば、図１６の破線で表す文の並びｓ１、ｓ２、ｓ３とｓ２、ｓ１、ｓ３は最後がｓ３であり、それまでに選ばれた文も同様であるため、これら２つの仮説を両方とも展開する必要はない。例えば、ｓ２、ｓ１、ｓ３の連接スコアが高い場合には、この文の並びに係る仮説のみを展開し、ｓ１、ｓ２、ｓ３に係る仮説は破棄すればよい。 When arranging the sentences, the Held and Karp Algorithm uses all the hypotheses except the one with the highest score if there are multiple hypotheses where the last sentence selected and the previously selected sentence are the same regardless of the order. The search is efficiently performed by discarding the hypothesis (line 5 in FIG. 15). For example, the sequence of sentences s1, s2, s3 and s2, s1, s3 shown in broken lines in FIG. 16 ends with s3, and the sentences selected so far are the same, so both of these two hypotheses are expanded. do not have to. For example, when the connection score of s2, s1, and s3 is high, only the hypotheses related to this sentence should be developed, and the hypotheses related to s1, s2, and s3 should be discarded.

しかし、それでも依然探索空間は広大であるため、ｉ時点において連接スコアが高い上位ｂ件の仮説のみを、展開し、ｉ＋１時点の仮説とする（ビームサーチ、図１５の４行目）。つまり、上位ｂ件以外の仮説を破棄する。例えば、図１６の一点鎖線で表す文の並びｓ３、ｓ１、ｓ２がｉ＝３時点において、上位ｂ件以内でなければ、この仮説について、ｉ＝４以降において展開する必要はない。これにより探索空間を大幅に狭めることができる。これにより、さらに効率的に近似解を探索することができる。 However, since the search space is still vast, only the top b hypotheses with a high connection score at the time point i are expanded to be the hypothesis at the time point i + 1 (beam search, line 4 in FIG. 15). That is, the hypotheses other than the top b are discarded. For example, if the sentence sequence s3, s1, and s2 represented by the alternate long and short dash line in FIG. 16 is not within the top b at i = 3, it is not necessary to develop this hypothesis after i = 4. As a result, the search space can be significantly reduced. Thereby, an approximate solution can be searched more efficiently.

例えば、Ｑ＝１００とし、ｂ＝２０としHeld and Karp Algorithmを用いない場合、ｉ＝１時点において、生成される１００個の仮説のうち、上位２０件の仮説のみを展開する。よって、上位２０件に対し、それぞれ９９件の仮説が展開される。その結果、生成される１９８０件の仮説のうち、また上位２０件の仮説のみを展開するだけでよい。ビームサーチを行わない場合には、１００件に対し、それぞれ９９件の仮説が展開される。その結果生成される９９００件の仮説に対し、それぞれ９８件の仮説を展開する。この処理をｉ＝Ｑ時点まで繰り返すため、ビームサーチを用いる場合に比べ演算量が膨大となる。 For example, when Q = 100 and b = 20 and the Held and Karp Algorithm is not used, only the top 20 hypotheses are developed out of the 100 hypotheses generated at time i = 1. Therefore, 99 hypotheses are developed for the top 20 cases. As a result, among the 1980 hypotheses generated, only the top 20 hypotheses need be developed. When the beam search is not performed, 99 hypotheses are developed for every 100 cases. For the 9900 hypotheses generated as a result, 98 hypotheses are developed. Since this process is repeated up to the time point i = Q, the amount of calculation is enormous as compared with the case of using a beam search.

なお、ａｒｇｍａｘ操作は、重みパラメタｗを求める処理の中で行われ（図１２の４行目）、その操作中に連接スコアＭを用いるが（図１５参照）、この連接スコアはａｒｇｍａｘ操作呼出し時の（更新中の）重みパラメタを用いて、式（２）等を用いて求める。 The argmax operation is performed in the process of obtaining the weight parameter w (line 4 in FIG. 12), and the concatenation score M is used during the operation (see FIG. 15). This concatenation score is used when the argmax operation is called. Using the weight parameter (during updating) of (2) and the like.

＜重要文順列探索部１１９＞
重要文順列探索部１１９は、内容性スコアと連接スコアを受信し、これらの和が、最大値となる、または、最大値の近似値となる変換文の順列を探索し（ｓ１１９）、出力部１３５へ送信する。 <Important sentence permutation search unit 119>
The important sentence permutation search unit 119 receives the content score and the concatenation score, searches for a permutation of the conversion sentence in which the sum of these becomes the maximum value or the approximate value of the maximum value (s119), and the output unit Send to 135.

重要文順列探索部１１９は、連接スコアと内容性スコアを元に、要約の対象とするテキストデータから求めた１以上の変換文の並びを、要約長以内で探し出す。このとき、同じ入力文から生成される変換文のバリエーション間の順列は求めないように制約を付加してもよい。この場合、連接スコア計算部においても同じ入力文から生成される変換文のバリエーション間の連接スコアを求める必要はない。例えば、短縮文のバリエーションには、入力文識別子を付加し、同じ入力文識別子を付与されたる短縮文同士の連接スコア及び順列は求めないようにする。 The important sentence permutation search unit 119 searches for a list of one or more conversion sentences obtained from the text data to be summarized within the summary length based on the connection score and the content score. At this time, a restriction may be added so as not to obtain a permutation between variations of conversion sentences generated from the same input sentence. In this case, it is not necessary for the connection score calculation unit to obtain a connection score between variations of the converted sentence generated from the same input sentence. For example, an input sentence identifier is added to a variation of a short sentence, and a connection score and a permutation between short sentences given the same input sentence identifier are not obtained.

Ｓ^＊は内容性スコアと連接スコアの和が最大値、または、最大値の近似値である文の並びとし、Ｕは要約の対象とするテキストから構成可能な文の並びのすべて、ＳはＵに含まれる任意の並びの一つとする。文を選び出す際には単純に文を選ぶのではなく、内容性スコアと、連接スコアの２点を考慮し、最良の文の並びＳ^＊を、Ｕから選び出す。一例として、Ｓ^＊は以下のように定義できる。 S ^* is a sequence of sentences in which the sum of the content score and the connection score is the maximum value or an approximation of the maximum value, U is a sequence of sentences that can be constructed from the text to be summarized, and S is U One of the arbitrary sequences included in. When selecting a sentence, instead of simply selecting a sentence, the best sentence sequence S ^* is selected from U in consideration of the content score and the connection score. As an example, S ^* can be defined as follows.

λは内容性スコアと連接スコアのどちらを重視するか制御するパラメタである。（ｓｉ,ｓｊ）∈Ｓは、文の並びＳにおいて隣接する文ｓｉとｓｊのうち、ｓｉがｓｊの前に現れることを示している。ｌｅｎｇｔｈ（Ｓ）は文の並びＳの長さを、Ｋは要約長を示す。 λ is a parameter that controls which of the content score and the connection score is important. (Si, sj) ∈ S indicates that among the adjacent sentences si and sj in the sentence sequence S, si appears before sj. length (S) indicates the length of the sentence sequence S, and K indicates the summary length.

最良の要約Ｓ^＊は、Ｓ^＊が含むすべての文sの内容性スコアＣｏｎｔｅｎｔ（ｓ）の合計ΣＣｏｎｔｅｎｔ（ｓ）と、それらの文のうち隣接する文間の連接スコアＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）の合計ΣＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）の和が最大のものである。なお、内容性スコアと連接スコアの和の最大値とは、ａｒｇｍａｘ［ΣＣｏｎｔｅｎｔ（ｓ）＋ΣＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）］だけでなく、λを用いて調整される値ａｒｇｍａｘ［ΣＣｏｎｔｅｎｔ（ｓ）＋λΣＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）］をも含む。 The best summary S ^* is the sum ΣContent (s) of the content score Content (s) of all sentences s included in S ^* and the connection score Connect (sj | si) between adjacent sentences among those sentences. The sum of the total ΣConnect (sj | si) is the largest. Note that the maximum value of the sum of the content score and the connection score is not only argmax [ΣContent (s) + ΣConnect (sj | si)] but also a value adjusted using λ argmax [ΣContent (s) + λΣConnect (sj | Si)].

しかし、このようなＳ^＊を求める問題、即ち式（４）のａｒｇｍａｘ操作を行う問題も厳密解を短時間で求めることが難しい。これに対処するため、重みパラメタｗを学習する際と同様に、動的計画法とビームサーチを用いて探索を行い、近似解を求める。具体的には、図１５のＨ_ｉ（Ｃ,ｓｋ）を連接スコアと内容性スコアの和を表わすものとし（ｗを学習する際には連接スコアのみを表わす）、この和の高い仮説のみを順次展開するようにする。一方、ｗを学習した際とは異なり、要約サイズの制限があるため、すべての文を並べる必要はない。そのため、図１５に示したHeld and Karp Algorithmとビームサーチを用い探索を行いつつ、いかなる文を追加しても要約サイズを超える仮説は展開を停止し要約候補として別途保存しておく。そして、すべての仮説を展開し終わった後、保存された要約候補の中から最もスコアが高いものを選ぶことによって近似解を得ることができる。 However, it is difficult to obtain an exact solution in a short time for such a problem of obtaining S ^*, that is, a problem of performing the argmax operation of Equation (4). In order to cope with this, as in the case of learning the weight parameter w, a search is performed using dynamic programming and beam search to obtain an approximate solution. Specifically, H _i (C, sk) in FIG. 15 represents the sum of the concatenation score and the content score (only the concatenation score is represented when learning w), and only the hypothesis having a high sum is represented. Try to expand sequentially. On the other hand, unlike when learning w, there is a limitation on the summary size, so it is not necessary to arrange all sentences. Therefore, while performing a search using the Held and Karp Algorithm and beam search shown in FIG. 15, the hypothesis exceeding the summary size is stopped regardless of the addition of any sentence, and is separately stored as a summary candidate. Then, after all hypotheses are developed, an approximate solution can be obtained by selecting the one with the highest score from the stored summary candidates.

以下、図１７を用いて重要文順列探索部１１９の処理例について説明する。時点ｉにおけるＨ（ｉ）個の文の並びの集合をＳ（ｉ）＝｛Ｓ（ｉ，１），Ｓ（ｉ，２），…，Ｓ（ｉ，Ｈ（ｉ））｝とする。要約元となるテキストにはＱ個の文が含まれ、その集合をＺ＝｛ｓ１，ｓ２，…，ｓＱ｝と表す。 Hereinafter, a processing example of the important sentence permutation searching unit 119 will be described with reference to FIG. Let S (i) = {S (i, 1), S (i, 2),..., S (i, H (i))} be the set of H (i) sentence sequences at time point i. The summarization source text includes Q sentences, and the set is represented as Z = {s1, s2,..., SQ}.

まず、初期設定を行う（ｓ１１９ａ）。ｉ時点の文の並びのＳ（ｉ，ｈ）が文ｓｑを既に網羅しているか否か判定し（ｓ１１９ｂ）、網羅していない場合には、Ｓ（ｉ，ｈ）に文ｓｑを追加し、文の並びＳ（ｉ＋１，ｋ）を生成する（ｓ１１９ｃ）。Ｓ（ｉ＋１，ｋ）の大きさが要約サイズＫ以下か否かを判定し（ｓ１１９ｄ）、大きい場合には、文ｓｑ追加前の文の並びＳ（ｉ，ｈ）を保存する（ｓ１１９ｅ）。以降、この文の並びＳ（ｉ＋１，ｋ）に対する仮説の展開は行わない。例えば、図１６において、一点鎖線で表される文の並びＳ３，Ｓ１に対し、Ｓ２を追加したときに、要約サイズＫを超える場合には、文の並びＳ３，Ｓ１を保存し、文の並びＳ３，Ｓ１，Ｓ２に対する仮説の展開は行わない。 First, initial setting is performed (s119a). It is determined whether or not S (i, h) of the sentence sequence at time i already covers the sentence sq (s119b). If not, the sentence sq is added to S (i, h). , A sentence sequence S (i + 1, k) is generated (s119c). It is determined whether or not the size of S (i + 1, k) is equal to or smaller than the summary size K (s119d). If it is larger, the sentence sequence S (i, h) before the sentence sq is added is saved (s119e). Thereafter, the hypothesis is not developed for this sentence sequence S (i + 1, k). For example, in FIG. 16, when S2 is added to the sentence sequence S3, S1 represented by the alternate long and short dash line, if the summary size K is exceeded, the sentence sequence S3, S1 is stored, and the sentence sequence is stored. The hypothesis is not developed for S3, S1, and S2.

この処理をｉ時点の文の並びＳすべてに対して行い（ｓ１１９ｇ、ｈ）、さらに、要約元となるテキストに含まれるすべての文に対して行う（ｓ１１９ｉ，ｊ）。 This processing is performed for all the sentence sequences S at time i (s119g, h), and is further performed for all sentences included in the text to be summarized (s119i, j).

生成された文の並びの集合Ｓ（ｉ＋１）＝｛Ｓ（ｉ＋１，１），Ｓ（ｉ＋１，２），…，Ｓ（ｉ＋１，ｋ）｝に含まれる各文の並びの内容性スコアと連接スコアの和ｓｕｍを求める（ｓ１１９ｋ）。ｋ’＝１，２，…，ｋとし、各文の並びの内容性スコアをＣｏｎｔｅｎｔ（Ｓ（ｉ＋１，ｋ’））とし、連接スコアをＣｏｎｎｅｃｔ（Ｓ（ｉ＋１，ｋ’））とすると、 The content score of each sentence sequence included in the set S (i + 1) = {S (i + 1,1), S (i + 1,2),..., S (i + 1, k)} A sum sum of scores is obtained (s119k). If k ′ = 1, 2,..., k, the content score of each sentence sequence is Content (S (i + 1, k ′)), and the connection score is Connect (S (i + 1, k ′)),

と表すことができる。すべてのｋ’に対応するｓｕｍ（Ｓ（ｉ＋１，ｋ’））を求め、最後に追加した文が同一であって、かつ、既に網羅した文の集合が同一の文の並びが存在するか否か判定し、存在する場合には、存在する文の並びの中でｓｕｍ（Ｓ（ｉ＋１，ｋ’））が最大か否か判定し（ｓ１１９ｍ）、最大でない場合には、対応する仮説を破棄する（ｓ１１９ｎ）。その上で、各ｓｕｍの値が上位ｂ件に属するか否か判定し（ｓ１１９ｐ）、属さない場合には対応する仮説を破棄する（ｓ１１９ｎ）。以降、破棄した文の並びに対する仮説の展開は行わない。 It can be expressed as. Sum (S (i + 1, k ′)) corresponding to all k ′ is obtained, and whether or not there is a sequence of sentences in which the last added sentence is the same and the sentence set already covered is the same. If it is present, it is determined whether sum (S (i + 1, k ′)) is maximum in the sequence of existing sentences (s119m). If not, the corresponding hypothesis is discarded. (S119n). After that, it is determined whether or not the value of each sum belongs to the top b cases (s119p). If it does not belong, the corresponding hypothesis is discarded (s119n). Thereafter, the hypothesis is not expanded for the discarded sentence sequence.

ｉを更新し（ｓ１１９ｑ）、上記処理（ｓ１１９ｂ〜ｓ１１９ｑ）を繰り返し、通常、Ｋは要約元のテキストの大きさよりも小さいので、要約元のテキストに含まれる文のすべてを並べる前に、すべての仮説は破棄されるか、保存される。そして、保存された仮説の中から最も大きいｓｕｍに対応する文の並びをＳ^＊とする。 i is updated (s119q), and the above processing (s119b to s119q) is repeated. Usually, since K is smaller than the size of the text of the summarizing source, all of the sentences included in the text of the summarizing source are arranged. Hypotheses are discarded or saved. Then, the sequence of sentences corresponding to the largest sum among the stored hypotheses is S ^* .

＜出力部１３５＞
出力部１３５は、重要文順列探索部１１９で探索された変換文の並びを受信し、この順列を文に整形し、出力する。 <Output unit 135>
The output unit 135 receives the list of converted sentences searched by the important sentence permutation searching unit 119, shapes the permutation into a sentence, and outputs the sentence.

例えば、テキスト要約装置１００に対して、文１、文２、文３からなるテキストデータと（図３〜図５参照）、要約長として４５文字が入力された場合、下記のような出力（要約）を得ることができる。
文２「イタリアで修行してきたシェフによる料理は最高です。」（２５文字：短縮率０．８９）
↓
文１「ピザに対するイメージが変わりました。」（１８文字：短縮率０．５３）
要約「イタリアで修行してきたシェフによる料理は最高です。ピザに対するイメージが変わりました。」（４３文字：要約率０．４６）
これは、文２の短縮文の後に文１の短縮文がつながることを意味し、それぞれの文の短縮率は異なる。 For example, when text data composed of sentence 1, sentence 2, and sentence 3 (see FIGS. 3 to 5) and 45 characters are input as the summary length, the following output (summary) ) Can be obtained.
Sentence 2 “Cooking by a chef trained in Italy is the best.” (25 letters: 0.89 reduction rate)
↓
Sentence 1 “The image of pizza has changed.” (18 characters: shortening rate 0.53)
Summary “Cooking by a chef trained in Italy is the best. The image of pizza has changed.” (43 characters: summary rate 0.46)
This means that the shortened sentence of sentence 1 is connected after the shortened sentence of sentence 2, and the shortening rate of each sentence is different.

＜効果＞
このような構成とすることで、各入力文に対し尤もらしい変換文バリエーションを生成し、各変換文の内容性スコアと変換文の連接スコアを求め、これを同時に考慮した順列を探索するため、重要かつ自然な変換文の順列となる。大域的な情報に加えて、局所的な情報を利用することができ、内容網羅性及び可読性の高い要約を生成することができるという効果を奏する。言い換えると、原文の重要な内容を多く網羅し、かつ、読みやすい要約を生成することが可能となる。 <Effect>
With such a configuration, a plausible conversion sentence variation is generated for each input sentence, the content score of each conversion sentence and the connection score of the conversion sentence are obtained, and a permutation that considers this simultaneously is searched. It is a permutation of important and natural translations. In addition to global information, local information can be used, and it is possible to generate a summary with high content coverage and readability. In other words, it is possible to generate an easy-to-read summary that covers many important contents of the original text.

特に、要約長が短いほど本発明は有利になる。本実施例によれば、要約において隣接する短縮文同士のつながりやすさが考慮されるので、文短縮を行いつつも、つながりやすさに不可欠な情報が欠落しない文短縮が実現でき、限られた要約長でも読みやすい要約を生成することが可能になる。 In particular, the shorter the summary length, the more advantageous the present invention. According to the present embodiment, since the ease of connection between adjacent shortened sentences is considered in the summary, sentence shortening that does not lack information essential for easy connection can be realized while sentence shortening is limited. An easy-to-read summary can be generated even with the summary length.

また、複数の文章から要約を生成するという問題に対して、文を短縮しても要約の可読性を犠牲にせず、短縮した長さだけ別の価値ある情報を要約に含めることにより要約の内容網羅性を向上させることができる。 Also, for the problem of generating a summary from multiple sentences, the summary content is covered by including another valuable information in the summary for the shortened length without sacrificing the readability of the summary even if the sentence is shortened. Can be improved.

さらに、本実施例によれば、最適な文短縮の順列が、各文に最適な文短縮の短縮率を一意に決定する。このため、各文における文短縮の短縮率に関するパラメタをもはや必要としないという効果がある。 Further, according to the present embodiment, the optimum sentence shortening permutation uniquely determines the optimum sentence shortening shortening rate for each sentence. For this reason, there is an effect that a parameter relating to a sentence shortening rate in each sentence is no longer required.

＜ハードウェア構成＞
図１８は、本実施例におけるテキスト要約装置１００のハードウェア構成を例示したブロック図である。図１８に例示するように、この例のテキスト要約装置１００は、それぞれＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。 <Hardware configuration>
FIG. 18 is a block diagram illustrating a hardware configuration of the text summarizing device 100 in the present embodiment. As illustrated in FIG. 18, the text summarization device 100 in this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, and a RAM (Random). Access Memory) 16 and a bus 17.

この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１２は、データが入力される入力インターフェース、キーボード、マウス等であり、出力部１３は、データが出力される出力インターフェース、ディスプレイ、プリンタ等である。補助記憶装置１４は、例えば、ハードディスク、半導体メモリ等であり、テキスト要約装置１００としてコンピュータを機能させるためのプログラムや各種データが格納される。また、ＲＡＭ１６には、上記のプログラムや各種データが展開され、ＣＰＵ１１等から利用される。また、バス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を通信可能に接続する。なお、このようなハードウェアの具体例としては、例えば、パーソナルコンピュータの他、サーバ装置やワークステーション等を例示できる。 The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like. The output unit 13 is an output interface for outputting data, a display, a printer, and the like. The auxiliary storage device 14 is, for example, a hard disk, a semiconductor memory, or the like, and stores programs and various data for causing the computer to function as the text summarization device 100. Further, the above program and various data are expanded in the RAM 16 and used by the CPU 11 or the like. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 so that they can communicate with each other. In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.

＜プログラム構成＞
上述のように、補助記憶装置１４には、本実施例のテキスト要約装置１００の各処理を実行するための各プログラムが格納される。テキスト要約プログラムを構成する各プログラムは、単一のプログラム列として記載されていてもよく、また、少なくとも一部のプログラムが別個のモジュールとしてライブラリに格納されていてもよい。 <Program structure>
As described above, each program for executing each process of the text summarizing apparatus 100 of this embodiment is stored in the auxiliary storage device 14. Each program constituting the text summary program may be described as a single program sequence, or at least a part of the program may be stored in the library as a separate module.

＜ハードウェアとプログラムとの協働＞
ＣＰＵ１１は、読み込まれたＯＳプログラムに従い、補助記憶装置１４に格納されている上述のプログラムや各種データをＲＡＭ１６に展開する。そして、このプログラムやデータが書き込まれたＲＡＭ１６上のアドレスがＣＰＵ１１のレジスタ１１ｃに格納される。ＣＰＵ１１の制御部１１ａは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ｂに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 <Cooperation between hardware and program>
The CPU 11 expands the above-described program and various data stored in the auxiliary storage device 14 in the RAM 16 according to the read OS program. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, causes the calculation unit 11b to sequentially execute the operation indicated by the program, The calculation result is stored in the register 11c.

図１は、このようにＣＰＵ１１に上述のプログラムが読み込まれて実行されることにより構成されるテキスト要約装置１００の機能構成を例示したブロック図である。 FIG. 1 is a block diagram illustrating a functional configuration of a text summarizing apparatus 100 configured by reading and executing the above-described program in the CPU 11 as described above.

ここで、記憶部１０３は、補助記憶装置１４、ＲＡＭ１６、レジスタ１１ｃ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。また、変換文生成部１１２、素性要素抽出部１１３、内容性スコア計算部１１５、連接スコア計算部１１７及び重要文順列探索部１１９は、ＣＰＵ１１にテキスト要約プログラムを実行させることにより構成されるものである。 Here, the storage unit 103 corresponds to any one of the auxiliary storage device 14, the RAM 16, the register 11 c, other buffer memory and cache memory, or a storage area using these in combination. Moreover, the conversion sentence generation part 112, the feature element extraction part 113, the content score calculation part 115, the connection score calculation part 117, and the important sentence permutation search part 119 are comprised by making CPU11 run a text summary program. is there.

［変形例１］
図７及び図８を用いて実施例１と異なる部分についてのみ説明する。 [Modification 1]
Only parts different from the first embodiment will be described with reference to FIGS.

変換文生成部１１２は、図７記載の分割部１１２ｂのみを備える。記憶部１０３は、複文及び重文を単文に分割する分割ルールを記憶している。 The conversion sentence generation unit 112 includes only the division unit 112b illustrated in FIG. The storage unit 103 stores a division rule for dividing a compound sentence and a heavy sentence into a single sentence.

分割部１１２ｂは、入力部１３１を介して入力文を受信し、記憶部１０３の分割ルールを参照して、入力文が複文及び重文の場合には、単文に分割し（ｓ１１２ｂ）、２以上の単文を変換文のバリエーションとして素性要素抽出部１１３に送信する。 The division unit 112b receives the input sentence via the input unit 131, refers to the division rule of the storage unit 103, and divides the input sentence into a single sentence when the input sentence is a compound sentence or a heavy sentence (s112b), and two or more A simple sentence is transmitted to the feature element extraction unit 113 as a variation of the converted sentence.

例えば、形態素「で/判定詞:連用」＋「あ/動詞語幹:R」＋「り/動詞接尾辞:連用」＋「、/読点」を「で/判定詞:連用」＋「あ/動詞語幹:R」＋「る/動詞接尾辞:終止」＋「。/句点」に分割するという分割ルールを記憶部１０３に記憶しておく。分割部１１２ｂは、形態素解析済み文４「このお店はイタリアンの老舗であり、横浜駅から徒歩１０分の場所に位置する。（図６参照）」を受信し、形態素「で/判定詞:連用」＋「あ/動詞語幹:R」＋「り/動詞接尾辞:連用」＋「、/読点」をキーとして記憶部１０３を検索し、該当する分割ルールを取得し、このルールに従って、文４−１「このお店はイタリアンの老舗である。」と文４−２「横浜駅から徒歩１０分の場所に位置する。」という２つの文に分割し、これらを変換文のバリエーションとして出力する。 For example, the morpheme “de / determinant: combination” + “a / verb stem: R” + “ri / verb suffix: combination” + “, / reading” is changed to “de / determination: combination” + “a / verb A division rule of dividing into “stem: R” + “ru / verb suffix: end” + “./ phrase” is stored in the storage unit 103. The dividing unit 112b receives the morpheme-analyzed sentence 4 "This store is a long-established Italian restaurant and is located a 10-minute walk from Yokohama Station (see FIG. 6)", and the morpheme "de / determinant: The storage unit 103 is searched by using “continuous use” + “a / verb stem: R” + “ri / verb suffix: continuous use” + “, / read” as a key, and the corresponding division rule is obtained. 4-1 “This shop is a long-established Italian restaurant” and sentence 4-2 “Located 10 minutes on foot from Yokohama station” are divided into two sentences, and these are output as variations of conversion sentences. To do.

本変形例によれば、各入力文に対し１以上の変換文を生成し、各変換文の内容性スコアと変換文の連接スコアを求めて、要約長以内の変換文の順列を探索するため、コーパス等から求めた大域的な情報に加えて、変換文同士の連接に関する局所的な情報を利用することができ、内容網羅性及び可読性の高い要約を生成することができる。特に、複文や重文を単文に分割し、変換文のバリエーションとすることができるので、読みやすい（可読性の高い）要約を生成できる。 According to this modification, one or more conversion sentences are generated for each input sentence, the content score of each conversion sentence and the concatenation score of the conversion sentence are obtained, and the permutation of conversion sentences within the summary length is searched. In addition to global information obtained from a corpus or the like, it is possible to use local information related to the connection between converted sentences, and it is possible to generate a summary with high content coverage and readability. In particular, since compound sentences and heavy sentences can be divided into simple sentences and converted into variations of sentences, easy-to-read (highly readable) summaries can be generated.

なお本変形例では、同じ入力文から生成される変換文のバリエーション間の順列は求めないように制約を付加する必要はない。変換文のバリエーションは、それぞれ異なる内容を持つ文となるためである。 In this modification, it is not necessary to add a restriction so as not to obtain a permutation between variations of the conversion sentence generated from the same input sentence. This is because the variations of the converted sentence are sentences having different contents.

［変形例２］
図７及び図８を用いて実施例１と異なる部分についてのみ説明する。 [Modification 2]
Only parts different from the first embodiment will be described with reference to FIGS.

変換文生成部１１２は、図７記載の換言部１１２ｃのみを備える。記憶部１０３は、ある表現をより短い同じ意味を持つ他の表現に言い換えるというルールを換言テーブルとして記憶している。 The conversion sentence generation unit 112 includes only the paraphrase unit 112c illustrated in FIG. The storage unit 103 stores, as a paraphrase table, a rule for paraphrasing a certain expression into a shorter expression having the same meaning.

換言部１１２ｃは、入力部１３１を介して入力文を受信し、記憶部１０３の換言テーブルを参照して、入力文の表現をより短い同じ意味を持つ他の表現に言い換え（ｓ１１２ｃ）、言い換えた文を変換文のバリエーションとして素性要素抽出部１１３に送信する。 The paraphrase unit 112c receives the input sentence via the input unit 131, refers to the paraphrase table in the storage unit 103, paraphrases the expression of the input sentence into another expression having the same meaning (s112c), and paraphrases. The sentence is transmitted to the feature element extraction unit 113 as a variation of the converted sentence.

例えば、「徒歩Ｘ分の場所に位置する」という表現をより短い同じ意味を持つ「Ｘ分歩くと着く」という表現に言い換えるというルールを換言テーブルとして記憶部１０３に記憶している場合、換言部１１２ｃは、文４「このお店はイタリアンの老舗であり、横浜駅から徒歩１０分の場所に位置する。」を受信し、「徒歩１０分の場所に位置する」という表現をキーとして、記憶部１０３を検索し、上記ルールを取得し、このルールに従って、文４を「このお店はイタリアンの老舗であり、横浜駅から１０分歩くと着く。」に言い換え、言い換えた文を変換文のバリエーションとして素性要素抽出部１１３に送信する。なお、１つの入力文に対し、２つ以上の言い換え表現を用意し、換言テーブルに記憶してもよい。なお、言い換え表現は文字列のパターン以外でもよい。例えば、形態素のパターンでもよいし、係り受けのパターンでもよく、特に規定するものではない。 For example, in the case where a rule that paraphrases the expression “located at a place of X minutes on foot” to the expression “gets walking when X minutes” having the same meaning is stored in the storage unit 103 as a paraphrase table, 112c receives sentence 4 "This shop is a long-established Italian restaurant located 10 minutes on foot from Yokohama Station" and stores it using the expression "located 10 minutes on foot" as a key. Part 103 is searched, the above rule is obtained, and according to this rule, sentence 4 is rephrased as “This store is a long-established Italian restaurant, and it will arrive after walking 10 minutes from Yokohama station”. It transmits to the feature element extraction part 113 as a variation. Two or more paraphrase expressions may be prepared for one input sentence and stored in the paraphrase table. The paraphrase expression may be other than the character string pattern. For example, it may be a morpheme pattern or a dependency pattern, and is not particularly defined.

本変形例によれば、実施例１及び変形例１と同様に、内容網羅性及び可読性の高い要約を生成することができる。特に、より短い同じ意味を持つ他の表現に言い換えるため、同内容の要約をより短い文字数で作成できる。 According to the present modification, a summary with high content coverage and high readability can be generated as in the first embodiment and the first modification. In particular, a summary of the same content can be created with a shorter number of characters to paraphrase another expression having the same meaning.

［変形例３］
図７及び図８を用いて実施例１と異なる部分についてのみ説明する。 [Modification 3]
Only parts different from the first embodiment will be described with reference to FIGS.

変換文生成部１１２は、短縮文生成部１１２ａと、分割部１１２ｂと、換言部１１２ｃを備える。実施例１、変形例１、変形例２では、各部が単独で変換処理を行っていたが、本変形例では、各部が並列に変換処理を行う。短縮文生成部１１２ａと、分割部１１２ｂと、換言部１１２ｃは、それぞれ第１の変換文のバリエーション、第２の変換文のバリエーション、第３の変換文のバリエーションを求め（ｓ１１２ａ，ｓ１１２ｂ，ｓ１１２ｃ）、素性要素抽出部１１３に送信する。各部の処理は実施例１、変形例１、変形例２で説明した通りである。 The conversion sentence generation unit 112 includes a shortened sentence generation unit 112a, a division unit 112b, and a paraphrase unit 112c. In the first embodiment, the first modification, and the second modification, each unit performs the conversion process independently, but in this modification, each unit performs the conversion process in parallel. The abbreviated sentence generating unit 112a, the dividing unit 112b, and the paraphrase unit 112c obtain variations of the first converted sentence, second converted sentence, and third converted sentence, respectively (s112a, s112b, and s112c). And transmitted to the feature element extraction unit 113. The processing of each part is as described in the first embodiment, the first modification, and the second modification.

このような構成とすることで、実施例１、変形例１、変形例２と同様の効果を得ることができる。さらに、入力文に対して、より多くの変換文のバリエーションを求めることができ、より多くのバリエーションに基づき、重要文順列を探索することができるため、より内容網羅性及び可読性の高い要約を生成することができる。重要文抽出と文短縮を組合せるだけでなく、文短縮の代わりに、各文に対して同義でもより短い表現への言い換えを行ったり、重文に対する文分割を行うことを、重要文抽出と組合せることによって、さらに読みやすく原文の情報を網羅した要約を生成する。 By adopting such a configuration, the same effects as those of the first embodiment, the first modification, and the second modification can be obtained. Furthermore, it is possible to obtain more conversion sentence variations for the input sentence, and to search important sentence permutations based on more variations, thus generating a more comprehensive and readable summary. can do. In addition to combining important sentence extraction and sentence shortening, instead of sentence shortening, combining each sentence with a synonym for a shorter expression or shorter sentence division, or combining sentences with important sentence extraction To generate a summary that is easier to read and covers the original information.

なお、変換文生成部１１２は、短縮文生成部１１２ａと分割部１１２ｂと換言部１１２ｃのうち、何れか２つを備える構成としてもよい。 The converted sentence generation unit 112 may include any two of the abbreviated sentence generation unit 112a, the division unit 112b, and the paraphrase unit 112c.

［変形例４］
図１９及び図２０を用いて実施例１と異なる部分についてのみ説明する。 [Modification 4]
Only parts different from the first embodiment will be described with reference to FIGS. 19 and 20.

変換文生成部２１２は、短縮文生成部１１２ａと、分割部１１２ｂと、換言部１１２ｃを備える。実施例１、変形例１、変形例２では、各部が単独で変換処理を行っていたが、本変形例では、各部が縦列に変換処理を行う（ｓ２１２）。例えば、短縮文生成部１１２ａは入力文を用いて第一の変換文のバリエーションを求める（ｓ１１２ａ）。分割部１１２ｂは、入力文の代わりに、短縮文生成部１１２ａで求めた第１の変換文のバリエーションを入力され、これを用いて、第２の変換文のバリエーションを求める（ｓ１１２ｂ）。さらに、換言部１１２ｃは、入力文の代わりに、第２の変換文のバリエーションを入力され、これを用いて、第３の変換文バリエーションを求める（ｓ１１２ｃ）。そして、第３の変換文のバリエーションを素性要素抽出部１１３に送信する。 The conversion sentence generation unit 212 includes a shortened sentence generation unit 112a, a division unit 112b, and a paraphrase unit 112c. In the first embodiment, the first modification, and the second modification, each unit performs the conversion process independently, but in this modification, each unit performs the conversion process in a column (s212). For example, the abbreviated sentence generation unit 112a obtains a variation of the first converted sentence using the input sentence (s112a). The dividing unit 112b receives the variation of the first conversion sentence obtained by the shortened sentence generation unit 112a instead of the input sentence, and uses this to obtain the variation of the second conversion sentence (s112b). Further, the paraphrase unit 112c receives a variation of the second conversion sentence instead of the input sentence, and uses this to obtain a third conversion sentence variation (s112c). Then, the variation of the third conversion sentence is transmitted to the feature element extraction unit 113.

このような構成とすることで、実施例１、変形例１、変形例２と同様の効果を得ることができる。さらに、短縮文作成部１１２ａにより、入力文に対して、多くの変換文のバリエーションを求めることができ、さらに、分割部１１２ｂにより可読性の高い変換文のバリエーションを得ることができ、換言部１１２ｃにより同内容の要約をより短い文字数で作成することができる。これらの文をもとに、重要文順列を探索することができるため、より内容網羅性及び可読性の高い要約を生成することができる。 By adopting such a configuration, the same effects as those of the first embodiment, the first modification, and the second modification can be obtained. Furthermore, the abbreviated sentence creating unit 112a can obtain many variations of the converted sentence for the input sentence, and the dividing unit 112b can obtain a highly readable variation of the converted sentence. A summary of the same content can be created with a shorter number of characters. Since important sentence permutations can be searched based on these sentences, it is possible to generate a summary with higher content coverage and readability.

なお、短縮文生成部１１２ａと、分割部１１２ｂと、換言部１１２ｃのうち、何れか二つを備える構成としてもよいし、各部は、どのような順番で縦列接続してもよい。 In addition, it is good also as a structure provided with any two among the shortening sentence production | generation parts 112a, the division | segmentation part 112b, and the paraphrase part 112c, and you may cascade-connect each part in what order.

［変形例５］
実施例１と異なる部分についてのみ説明する。短縮文生成部１１２ａにおいて、参考文献３記載の方法を用いて短縮文のバリエーションを生成する。参考文献３では、文節網羅率と文節連接確率に基づいて各部分木のスコアを以下のように定義し、スコアが最大となる述部をルートとした部分木を要約対象である文の短縮文とする。本変形例では、スコアが最大となる述部をルートとした部分木だけではなく上位Ｎ件を短縮文のバリエーションとして求める。 [Modification 5]
Only parts different from the first embodiment will be described. The abbreviated sentence generation unit 112a generates a variation of the abbreviated sentence using the method described in Reference 3. In Reference 3, the score of each subtree is defined as follows based on the phrase coverage rate and the phrase connection probability, and the shortened sentence of the sentence whose summary is the subtree rooted at the predicate with the highest score And In this modification, not only the subtree having the predicate with the highest score as the root but also the top N items are obtained as variations of the abbreviated sentence.

ここで、Ｘは要約対象の文を、Ｇ（Ｘ）は文Ｘにおける部分木の集合を、Ｗは部分木を、ｎｏｄｅ（Ｗ）は部分木Ｗを構成する文節数を表す。Ｐ_ｉｍｐ（ｗ_ｉ）は文節網羅率を表し、 Here, X represents a sentence to be summarized, G (X) represents a set of subtrees in the sentence X, W represents a subtree, and node (W) represents the number of clauses constituting the subtree W. P _imp (w _i ) represents the phrase coverage rate,

である。ここで、Ｘは文全体、ｗ_ｉとｗ_ｊは文節を表し、ｔ_ｋは文節ｗ_ｉに含まれる単語を、ｔ_ｌは文節ｗ_ｊに含まれる単語を表す。文節網羅率を考慮することで、文の内容を網羅するような重要な単語を含む文節を多く含む部分木のスコアが高くなる。Ｐ_ａｄｊ（ｗ_ｉ｜ｗ_ｉ−１）は文節連接確率を表し、 It is. Here, X represents the entire sentence, w _i and w _j represent phrases, t _k represents a word included in the phrase w _i , and t _l represents a word included in the phrase w _j . By considering the phrase coverage rate, the score of the subtree that includes many phrases including important words that cover the contents of the sentence increases. P _adj (w _i | w _i−1 ) represents the phrase connection probability,

である。但し、ｆ_ｉは文節の内容語の主辞を表し、ｆ_ｉ−１は文節の機能語の主辞を表す。文節の内容語の主辞とは、文節に存在する内容語列のうち最後尾のものを指す。また、節の機能語の主辞とは、文節に存在する機能語列のうち句読点等の記号類を除く最後尾のものを指す。但し、副詞文節のように機能語がないものは内容の主辞と同じ単語になる。文節連接確率は各文節が隣り合う尤もらしさを表し、コーパスから最尤推定みより求める。 It is. However, f _i represents the main word of the content word of the phrase, and f _i−1 represents the main word of the function word of the phrase. The main word of the content word of a phrase refers to the last word in the content word string existing in the phrase. Moreover, the main word of the function word of a clause refers to the last one excluding symbols such as punctuation marks in a function word string existing in a clause. However, if there is no function word such as adverb clause, it becomes the same word as the main part of the content. The phrase connection probability represents the likelihood that each phrase is adjacent, and is obtained from the maximum likelihood estimation from the corpus.

このような構成とすることによって、実施例１と同様の効果を得ることができる。 By adopting such a configuration, the same effect as in the first embodiment can be obtained.

［その他の変形例］
入力とするテキストデータは形態素解析、固有表現抽出、係り受け解析が済んでいないものであってもよい。その場合には必要に応じてテキスト要約装置に従来技術の形態素解析部、固有表現抽出部、係り受け解析部等を設ければよい。 [Other variations]
The input text data may be one that has not been subjected to morphological analysis, specific expression extraction, or dependency analysis. In that case, the text summarizing device may be provided with a conventional morphological analysis unit, a specific expression extraction unit, a dependency analysis unit, and the like as necessary.

素性要素は、必ずしも形態素単位である必要はなく、文節等、素性を構成することができる単位であればよい。 A feature element does not necessarily have to be a morpheme unit, and may be a unit that can constitute a feature, such as a phrase.

内容性スコア計算部１１５は、素性要素とは異なる単位を文要素（例えば、単語や固有表現等）としてもよい。その場合、素性要素抽出部１１３の出力を入力とするのではなく、自身の備える文要素抽出部において、入力されたテキストデータから文要素を抽出する。 The content score calculation unit 115 may use a unit different from the feature element as a sentence element (for example, a word or a specific expression). In that case, instead of using the output of the feature element extraction unit 113 as an input, the sentence element extraction unit provided therein extracts a sentence element from the input text data.

文要素スコア及び重みパラメタは、他の方法を用いて求めてよいし、他の装置で予め求めたものを記憶部１０３に記憶してもよい。 The sentence element score and the weight parameter may be obtained using other methods, or those obtained in advance by another apparatus may be stored in the storage unit 103.

重要文順列探索部１１９は、他に、冗長性を削減する工夫を行うこともできる。式（１）と式（４）によれば、同一の内容語や情報抽出の結果がＳに含まれていた場合、それらは何度も内容性スコアに加算される。しかし、一般に、要約に同一の情報が何度も現れることは好ましくないため、同一の内容語や情報抽出の結果は一度しか内容性スコアに加算されない工夫を行うことによって、要約に同一の情報が何度も含まれることを防ぐことができる。この工夫により、同一の入力文から生成される複数の変換文が同時に要約に含まれる可能性が小さくなる。 The important sentence permutation search unit 119 can also devise other ways to reduce redundancy. According to Equation (1) and Equation (4), if the same content word or information extraction result is included in S, they are added to the content score many times. However, in general, it is not preferable that the same information appears in the summary over and over, so the same content word or information extraction result is added to the content score only once, so that the same information is included in the summary. It can be prevented from being included many times. This contrivance reduces the possibility that a plurality of conversion sentences generated from the same input sentence are included in the summary at the same time.

連接スコア計算部１１７が３つ以上の文の並びの連接スコアｗ^ＴΦ（ｘ，ｙ）を求めているが、連接スコア計算部１１７は２つの文の連接スコアＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）のみを求め、出力し、重要文順列探索部１１９で３つ以上の文の並びの連接スコアｗ^ＴΦ（ｘ，ｙ）を求める構成としてもよい。 The connection score calculation unit 117 calculates the connection score w ^T Φ (x, y) of a sequence of three or more sentences. The connection score calculation unit 117 calculates only the connection score Connect (sj | si) of two sentences. calculated, and outputs, connecting score sequence of three or more statements in the key sentence permutation search unit ^{119 w T Φ (x, y} ) may be configured to seek.

また、連接スコア計算部１１７は、２つの文の連接スコアＣｏｎｎｅｃｔ（ｓｊ｜ｓｉ）から３つ以上の文の並びの連接スコアｗ^ＴΦ（ｘ，ｙ）を求めているが、３つ以上の文から連接スコアを求め、その連接スコアを用いて、文の並びの連接スコアを求める構成としてもよい。例えば、Ｃｏｎｎｅｃｔ（ｓｎ｜ｓｉ，ｓ（ｉ＋１）,…，ｓ（ｎ−１））とし、文ｓｉ，ｓ（ｉ＋１）,…，ｓ（ｎ−１）の次に文ｓｎが続く場合の連接スコアを求める。この場合、重みパラメタ等をこれに併せて適宜設定する。 The connection score calculation unit 117 obtains a connection score w ^T Φ (x, y) of a sequence of three or more sentences from the connection score Connect (sj | si) of two sentences. It is good also as a structure which calculates | requires the connection score from a sentence, and calculates | requires the connection score of the sequence of sentences using the connection score. For example, Connect (sn | si, s (i + 1),..., S (n-1)), and the connection when the sentence sn follows the sentence si, s (i + 1),. Find the score. In this case, a weight parameter or the like is appropriately set in accordance with this.

重みパラメタを算出する際や、重要文順列探索部１１９において、ビームサーチを用いる場合に、ｂの値は、テキスト要約装置の演算性能等を加味して適宜設定すればよい（例えば、ｂ＝１〜１０００）。また、時点ｉにより、その数を変更してもよい。例えば、ｉの増加に従い、ｂを減少させる構成としてもよい。ｉが大きくなるにつれ、一つの仮説から展開できる仮説の数は少なくなるため、演算量を調整することができる。また、ｂの値は、定数ではなく、例えば、要約元となるテキストに含まれる文の数Ｑに応じて変更してもよく、ｂ＝Ｑ×０．１等としてもよい。 When calculating the weight parameter or using the beam search in the important sentence permutation search unit 119, the value of b may be appropriately set in consideration of the calculation performance of the text summarizing device (for example, b = 1). ~ 1000). The number may be changed depending on the time point i. For example, it may be configured to decrease b as i increases. As i increases, the number of hypotheses that can be developed from one hypothesis decreases, and the amount of computation can be adjusted. Further, the value of b is not a constant, and may be changed according to the number Q of sentences included in the text as a summary source, for example, b = Q × 0.1.

また、重みパラメタを算出する際や、重要文順列探索部１１９において、必ずしもビームサーチと動的計画法を用いなくともよい。また、何れか一方のみを用いても、効率化を図ることは可能である。また、他の方法を用いて高いスコアを効率的に求めてもよい。 Further, when calculating the weight parameter, the important sentence permutation search unit 119 does not necessarily use the beam search and the dynamic programming. Further, even if only one of them is used, it is possible to improve efficiency. Moreover, you may obtain | require a high score efficiently using another method.

要約長は、図示しない要約長決定部で求め重要文順列探索部１１９の入力としてもよい。要約長決定部は、入力部１３１を介して、要約元となるテキストデータ、または、その大きさ（キロバイト）が入力され、その大きさに応じて要約長を決定する。例えば、要約したテキストデータが、要約元となるテキストデータの大きさの５％〜２０％以下の大きさとなるように、自動的に要約長を決定してもよい。 The summary length may be obtained by a summary length determination unit (not shown) and input to the important sentence permutation search unit 119. The summary length determination unit receives text data as a summary source or its size (kilobytes) via the input unit 131, and determines the summary length according to the size. For example, the summary length may be automatically determined so that the summarized text data has a size that is 5% to 20% or less of the size of the text data that is the summarization source.

＜テキスト要約装置２００＞
図１及び図２１を用いて実施例２に係るテキスト要約装置２００を説明する。テキスト要約装置１００とは、連接スコア計算部２１７の構成及び記憶部２０３の記憶するデータが異なる。 <Text Summarization Device 200>
A text summarizing apparatus 200 according to the second embodiment will be described with reference to FIGS. 1 and 21. It differs from the text summarization apparatus 100 in the configuration of the connection score calculation unit 217 and the data stored in the storage unit 203.

＜連接スコア計算部２１７＞
連接スコア計算部２１７は、素性ベクトル生成部１１７ａを有さず、計算部２１７ｂのみを備え、連接スコアをMirella Lapata, "Probabilistic Text Structuring: Experiments with Sentence Ordering", In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics, 2003, pp.545 - 552.記載の方法を用いて計算する。その場合、一例として、連接スコアは以下のように定義できる。 <Connection score calculation unit 217>
The concatenated score calculation unit 217 does not include the feature vector generation unit 117a, includes only the calculation unit 217b, and calculates the concatenation score as Mirella Lapata, “Probabilistic Text Structuring: Experiments with Sentence Ordering”, In Proceedings of the 41st Annual Meeting of the 41 Calculation is performed using the method described in Association for Computational Linguistics (ACL), Association for Computational Linguistics, 2003, pp.545-552. In that case, as an example, the connection score can be defined as follows.

ｆｉｋは文ｓｉのｋ番目の素性要素、ｆｊｍは文ｓｊのｍ番目の素性要素である。これはそれぞれ図１１の文ｓｉ、文ｓｊから抽出された素性要素に対応する。ｐ（ｆｊｍ|ｆｉｋ）は、即ち素性要素ｆｊｍが、素性要素ｆｉｋが与えられた状態で出現する確率である。式（５）によれば、連接スコア計算部２１７は、文ｓｉと文ｓｊの連接スコアとして、それぞれの文の素性要素の直積集合（ｓｉ,ｓｊ）の、各条件付き確率ｐ（ｆｊｍ|ｆｉｋ）の総積を、直積集合の濃度で正規化したものを求める。ｐ（ｆｊｍ|ｆｉｋ）は一例として以下のように計算できる。 fik is the kth feature element of the sentence si, and fjm is the mth feature element of the sentence sj. This corresponds to the feature elements extracted from the sentences si and sj in FIG. p (fjm | fik) is a probability that the feature element fjm appears in a state where the feature element fik is given. According to Equation (5), the concatenated score calculation unit 217 uses each conditional probability p (fjm | fik) of the direct product set (si, sj) of feature elements of each sentence as the concatenated score of the sentence si and the sentence sj. ) Is obtained by normalizing the total product of) with the concentration of the Cartesian product set. As an example, p (fjm | fik) can be calculated as follows.

ここでＣ（ｆｉｋ,ｆｊｍ）は素性要素ｆｉｋと素性要素ｆｊｍが、上述の訓練データτ中の隣接する文において、素性要素ｆｉｋが前の文に、素性要素ｆｊｍが後の文に現れる回数である。分母は素性要素ｆｉｋが訓練データτにおいて現れる数である。式（５）を連接スコアとして用いる場合、例えば、図示しない重みパラメタ算出部は、訓練用データτを用いて、各数をカウントし、式（６）により各条件付き確率を求め、重みパラメタとして、記憶部２０３に記憶する。図２１に推定した条件付き確率の一例を示す。素性の列の左側が素性要素ｆｉｋに対応し、右側が素性要素ｆｊｍ、条件付き確率の列がｐ（ｆｊｍ|ｆｉｋ）に対応する。 Here, C (fik, fjm) is the number of times the feature element fik and the feature element fjm appear in the preceding sentence in the training data τ described above, the feature element fik appears in the previous sentence, and the feature element fjm appears in the subsequent sentence. is there. The denominator is a number in which the feature element fik appears in the training data τ. When using Equation (5) as the connection score, for example, a weight parameter calculation unit (not shown) counts each number using the training data τ, obtains each conditional probability according to Equation (6), and uses it as a weight parameter. And stored in the storage unit 203. FIG. 21 shows an example of the estimated conditional probability. The left side of the feature column corresponds to the feature element fik, the right side corresponds to the feature element fjm, and the conditional probability column corresponds to p (fjm | fik).

このような構成とすることで、実施例１と同様の効果を得ることができる。 By adopting such a configuration, the same effect as in the first embodiment can be obtained.

１００，２００テキスト要約装置
１０３，２０３記憶部
１１２変換文生成部
１１２ａ短縮文生成部
１１２ｂ分割部
１１２ｃ換言部
１１３素性要素抽出部
１１５内容性スコア計算部
１１７，２１７連接スコア計算部
１１９重要文順列探索部 100, 200 Text summarization device 103, 203 Storage unit 112 Conversion sentence generation unit 112a Abbreviated sentence generation unit 112b Division unit 112c Paraphrase unit 113 Feature element extraction unit 115 Content score calculation unit 117, 217 Concatenated score calculation unit 119 Important sentence permutation search Part

Claims

A text summarization device that summarizes text data composed of one or more input sentences,
A weighting parameter for a feature that is a combination considering the order of two feature elements, and a storage unit that stores in advance a sentence element score for a sentence element constituting the sentence;
A conversion sentence generation unit that converts input sentences and generates one or more conversion sentences for each input sentence;
A feature element extraction unit that extracts a feature element from the converted sentence;
Using the sentence element score for the sentence element included in each converted sentence, a content score calculating unit for obtaining a content score of each converted sentence;
Using the feature element extracted by the feature element extraction unit and the weight parameter, a connection score calculation unit that obtains a connection score of a converted sentence;
An important sentence permutation search unit that searches for a permutation of a converted sentence that is the maximum value or a sum of the content score and the connection score, or an approximate value of the maximum value;
A text summarization device.

The text summarization device according to claim 1,
Further, the storage unit stores the probability of word connection as a language model,
Furthermore, the conversion sentence generation unit includes an abbreviated sentence generation part that generates one or more abbreviated sentences for each input sentence using the input sentence.
A text summarization device.

The text summarization device according to claim 1 or 2,
Further, the storage unit stores a division rule for dividing a compound sentence and a heavy sentence into a single sentence,
Further, the conversion sentence generation unit includes a division unit that divides the input sentence into simple sentences when the input sentence is a compound sentence and a heavy sentence with reference to the division rule.
A text summarization device.

The text summarization device according to any one of claims 1 to 3,
Further, the storage unit stores a paraphrase table that is a rule for paraphrasing a certain expression into another expression having the same meaning that is shorter.
Furthermore, the conversion sentence generation unit includes a paraphrase part that refers to the paraphrase table and paraphrases the expression of the input sentence into another expression having the same meaning.
A text summarization device.

A text summarization method for summarizing text data composed of one or more input sentences,
A weight parameter for a feature that is a combination considering the order of two feature elements, and a sentence element score for a sentence element constituting a sentence are stored in advance,
A conversion statement generation step in which a conversion statement generation unit converts the input statement and generates one or more conversion statements for each input statement;
A feature element extraction unit for extracting a feature element from the conversion sentence;
A content score calculation unit that calculates a content score of each converted sentence using a sentence element score for a sentence element included in each converted sentence;
A concatenation score calculation unit uses the feature element extracted in the feature element extraction step and the weight parameter to calculate a concatenation score of the converted sentence;
The important sentence permutation search unit has an important sentence permutation search step of searching for a permutation of a converted sentence that has a maximum value or an approximate value of the maximum value as a sum of the content score and the connection score.
A text summarization method characterized by that.

The text summarization method according to claim 5,
Remember the probability of word connections as a language model,
Further, the conversion sentence generation step includes a short sentence generation step in which the short sentence generation unit generates one or more short sentences for each input sentence using the input sentence.
A text summarization method characterized by that.

A text summarization method according to claim 5 or 6, comprising:
A division rule for dividing compound sentences and heavy sentences into simple sentences is stored in advance.
Further, the conversion sentence generation step includes a division step in which the dividing unit divides the input sentence into simple sentences when the input sentence is a compound sentence and a heavy sentence with reference to the division rule.
A text summarization method characterized by that.

A text summarization method according to any one of claims 5 to 7, comprising:
Preliminarily store a paraphrase table that is a rule for paraphrasing a certain expression into a shorter other expression having the same meaning,
Further, the conversion sentence generation step includes a paraphrase step in which the paraphrase unit refers to the paraphrase table and paraphrases the expression of the input sentence to another expression having the same meaning.
A text summarization method characterized by that.

A text summarization program for causing a computer to function as the text summarization device according to claim 1.