JP2019121139A

JP2019121139A - Summarizing device, summarizing method, and summarizing program

Info

Publication number: JP2019121139A
Application number: JP2017255133A
Authority: JP
Inventors: 佑磨林; Yuma Hayashi; 幹森岡; Miki Morioka; 素紀二宗; Motonori Niso; 弘明高津; Hiroaki Takatsu
Original assignee: Airev Co Ltd
Current assignee: Airev Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2019-07-22
Anticipated expiration: 2037-12-29
Also published as: JP7142435B2

Abstract

To provide a summarizing device that generates an easy-to-read summary.SOLUTION: The summarizing device 10 includes: a document analysis unit 18 that analyzes a document and generates document data; an important sentence extraction unit 20 that extracts a plurality of sentences having high importance scores as important sentences from document data analyzed by the document analysis unit 18; a compressed sentence generation unit 22 that compresses each of the extracted important sentences to generate a plurality of compressed sentences corresponding to the important sentences; and a summary generation unit 24 that selects a compressed sentence having a highest summary score from among the plurality of compressed sentences corresponding to the individual important sentences, and generates a summary.SELECTED DRAWING: Figure 1

Description

本発明は、要約装置、要約方法、及び要約プログラムに関する。 The present invention relates to a summarizing apparatus, a summarizing method, and a summarizing program.

従来から、ニュースなどの文書を自動的に要約する技術が知られている。 Conventionally, techniques for automatically summarizing documents such as news are known.

下記特許文献１には、原文に基づいて要約を自動生成する文書要約装置が開示されている。この装置は、原文に含まれる複数の文と、これらの文のうち２文を融合した融合文と、文を分割した分割文とを圧縮前候補文とし、この圧縮前候補文を圧縮して要約候補文を生成している。そして、文書要約装置は、生成された複数の要約候補文から、所定の要約長さを満たす要約候補文を選択し要約を生成している。 Patent Document 1 below discloses a document summarizing apparatus that automatically generates a summary based on a text. This device uses multiple sentences contained in the original sentence, a fused sentence obtained by fusing two of these sentences, and a divided sentence obtained by dividing the sentence as candidate sentences before compression, and compresses the candidate sentences before compression. A summary candidate sentence is being generated. Then, the document summarizing apparatus selects a summary candidate sentence satisfying a predetermined summary length from the plurality of generated summary candidate sentences and generates a summary.

下記非特許文献１には、文書の一部、例えば文、句、単語を抽出し、単語の共起、重要度、関連度などに基づいて要約を生成する抽出型要約モデルが開示されている。この抽出型要約モデルにおいては、生成された要約は、元の文書から出力する表現を抽出しているので、元の文書に現れた単語で構成されるとともに、文法的な要約を作成可能である。 Non-Patent Document 1 below discloses an extractive summary model that extracts a part of a document, such as sentences, phrases, and words, and generates a summary based on co-occurrence of words, importance, relevance, etc. . In this extractive summary model, since the generated summary extracts the expression to be output from the original document, it can be composed of words appearing in the original document, and a grammatical summary can be created. .

下記非特許文献２には、教師データを用いた学習モデルにより、文書に現れない単語を使った要約を生成する生成型要約モデルが開示されている。この生成型要約モデルの教師データには、様々な長さの文で構成された要約が含まれているので、要約の構造情報の注釈を簡単に付けることができない。その結果、生成された要約は、文法的に不自然なものになってしまう、あるいは文書とは矛盾する内容になってしまう。 Non-Patent Document 2 below discloses a generative summarization model that generates a summarization using words that do not appear in a document by a learning model using teacher data. Since the teacher data of this generative summary model includes a summary composed of sentences of various lengths, it is not possible to easily annotate the structure information of the summary. As a result, the generated summary may be grammatically unnatural or contradictory to the document.

特開２０１７−１５１８６３号公報Unexamined-Japanese-Patent No. 2017-151863

Ramesh Nallapati, Bowen Zhow, Cicero dos Santos, Caglar Gulcehre, Bing Xiang : Abstructive Text Summarization using Sequence-to-sequence RNNs and Beyond, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning(CoNLL), pp. 280-290, Berlin, Germany, August 7-12,2016.Ramesh Nallapati, Bowen Zhow, Cicero dos Santos, Caglar Gulcehre, Bing Xiang: Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 280-290, Berlin, Germany, August 7-12, 2016. Rush, A. M., Chopra, S. and Weston, J.: A Neural Attention Model for Abstractive Sentence Summarization, Proceedings of EMNLP 2015, pp. 379-389(2015)Rush, A. M., Chopra, S. and Weston, J .: A Neural Attention Model for Abstractive Sentence Summarization, Proceedings of EMNLP 2015, pp. 379-389 (2015)

従来の抽出型要約モデルでは、例えば２００文字ほどのニュース記事から、その記事の主題を押さえた重要度及び関連度が高い文を、文法的な構成を保ったまま抽出することができる。しかしながら、このモデルでは、要約における特徴的な技法の１つである表現の置き換えが適切に行われず、読みやすい要約を得ることができないという問題がある。 In the conventional extractive summary model, for example, a news article of about 200 characters can be extracted with a grammatical structure while maintaining importance and relevance of a sentence that holds down the subject of the article. However, in this model, there is a problem that replacement of the expression, which is one of the characteristic techniques in the summary, is not properly performed, and an easy-to-read summary can not be obtained.

一方、生成型要約モデルでは、上述のように、文書に現れない単語を使った要約を生成することができるものの、文法構造としては適正ではなく、または原文とは矛盾する要約が多くなり、読みやすい要約を得ることができないという問題がある。 On the other hand, although the generative summarization model can generate a summarization using words that do not appear in the document as described above, it is not appropriate as a grammatical structure, or there are many summaries that are inconsistent with the original text. There is a problem that it is not possible to obtain an easy summary.

本発明は、読みやすい要約を生成できる要約装置、要約方法、及び要約プログラムを提供することにある。 The present invention is to provide a summary device, a summary method, and a summary program that can generate an easy-to-read summary.

本発明は、文書から要約を生成する要約装置において、文書を解析して文書データを生成する文書解析部と、前記文書解析部により解析された文書データから、重要スコアが高い複数の文を重要文としてそれぞれ抽出する重要文抽出部と、前記重要文抽出部により抽出された前記重要文から要約を生成する要約生成部と、を有し、前記要約生成部は、前記複数の重要文をそれぞれ文圧縮して、当該重要文に対応する圧縮文を生成する圧縮部と、前記複数の圧縮文から、要約スコアが最も高い前記圧縮文を要約として選択する選択部と、を有することを特徴とする。 According to the present invention, in a summarizing apparatus for generating a summary from a document, a plurality of sentences having high importance scores are important from a document analysis unit that analyzes documents and generates document data, and document data analyzed by the document analysis unit. The system includes: an important sentence extraction unit for extracting each sentence as a sentence; and a summary generation unit for generating a summary from the important sentences extracted by the important sentence extraction unit, wherein the summary generation unit Sentence compression to generate a compressed sentence corresponding to the important sentence, and a selection unit for selecting the compressed sentence having the highest summary score as a summary from the plurality of compressed sentences. Do.

また、要約生成部は、前記圧縮部により生成された前記圧縮文を解析して圧縮文データを生成する圧縮文解析部と、前記圧縮文と前記圧縮文データと抽出された特徴量に基づいて要約スコアを算出するスコア算出部とを有することができる。 The summary generation unit analyzes the compressed sentence generated by the compression unit to generate compressed sentence data, and a compressed sentence analysis unit; and the compressed sentence, the compressed sentence data, and the extracted feature amount. And a score calculation unit that calculates a summary score.

また、前記重要文抽出部は、前記重要文に含まれる固有表現を抽出することができる。 In addition, the important sentence extraction unit can extract a proper expression included in the important sentence.

また、前記圧縮部は、ニューラルネットワークを用いて前記重要文を文圧縮し、前記ニューラルネットワークの学習モデルは、複数の学習用文と前記複数の学習用文の各々に対する圧縮文とから予め学習されたものであることができる。 The compression unit compresses the important sentence using a neural network, and a learning model of the neural network is previously learned from a plurality of learning sentences and a compressed sentence for each of the plurality of learning sentences. Can be

別の発明は、文書から要約を生成する要約方法であって、文書を解析して文書データを生成する文書解析ステップと、前記文書解析ステップにより解析された文書データから、スコアが高い複数の文を重要文としてそれぞれ抽出する重要文抽出ステップと、前記重要文抽出ステップにより抽出された前記重要文から要約を生成する要約生成ステップと、を有し、前記要約生成ステップは、前記複数の重要文をそれぞれ文圧縮して、当該重要文に対応する圧縮文を生成する圧縮ステップと、前記圧縮ステップにより生成された前記複数の圧縮文から、スコアが最も高い前記圧縮文を要約として選択する選択ステップとを有することを特徴とする。 Another invention is a summary method for generating a summary from a document, comprising: a document analysis step of analyzing a document to generate document data; and a plurality of sentences having high scores from the document data analyzed by the document analysis step , And the summary generation step of generating a summary from the important sentences extracted by the important sentence extraction step, and the summary generation step includes the plurality of important sentences. Compressing each sentence to generate a compressed sentence corresponding to the important sentence, and selecting the compressed sentence with the highest score as a summary from the plurality of compressed sentences generated by the compressing step And.

さらに、別の発明は、コンピュータを、請求項１から５の何れか１つに記載の要約装置の各部として機能させるためのプログラムである。 Furthermore, another invention is a program for causing a computer to function as each part of the summarizing apparatus according to any one of claims 1 to 5.

本発明の要約装置、要約方法、及び要約プログラムによれば、読みやすい要約を生成できる。 The summary device, summary method, and summary program of the present invention provide an easy-to-read summary.

本実施形態に係る要約装置の構成を示す図である。It is a figure which shows the structure of the summarizing apparatus which concerns on this embodiment. 本実施形態に係る学習装置の構成を示す図である。It is a figure showing composition of a learning device concerning this embodiment. 要約装置が実行する処理の流れを示すフロー図である。It is a flowchart which shows the flow of the process which a summarizing apparatus performs. 文書解析部が実行する処理の流れを示すフロー図である。It is a flowchart which shows the flow of the process which a document analysis part performs. 重要文抽出部が実行する処理の流れを示すフロー図である。It is a flow figure showing the flow of processing which an important sentence extraction part performs. 圧縮文生成部が実行する処理の流れを示すフロー図である。It is a flowchart which shows the flow of the process which a compressed sentence production | generation part performs. 要約生成部が実行する処理の流れを示すフロー図である。It is a flowchart which shows the flow of the process which a summary production | generation part performs. 本実施形態に係る要約装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the summarizing apparatus which concerns on this embodiment.

以下、要約装置、要約方法、及び要約プログラムの実施の形態について、図を用いて説明する。 Hereinafter, embodiments of the summarizing apparatus, the summarizing method, and the summarizing program will be described with reference to the drawings.

［要約装置の構成］
本発明の実施の形態に係る要約装置の構成について図１を用い説明する。図１は、本実施形態に係る要約装置の構成を示す図である。 [Configuration of Summarizing Device]
The configuration of the summarizing apparatus according to the embodiment of the present invention will be described with reference to FIG. FIG. 1 is a diagram showing the configuration of a summary device according to the present embodiment.

要約装置１０は、情報が入力される入力部１２と、入力部１２に入力された情報を演算する演算部１４と、演算部１４による演算された情報を出力する出力部１６とを有する。 The summarizing apparatus 10 includes an input unit 12 to which information is input, an operation unit 14 that calculates information input to the input unit 12, and an output unit 16 that outputs information calculated by the operation unit 14.

入力部１２は、要約対象となる文書を受け付ける。文書には、複数の文、すなわちテキストデータが含まれる。 The input unit 12 receives a document to be summarized. A document contains multiple sentences, that is, text data.

演算部１４は、文書解析部１８と重要文抽出部２０と圧縮文生成部２２と要約生成部２４とを有する。また、演算部１４は、入力部１２に入力された文書が記憶される記憶部２６を有する。 The operation unit 14 includes a document analysis unit 18, an important sentence extraction unit 20, a compressed sentence generation unit 22, and a summary generation unit 24. In addition, the calculation unit 14 has a storage unit 26 in which the document input to the input unit 12 is stored.

文書解析部１８は、記憶部２６に記憶された文書を解析して文書データを生成する。具体的には、文書解析部１８は、文書に対して形態素解析を行い、分かち書き済みの文書データを生成する。形態素解析により、文書に含まれる文は、意味を持つ最小の言語単位である形態素に分けられ、さらに、その形態素に対する品詞、読み、意味、原形（活用のある単語のみ）などの情報が付加される。そして、文書解析部１８は、分かち書き済みの文書データに基づいて構文解析を行い、各々の文について構文解析済みの文書データを生成する。 The document analysis unit 18 analyzes the document stored in the storage unit 26 to generate document data. Specifically, the document analysis unit 18 performs morphological analysis on the document, and generates document data that has been separated. By morphological analysis, sentences contained in the document are divided into morphemes which are the smallest linguistic unit having meaning, and further information such as part of speech, reading, meaning, original form (only words with utilization) is added to the morpheme Ru. Then, the document analysis unit 18 performs syntactic analysis based on the document data that has already been shared, and generates parsed document data for each sentence.

重要文抽出部２０は、文書解析部１８により解析された文書データから、重要スコアが高い複数の文を重要文としてそれぞれ抽出する。具体的には、重要文抽出部２０は、文章データの特徴量を抽出し、その特徴量に基づいて重要スコアを付与する。重要スコアの付与には、後述する学習装置４０により生成されたモデルが用いられる。重要スコアを算出するための特徴量には、例えば、文書全体における文の位置、文の長さ、単語の出現頻度（共起）、単語の重要度、主題への関連度、文と文との類似度、単語の固有表現、文に含まれる動詞の情報が用いられる。例えば、単語の出現頻度が高いほど、単語の重要度の高いほど、または主題への関連度が高いほど、重要スコアを高くなるように設定することができる。また、重要文抽出部２０は、予め設定された制約条件に基づいて重要文を抽出することもできる。制約条件とは、重要スコア以外で、重要文の抽出の自由度を制限する条件であり、文の長さ、固有表現を含む。文が冗長である、または短すぎると要約に適さないので、重要文抽出部２０は、文の長さが所定の範囲以内である場合に限って、重要文を抽出する。また、人物、時間、場所、事柄などの固有表現は要約には必須の項目であるので、重要文抽出部２０は、所定の固有表現を含む場合に限って、重要文を抽出することができる。これらの制約条件は任意に設定可能であり、重要スコアに基づく重要文の抽出と組み合わせることで、要約になりうる精度の良い重要文を抽出することができる。 The important sentence extraction unit 20 extracts a plurality of sentences having high importance scores as important sentences from the document data analyzed by the document analysis unit 18. Specifically, the important sentence extraction unit 20 extracts feature quantities of sentence data, and assigns an important score based on the feature quantities. A model generated by a learning device 40 described later is used to assign the important score. The feature amount for calculating the importance score includes, for example, the position of the sentence in the entire document, the length of the sentence, the frequency of occurrence of the word (co-occurrence), the degree of importance of the word, the degree of relevance to the subject, the sentence and the sentence, The degree of similarity of the word, the specific expression of the word, and the information of the verb contained in the sentence are used. For example, the higher the frequency of occurrence of a word, the higher the degree of importance of the word, or the higher the degree of relevance to the subject, the higher the importance score can be set. The important sentence extraction unit 20 can also extract important sentences based on preset constraint conditions. The constraint condition is a condition other than the important score, which restricts the freedom of extraction of the important sentence, and includes the sentence length and the proper expression. If a sentence is redundant or too short, it is not suitable for summary, so the important sentence extraction unit 20 extracts an important sentence only when the sentence length is within a predetermined range. In addition, since the proper expression such as person, time, place, and matter is an essential item in the summary, the important sentence extraction unit 20 can extract the important sentence only when it includes a predetermined proper expression. . These constraints can be set arbitrarily, and by combining with extraction of important sentences based on important scores, it is possible to extract accurate important sentences that can be summarized.

文章データの特徴量は、文章データに含まれる素性を要素とする特徴ベクトルであり、所定の単位の特徴情報をベクトル化したものである。所定の単位とは、形態素またはＮ−ｇｒａｍである。形態素は、上述のように意味を持つ最小の言語単位である。Ｎ−ｇｒａｍは、Ｎ言語単位が隣接して生じる言語単位である。所定の単位の特徴情報には、品詞、読み、意味、原形（活用のある単語のみ）の情報の少なくとも１つが含まれる。ベクトルに基づくスコアは、二値（０又は１）、出現数、ＴＦ−ＩＤＦ（Term Frequency-Inverse Document Frequency）の値、ＬＤＡ（Latent Dirichlet Allocation）の値などであり、任意に設定可能である。例えば、ＴＦ−ＩＤＦは、単語の類似度がスコア化できるので、単語の重要度を示す指標として有用である。ＬＤＡは、推論されたトピックの類似度をスコア化できるので、主題の関連度、または文と文との類似度を示す指標として有用である。 The feature amount of the sentence data is a feature vector having the feature included in the sentence data as an element, and is a vector of feature information of a predetermined unit. The predetermined unit is a morpheme or an N-gram. A morpheme is the smallest linguistic unit that has meaning as described above. An N-gram is a language unit that occurs adjacent to N language units. The feature information of the predetermined unit includes at least one of part-of-speech, reading, meaning, and original (only words with utilization) information. The score based on a vector is a binary value (0 or 1), the number of occurrences, a value of TF-IDF (Term Frequency-Inverse Document Frequency), a value of LDA (Latent Dirichlet Allocation) or the like, and can be arbitrarily set. For example, TF-IDF is useful as an index indicating the importance of a word, because the similarity of the word can be scored. Since LDA can score the similarity of inferred topics, it is useful as an index indicating the degree of relevance of subjects or the similarity between sentences.

また、重要文抽出部２０では、文書データから重要文をＮ行（Ｎは２以上の整数）抽出することが予め設定されている。この数値Ｎは任意に設定可能である。この設定により、重要スコアが高い順に、上位Ｎ番目までの重要文を抽出する。また、抽出される複数の重要文において、類似度が高い２つの重要文が含まれている場合、重要文抽出部２０は、それらの重要文の中から重要スコアの低い一方の重要文を抽出対象から除外し、上位Ｎ＋１番目の重要文を抽出することができる。これにより、重要文抽出部２０、抽出された複数の重要文に、重複した内容の文が含まれるのを防ぎ、かつ、重要スコアが高いにもかかわらず抽出漏れした重要文を抽出することができる。類似度の判定は、互いの重要文の特徴量が所定の範囲内であるとすることもでき、また、重要文に含まれる単語の一致率が所定値以上とすることもできる。なお、本実施形態では類似度の高い重要文の数が２行である場合について説明したが、本発明はこの構成に限定されず類似度の高い重要文がｋ行（ｋは３以上の整数）以上になってもよい。この場合、重要文抽出部２０は、それらの重要文の中から重要スコアの低いｋ−１行の重要文を抽出対象から除外し、上位Ｎ＋ｋ番目の重要文を抽出することができる。 In addition, the important sentence extraction unit 20 is preset to extract N lines (N is an integer of 2 or more) of important sentences from the document data. This numerical value N can be set arbitrarily. By this setting, the top N important sentences are extracted in descending order of the importance score. In addition, when two important sentences having high similarity are included in a plurality of extracted important sentences, the important sentence extraction unit 20 extracts one important sentence having a low important score from among the important sentences. It can be excluded from the target and the top N + 1 important sentences can be extracted. Thus, the important sentence extraction unit 20 can prevent the extracted plural important sentences from including sentences having duplicate contents, and can extract the important sentences that are missed even though the important score is high. it can. In the determination of the degree of similarity, the feature amounts of the important sentences of each other may be within a predetermined range, and the coincidence rate of the words included in the important sentences may be equal to or more than a predetermined value. Although the present embodiment has described the case where the number of important sentences having a high degree of similarity is two lines, the present invention is not limited to this configuration and k lines of important sentences having a high degree of similarity (k is an integer of 3 or more) ) May be more. In this case, the important sentence extraction unit 20 can exclude the k-1 important sentences having low importance scores from among the important sentences from extraction targets, and extract the top N + k-th important sentences.

圧縮文生成部２２は、重要文抽出部２０により抽出された重要文をそれぞれ文圧縮して、それらの重要文に対応する圧縮文を複数生成する。具体的には、圧縮文生成部２２は、重要文の意味を抽出し、その意味から単語を生成して、重要文とは単語及び構文が相違しつつ、意味が同じまたは類似の圧縮文を生成する。重要文の意味抽出と、その意味からの単語生成とには、ニューラルネットワーク、例えばEncoder-Decoder翻訳モデルやＲＮＮ（Recurrent Neural Network）が用いられ、そのモデルは、後述する学習装置４０により生成される。 The compressed sentence generation unit 22 compresses each of the important sentences extracted by the important sentence extraction unit 20, and generates a plurality of compressed sentences corresponding to the important sentences. Specifically, the compressed sentence generation unit 22 extracts the meaning of the important sentence, generates a word from the meaning, and the compressed sentence having the same or similar meaning while the word and the syntax are different from the important sentence. Generate A neural network such as an Encoder-Decoder translation model or RNN (Recurrent Neural Network) is used for extracting the meaning of the important sentence and generating a word from the meaning, and the model is generated by a learning device 40 described later. .

また、圧縮文生成部２２では、１つの重要文から圧縮文をＭ行（Ｍは２以上の整数）生成することが予め設定されている。この数値Ｍは任意に設定可能であり、上述した数値Ｎとは独立している。 In addition, the compressed sentence generation unit 22 is preset to generate M lines (M is an integer of 2 or more) of a compressed sentence from one important sentence. This numerical value M can be set arbitrarily, and is independent of the above-mentioned numerical value N.

要約生成部２４は、圧縮文生成部２２により生成された各重要文に対応する複数の圧縮文から、要約スコアが最も高い圧縮文をそれぞれ選択して要約を生成する。具体的には、要約生成部２４は、各圧縮文を解析して圧縮文データをそれぞれ生成する圧縮文解析部２８と、圧縮文解析部２８により生成された圧縮文データから抽出された特徴量に基づいて要約スコアを算出するスコア算出部３０とを有する。 The summary generation unit 24 selects a compressed sentence having the highest summary score from a plurality of compressed sentences corresponding to each important sentence generated by the compressed sentence generation unit 22, and generates a summary. Specifically, the summary generation unit 24 analyzes each compressed sentence to generate compressed sentence data, and a feature amount extracted from the compressed sentence data generated by the compressed sentence analysis unit 28. And a score calculation unit 30 that calculates a summary score based on the

圧縮文解析部２８は、圧縮文に対して形態素解析を行い、分かち書き済みの圧縮文データを生成する。形態素解析により、文書に含まれる文は、意味を持つ最小の言語単位である形態素に分けられ、さらに、その形態素に対する品詞、読み、意味、原形（活用のある単語のみ）などの情報が付加される。そして、圧縮文解析部２８は、分かち書き済みの圧縮文データに基づいて構文解析を行い、各圧縮文について構文解析済みの圧縮文データを生成する。 The compressed sentence analysis unit 28 performs morphological analysis on the compressed sentence to generate divided sentence data. By morphological analysis, sentences contained in the document are divided into morphemes which are the smallest linguistic unit having meaning, and further information such as part of speech, reading, meaning, original form (only words with utilization) is added to the morpheme Ru. Then, the compressed sentence analysis unit 28 performs syntactic analysis based on the compressed sentence data that has already been separated, and generates compressed sentence data that has been parsed for each compressed sentence.

スコア算出部３０は、圧縮文データの特徴量を抽出し、その特徴量に基づいて要約スコアを付与する。要約スコアの付与には、後述する学習装置４０により生成されたモデルが用いられる。要約スコアを算出するための特徴量には、例えば、文書のジャンル、文の長さ、単語の出現頻度（共起）、単語の重要度、単語の固有表現、文に含まれる動詞、品詞の関連性、文字列、単語列及び品詞列の情報が用いられる。品詞の関連性とは、主語の品詞と述語の品詞との対などである。要約スコアは、主語と述語の主従関係が正しくなるほど高くなるように設定されてもよい。また、要約スコアに、品詞列を用いることができる。具体的には、圧縮文における文末の品詞列パターンが、教師データに用いられる要約における文末の品詞列パターンに似ているほど高い要約スコアを付与する。要約特有の文末の品詞列パターンには、体言止め、常体、ですます調、特有の助詞がある。特有の助詞とは「など」である。例えば、重要文が「グレープフルーツやレモンなど、柑橘系の香りは甘い物食べたい時のイライラを軽減してくれます」であり、この重要文から生成される圧縮文は、「柑橘系の香りは甘い物が食べたい時のイライラを軽減してくれます」、「柑橘系の香りは甘い物が食べたい時のイライラを軽減してくれる」、「レモンなど、柑橘系の香りは甘い物が食べたい時のイライラを軽減してくれる」の３通りであるとする。一方、教師データに含まれる要約には、「柑橘系の香りは、甘い物が食べたい時のイライラを軽減してくれる」がある。この場合の圧縮文における重要スコアは、文末の品詞列パターンの類似度から、順に、０．９４，１．００、０．９４となり、文末の品詞列パターンが教師データのものと同じである２番目の圧縮文に対して高い要約スコアが付与される。この品詞列による重要スコアの付与は一例であり、上記他の特徴量と組み合わせることで、要約に適する圧縮文を選択することができる。 The score calculation unit 30 extracts feature amounts of the compressed sentence data, and gives a summary score based on the feature amounts. A model generated by a learning device 40 described later is used for giving a summary score. The feature amount for calculating the summary score includes, for example, the genre of the document, the length of the sentence, the frequency of occurrence of the word (co-occurrence), the degree of importance of the word, the specific expression of the word, the verb included in the sentence, and the part of speech Information on relevance, character strings, word strings and part-of-speech strings are used. The relevance of the part of speech is, for example, a pair of the part of speech of the subject and the part of speech of the predicate. The summary score may be set to be higher as the master-detail relationship between the subject and the predicate becomes correct. Also, a part-of-speech sequence can be used for the summary score. Specifically, as the part-of-speech sequence pattern at the end of the sentence in the compressed sentence resembles the part-of-speech sequence pattern at the end of the sentence in the summary used for the teacher data, a high summary score is given. The part-of-speech sequence pattern at the end of the summary-specific sentence includes an excuse stop, a normal state, an ups and downs, a distinctive particle. Specific particles are "etc." For example, the important sentence is "the citrus scent such as grapefruit and lemon reduces the frustration when you want to eat sweets", and the compressed sentence generated from this important sentence is "the citrus scent is "It reduces the frustration when you want to eat sweets." "The citrus scent reduces the frustration when you want to eat sweets." "The citrus scent such as lemon eats sweets. It will reduce the frustration when it is hard. " On the other hand, the summary included in the teacher data is that "the citrus scent reduces the frustration when sweets want to eat". The importance score in this case is 0.94, 1.00, 0.94 in order from the similarity of the part-of-speech string pattern at the end of the sentence, and the part-of-speech string pattern of the sentence end is the same as that of the teacher data 2 A high summary score is given for the second compressed sentence. The assignment of the important score by this part-of-speech sequence is an example, and by combining with the other feature amount, it is possible to select a compressed sentence suitable for the summary.

圧縮文データの特徴量は、圧縮文データに含まれる素性を要素とする特徴ベクトルであり、所定の単位の特徴情報をベクトル化したものである。所定の単位とは、形態素およびＮ−ｇｒａｍである。形態素は、上述のように意味を持つ最小の言語単位である。Ｎ−ｇｒａｍは、Ｎ言語単位が隣接して生じる言語単位である。所定の単位の特徴情報には、品詞、読み、意味、原形（活用のある単語のみ）の情報の少なくとも１つが含まれる。ベクトルに基づくスコアは、二値（０又は１）、出現数、ＴＦ−ＩＤＦ（Term Frequency-Inverse Document Frequency）の値、ＬＤＡ（Latent Dirichlet Allocation）の値などであり、任意に設定可能である。例えば、ＴＦ−ＩＤＦは、単語の類似度がスコア化できるので、単語の重要度を示す指標として有用である。 The feature amount of the compressed sentence data is a feature vector having the feature included in the compressed sentence data as an element, and is obtained by vectorizing the feature information of a predetermined unit. The predetermined units are morphemes and N-grams. A morpheme is the smallest linguistic unit that has meaning as described above. An N-gram is a language unit that occurs adjacent to N language units. The feature information of the predetermined unit includes at least one of part-of-speech, reading, meaning, and original (only words with utilization) information. The score based on a vector is a binary value (0 or 1), the number of occurrences, a value of TF-IDF (Term Frequency-Inverse Document Frequency), a value of LDA (Latent Dirichlet Allocation) or the like, and can be arbitrarily set. For example, TF-IDF is useful as an index indicating the importance of a word, because the similarity of the word can be scored.

要約生成部２４は、Ｍ行の圧縮文から、要約スコアが最も高い圧縮文１行を選択する。Ｎ行の重要文があるので、要約生成部２４により選択される圧縮文はＮ行となる。そして、要約生成部２４は、それらの選択された圧縮文から要約を生成する。このとき、要約生成部２４は、文書全体における各重要文の位置と、これらの重要文に対応する圧縮文との関係に基づいて、各重要文の順番どおりに、選択された圧縮文を並べて、要約の生成とすることができる。また、要約生成部２４は、要約生成部２４は、予め設定された制約条件に基づいて圧縮文を選択することもできる。制約条件とは、要約スコア以外で、圧縮文の選択の自由度を制限する条件であり、文の長さ、固有表現を含む。文が冗長である、または短すぎると要約に適さないので、要約生成部２４は、文の長さが所定の範囲以内である場合に限って、圧縮文を選択する。また、人物、時間、場所、事柄などの固有表現は要約には必須の項目であるので、要約生成部２４は、所定の固有表現を含む場合に限って、圧縮文を選択することができる。これらの制約条件は任意に設定可能であり、要約スコアに基づく圧縮文の選択と組み合わせることで、要約になりうる精度の良い圧縮文を選択することができる。 The summary generation unit 24 selects one compressed sentence line having the highest summary score from the compressed sentences of M lines. Since there are N lines of important sentences, the compressed sentence selected by the summary generator 24 is N lines. Then, the summary generator 24 generates a summary from the selected compressed sentences. At this time, the summary generation unit 24 arranges the selected compressed sentences in the order of the important sentences based on the relationship between the position of each important sentence in the entire document and the compressed sentences corresponding to these important sentences. , And summary generation. The summary generator 24 can also select a compressed sentence based on preset constraints. The constraint condition is a condition other than the summary score, which restricts the freedom of selection of the compressed sentence, and includes the sentence length and the specific expression. The summary generator 24 selects a compressed sentence only when the sentence length is within a predetermined range, since the sentence is redundant or too short to be suitable for a summary. In addition, since a specific expression such as a person, time, place, or matter is an essential item in the summary, the summary generation unit 24 can select a compressed sentence only when it includes a predetermined specific expression. These constraints can be arbitrarily set, and in combination with the selection of the compressed sentence based on the summary score, it is possible to select an accurate compressed sentence that can be a summary.

出力部１６は、生成された要約を出力する。出力は、モニタによるテキスト表示であってもよく、要約をテキストデータとして外部環境に送信しても良い。 The output unit 16 outputs the generated summary. The output may be a text display by a monitor, or the summary may be sent to the external environment as text data.

このような構成により、重要文抽出部２０において、原文である文書の主題をおさえた重要度及び関連度が高い文を抽出することができる。また、圧縮文生成部２２において、文書に現れない単語を使った文を生成することができる。さらに、要約生成部２４では、文法構造が崩れた又は文法的に適切ではない圧縮文が取り除かれ、文書とは矛盾しない要約を生成することができる。その結果、その要約装置１０によれば、読みやすい要約を得ることができる。 With such a configuration, the important sentence extraction unit 20 can extract sentences having a high degree of importance and relevance related to the subject of the original document. In addition, the compressed sentence generation unit 22 can generate a sentence using a word that does not appear in the document. Furthermore, in the summary generation unit 24, compressed sentences whose grammatical structure is broken or not grammatically appropriate can be removed, and a summary that is not inconsistent with the document can be generated. As a result, the summary device 10 can provide an easy-to-read summary.

さらに、圧縮文生成部２２は、文圧縮の前に、重要文に含まれる固有表現を抽出して任意のラベルを付与する。そして、要約生成部２４は、選択した圧縮文に含まれる任意のラベルに対し、これに対応する固有表現を置き換える。このように、文圧縮前に、固有表現に対してラベルを付与し、要約となった後にラベルに対応する固有表現に置換することにより、圧縮時における不用意な単語の変換を防止することができる。 Furthermore, before sentence compression, the compressed sentence generation unit 22 extracts the unique expression included in the important sentence and gives an arbitrary label. Then, the summary generation unit 24 replaces the specific expression corresponding to any label included in the selected compressed sentence. In this way, it is possible to prevent unintended conversion of words during compression by giving labels to specific expressions before sentence compression and replacing with abstract expressions corresponding to the labels after being summarized. it can.

記憶部２６は、入力部１２に入力された文書を記憶する構成について説明したが、この構成に限定されない。記憶部２６は、構文解析済みの文書データ、重要文、圧縮文、要約、重要及び要約スコア、固有表現とラベルの対抗関係なども記憶することができる。 Although the storage unit 26 has described the configuration for storing the document input to the input unit 12, the present invention is not limited to this configuration. The storage unit 26 can also store parsed document data, important sentences, compressed sentences, summaries, important and summary scores, countering relationships between proper expressions and labels, and the like.

［学習装置の構成］
次に実施の形態に係る要約装置の構成について図２を用い説明する。図２は、本実施形態に係る学習装置の構成を示す図である。 [Configuration of learning device]
Next, the configuration of the summarizing apparatus according to the embodiment will be described with reference to FIG. FIG. 2 is a diagram showing the configuration of a learning device according to the present embodiment.

学習装置４０は、情報が入力される入力部４２と、入力部４２に入力された情報を演算する演算部４４と、演算部４４による演算された情報を出力する出力部４６とを有する。 The learning device 40 includes an input unit 42 to which information is input, an operation unit 44 that calculates information input to the input unit 42, and an output unit 46 that outputs information calculated by the operation unit 44.

入力部４２は、学習対象となる複数の学習用文書を受け付ける。また、入力部４２は、学習用文書に対する要約を受け付ける。要約は複数の文を含む。 The input unit 42 receives a plurality of learning documents to be learned. Also, the input unit 42 accepts a summary for the learning document. The summary contains multiple sentences.

演算部４４は、学習用文書解析部４８と特徴量抽出部５０とモデル学習部５２と記憶部５４とを有する。 The calculation unit 44 includes a learning document analysis unit 48, a feature extraction unit 50, a model learning unit 52, and a storage unit 54.

学習用文書解析部４８は、学習用文書を解析して学習用文書データを生成する。具体的には、学習用文書解析部４８は、学習用文書に対して形態素解析を行い、分かち書き済みの学習用文書データを生成する。形態素解析により、文書に含まれる文は、意味を持つ最小の言語単位である形態素に分けられ、さらに、その形態素に対する品詞、読み、意味、原形（活用のある単語のみ）などの情報が付加される。そして、学習用文書解析部４８は、分かち書き済みの学習用文書データに基づいて構文解析を行い、各々の文について構文解析済みの学習用文書データを生成する。 The learning document analysis unit 48 analyzes the learning document to generate learning document data. Specifically, the learning document analysis unit 48 performs morphological analysis on the learning document to generate the separated learning document data. By morphological analysis, sentences contained in the document are divided into morphemes which are the smallest linguistic unit having meaning, and further information such as part of speech, reading, meaning, original form (only words with utilization) is added to the morpheme Ru. Then, the learning document analysis unit 48 performs syntactic analysis based on the divided learning document data, and generates learning document data that has been syntactically analyzed for each sentence.

学習用文書は、例えば、オンラインでのメディアサイトのニュース記事であり、そのテキストデータが取得されて記憶部５４に格納される。ニュース記事は、政治、経済、スポーツ、テクノロジー、天気、芸能などであり、これらの１つのカテゴリであってもよく、複数のカテゴリであってもよい。なお、学習用文章は、ニュース記事に限らず、他のテキストデータ、例えば論文や電子メールの内容であってもよい。 The learning document is, for example, a news article of a media site online, and its text data is acquired and stored in the storage unit 54. News articles are politics, economy, sports, technology, weather, entertainment, etc., and may be one category of these, or may be multiple categories. The text for learning is not limited to the news article, but may be other text data such as the contents of a paper or an e-mail.

特徴量抽出部５０は、構文解析済みの学習用文書データの特徴量を抽出する。特徴量は、学習用文書データの素性を要素とする特徴ベクトルであり、各文が含む所定の単位の特徴情報を、機械学習用にベクトル化したものである。特徴ベクトルは、特徴量として記憶部５４に格納される。所定の単位とは、形態素またはＮ−ｇｒａｍである。形態素は、意味を持つ最小の言語単位である。Ｎ−ｇｒａｍは、Ｎ言語単位が隣接して生じる言語単位である。所定の単位の特徴情報には、品詞、読み、意味、原形（活用のある単語のみ）の情報の少なくとも１つが含まれる。特徴量に含まれる素性のスコアは、二値（０又は１）、出現数、ＴＦ−ＩＤＦ（Term Frequency-Inverse Document Frequency）の値、ＬＤＡ（Latent Dirichlet Allocation）の値などである。 The feature amount extraction unit 50 extracts feature amounts of the syntactically analyzed document data for learning. The feature amount is a feature vector whose element is the feature of document data for learning, and is obtained by vectorizing feature information of a predetermined unit included in each sentence for machine learning. The feature vector is stored in the storage unit 54 as a feature amount. The predetermined unit is a morpheme or an N-gram. A morpheme is the smallest linguistic unit with meaning. An N-gram is a language unit that occurs adjacent to N language units. The feature information of the predetermined unit includes at least one of part-of-speech, reading, meaning, and original (only words with utilization) information. The score of the feature included in the feature amount is binary (0 or 1), the number of occurrence, the value of TF-IDF (Term Frequency-Inverse Document Frequency), the value of LDA (Latent Dirichlet Allocation), or the like.

特徴量抽出部５０は、辞書等を利用することで同義語、上位・下位語を同一素性として扱う、またはＴＦ−ＩＤＦ値に閾値を設けるなど統計の取りやすい特徴量を設計することが好適である。これにより、特徴量の特徴ベクトルの次元の低減が可能になるとともに、ノイズに対して頑健になる。 It is preferable that the feature extraction unit 50 treats synonyms and upper / lower terms as the same feature by using a dictionary or the like, or design a feature with easy statistics such as setting a threshold to a TF-IDF value. is there. This makes it possible to reduce the dimension of the feature vector of the feature amount and to be robust against noise.

モデル学習部５２は、特徴量を入力とし、要約を推定するモデルを学習する。要約は、１つの学習用文書に対して複数の文からなる。モデル学習部５２による学習で生成されたモデルは、記憶部５４に格納される。 The model learning unit 52 receives feature amounts as input, and learns a model for estimating a summary. The summary consists of multiple sentences for one learning document. The model generated by learning by the model learning unit 52 is stored in the storage unit 54.

モデルの学習には、教師用データと所定の機械学習のアルゴリズムとが用いられる。教師用データは学習用文書と要約であり、具体的には、学習用文書から特徴量抽出部５０により抽出された特徴量と、この学習用文書に対応する要約とである。これらの対応関係は、人を介して学習用文書から要約が作られているので、正しい関係であり、教師用データとして適している。特徴量が、モデルを学習するための説明変数となる。一方、要約が、モデルを学習するための目的変数となる。よって、特徴量を入力とし、要約を出力とすることで、モデル学習部５２は、説明変数によって目的変数が説明できるかを学習し、定量的に分析可能なモデルを学習および生成することができる。 For model learning, teacher data and a predetermined machine learning algorithm are used. The teacher data is a learning document and a summary, and specifically, a feature amount extracted from the learning document by the feature amount extraction unit 50 and a summary corresponding to the learning document. These correspondence relationships are correct relationships because they are summarized from documents for learning through people, and are suitable as data for teachers. The feature amount is an explanatory variable for learning a model. On the other hand, the summary is the objective variable for learning the model. Therefore, the model learning unit 52 can learn whether the objective variable can be explained by the explanatory variable and can learn and generate a model that can be analyzed quantitatively by using the feature amount as the input and the summary as the output. .

所定の機械学習のアルゴリズムは、線形回帰、決定木、ロジスティック回帰、ニューラルネットワーク、ｋ−ｍｅａｎｓ法、ＮｅａｒｅｓｔＮｅｉｇｈｂｏｒ法、ＳＶＭ（Ｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅ）等を利用することができる。本実施形態の圧縮文生成部２２に使用されるモデルには、ニューラルネットワークが用いられる。また、機械学習のアルゴリズムとして、逐一を学習できるオンライン線形識別器を用いることができる。このアルゴリズムは、要約装置１０の運用後、利用者からのフィードバック情報を即時にモデルに反映させたい場合、特に有用である。オンライン線形識別器には、パーセプトロン、ＣＷ（Confidence Weighted Online Learning）、ＡＲＯＷ（Adaptive Regularization of Weight Vectors）などがある。 As a predetermined machine learning algorithm, linear regression, decision tree, logistic regression, neural network, k-means method, nearest neighbor method, SVM (Support vector machine) or the like can be used. A neural network is used as a model used in the compressed sentence generation unit 22 of the present embodiment. In addition, as an algorithm of machine learning, an on-line linear discriminator capable of learning one by one can be used. This algorithm is particularly useful when, after the operation of the summarizing apparatus 10, it is desired to reflect feedback information from the user in the model immediately. The online linear classifiers include perceptron, CW (Confidence Weighted Online Learning), AROW (Adaptive Regularization of Weight Vectors), and the like.

出力部４６は、生成されたモデルを出力する。モデルは、重要文抽出部２０と圧縮文生成部２２と要約生成部２４に用いられる。これらの各構成２０，２２，２４のモデルは、同一モデルであってもよく、各用途に応じて異なるアルゴリズムを用いて生成されたモデルであってもよい。 The output unit 46 outputs the generated model. The model is used for the important sentence extraction unit 20, the compressed sentence generation unit 22, and the summary generation unit 24. The model of each of these configurations 20, 22, 24 may be the same model, or may be a model generated using a different algorithm according to each application.

［要約生成の処理フロー］
次に、図３を用いて、要約装置１０が実行する処理の流れについて説明する。 [Process flow of summary generation]
Next, the flow of processing performed by the abstracting apparatus 10 will be described with reference to FIG.

まず、ステップＳ０１において、文書解析部１８は、文書を解析して文書データを生成する。 First, in step S01, the document analysis unit 18 analyzes the document and generates document data.

そして、ステップＳ０２では、重要文抽出部２０は、前記文書解析ステップにより解析された文書データから、スコアが高い複数の文を重要文としてそれぞれ抽出する。 Then, in step S02, the important sentence extraction unit 20 extracts a plurality of sentences having high scores as important sentences from the document data analyzed in the document analysis step.

次に、ステップＳ０３において、圧縮文生成部２２は、重要文抽出ステップにより抽出された重要文をそれぞれ文圧縮して、その重要文に対応する圧縮文を複数生成する。 Next, in step S03, the compressed sentence generation unit 22 compresses each of the important sentences extracted in the important sentence extraction step, and generates a plurality of compressed sentences corresponding to the important sentences.

最後に、ステップＳ０４において、要約生成部２４は、各重要文に対応する複数の圧縮文から、要約スコアが最も高い圧縮文をそれぞれ選択して要約を生成し、この処理が終了する。 Finally, in step S04, the summary generation unit 24 selects a compressed sentence with the highest summary score from a plurality of compressed sentences corresponding to each important sentence to generate a summary, and the process ends.

次に、図４を用いて、文書解析部１８が実行する処理の流れについて、具体的に説明する。 Next, the flow of processing performed by the document analysis unit 18 will be specifically described with reference to FIG.

まず、ステップＳ１１において、文書解析部１８は、記憶部２４に記憶された文書に対して形態素解析を行い、分かち書き済みの文書データを生成する。 First, in step S11, the document analysis unit 18 performs morphological analysis on the document stored in the storage unit 24 and generates document data on which sharing has been completed.

そして、ステップＳ１２では、文書解析部１８は、分かち書き済みの文書データに基づいて構文解析を行い、各々の文について構文解析済みの文書データを生成し、この処理を終了する。 Then, in step S12, the document analysis unit 18 performs syntactic analysis based on the document data that has already been split, generates document data that has been syntactically analyzed for each sentence, and ends this processing.

次に、図５を用いて、重要文抽出部２０が実行する処理の流れについて、具体的に説明する。この処理では、文書データから重要文をＮ行（Ｎは２以上の整数）抽出することが予め設定されている。 Next, the flow of processing performed by the important sentence extraction unit 20 will be specifically described using FIG. In this process, it is preset to extract N lines (N is an integer of 2 or more) of important sentences from document data.

まず、ステップＳ２１において、重要文抽出部２０は、文章データの特徴量を抽出し、その特徴量に基づいて重要スコアを付与する。 First, in step S21, the important sentence extraction unit 20 extracts feature amounts of sentence data, and assigns an important score based on the feature amounts.

そして、ステップ２２では、重要文抽出部２０は、重要スコアが高い順に、上位Ｎ番目までの重要文を抽出する。ここで、抽出される複数の重要文において、類似度が高い重要文が含まれている場合、重要文抽出部２０は、その中から重要スコアの低い重要文を抽出対象から除外し、上位Ｎ＋１番目の重要文を抽出する。Ｎ行の重要文が抽出されると、この処理が終了する。 Then, in step 22, the important sentence extraction unit 20 extracts the top N important sentences in the descending order of the importance score. Here, when a plurality of important sentences to be extracted includes an important sentence having a high degree of similarity, the important sentence extraction unit 20 excludes an important sentence having a low important score from among the extraction targets, and the top N + 1 Extract the second important sentence. This process ends when the important sentences of N lines are extracted.

次に、図６を用いて、圧縮文生成部２２が実行する処理の流れについて、具体的に説明する。この処理では、１つの重要文から圧縮文をＭ行（Ｍは２以上の整数）生成することが予め設定されている。なお、この数値Ｍは任意に設定可能であり、上述した数値Ｎとは独立している。 Next, the flow of processing performed by the compressed sentence generation unit 22 will be specifically described using FIG. In this process, it is preset to generate M lines of compressed sentences (M is an integer of 2 or more) from one important sentence. Note that this numerical value M can be set arbitrarily, and is independent of the above-mentioned numerical value N.

まず、ステップＳ３１において、圧縮文生成部２２は、文圧縮の前に、重要文に含まれる固有表現を抽出して任意のラベルを付与する。 First, in step S31, before sentence compression, the compressed sentence generation unit 22 extracts unique expressions included in important sentences and gives arbitrary labels.

そして、ステップ３２では、圧縮文生成部２２は、重要文の意味を解析し、その意味から単語を生成して、重要文とは単語及び構文が相違しつつ意味が同じ圧縮文を生成し、この処理を終了する。 Then, in step 32, the compressed sentence generation unit 22 analyzes the meaning of the important sentence, generates a word from the meaning, and generates a compressed sentence having the same meaning as the important sentence while the word and the syntax are different. This process ends.

次に、図７を用いて、要約生成部２４が実行する処理の流れについて、具体的に説明する。要約生成部２４は、圧縮文解析部２８とスコア算出部３０を有し、これらの構成とともに実行する。 Next, the flow of processing performed by the summary generation unit 24 will be specifically described using FIG. 7. The summary generation unit 24 includes a compressed sentence analysis unit 28 and a score calculation unit 30, and executes this together with these configurations.

まず、ステップＳ４１において、圧縮文解析部２８は、各圧縮文を解析して圧縮文データをそれぞれ生成する。具体的には、圧縮文解析部２８は、圧縮文に対して形態素解析を行い、分かち書き済みの圧縮文データを生成する。そして、圧縮文解析部２８は、分かち書き済みの圧縮文データに基づいて構文解析を行い、各圧縮文について構文解析済みの圧縮文データを生成する。 First, in step S41, the compressed sentence analysis unit 28 analyzes each compressed sentence to generate compressed sentence data. Specifically, the compressed sentence analysis unit 28 performs morphological analysis on the compressed sentence to generate divided sentence data. Then, the compressed sentence analysis unit 28 performs syntactic analysis based on the compressed sentence data that has already been separated, and generates compressed sentence data that has been parsed for each compressed sentence.

そして、ステップＳ４２では、スコア算出部３０は、圧縮文解析部２８により生成された圧縮文データから抽出された特徴量に基づいて要約スコアを算出する。具体的には、スコア算出部３０は、圧縮文データの特徴量を抽出し、その特徴量に基づいて要約スコアを付与する。 Then, in step S42, the score calculation unit 30 calculates a summary score based on the feature amount extracted from the compressed sentence data generated by the compressed sentence analysis unit 28. Specifically, the score calculation unit 30 extracts the feature amount of the compressed sentence data, and assigns a summary score based on the feature amount.

さらに、ステップＳ４３では、要約生成部２４は、Ｍ行の圧縮文から、要約スコアが最も高い圧縮文１行を選択する。Ｎ行の重要文がある場合、要約生成部２４は、Ｎ回の処理を行い、Ｎ行の圧縮文を選択する。 Furthermore, in step S43, the summary generation unit 24 selects one compressed sentence line having the highest summary score from the compressed sentences of M lines. If there are N lines of important sentences, the summary generation unit 24 performs N times of processing and selects N lines of compressed sentences.

最後に、ステップＳ４４では、要約生成部２４は、それらの選択された圧縮文から要約を生成する。このとき、要約生成部２４は、文書全体における各重要文の位置と、これらの重要文に対応する圧縮文との関係に基づいて、その順番に、選択された圧縮文を並べて、要約の生成とすることができる。 Finally, in step S44, the summary generator 24 generates a summary from the selected compressed sentences. At this time, the summary generation unit 24 arranges the selected compressed sentences in the order based on the relationship between the position of each important sentence in the entire document and the compressed sentences corresponding to these important sentences, and generates a summary. It can be done.

上述した一連の処理を実行することができる機能を要約装置１０が備えていればよく、本発明は図１に示す機能的構成に限定されない。上記一連の処理は、ハードウェアにより実行させることも、ソフトウェアにより実行させることもできる。また、１つの機能ブロックは、ハードウェア単体で構成されてもよいし、ソフトウェア単体で構成されてもよく、またはこれらの組み合わせで構成されてもよい。 The summarizing device 10 only needs to have a function capable of executing the series of processes described above, and the present invention is not limited to the functional configuration shown in FIG. The above-described series of processes can be performed by hardware or software. Also, one functional block may be configured as a single piece of hardware, may be configured as a single piece of software, or may be configured as a combination of these.

一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、コンピュータなどにネットワークや記憶媒体からインストールされる。なお、プログラムを記述するステップは、その順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的或いは個別に実行される処理を含むことができる。 When the series of processes are executed by software, a program constituting the software is installed on a computer or the like from a network or storage medium. It should be noted that the step of writing a program can include processing performed in parallel or separately, not necessarily processing performed chronologically according to the order, as well as processing performed chronologically.

［要約装置のハードウェア構成］
次に、本実施形態の要約装置１０のハードウェア構成について図８を用いて説明する。 [Hardware configuration of summary device]
Next, the hardware configuration of the summarizing apparatus 10 according to the present embodiment will be described with reference to FIG.

要約装置１０は、一般的なコンピュータの構成を有しており、例えば、ＣＰＵ（Central Processing Unit）１００、ＲＡＭ（Random Access Memory）１０２、ＲＯＭ（Read Only Memory）１０４と、記憶部１０６と、ネットワークＩ／Ｆ（Interface）部１０８、入力部１１０、表示部１１２、及びバス１１４等を有する。 The summarizing apparatus 10 has a general computer configuration, and, for example, a central processing unit (CPU) 100, a random access memory (RAM) 102, a read only memory (ROM) 104, a storage unit 106, and a network. It has an I / F (Interface) unit 108, an input unit 110, a display unit 112, a bus 114, and the like.

ＣＰＵ１００は、ＲＯＭ１０４や記憶部１０６等に記憶されたプログラムやデータをＲＡＭ１０２上に読み出し、処理を実行することにより、装置１０全体の制御や機能を実現する演算装置である。ＲＡＭ１０２は、ＣＰＵ１００のワークエリア等として用いられる揮発性のメモリである。ＲＯＭ１０４は、例えば、要約装置１０の起動時に実行されるＢＩＯＳ（Basic Input/Output System）、及び各種設定等が記憶された不揮発性のメモリである。 The CPU 100 is an arithmetic device that implements control and functions of the entire apparatus 10 by reading programs and data stored in the ROM 104, the storage unit 106, and the like on the RAM 102 and executing processing. The RAM 102 is a volatile memory used as a work area or the like of the CPU 100. The ROM 104 is, for example, a non-volatile memory in which a BIOS (Basic Input / Output System) executed when the summarizing apparatus 10 is started, various settings and the like are stored.

記憶部１０６は、ＯＳ（Operating System）や、各種のアプリケーションプログラム等を記憶する、例えば、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）等のストレージ装置である。 The storage unit 106 is a storage device, such as a hard disk drive (HDD) or a solid state drive (SSD), which stores an operating system (OS) and various application programs.

ネットワークＩ／Ｆ部１０８は、要約装置１０をネットワークに接続し、外部の情報端末等（図示せず）と通信を行うための通信インタフェースである。 A network I / F unit 108 is a communication interface for connecting the abstract device 10 to a network and communicating with an external information terminal (not shown).

入力部１１０は、例えばキーボードやマウス等の入力装置であり、要約装置１０の運用者が、要約装置１０に各種操作信号を入力するために用いられる。表示部１１２は、例えばディスプレイ等の表示装置であり、要約装置１０による処理結果等を表示する。なお、入力部１１０、表示部１１２等は、必要なときに、要約装置１０に接続して利用する形態であってもよい。 The input unit 110 is, for example, an input device such as a keyboard and a mouse, and is used by the operator of the abstracting apparatus 10 to input various operation signals to the abstracting apparatus 10. The display unit 112 is, for example, a display device such as a display, and displays the processing result by the abstracting device 10 and the like. The input unit 110, the display unit 112, and the like may be connected to the abstracting apparatus 10 when necessary, and may be used.

バス１１４は、上記の各構成に共通に接続され、例えば、アドレス信号、データ信号、及び各種の制御信号等を伝達する。 The bus 114 is commonly connected to the above-described components, and transmits, for example, address signals, data signals, various control signals, and the like.

なお、本実施形態の要約装置１０は、上記構成に限定されず、ハードウェアの構成が別々のコンピュータにより実現されてもよい。また、要約装置１０としての機能が、情報端末にインストールされたアプリケーションにより動作する場合、本実施形態の要約装置１０のハードウェア構成の一部が情報端末のハードウェアによって実現されてもよい。 The summarizing apparatus 10 according to the present embodiment is not limited to the above configuration, and the hardware configuration may be realized by separate computers. Moreover, when the function as the summarizing apparatus 10 is operated by an application installed in the information terminal, part of the hardware configuration of the summarizing apparatus 10 of the present embodiment may be realized by the hardware of the information terminal.

１０要約装置、１２入力部、１４演算部、１６出力部、１８文書解析部、２０重要文抽出部、２２圧縮文生成部，２４要約生成部、２６記憶部、２８圧縮文解析部、３０スコア算出部、４０学習装置、４２入力部、４４演算部、４６出力部、４８学習用文書解析部、５０特徴量抽出部、５２モデル学習部、５４記憶部。 DESCRIPTION OF REFERENCE NUMERALS 10 summary apparatus 12 input unit 14 operation unit 16 output unit 18 document analysis unit 20 important sentence extraction unit 22 compressed sentence generation unit 24 summary generation unit 26 storage unit 28 compressed sentence analysis unit 30 score Calculation unit, 40 learning device, 42 input unit, 44 operation unit, 46 output unit, 48 learning document analysis unit, 50 feature amount extraction unit, 52 model learning unit, 54 storage unit.

Claims

In a summarization device which generates a summarisation from documents:
A document analysis unit that analyzes a document and generates document data;
An important sentence extraction unit that extracts, as important sentences, a plurality of sentences having high importance scores from the document data analyzed by the document analysis unit;
A compressed sentence generation unit that generates a plurality of compressed sentences corresponding to the important sentences by respectively compressing the important sentences extracted by the important sentence extraction unit;
A summary generation unit that selects the compressed sentences with the highest summary score from the plurality of compressed sentences corresponding to the respective important sentences to generate a summary;
A summarizing device characterized by having:

The summarizing apparatus according to claim 1, wherein
The summary generator
A compressed sentence analysis unit that analyzes each of the compressed sentences and generates compressed sentence data;
A score calculation unit that calculates a summary score based on the feature amount extracted from the compressed sentence data;
A summarizing device characterized by having:

The summarizing apparatus according to claim 1 or 2, wherein
The compressed sentence generation unit extracts specific expressions included in the important sentence and adds an arbitrary label before the sentence compression.
The summary generator replaces the corresponding unique expression for the arbitrary label included in the selected compressed sentence.
Summary device characterized by.

The summarizing device according to any one of claims 1 to 3, wherein
The compressed sentence generation unit compresses the important sentence using a neural network,
The learning model of the neural network is previously learned from a plurality of learning documents and an abstract for each of the plurality of learning documents.
Summary device characterized by.

A summary method for generating a summary from documents,
A document analysis step of analyzing a document and generating document data;
An important sentence extraction step of respectively extracting, as important sentences, a plurality of sentences having high scores from the document data analyzed by the document analysis step;
A compressed sentence generation step of sentence compression of the important sentences extracted in the important sentence extraction step to generate a plurality of compressed sentences corresponding to the important sentences;
A summary generation step of selecting each of the compressed sentences having the highest summary score from the plurality of compressed sentences corresponding to each of the important sentences to generate a summary;
A summary method characterized by having:

The program for functioning a computer as each part of the summarizing apparatus as described in any one of Claim 1 to 5.