JP7288293B2

JP7288293B2 - Summary generation device and summary generation method

Info

Publication number: JP7288293B2
Application number: JP2018162525A
Authority: JP
Inventors: 新司飯塚; 秀彰宮内; 毅 ▲高▼橋
Original assignee: 株式会社日立ソリューションズ東日本
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2023-06-07
Anticipated expiration: 2038-08-31
Also published as: JP2020035272A

Description

本発明は、要約生成技術に関する。 The present invention relates to abstract generation technology.

例えば、コールセンターの顧客対応業務等では、音声認識システムによりテキスト化された通話内容の音声認識テキストを、応対の品質向上のためのデータ分析、オペレータによる通話記録のシステム登録、などに活用している。 For example, in customer service operations at call centers, the voice recognition text of the call content converted into text by the voice recognition system is used for data analysis to improve the quality of response, and for system registration of call records by operators. .

しかし、通話の音声認識テキストには、言い淀みや主要な内容とは関係のない発言等が多く含まれているため、通話の音声認識テキストを人が読んで内容を把握するには手間がかかる。 However, the speech recognition text of a call contains many hesitation and utterances unrelated to the main content, so it takes time and effort for people to read the speech recognition text of a call and understand the content. .

そのため、音声認識テキストを人が読みやすい文章に要約する、要約生成技術に対するニーズが増大している。 Therefore, there is an increasing need for a summary generation technology that summarizes speech recognition text into human-readable sentences.

コールセンターのヘルプサービスにおける顧客との会話は、あいさつ、困り事の質問、質問への回答、などの順番で行われる。コールセンターの通話の要約では、顧客との会話のうち、質問に関する発言と、回答に関する発言と、の両方の話題が要約に含まれることが望ましい。 Conversations with customers in call center help services are conducted in the order of greetings, questions about problems, answers to questions, and the like. In summaries of calls at a call center, it is desirable that the summaries include both topics of remarks regarding questions and remarks regarding answers in conversations with customers.

下記特許文献１では、単語を数値ベクトル化する分散表現の技術（技術内容については非特許文献１参照）を用いて、文や文書の類似度を算出し、類似度に基づいて要約に含める文を抽出する、抽出的要約技術が述べられている。 In Patent Document 1 below, a technique of distributed representation that converts words into numerical vectors (see Non-Patent Document 1 for technical content) is used to calculate the similarity of sentences and documents, and the sentences to be included in the summary based on the similarity Abstractive summarization techniques are described to extract the

特開２０１６－２０７１４１号公報JP 2016-207141 A

T. Mikolov、I. Sutskever、K. Chen、G. S. Corrado、J. Dean、“Distributed representations of words and phrases and their compositionality”Advances in neural information processing systems、pp. 3111-3119 (2013).T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, pp. 3111-3119 (2013).

しかし、特許文献１の技術では、取得部から取得した文書のみから要約を抽出するため、例えば、「お世話になります」などのように、他の通話と共通する定型的な表現もそのまま抽出され、期待される要約の抽出結果と比較すると、要約の抽出精度が低くなる場合がある（第１の課題）。 However, in the technique of Patent Document 1, since a summary is extracted only from the document acquired from the acquisition unit, fixed expressions common to other calls, such as "Thank you for your help," are extracted as they are. , the summary extraction accuracy may be low compared to the expected summary extraction result (first problem).

また、特許文献１の技術では、通話内容の時間的な変化を考慮していないため、質問に関する発言は抽出されず、回答に関する発言のみ抽出されるなど、通話内の複数の話題を考慮した要約を生成できない場合がある（第２の課題）。
本発明は、上記の課題を解決し、要約の抽出精度を高めることを目的とする。 In addition, since the technique of Patent Document 1 does not consider temporal changes in the content of a call, it does not extract utterances related to questions, but extracts only utterances related to answers. may not be generated (second problem).
SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems and improve the accuracy of abstract extraction.

本発明においては、例えば、以下の１）～４）までの手順により、分散表現を用いた自動分類手法による不要文除去処理と、抽出的要約処理とを実行する。
１）形態素解析
２）フィラーなどの不要語除去
３）分散表現を用いた自動分類手法による不要文除去（第１の課題を解決する手段により解決する。）
４）抽出的要約
この際、要約対象の文書から、出現順に一定数の文をウィンドウで切り出して、ウィンドウ内の文書を従来の分散表現を用いた抽出的要約技術で要約し、ウィンドウを一文ずつスライドさせていくことで文書全体の要約を生成する手法（以下、「スライディングウィンドウ法」と称する。）を適用すると良い（とりわけ、第２の課題を解決するための手段により解決する）。 In the present invention, for example, according to the following procedures 1) to 4), unnecessary sentence removal processing and extractive summarization processing are executed by an automatic classification method using distributed representation.
1) Morphological analysis 2) Removal of unnecessary words such as fillers 3) Removal of unnecessary sentences by an automatic classification method using distributed representation (solved by the means for solving the first problem.)
4) Extractive summarization At this time, a certain number of sentences are cut out from the document to be summarized in a window in order of appearance, and the document in the window is summarized by extractive summarization technology using conventional distributed representation, and the window is extracted sentence by sentence. It is preferable to apply a method of generating a summary of the entire document by sliding (hereinafter referred to as "sliding window method") (particularly, it is solved by the means for solving the second problem).

スライディングウィンドウ法の適用においては、ウィンドウに含める文の数の最大値であるウィンドウサイズを、ウィンドウ内の話題が１つに限定できる程度に小さくし、ウィンドウ内の文書の要約を行う。これにより、ウィンドウ内の文書から抽出される要約結果はその話題に関するものとなる。 In applying the sliding window method, the window size, which is the maximum number of sentences included in the window, is made small enough to limit the number of topics in the window to one, and the documents in the window are summarized. This ensures that the summary results extracted from the documents in the window are on that topic.

これをウィンドウの位置を一文ずつずらしながら行い、各ウィンドウから抽出される要約結果を、重複する文を除いて合併することで、文書全体の要約を生成する。これにより、文書全体の要約に各話題に関する要約結果が含まれるようにすることができる。 This process is performed while shifting the position of the window by one sentence, and the summaries extracted from each window are combined, excluding duplicate sentences, to generate a summary of the entire document. This allows the summary of the entire document to include summary results for each topic.

本発明の一観点によれば、一つ以上の文を含む文書から、文を抽出して前記文書の要約を生成する要約生成装置であって、単語と、前記単語を多次元の実数値ベクトルで表した単語分散表現と、が登録されている、単語分散表現情報記憶部と、文と、前記文が要か不要かの情報が記載された文ラベルと、単語分散表現情報に基づいて算出された前記文の分散表現である文分散表現と、が登録されている、不要文判定教師データ情報記憶部と、要約対象の文書である要約対象文書を取得する、要約対象文書取得部と、前記要約対象文書に含まれる文に対して、前記単語分散表現情報記憶部に記憶されている単語分散表現情報に基づいて前記文の文分散表現を算出し、前記不要文判定教師データ情報記憶部に登録されている不要文判定教師データ情報の前記文ラベルと前記文分散表現とに基づき、前記文の文分散表現から、自動分類手法により前記文が不要文であるかどうかを判定し、前記要約対象文書から、不要文であると判定された文を除去することで、不要文除去済みの文書を生成する、不要文除去処理部と、前記不要文除去済みの文書から、抽出的要約手法により文を抽出して要約とすることで、前記要約対象文書の要約を生成する、要約生成部と、を有することを特徴とする要約生成装置が提供される。 According to one aspect of the present invention, there is provided a summary generation device for extracting sentences from a document containing one or more sentences and generating a summary of the document, comprising words and multidimensional real-valued vectors of the words Calculation based on the word distributed representation information storage unit in which the word distributed representation represented by is registered, the sentence, the sentence label containing information as to whether the sentence is necessary or not, and the word distributed representation information a sentence distributed representation that is a distributed representation of the sentence that has been written; an unnecessary sentence judgment training data information storage unit; calculating a sentence distributed representation of the sentence contained in the summary target document based on the word distributed representation information stored in the word distributed representation information storage section; Based on the sentence label and the sentence distributed representation of the unnecessary sentence judgment training data information registered in the An unnecessary sentence removal processing unit for generating an unnecessary sentence-removed document by removing sentences determined to be unnecessary sentences from a document to be summarized, and an extractive summarization method from the unnecessary sentence-removed document. and a summary generation unit that generates a summary of the document to be summarized by extracting sentences from the above to generate a summary.

前記不要文除去処理部において、前記自動分類手法は、前記不要文判定教師データ情報記憶部に登録されている文分散表現のうち、文ラベルが不要である文分散表現と、前記文の文分散表現と、のコサイン類似度を算出し、前記コサイン類似度のうち、少なくとも一つの値が事前に登録されている閾値より大きければ、前記文が不要文であると判定することが好ましい。 In the unnecessary sentence removal processing unit, the automatic classification method includes, among the sentence distributed expressions registered in the unnecessary sentence determination training data information storage unit, sentence distributed expressions that do not require a sentence label, and sentence distributed expressions of the sentences. It is preferable to calculate the cosine similarity between the expression and and, if at least one of the values of the cosine similarity is greater than a pre-registered threshold value, determine that the sentence is an unnecessary sentence.

あるいは、前記不要文除去処理部において、前記自動分類手法は、前記不要文判定教師データ情報記憶部に登録されている文ラベルと文分散表現を教師データとした、ｋ－近傍法、ニューラルネットワーク、サポートベクターマシンを含む、教師あり機械学習による自動分類手法のうち、いずれか一つの手法であることが好ましい。 Alternatively, in the unnecessary sentence removal processing unit, the automatic classification method may be a k-neighborhood method, a neural network, or the like, using sentence labels and sentence distributed representations registered in the unnecessary sentence determination training data information storage unit as training data. Any one of automatic classification methods based on supervised machine learning, including support vector machines, is preferable.

前記抽出的要約手法は、前記抽出的要約手法に入力された文書である入力文書に対して、前記単語分散表現情報に基づいて、前記入力文書に含まれる文の文分散表現を算出し、前記文分散表現に基づき算出された、前記入力文書に含まれる文の重要度に基づき、前記入力文書の要約に含める文を抽出することが好ましい。 The extractive summarization method calculates a sentence distributed representation of a sentence included in the input document based on the word distributed representation information for an input document that is input to the extractive summarization method, and Preferably, the sentences to be included in the summary of the input document are extracted based on the importance of the sentences included in the input document calculated based on the sentence distributed representation.

具体的には、前記抽出的要約手法は、前記入力文書に含まれる単語に対して、前記入力文書における前記単語の出現頻度に基づき算出される実数値である、文書中における前記単語の出現頻度を表す出現頻度の指標を算出し、前記入力文書に含まれる文に対して、形態素解析部を用いて前記文を形態素解析して単語へ分かち書きし、前記単語のうち、不要語除去処理部により不要と判定された単語である不要語を、前記文から除去し、前記不要語を除去した前記文に含まれる単語に対して、前記単語分散表現情報を参照して、前記単語の単語分散表現を取得し、前記単語分散表現に、前記単語の出現頻度の指標を乗算することで、重み付き単語分散表現を算出し、前記重み付き単語分散表現を合成することで、前記文分散表現を算出し、前記文分散表現を合成して、前記入力文書の文書分散表現を算出し、前記文分散表現と前記文書分散表現とのコサイン類似度として算出された重要度に基づき、前記入力文書の要約に含める文を抽出することが好ましい。 Specifically, the extractive summarization method includes, for a word contained in the input document, the appearance frequency of the word in the document, which is a real value calculated based on the appearance frequency of the word in the input document. is calculated, and the sentences included in the input document are morphologically analyzed using the morphological analysis unit to divide and write the sentences into words, and among the words, the unnecessary word removal processing unit Unnecessary words, which are words determined to be unnecessary, are removed from the sentence, and words included in the sentence from which the unnecessary words are removed are referred to the word distributed representation information, and word distributed representation of the words is performed. is obtained, and the weighted word distributed representation is calculated by multiplying the word distributed representation by the word appearance frequency index, and the sentence distributed representation is calculated by synthesizing the weighted word distributed representation. and synthesizing the sentence distributed representations to calculate the document distributed representation of the input document, and summarizing the input document based on the degree of importance calculated as the cosine similarity between the sentence distributed representation and the document distributed representation. It is preferable to extract sentences to be included in .

前記単語の出現頻度の指標は、入力文書中の単語に対して、入力文書における単語の出現頻度に基づき算出される実数値であって、出現頻度が大きいほど値が小さくなる、正の実数値であることを要件とする指標であるようにすると良い。 The index of the word appearance frequency is a real value calculated based on the appearance frequency of the word in the input document, and is a positive real value that decreases as the appearance frequency increases. It is better to make it an index that requires that

このような指標を分散表現に乗算することで、出現頻度の高い単語の重みを低くすることができる。 By multiplying the distributed representation by such an index, it is possible to lower the weight of words with a high appearance frequency.

また、単語と、非負の実数値である単語の重みと、が登録されている、単語重み付け情報をさらに備え、前記抽出的要約手法は、前記不要語を除去した前記文に含まれる単語に対して、前記単語重み付け情報を参照して、前記単語の重みを取得し、前記単語分散表現に、前記単語の重みと、前記単語の出現頻度の指標と、を乗算することで、重み付き単語分散表現を算出することが好ましい。 The method further comprises word weighting information in which words and word weights that are non-negative real numbers are registered, and the extractive summarization method applies and obtaining the weight of the word by referring to the word weighting information, and multiplying the word distribution representation by the weight of the word and an index of the frequency of occurrence of the word to obtain a weighted word distribution It is preferred to compute the expression.

前記抽出的要約手法において、前記不要語除去処理部は、前記形態素解析部による単語の品詞判定の結果が、フィラーである単語、感動詞である単語、のいずれか一方または両方を不要と判定することが好ましい。 In the extractive summarization method, the unnecessary word removal processing unit determines that one or both of filler words and interjection words are unnecessary as a result of part-of-speech determination of words by the morphological analysis unit. is preferred.

前記抽出的要約手法は、前記抽出的要約手法に入力された文書である入力文書に対して、前記入力文書において連続して出現する一部または全部の文からなる文書である、前記入力文書のウィンドウを一つ以上生成し、前記生成されたウィンドウは、前記入力文書に含まれるいずれの文も、少なくとも一つの前記ウィンドウに含まれる、という条件を満たし、前記生成されたウィンドウごとに、上記に記載の抽出的要約手法により文を抽出して要約とすることで、前記ウィンドウの要約を生成し、前記ウィンドウの要約を合併し、重複する文を除去することで、前記入力文書の要約を生成することが好ましい。 The extractive summarization method is an input document that is a document input to the extractive summarization method. generating one or more windows, the generated windows satisfying the condition that any sentence contained in the input document is contained in at least one of the windows; generating a summary of said window by extracting sentences into a summary according to the described abstractive summarization technique; generating a summary of said input document by merging said window summaries and removing duplicate sentences; preferably.

また、入力装置から、前記ウィンドウに含める文の数の最大値である、ウィンドウサイズを設定する、要約パラメータ設定部をさらに備え、前記抽出的要約手法において、前記生成されたウィンドウは、前記生成されたウィンドウに含まれる文の数が、いずれも前記ウィンドウサイズ以下であり、かつ、前記入力文書に含まれる文である第１の文と、前記入力文書において前記第１の文の次に出現する第２の文に対して、前記第１の文が少なくとも一つの前記ウィンドウにおいて出現順に最後の文であるならば、前記第２の文も少なくとも一つの前記ウィンドウにおいて出現順に最後の文である、という条件をさらに満たすことが好ましい。 a summary parameter setting unit that sets a window size, which is the maximum number of sentences to be included in the window, from an input device; the number of sentences contained in the window is equal to or less than the window size, and a first sentence that is a sentence contained in the input document and that appears next to the first sentence in the input document for a second sentence, if said first sentence is the last sentence in order of appearance in at least one of said windows, then said second sentence is also the last sentence in order of appearance in at least one of said windows; It is preferable to further satisfy the following condition.

また、入力装置から、要約に含める文の目標抽出件数と、要約処理の継続条件と、要約処理の終了条件と、を設定する、要約パラメータ設定部をさらに備え、前記抽出的要約手法は、前記抽出的要約手法に入力された文書である入力文書に対して、出力文書を、前記入力文書を代入することにより、初期化し、前記出力文書に対して、上記に記載の抽出的要約手法を適用することで、前記出力文書の要約を生成し、前記出力文書を、前記生成した前記出力文書の要約を代入することにより、更新する、更新処理を実行し、前記要約処理の継続条件が満たされているか、または、前記要約処理の終了条件が満たされておらず、かつ、前記出力文書に含まれる文の数が前記目標抽出件数より大きい場合は、前記更新処理を繰り返し、上記以外の場合は、前記出力文書を前記入力文書の要約として出力することを特徴とする。 The apparatus further comprises a summary parameter setting unit for setting, from the input device, a target number of sentences to be extracted to be included in the summary, a condition for continuing the process of summarizing, and a condition for terminating the process of summarizing, wherein the extractive summarization method comprises: For an input document that is a document input to an abstractive summarization method, initialize an output document by substituting said input document, and apply the extractive summarization method described above to said output document. generates a summary of the output document, updates the output document by substituting the generated summary of the output document, executes update processing, and satisfies the continuation condition of the summary processing. or if the termination condition of the summarization process is not satisfied and the number of sentences included in the output document is greater than the target extraction number, the update process is repeated; and outputting the output document as a summary of the input document.

本発明の他の観点によれば、一つ以上の文を含む文書から、コンピュータ処理により文を抽出して前記文書の要約を生成する要約生成方法であって、要約対象の文書である要約対象文書を取得する、要約対象文書取得ステップと、コンピュータが、ａ）前記要約対象文書に含まれる文に対して、単語と、前記単語を多次元の実数値ベクトルで表した単語分散表現単語分散表現情報に基づいて前記文の文分散表現を算出し、ｂ）文と、前記文が要か不要かの情報が記載された文ラベルと、前記単語分散表現情報に基づいて算出された前記文の分散表現である文分散表現と、が登録されている、不要文判定教師データ情報に含まれる文ラベルと文分散表現から、自動分類手法により前記文が不要文であるかどうかを判定し、ｃ）前記要約対象文書から、不要文であると判定された文を除去することで、不要文除去済みの文書を生成する、不要文除去処理ステップと、ｄ）前記不要文除去済みの文書から、抽出的要約手法により文を抽出して要約とすることで、前記要約対象文書の要約を生成する、要約生成ステップと、を実行することを特徴とする要約生成方法が提供される。 According to another aspect of the present invention, there is provided a summary generating method for extracting sentences from a document containing one or more sentences by computer processing and generating a summary of the document, wherein the document to be summarized is a document to be summarized. a step of acquiring a document to be summarized, which acquires a document; calculating a sentence distributed representation of the sentence based on the information; b) a sentence, a sentence label containing information as to whether the sentence is necessary or unnecessary, and a sentence distributed representation of the sentence calculated based on the word distributed representation information; From the sentence label and the sentence distributed representation included in the unnecessary sentence judgment training data information registered with the sentence distributed representation that is the distributed representation, it is judged whether or not the sentence is an unnecessary sentence by an automatic classification method, c ) an unnecessary sentence removal processing step of generating an unnecessary sentence-removed document by removing sentences determined to be unnecessary sentences from the summary target document; and d) from the unnecessary sentence-removed document, and a summary generation step of generating a summary of the document to be summarized by extracting sentences by a selective summarization method to generate a summary.

本発明によれば、他の通話と共通する定型的な表現を、不要文判定により事前に除去できるため、従来の技術と比較して要約の抽出精度を高めることができる。
また、スライディングウィンドウ法を適用することで、文書中の各話題の重要文がいずれかのウィンドウの要約結果に含まれるため、それらを合併して全体の要約とすることで、複数の話題を考慮した要約を生成できる。 According to the present invention, stereotyped expressions common to other conversations can be eliminated in advance by judging unnecessary sentences, so the accuracy of abstract extraction can be improved as compared with the conventional technology.
In addition, by applying the sliding window method, the important sentences of each topic in the document are included in the summary result of one of the windows, so by merging them into the overall summary, we can consider multiple topics. can generate a summary

図１は、本発明の一実施の形態による要約生成装置の一構成例を示す機能ブロック図である。FIG. 1 is a functional block diagram showing one configuration example of a summary generation device according to one embodiment of the present invention. 図２は、本実施の形態による要約生成装置による全体処理例を示すフローチャート図である。FIG. 2 is a flow chart showing an example of overall processing by the summary generation device according to this embodiment. 単語分散表現学習部の処理例を示すフローチャート図である。FIG. 10 is a flow chart diagram showing an example of processing of a word distributed representation learning unit; 不要文判定教師データ生成部の処理例を示すフローチャート図である。FIG. 10 is a flow chart showing a processing example of an unnecessary sentence judgment training data generation unit; 不要文除去処理部の処理例を示すフローチャート図である。FIG. 5 is a flow chart showing an example of processing by an unnecessary sentence removal processing unit; 要約生成部においてスライディングウィンドウ法を適用した処理例を示すフローチャート図である。FIG. 10 is a flow chart showing an example of processing in which the sliding window method is applied in the summary generator; 図６のステップＳ６－８の処理例を示すフローチャート図である。FIG. 7 is a flowchart showing an example of processing in step S6-8 of FIG. 6; スライディングウィンドウ法におけるウィンドウの一例を示す図である。It is a figure which shows an example of the window in a sliding window method. 単語辞書テーブルの一構成例を示す図である。It is a figure which shows one structural example of a word dictionary table. 単語重み付けテーブルの一構成例を示す図である。It is a figure which shows one structural example of a word weighting table. 分散表現学習コーパステーブルの一構成例を示す図である。FIG. 10 is a diagram showing a configuration example of a distributed representation learning corpus table; 単語分散表現テーブルの一構成例を示す図である。FIG. 10 is a diagram showing a configuration example of a word distributed representation table; 不要文判定教師データテーブルの一構成例を示す図である。FIG. 10 is a diagram showing a configuration example of an unnecessary sentence determination training data table; 要約対象文書テーブルの一構成例を示す図である。FIG. 10 is a diagram showing a configuration example of a summary target document table; 前処理結果テーブルの一構成例を示す図である。It is a figure which shows one structural example of a pre-processing result table. 要約結果テーブルの一構成例を示す図である。It is a figure which shows one structural example of a summary result table.

以下においては、コールセンター等のヘルプサービスを提供している事業者向けの要約生成技術を例にして説明するが、本発明は、その他の種々の業務を含む要約生成技術に適用可能である。 In the following description, an example of abstract generation technology for businesses that provide help services such as call centers will be described, but the present invention is applicable to other various types of abstract generation technology.

本明細書において、スライディングウィンドウ法とは、要約対象の文書から、出現順に一定数の文をウィンドウで切り出して、ウィンドウ内の文書を従来の分散表現を用いた抽出的要約技術で要約し、ウィンドウを一文ずつスライドさせていくことで文書全体の要約を生成する方法をいう。 In this specification, the sliding window method is defined as extracting a certain number of sentences from a document to be summarized in the order of their appearance in a window, and summarizing the documents in the window by a conventional extractive summarization technique using distributed representation. This is a method of generating a summary of the entire document by sliding each sentence.

また、本明細書において、図１および図９から図１６までにおいて示されている各種情報は、テーブルの形式により例示的に示している。これらの各種情報は、例えば、図１の補助記憶装置の各データ情報を記憶する記憶部（或いは記憶領域）に記憶されるのが一般的である。また、本明細書において、単語の分散表現とは、例えば、非特許文献１の技術であるword2vecなどにより学習された単語のベクトル空間への埋め込みのことである。また、以下では、その埋め込みによって単語と対応付けられたベクトル自体も、単語の分散表現として参照する。自然言語処理に機械学習を適用しやすくするために、おおよそ数百次元のベクトルで単語を表現することを意図するものである。 Also, in this specification, various types of information shown in FIGS. 1 and 9 to 16 are exemplified in the form of tables. These various types of information are generally stored, for example, in a storage section (or storage area) that stores each data information in the auxiliary storage device of FIG. In this specification, the distributed representation of words means embedding words learned by word2vec, which is a technique of Non-Patent Document 1, in a vector space, for example. Also, hereinafter, the vector associated with the word by embedding is also referred to as the distributed representation of the word. In order to facilitate the application of machine learning to natural language processing, it is intended to represent words with roughly several hundred dimensional vectors.

以下に、本発明の一実施の形態による要約生成技術について図面を参照しながら詳細に説明する。 A summary generation technique according to an embodiment of the present invention will be described in detail below with reference to the drawings.

図１は、本実施の形態による要約生成装置の一構成例を示す機能ブロック図である。図１に示すように、本実施の形態による要約生成装置Ａは、補助記憶装置（各記憶部）１と、主記憶装置２と、入力装置３と、出力装置４と、中央演算装置（ＣＰＵ）５と、を有している。尚、図１では、全ての構成要素が１つの装置内に設けられている構成例を示しているが、例えば、補助記憶装置が別の装置内に遠隔で設けられているなど、種々の形態が含まれることは言うまでもない。 FIG. 1 is a functional block diagram showing one configuration example of a summary generation device according to this embodiment. As shown in FIG. 1, a summary generator A according to this embodiment includes an auxiliary storage device (each storage unit) 1, a main storage device 2, an input device 3, an output device 4, a central processing unit (CPU ) 5 and Note that FIG. 1 shows a configuration example in which all the components are provided in one device, but various configurations are possible, such as, for example, the auxiliary storage device being remotely provided in another device. It goes without saying that the

補助記憶装置（各記憶部）１には、単語辞書テーブル１－１、単語重み付けテーブル１－２、分散表現学習コーパステーブル１－３、単語分散表現テーブル（情報）１－４、不要文判定教師データテーブル（情報）１－５、要約対象文書テーブル（情報）１－６、前処理結果テーブル１－７、要約結果テーブル１－８が設けられている。 The auxiliary storage device (each storage unit) 1 includes a word dictionary table 1-1, a word weighting table 1-2, a distributed representation learning corpus table 1-3, a word distributed representation table (information) 1-4, and an unnecessary sentence determination teacher. A data table (information) 1-5, a summary target document table (information) 1-6, a preprocessing result table 1-7, and a summary result table 1-8 are provided.

また、主記憶装置２には、形態素解析部２－１、不要語除去処理部２－２、単語分散表現学習部２－３、不要文判定教師データ生成部２－４、不要文除去処理部２－５、要約対象文書取得部２－６、要約生成部２－７としてＣＰＵを機能させるための例えばプログラム等が格納され、プログラムによりＣＰＵにそれぞれの機能部として機能するように構成されている。 The main storage device 2 also includes a morphological analysis unit 2-1, an unnecessary word removal processing unit 2-2, a word distributed expression learning unit 2-3, an unnecessary sentence judgment training data generation unit 2-4, and an unnecessary sentence removal processing unit. 2-5, a summary target document acquisition unit 2-6, and a summary generation unit 2-7, for example, programs for causing the CPU to function are stored. .

入力装置３は、音声を取得するマイクロフォンやマウス、キーボードなどが含まれ、出力装置４は要約を出力するディスプレイやスピーカなどが含まれる。 The input device 3 includes a microphone, mouse, keyboard, etc. for acquiring voice, and the output device 4 includes a display, a speaker, etc. for outputting a summary.

上記各処理部２－１～２－７による処理の流れの一例を示すフローチャート図としては、以下の図面を参照する。図２は、本実施の形態による要約生成装置Ａによる全体処理例を示すフローチャート図である。図３は、単語分散表現学習部２－３の処理例を示すフローチャート図である。図４は、不要文判定教師データ生成部２－４の処理例を示すフローチャート図である。図５は、不要文除去処理部２－５の処理例を示すフローチャート図である。図６は、要約生成部２－７においてスライディングウィンドウ法を適用した処理例を示すフローチャート図である。図７は、図６のステップＳ６－８の処理例を示すフローチャート図である。
また、図８は、ウィンドウの一例を示す図である。 As a flow chart diagram showing an example of the flow of processing by the processing units 2-1 to 2-7, the following drawings are referred to. FIG. 2 is a flow chart showing an example of overall processing by the summary generation device A according to this embodiment. FIG. 3 is a flow chart showing a processing example of the word distributed representation learning unit 2-3. FIG. 4 is a flow chart showing a processing example of the unnecessary sentence judgment training data generation unit 2-4. FIG. 5 is a flow chart showing a processing example of the unnecessary sentence removal processing section 2-5. FIG. 6 is a flow chart showing an example of processing in which the sliding window method is applied in the summary generator 2-7. FIG. 7 is a flow chart showing an example of processing in step S6-8 of FIG.
Moreover, FIG. 8 is a figure which shows an example of a window.

さらに、図９から図１６までは、補助記憶装置１に各種情報等が格納されている各テーブルの一構成例を示す図である。図９は、単語辞書テーブル１－１の一構成例を示す図である。単語辞書テーブル１－１は、形態素解析部２－１により、形態素解析における品詞判定を行うために参照される辞書である。図９に示すように、単語毎に、品詞が記載されており、特に、不要語である可能性が高いフィラーであるか否かを明記している。図１０は、単語重み付けテーブル１－２の一構成例を示す図である。すなわち、単語毎に単語の重みが付与されている。０、０．５などは重みが小さい例、１０．０などは単語の重みが大きい例である。 Further, FIGS. 9 to 16 are diagrams showing one configuration example of each table in which various information and the like are stored in the auxiliary storage device 1. FIG. FIG. 9 is a diagram showing a configuration example of the word dictionary table 1-1. The word dictionary table 1-1 is a dictionary referred to by the morphological analysis unit 2-1 to determine the part of speech in the morphological analysis. As shown in FIG. 9, the part of speech is described for each word, and in particular, whether or not it is a filler that is highly likely to be an unnecessary word is specified. FIG. 10 is a diagram showing a configuration example of the word weighting table 1-2. That is, a word weight is assigned to each word. 0, 0.5, etc. are examples of low weights, and 10.0, etc. are examples of high weights of words.

図１１は、分散表現学習コーパステーブル１－３の一構成例を示す図である。図１２は、単語分散表現テーブル１－４の一構成例を示す図である。図１３は、不要文判定教師データテーブル１－５の一構成例を示す図である。図１３に示すように、不要文判定教師データテーブルには、教師データＩＤ毎に、教師データ文と、要不要の文ラベルと、ベクトル値１～２００（例示）が示されている。図１４は、要約対象文書テーブル１－６の一構成例を示す図である。図１５は、前処理結果テーブル１－７の一構成例を示す図である。図１５に示すように、前処理結果テーブルには、文書ＩＤ毎に、それに含まれる文の文ＩＤ、その単語分かち書き、不要文判定結果が格納されている。図１６は、要約結果テーブル１－８の一構成例を示す図である。 FIG. 11 is a diagram showing a configuration example of the distributed representation learning corpus table 1-3. FIG. 12 is a diagram showing a configuration example of the word distributed expression table 1-4. FIG. 13 is a diagram showing a configuration example of the unnecessary sentence judgment training data table 1-5. As shown in FIG. 13, the unnecessary sentence determination training data table shows training data sentences, necessary/unnecessary sentence labels, and vector values 1 to 200 (example) for each training data ID. FIG. 14 is a diagram showing a configuration example of the summary target document table 1-6. FIG. 15 is a diagram showing a configuration example of the preprocessing result table 1-7. As shown in FIG. 15, the preprocessing result table stores, for each document ID, the sentence ID of the sentence included therein, the word separator, and the unnecessary sentence determination result. FIG. 16 is a diagram showing a configuration example of the summary result table 1-8.

以下、フローチャート図に沿って、本実施の形態による処理の詳細について説明する。
図２に示すように、要約生成処理の全体処理の概要においては、処理が開始されると（ＳＴＲＡＴ）、ステップＳ１において、単語分散表現を学習させる。この処理については、下記の図３において詳細に説明する。次いで、ステップＳ２において、不要文判定教師データを生成する。この処理については、下記の図４において詳細に説明する。次いで、ステップＳ３において、すべての要約対象文書に対して処理が完了したか否かを判定する。ステップＳ３でＹｅｓの場合には、処理を終了する（ＥＮＤ）。ステップＳ３でＮｏの場合には、ステップＳ４に進み、要約対象文書取得部２－６が要約対象文書テーブル１－６（図１４）から、文書ＩＤに基づき、処理が未完了の要約対象文書Ｄを１件取得する。尚、文書とは、文の列である。本実施の形態では、文書とは、要約対象文書テーブル１－６に登録されている同一の文書ＩＤを持つ文の全部または一部を、文ＩＤの昇順に並べた列とする。次いで、ステップＳ５において、要約対象文書Ｄの不要文を除去する。ステップＳ５の処理については、図５により詳細に説明する。ステップＳ６において、要約対象文書Ｄの要約Ｓを生成する。ステップＳ６の処理については、図６および図７により詳細に説明する。次いで、ステップＳ７において、要約Ｓを要約結果テーブル１－８に格納し、ステップＳ３に戻り、最終的に全ての処理が完了すると処理が終了する（ＥＮＤ）。 Details of the processing according to the present embodiment will be described below with reference to flowcharts.
As shown in FIG. 2, in the outline of the overall process of the summary generation process, when the process is started (STRAT), word distributed representation is learned in step S1. This process is described in detail in FIG. 3 below. Next, in step S2, unnecessary sentence judgment teacher data is generated. This process is described in detail in FIG. 4 below. Next, in step S3, it is determined whether or not all the documents to be summarized have been processed. If Yes in step S3, the process ends (END). In the case of No in step S3, the process proceeds to step S4, where the summary target document acquisition unit 2-6 selects the summary target document D that has not been processed based on the document ID from the summary target document table 1-6 (FIG. 14). to get one. A document is a string of sentences. In this embodiment, a document is a column in which all or part of the sentences having the same document ID registered in the summary object document table 1-6 are arranged in ascending order of the sentence ID. Next, in step S5, unnecessary sentences of the document D to be summarized are removed. The processing of step S5 will be described in detail with reference to FIG. In step S6, a summary S of the document D to be summarized is generated. The processing of step S6 will be described in detail with reference to FIGS. 6 and 7. FIG. Next, in step S7, the summary S is stored in the summary result table 1-8, the process returns to step S3, and when all the processes are finally completed, the process ends (END).

以上の処理により、図１６に例示されるように、要約結果テーブル１－８に要約結果が格納される。要約結果テーブル１－８は、文書ＩＤと、文ＩＤと、文ＩＤ毎の文の内容とを含む。 By the above processing, the summary result is stored in the summary result table 1-8 as illustrated in FIG. The summary result table 1-8 includes document IDs, sentence IDs, and sentence contents for each sentence ID.

図３は、図２の単語分散表現の学習処理（ステップＳ１）の詳細な処理例を示す図である。図３に示すように、ステップＳ１においては、ステップＳ１－１に示すように、形態素解析部２－１を用いて、分散表現学習コーパステーブル１－３の各文を形態素解析して単語へ分かち書きする。分散表現学習コーパステーブル１－３は、図１１に示すように、自然言語処理に用いるため、自然言語の文章を集積したコーパスのコーパスＩＤと、文ＩＤと、文の内容とを含む。 FIG. 3 is a diagram showing a detailed processing example of the word distributed representation learning processing (step S1) in FIG. As shown in FIG. 3, in step S1, as shown in step S1-1, each sentence in the distributed representation learning corpus table 1-3 is morphologically analyzed using the morphological analysis unit 2-1, and divided into words. do. The distributed representation learning corpus table 1-3, as shown in FIG. 11, contains corpus IDs of corpuses in which sentences in natural language are accumulated, sentence IDs, and contents of sentences for use in natural language processing.

次いで、ステップＳ１－２において、例えば非特許文献１の技術であるword2vecなどの既存の分散表現の学習手法を用いて、単語の分散表現の参照用のデータとするために、形態素解析部２－１が単語へ分かち書きした全ての文を入力として、単語の分散表現を学習する。 Next, in step S1-2, for example, using an existing distributed representation learning method such as word2vec, which is the technology of Non-Patent Document 1, the morphological analysis unit 2- 1 learns distributed representations of words by using all sentences that are divided into words by 1 as input.

次いで、ステップＳ１－３において、学習した分散表現を参照するために、上記において学習した単語の分散表現を単語分散表現テーブル１－４に格納する。そして、単語分散表現の学習処理（ステップＳ１）を終了する（ＲＥＴＵＲＮ）。単語分散表現テーブル１－４は、図１２に示すように、単語と、その単語の分散表現である例えば２００次元のベクトルの、ベクトル値とを有する。ベクトル値は、プラスとマイナスとを含む実数値である。ここで近い意味の単語は、ベクトル値も近くなるようになっている。 Next, in step S1-3, in order to refer to the learned distributed representation, the learned distributed representation of the word is stored in the word distributed representation table 1-4. Then, the word distributed representation learning process (step S1) is terminated (RETURN). The word distributed representation table 1-4, as shown in FIG. 12, has words and vector values such as 200-dimensional vectors that are distributed representations of the words. Vector values are real numbers, including pluses and minuses. Words with close meanings have close vector values.

図４は、図２の不要文判定教師データの生成処理（ステップＳ２）の詳細な処理例を示す図である。まず、ステップＳ２－１において、不要文判定教師データテーブル１－５の全ての行の処理が完了したか否かを判定する。Ｙｅｓの場合には、終了する（ＲＥＴＵＲＮ）。 FIG. 4 is a diagram showing a detailed processing example of the unnecessary sentence judgment training data generation processing (step S2) in FIG. First, in step S2-1, it is determined whether or not all rows of the unnecessary sentence determination teacher data table 1-5 have been processed. If yes, exit (RETURN).

Ｎｏの場合には、ステップＳ２－２において、不要文判定教師データテーブル１－５から、処理が未完了の教師データの文ｓ_ｉを１件取得する。不要文判定教師データテーブル１－５は、図１３に示すように、教師データＩＤ毎に教師データ文（一文）と、要不要の文ラベルと、が格納されている。そして、処理が進むに従って、ベクトル値が格納されていく。 If No, in step S2-2, one sentence _si of unprocessed training data is acquired from the unnecessary sentence judgment training data table 1-5. As shown in FIG. 13, the unnecessary sentence determination training data table 1-5 stores a training data sentence (sentence) and an unnecessary sentence label for each training data ID. Vector values are stored as the process progresses.

ステップＳ２－３において、形態素解析部２－１を用いて、文ｓ_ｉを形態素解析して単語へ分かち書きする。ステップＳ２－４において、形態素解析の品詞情報を参照して不要語除去処理部２－２を用いて文ｓ_ｉから不要語を除去する。一例として、不要語除去処理部２－２は、形態素解析部２－１による単語の品詞判定の結果が、フィラーである単語、感動詞である単語、のいずれか一方または両方を不要と判定する。形態素解析部２－１による単語の品詞判定では、図９に示す単語辞書テーブル１－１に登録されている品詞の情報が用いられる。 In step S2-3, the morphological analysis unit 2-1 is used to morphologically analyze the sentence si and divide _it into words. In step S2-4, unnecessary words are removed from the sentence _si by referring to the part-of-speech information of the morphological analysis and using the unnecessary word removal processor 2-2. As an example, the unnecessary word removal processing unit 2-2 determines that one or both of filler words and interjection words are unnecessary as a result of part-of-speech determination of words by the morphological analysis unit 2-1. . The part-of-speech information registered in the word dictionary table 1-1 shown in FIG. 9 is used for part-of-speech determination of words by the morphological analysis unit 2-1.

ステップＳ２－５において、単語分散表現テーブル１－４に登録されている単語ｗの単語分散表現ｘ_ｗを参照して、文ｓ_ｉの文分散表現ｘ_ｉ＝Σ_ｗ∈ｓｉｘ_ｗを算出する。ここで、単語分散表現ｘ_ｗは、単語分散表現テーブル１－４に登録されているベクトル値１～２００（例示）を成分とするベクトルのことである。また記号「ｗ∈ｓ_ｉ」は、単語ｗが文ｓ_ｉに出現することを表し、上記のΣは、文ｓ_ｉに出現する単語ｗについての単語分散表現ｘ_ｗの和を表す。次いで、ステップＳ２－６において、文分散表現ｘ_ｉを不要文判定教師データテーブル１－５に登録し、ステップＳ２－１に戻る。 In step S2-5, referring to the word distributed representation x _w of the word w registered in the word distributed representation table 1-4, the sentence distributed representation x _i =Σ _w∈si x _w of the sentence s _i is calculated. . Here, the word distributed representation x _w is a vector whose components are the vector values 1 to 200 (example) registered in the word distributed representation table 1-4. The symbol “wεs _i ” represents that word w appears in sentence s _i , and Σ above represents the sum of word distributed representations x _w for word w appearing in sentence s _i . Next, in step S2-6, the distributed sentence expression x _i is registered in the unnecessary sentence judgment training data table 1-5, and the process returns to step S2-1.

図５は、図２のステップＳ５の詳細な流れの一例を示すフローチャート図であり、不要文除去の処理の流れの一例を示す図である。ステップＳ４で取得した、処理が未完了の要約対象文書Ｄを入力として、ステップＳ５の不要文除去処理を行う。 FIG. 5 is a flowchart showing an example of the detailed flow of step S5 in FIG. 2, and is a diagram showing an example of the flow of unnecessary sentence elimination processing. The unprocessed summary target document D obtained in step S4 is used as an input to perform unnecessary sentence removal processing in step S5.

まず、ステップＳ５－１において、要約対象文書Ｄの全ての文の処理を完了したか否かを判定する。Ｎｏの場合には、処理を継続し、ステップＳ５－２において、要約対象文書Ｄから処理が未完了の文ｓ_ｉを１件取得する。次いで、ステップＳ５－３において、形態素解析部２－１を用いて、文ｓ_ｉを形態素解析して単語へ分かち書きする。次いで、ステップＳ５－４において、不要語除去処理部２－２を用いて文ｓ_ｉから不要語を除去する。 First, in step S5-1, it is determined whether or not all the sentences of the document D to be summarized have been processed. If No, the process continues, and one unprocessed sentence _si is obtained from the document D to be summarized in step S5-2. Next, in step S5-3, the morphological analysis unit 2-1 is used to morphologically analyze the sentence si and separate _it into words. Next, in step S5-4, unnecessary words are removed from the sentence _si by using the unnecessary word removal processor 2-2.

次に、ステップＳ５－５において、不要語除去後の文ｓ_ｉを前処理結果テーブル１－７の単語分かち書きに登録する。次いで、ステップＳ５－６において、単語分散表現テーブル１－４に登録されている単語ｗの単語分散表現ｘ_ｗを参照して、文ｓ_ｉの文分散表現ｘ_ｉ＝Σ_ｗ∈ｓｉｘ_ｗを算出する。 Next, in step S5-5, the sentence s _i after removing the unnecessary words is registered in the word separator of the preprocessing result table 1-7. Next, in step S5-6, referring to the word distributed representation x _w of the word w registered in the word distributed representation table 1-4, the sentence distributed representation x _i =Σ _w∈si x _w of the sentence s _i is obtained. calculate.

尚、図５のステップＳ５－３～ステップＳ５－６までの処理は、大きな流れは、上記の図４のステップＳ２－３～ステップＳ２－５の処理と同様である。 Incidentally, the processing from step S5-3 to step S5-6 in FIG. 5 is basically the same as the processing from step S2-3 to step S2-5 in FIG.

次いで、ステップＳ５－７において、不要文判定教師データテーブル１－５に登録されている、文ベクトルと、ベクトル値１～２００（例示）を成分とするベクトルである文分散表現との組を教師データ、上記の文分散表現ｘ_ｉを入力として、自動分類手法により文ｓ_ｉが不要文かどうかを判定する。 Next, in step S5-7, a combination of a sentence vector registered in the unnecessary sentence determination training data table 1-5 and a sentence distributed representation that is a vector having vector values 1 to 200 (example) as components is used as a training data. Data and the above sentence distributed expression x _i are input, and whether or not sentence _si is an unnecessary sentence is determined by an automatic classification method.

自動分類手法による不要文の判定には、以下の手法を用いることが好ましい。
(a) コサイン類似度による類似文検索
不要文判定教師データテーブル１－５に登録されている文分散表現のうち、文ラベルが不要である文分散表現と、文の文分散表現ｘ_ｉと、のコサイン類似度を算出し、このコサイン類似度のうち、少なくとも一つの値が事前に登録されている閾値より大きければ、文ｓ_ｉが不要文であると判定する。 It is preferable to use the following method for determining unnecessary sentences by the automatic classification method.
(a) Similar sentence search by cosine similarity Of the sentence distributed expressions registered in the unnecessary sentence judgment training data table 1-5, the sentence distributed expression that does not require a sentence label, the sentence distributed expression _xi of the sentence, is calculated, and if at least one value of the cosine similarities is greater than a pre-registered threshold value, the sentence _si is determined to be an unnecessary sentence.

(b) 教師あり機械学習による不要文の判定
不要文判定教師データテーブル１－５に登録されている文ラベルと文分散表現を教師データとした、ｋ－近傍法、ニューラルネットワーク、サポートベクターマシンを含む、教師あり機械学習による自動分類手法のうち、いずれか一つの手法により、文ｓ_ｉが不要文かどうかを判定する。 (b) Judgment of unnecessary sentences by supervised machine learning The k-neighborhood method, neural network, and support vector machine are used as training data for sentence labels and sentence distributed representations registered in the unnecessary sentence judgment training data table 1-5. It is determined whether the sentence _si is an unnecessary sentence by any one of automatic classification methods by supervised machine learning.

次に、ステップＳ５－８において、不要文の判定結果を前処理結果テーブル１－７に登録する。ステップＳ５－８から、ステップＳ５－１に戻り、ステップＳ５－１でＹｅｓになるまで処理を継続する。ステップＳ５－１において、Ｙｅｓの場合には、ステップＳ５－９において、前処理結果テーブル１－７を参照し、不要と判定された文を要約対象文書Ｄから除去する。そして、不要文除去の処理（ステップＳ５）を終了する（ＲＥＴＵＲＮ）。 Next, in step S5-8, the determination result of the unnecessary sentence is registered in the preprocessing result table 1-7. From step S5-8, the process returns to step S5-1 and continues until YES in step S5-1. If Yes in step S5-1, the preprocessing result table 1-7 is referred to in step S5-9, and sentences determined to be unnecessary are removed from document D to be summarized. Then, the process of removing unnecessary sentences (step S5) is terminated (RETURN).

以上の不要文除去処理は、不要文判定教師データテーブル１－５に登録されている不要文判定教師データ情報の文ラベルと上記の文分散表現とに基づき、文の文分散表現から、自動分類手法により前記文が不要文であるかどうかを自動分類手法により判定し、不要と判定された文を要約対象文書から除去する処理である。この処理により、要約対象文書から、不要文であると判定された文を除去し、不要文除去済みの文書を生成することができる。 The unnecessary sentence removal processing described above is based on the sentence label of the unnecessary sentence judgment training data information registered in the unnecessary sentence judgment training data table 1-5 and the above sentence dispersion representation, and automatically classifies the sentence from the sentence dispersion representation. This is a process of determining whether or not the sentence is an unnecessary sentence by an automatic classification method, and removing sentences determined to be unnecessary from the document to be summarized. Through this processing, sentences determined to be unnecessary sentences can be removed from the document to be summarized, and a document from which unnecessary sentences have been removed can be generated.

図６は、図２のステップＳ６の処理の詳細な流れの一例を示すフローチャート図であり、抽出的要約手法により要約対象文書Ｄの要約Ｓを生成する処理の流れの一例を示す図である。ここでは、上記のスライディングウィンドウ法を再帰的に適用することで、要約対象文書の要約を生成する（以下、「再帰的スライディングウィンドウ法」と称する）。 FIG. 6 is a flow chart showing an example of a detailed flow of processing in step S6 of FIG. 2, and is a diagram showing an example of the flow of processing for generating a summary S of a document D to be summarized using the extractive summarization technique. Here, the above sliding window method is recursively applied to generate a summary of the document to be summarized (hereinafter referred to as "recursive sliding window method").

まず、ステップＳ６－１において、不要文除去済みの入力文書Ｄに含まれる各単語ｗに対して、単語ｗのＩＤＦの値ｉｄｆ_ｗを算出する。ここで、単語ｗのＩＤＦの値ｉｄｆ_ｗは、文書Ｄにおける単語ｗの出現頻度を表す出現頻度の指標であり、文書Ｄに含まれる文の数｜Ｄ｜と、文書Ｄに含まれ、かつ単語ｗを含む文の数｜｛ｓ∈Ｄ：ｗ∈ｓ｝｜を用いて、ｉｄｆ_ｗ＝ｌｏｇ（｜Ｄ｜／｜｛ｓ∈Ｄ：ｗ∈ｓ｝｜）により算出される。単語ｗのＩＤＦは、文書Ｄにおける単語ｗの出現頻度が大きいほど値が小さくなる、正の実数値である。次いで、ステップＳ６－２において、出力文書Ｓを入力文書Ｄで、整数ｒ（再帰回数）を０で、それぞれ初期化する。次いで、ステップＳ６－３において、出力文書Ｓの文数Ｎを算出する。次いで、Ｓ６－４において、ｒ＜Ｒ_ｍｉｎまたは（ｒ＜Ｒ_ｍａｘかつＮ＞Ｍ）であるか否かを判定する。 First, in step S6-1, for each word w contained in the input document D from which unnecessary sentences have been removed, the IDF value _{idf_w} of the word w is calculated. Here, the IDF value idf _w of word w is an appearance frequency index representing the appearance frequency of word w in document D, and the number of sentences |D| It is calculated by idf _w =log(|D|/|{sεD:wεs}|) using the number of sentences containing word w |{sεD:wεs}|. The IDF of the word w is a positive real number whose value decreases as the appearance frequency of the word w in the document D increases. Next, in step S6-2, the output document S is initialized with the input document D, and the integer r (recursion count) is initialized with 0, respectively. Next, in step S6-3, the number of sentences N of the output document S is calculated. Next, in S6-4, it is determined whether or not r<R _min or (r<R _max and N>M).

ここで、最小適用回数Ｒ_ｍｉｎは、少なくともその回数だけはスライディングウィンドウ法を再帰的に適用することを示すパラメータである。最大適用回数Ｒ_ｍａｘは、スライディングウィンドウ法の再帰的な適用回数がその数を越えないことを意味する。それぞれ要約処理の継続条件と、要約処理の終了条件を表す。また、Ｍは要約Ｓに含める文の目標抽出件数である。例えば、入力装置３（マウス、キーボード等）から、要約Ｓに含める文の目標抽出件数Ｍと、要約処理の継続条件である最小適用回数Ｒ_ｍｉｎと、要約処理の終了条件である最大適用回数Ｒ_ｍａｘと、を設定する、要約パラメータ設定部をさらに備えていても良い。 Here, the minimum number of times of application R _min is a parameter indicating that the sliding window method should be applied recursively at least that number of times. The maximum number of applications R _max means that the number of recursive applications of the sliding window method does not exceed that number. They respectively represent a continuation condition for summarization processing and a termination condition for summarization processing. Also, M is the target extraction number of sentences to be included in the summary S. For example, from the input device 3 (mouse, keyboard, etc.), the target extraction number M of sentences to be included in the summary S, the minimum number of applications R _min as the continuation condition of the summarization process, and the maximum number of applications R min as the termination condition of the summarization process A summary parameter setting unit for setting _max and .

続くステップＳ６－５からＳ６－１０の処理は、出力文書Ｓに対して、１回スライディングウィンドウ法を適用する処理である。 The following steps S6-5 to S6-10 are processes for applying the sliding window method to the output document S once.

Ｓ６－５において、出力文書Ｓの文を出現順にｓ_１、ｓ_２、…、ｓ_Ｎとする。次いで、Ｓ６－６において、集合Ｓ^＊を空集合で、整数ｋを（１－Ｔ_ｒ）で、それぞれ初期化する。整数ｋは、ウィンドウ位置を表す。また、Ｔ_ｒはウィンドウ位置のオフセットであり非負の整数である。Ｔ_ｒは再帰回数ｒごとに異なる値であってよい。 At S6-5, the sentences of the output document S are set to s ₁ , s ₂ , . . . , s _N in order of appearance. Next, in S6-6, the set S ^* is initialized with an empty set and the integer k is initialized with (1-T _r ). The integer k represents the window position. _Tr is a window position offset and is a non-negative integer. _Tr may be a different value for each recursion number r.

次いで、ステップＳ６－７において、ｋ≦Ｎ－Ｌ_ｒ＋１であるか否かを判定する。Ｙｅｓの場合には、ステップＳ６－８において、ウィンドウＷ_ｋ＝｛ｓ_ｉ：ｋ≦ｉ＜ｋ＋Ｌ_ｒ｝の要約Ｓ_ｋを集合Ｓ^＊に追加する。ここでＬ_ｒはウィンドウサイズであり、ウィンドウＷ_ｋに含める文の数の最大値を表す正の整数である。Ｌ_ｒは再帰回数ｒごとに異なる値であってよい。例えば、入力装置３（マウス、キーボード等）からウィンドウサイズＬ_ｒを設定する、要約パラメータ設定部をさらに備えていても良い。ステップＳ６－８の処理の詳細は後述する。そして、ステップＳ６－９に進み、ｋ←ｋ＋１とし、ステップＳ６－７に戻る。ステップＳ６－７でＮｏの場合には、ステップＳ６－１０に進み、出力文書ＳをＳ^＊で、ｒをｒ＋１で、それぞれ更新する。すなわち、出力文書Ｓに１回スライディングウィンドウ法を適用し、出力文書Ｓの要約Ｓ^＊を算出し、出力文書Ｓを要約Ｓ^＊により更新する。次いで、ステップＳ６－３に戻る。ステップＳ６－３において、Ｎｏの場合には、ステップＳ６－１１において、要約Ｓを出力する。そして、要約対象文書Ｄの要約Ｓを生成する処理（ステップＳ６）を終了する（ＲＥＴＵＲＮ）。 Next, in step S6-7, it is determined whether or not k≦N−L _r +1. If Yes, at step S6-8, add the summary S _k of the window W _k ={s _i :k≦i<k+L _r } to the set S ^* . where _Lr is the window size, a positive integer representing the maximum number of sentences to be included in the window _Wk . L _r may be a different value for each recursion number r. For example, it may further include a summary parameter setting section for setting the window size _Lr from the input device 3 (mouse, keyboard, etc.). Details of the processing in step S6-8 will be described later. Then, the process proceeds to step S6-9, sets k←k+1, and returns to step S6-7. If No in step S6-7, the process advances to step S6-10 to update the output document S with S ^* and r with r+1. That is, the sliding window method is applied once to the output document S, the summary S ^* of the output document S is calculated, and the output document S is updated with the summary S ^* . Then, the process returns to step S6-3. If No in step S6-3, a summary S is output in step S6-11. Then, the processing (step S6) for generating the summary S of the document D to be summarized ends (RETURN).

上記のスライディングウィンドウ法の処理において、生成されたウィンドウは、生成されたウィンドウに含まれる文の数が、いずれもウィンドウサイズ以下であり、かつ、入力文書に含まれる文である第１の文と、入力文書において前記第１の文の次に出現する第２の文に対して、第１の文が少なくとも一つのウィンドウ（例えばウィンドウＷ_ｋ）において出現順に最後の文であるならば、第２の文も少なくとも一つのウィンドウ（例えばウィンドウＷ_ｋ＋１）において出現順に最後の文である、という条件をさらに満たす。 In the above sliding window method processing, the number of sentences included in the generated window is equal to or less than the window size, and the number of sentences included in the input document is the first sentence and the , for a second sentence that appears next to said first sentence in the input document, if the first sentence is the last sentence in order of appearance in at least one window (eg, window W _k ), then the second is also the last sentence in order of appearance in at least one window (eg, window W _k+1 ).

また、上記の再帰的スライディングウィンドウ法の処理は、入力文書に対して、出力文書を、入力文書を代入することにより、初期化し、出力文書に対して、スライディングウィンドウ法を適用することで、出力文書の要約を生成し、出力文書を、生成した出力文書の要約を代入することにより、更新する、更新処理を実行する。 In the recursive sliding window method processing, the output document is initialized by substituting the input document for the input document, and the output document is applied to the output document by applying the sliding window method. An update process is performed that generates a summary of the document and updates the output document by substituting the generated summary of the output document.

そして、要約処理の継続条件が満たされているか、または、要約処理の終了条件が満たされておらず、かつ、出力文書に含まれる文の数が目標抽出件数より大きい場合は、更新処理を繰り返し、上記以外の場合は、出力文書を入力文書の要約として出力する。 Then, if the conditions for continuing the summarization process are satisfied, or the conditions for ending the summarization process are not satisfied, and the number of sentences included in the output document is greater than the target number of extractions, the update process is repeated. , otherwise output the output document as a summary of the input document.

図７は、図６のステップＳ６－８の詳細な処理の流れの一例を示す図である。ステップＳ６－８においては、まず、ステップＳ６－８－１で、ウィンドウＷ_ｋに含まれる文ｓ_ｉ∈Ｗ_ｋに対し、前処理結果テーブル１－７を参照し、文ｓ_ｉの単語への分かち書きを取得する。ステップＳ６－８－２において、文ｓ_ｉの単語ｗ∈ｓ_ｉに対し、単語分散表現テーブル１－４に登録されている単語ｗの単語分散表現ｘ_ｗを取得する。ステップＳ６－８－３において、文ｓ_ｉ∈Ｗ_ｋに対し、ステップＳ６－１で算出した単語ｗのＩＤＦの値ｉｄｆ_ｗと、単語重み付けテーブル１－２に登録されている単語ｗの重みρ_ｗを重みとして、文ｓ_ｉの文分散表現ｘ_ｉ＝Σ_ｗ∈ｓｉ ρ_ｗｉｄｆ_ｗｘ_ｗを算出する。この処理は単語の重み付けを行う処理である。 FIG. 7 is a diagram showing an example of the detailed processing flow of step S6-8 in FIG. In step S6-8, first, in step S6-8-1, the preprocessing result table 1-7 is referred to for sentence s _i εW _k included in window W _k , and the word of sentence s _i is Get the wording. In step S6-8-2, for word wεs _i of sentence s _i , word distributed representation x _w of word w registered in word distributed representation table 1-4 is obtained. In step S6-8-3, for sentence s _i ∈W _k , the IDF value idf _w of word w calculated in step S6-1 and the weight ρ of word w registered in the word weighting table 1-2 With _w as the weight, the _sentence distributed representation x _i = _Σwεsi ρ _w idf _w x _w of sentence si is calculated. This process is a process of weighting words.

上記の処理は、抽出的要約手法は、不要語を除去した文に含まれる単語に対して、単語重み付けテーブル１－２を参照して、単語の重みを取得し、単語分散表現に、前記単語の重みと、単語の出現頻度の指標ＩＤＦと、を乗算することで、重み付き単語分散表現を算出する処理である。 In the above process, the extractive summarization method refers to the word weighting table 1-2 for the words contained in the sentence from which unnecessary words have been removed, obtains the weight of the word, and converts the word into the word distributed representation. is multiplied by the weight of the word appearance frequency index IDF to calculate the weighted word distributed representation.

次いで、ステップＳ６－８－４において、ウィンドウＷ_ｋに含まれる全ての文ｓ_ｉ∈Ｗ_ｋに対して処理が完了したか否かを判定する。ここで、Ｎｏであれば、ステップＳ６－８－１に戻る。Ｙｅｓであれば、ステップＳ６－８－５に進み、ウィンドウＷ_ｋの分散表現ｘ_Ｗｋを算出する。ウィンドウＷ_ｋの分散表現は、ウィンドウＷ_ｋに含まれる文ｓ_ｉの文分散表現ｘ_ｉの総和であり、ｘ_Ｗｋ＝Σ_{ｓｉ∈Ｗｋ} ｘ_ｉにより算出する。次いで、ステップＳ６－８－６において、各文ｓ_ｉ∈Ｗ_ｋに対し、文ｓ_ｉの重要度ｖ_ｉを、ウィンドウＷ_ｋの分散表現ｘ_Ｗｋと、文ｓ_ｉの文分散表現ｘ_ｉとのコサイン類似度、すなわちｖ_ｉ＝（ｘ_Ｗｋ・ｘ_ｉ）／（||ｘ_Ｗｋ|| ||ｘ_ｉ||）により算出する。次いで、ステップＳ６－８－７において、文ｓ_ｉの重要度ｖ_ｉで上位ｍ_ｒ件をＷ_ｋから抽出し、ウィンドウＷ_ｋの要約Ｓ_ｋとする。ここで、ｍ_ｒはウィンドウの要約に含める文の数を表す整数であり、１以上ウィンドウサイズＬ_ｒ以下の整数である。ｍ_ｒは再帰回数ｒごとに異なる値であってよい。さらに、ステップＳ６－８－８において、Ｓ^＊にＳ_ｋを合併し、Ｓ^＊から重複する文を除去する。そして、ステップＳ６－８を終了する（ＲＥＴＵＲＮ）。 Next, in step S6-8-4, it is determined whether or not all sentences s _i εW _k included in window W _k have been processed. Here, if No, the process returns to step S6-8-1. If Yes, the process advances to step S6-8-5 to calculate the distributed representation x _Wk of the window W _k . The distributed representation of the window _Wk is the sum of the sentence distributed representations _{x i} _of the sentences si included in the window _Wk , and is calculated by x _Wk = _ΣsiεWk x _i . Next, in step S6-8-6, for each sentence s _i εW _k , the importance v _i of sentence s _i is calculated as the distributed representation x _Wk of window W _k and the sentence distributed representation x _i of sentence s _i . , that is, v _i =(x _Wk ·x _i )/(||x _Wk || ||x _i ||). Next, in step S6-8-7, the m _r items with the highest importance level v _i of sentence s _i are extracted from W _k and taken as a summary S _k of window W _k . Here, _mr is an integer representing the number of sentences to be included in the window summary, and is an integer greater than or equal to 1 and less than or equal to the window size _Lr . m _r may be a different value for each recursion number r. Further, in step S6-8-8, S ^* is merged with S _k and redundant sentences are removed from S ^* . Then, step S6-8 is terminated (RETURN).

このようにスライディングウィンドウ法を用いると、通話内の各話題の重要文が、いずれかのウィンドウの要約に含まれるため、それらを合併して文書全体の要約とすることで、複数の話題を考慮した要約を生成することができる。また、再帰的スライディングウィンドウ法を用いることで、要約に含める文の目標抽出件数を指定することができるので、要約結果を所望の要約率に調整することができる。 When the sliding window method is used in this way, the important sentences of each topic in the conversation are included in the summary of one of the windows. You can generate a summarized summary. Moreover, by using the recursive sliding window method, it is possible to specify the target number of extracted sentences to be included in the summary, so that the summary result can be adjusted to a desired summary rate.

尚、図２のステップＳ６における抽出的要約手法として、図７の再帰的スライディングウィンドウ法を用いるか否かは必要に応じて適宜決めることができる。 Whether or not to use the recursive sliding window method of FIG. 7 as the extractive summarization method in step S6 of FIG. 2 can be determined as required.

図８は、ウィンドウの例を示す図である。ここで、ウィンドウサイズは、Ｌ_ｒ＝４、ウィンドウ位置のオフセットＴ_ｒ＝２、出力文書Ｓの文数をＮとする。ウィンドウＷ_ｋの添え字ｋは、ウィンドウ位置を表す。 FIG. 8 is a diagram showing an example of a window. Here, the window size is L _r =4, the window position offset T _r =2, and the number of sentences of the output document S is N. FIG. The subscript k of window _Wk represents the window position.

上から順番に説明する。
１）ウィンドウＷ_－１では、複数の文ｓ_ｉを有する出力文書Ｓのうち、ウィンドウＷ_－１内には、２つの文ｓ_１、ｓ_２のみが入っている。
２）ウィンドウＷ_０では、複数の文ｓ_ｉを有する出力文書Ｓのうち、ウィンドウＷ_０内には、３つの文ｓ_１、ｓ_２、ｓ_３のみが入っている。
３）ウィンドウＷ_１では、複数の文ｓ_ｉを有する出力文書Ｓのうち、ウィンドウＷ_１内には、４つの文ｓ_１、ｓ_２、ｓ_３、ｓ_４が入っている。
４）ウィンドウＷ_２では、複数の文ｓ_ｉを有する出力文書Ｓのうち、ウィンドウＷ_２内には、４つの文ｓ_２、ｓ_３、ｓ_４、ｓ_５が入っている。
Ｎ－４）ウィンドウＷ_Ｎ－４では、複数の文ｓ_ｉを有する出力文書Ｓのうち、ウィンドウＷ_Ｎ－４内には、４つの文ｓ_Ｎ－４、ｓ_Ｎ－３、ｓ_Ｎ－２、ｓ_Ｎ－１が入っている。
Ｎ－３）ウィンドウＷ_Ｎ－３では、複数の文ｓ_ｉを有する出力文書Ｓのうち、ウィンドウＷ_Ｎ－３内には、４つの文ｓ_Ｎ－３、ｓ_Ｎ－２、ｓ_Ｎ－１、ｓ_Ｎが入っている。 I will explain in order from the top.
1) In the window W ₋₁ , out of the output document S having a plurality of sentences s _i , only two sentences s ₁ and s ₂ are contained within the window W ₋₁ .
2) In window _W0 , of the output document S having a plurality of _{sentences si} , only three sentences _s1 , _s2 and _s3 are contained in window _W0 .
3) In window _W1 , among output documents S having a plurality of sentences _si , window _W1 contains four sentences _s1 , _s2 , _s3 , and _s4 .
4) In window _W2 , among the output document S having a plurality of sentences _si , window _W2 contains four sentences _s2 , _s3 , _s4 and _s5 .
N-4) In the window W _N-4 , among the output document S having a plurality of sentences s _i , there are four sentences s _N-4 , s _N-3 , s _N-2 in the window W _N-4. , s _N−1 are included.
N-3) In the window W _N-3 , among the output document S having a plurality of sentences s _i , there are four sentences s _N-3 , s _N-2 , s _N-1 in the window W _N-3. , s _N are included.

ここで、処理の出だしの１）、２）においては、出力文書Ｓの先頭の文であるｓ_１も要約に含まれやすくするために、ウィンドウ位置を－１や０のような、０以下の値となるようにしている。ウィンドウ位置のオフセットＴ_ｒに、例えば２のような正の値を指定することで、上記の例のようにウィンドウ位置を０以下の値に変更することが可能である。 Here, in 1) and 2) at the beginning of the process, the window position is set to a value less than or equal to 0, such as -1 or 0, so that _s1 , which is the first sentence of the output document S, is easily included in the summary. value. By specifying a positive value such as 2 for the offset _Tr of the window position, it is possible to change the window position to a value of 0 or less as in the above example.

以上に説明したように、本実施の形態によれば、定型的な表現を不要文として不要文判定教師データテーブルに登録しておくことで、他の通話と共通する定型的な表現を要約処理の前に除去できるので、従来の技術と比較して要約の抽出精度を高めることができる。 As described above, according to the present embodiment, by registering typical expressions as unnecessary sentences in the unnecessary sentence judgment training data table, typical expressions common to other calls are summarized. can be removed before , the abstract extraction accuracy can be improved compared to the conventional technique.

また、本実施の形態による抽出的要約手法（スライディングウィンドウ法）では、通話内の各話題の重要文が、いずれかのウィンドウの要約に含まれるため、それらを合併して文書全体の要約とすることで、複数の話題を考慮した要約を生成できる。 In the extractive summarization method (sliding window method) according to the present embodiment, since the important sentences of each topic in the call are included in the summary of one of the windows, they are merged into the summary of the entire document. By doing so, it is possible to generate a summary that considers multiple topics.

また、再帰的スライディングウィンドウ法を用いると、要約において、所望の要約率に調整することができる。 Also, with the recursive sliding window method, the summarization can be adjusted to the desired summarization rate.

上記の処理および制御は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）によるソフトウェア処理、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）やＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）によるハードウェア処理によって実現することができる。 The above processing and control can be realized by software processing by CPU (Central Processing Unit) and GPU (Graphics Processing Unit), and hardware processing by ASIC (Application Specific Integrated Circuit) and FPGA (Field Programmable Gate Array). I can.

また、上記の実施の形態において、図示されている構成等については、これらに限定されるものではなく、本発明の効果を発揮する範囲内で適宜変更することが可能である。その他、本発明の目的の範囲を逸脱しない限りにおいて適宜変更して実施することが可能である。 Moreover, in the above-described embodiment, the illustrated configurations and the like are not limited to these, and can be appropriately changed within the scope of exhibiting the effects of the present invention. In addition, it is possible to carry out by appropriately modifying the present invention as long as it does not deviate from the scope of the purpose of the present invention.

また、本発明の各構成要素は、任意に取捨選択することができ、取捨選択した構成を具備する発明も本発明に含まれるものである。 In addition, each component of the present invention can be selected arbitrarily, and the present invention includes an invention having a selected configuration.

また、本実施の形態で説明した機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより各部の処理を行ってもよい。尚、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。 Also, a program for realizing the functions described in the present embodiment is recorded in a computer-readable recording medium, and the program recorded in this recording medium is read by a computer system and executed, thereby processing of each section. may be performed. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The "computer system" also includes the home page providing environment (or display environment) if the WWW system is used.

また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また前記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。機能の少なくとも一部は、集積回路などのハードウェアで実現しても良い。 The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. Furthermore, "computer-readable recording medium" refers to a program that dynamically retains programs for a short period of time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It also includes those that hold programs for a certain period of time, such as volatile memories inside computer systems that serve as servers and clients in that case. Further, the program may be for realizing part of the functions described above, or may be capable of realizing the functions described above in combination with a program already recorded in the computer system. At least part of the functions may be implemented in hardware such as integrated circuits.

本発明は、要約生成装置に利用可能である。 INDUSTRIAL APPLICABILITY The present invention is applicable to a summary generator.

Ａ…要約生成装置
１…補助記憶装置（各記憶部）
１－１…単語辞書テーブル
１－２…単語重み付けテーブル
１－３…分散表現学習コーパステーブル
１－４…単語分散表現テーブル（情報）
１－５…不要文判定教師データテーブル（情報）
１－６…要約対象文書テーブル（情報）
１－７…前処理結果テーブル
１－８…要約結果テーブル
２…主記憶装置
２－１…形態素解析部
２－２…不要語除去処理部
２－３…単語分散表現学習部
２－４…不要文判定教師データ生成部
２－５…不要文除去処理部
２－６…要約対象文書取得部
２－７…要約生成部
３…入力装置
４…出力装置
５…中央演算装置（ＣＰＵ） A: Summary generating device 1: Auxiliary storage device (each storage unit)
1-1 Word dictionary table 1-2 Word weighting table 1-3 Distributed representation learning corpus table 1-4 Word distributed representation table (information)
1-5: Unnecessary sentence judgment teacher data table (information)
1-6: Summary target document table (information)
1-7... Preprocessing result table 1-8... Summary result table 2... Main storage device 2-1... Morphological analysis unit 2-2... Unnecessary word removal processing unit 2-3... Word distributed representation learning unit 2-4... Unnecessary Sentence judgment training data generation unit 2-5 Unnecessary sentence removal processing unit 2-6 Summary target document acquisition unit 2-7 Summary generation unit 3 Input device 4 Output device 5 Central processing unit (CPU)

Claims

A summary generation device for extracting sentences from a document containing one or more sentences and generating a summary of the document,
a word distributed representation information storage unit in which words and word distributed representations representing the words by multidimensional real-valued vectors are registered;
A training data sentence, a sentence label that is information indicating whether or not the training data sentence is required, and a sentence distributed representation that is a distributed representation of the training data sentence calculated based on word distributed representation information are registered. an unnecessary sentence judgment training data information storage unit;
a summary target document obtaining unit for obtaining a summary target document, which is a document to be summarized;
For the sentences included in the summary target document,
calculating a sentence distributed representation of the sentence based on the word distributed representation information stored in the word distributed representation information storage unit;
Based on the sentence label and the sentence distributed representation of the unnecessary sentence judgment training data information registered in the unnecessary sentence judgment training data information storage unit, the sentence is classified as an unnecessary sentence by an automatic classification method from the sentence distributed representation of the sentence. and determine whether
an unnecessary sentence removal processing unit for generating an unnecessary sentence-removed document by removing sentences determined to be unnecessary sentences from the document to be summarized;
From the unnecessary sentence-removed document,
a summary generation unit that generates a summary of the document to be summarized by extracting sentences by a selective summarization technique and generating a summary;
The extractive summarization method includes:
For an input document that is a document input to the extractive summarization method,
generating one or more windows of the input document, which are documents consisting of part or all of sentences that appear consecutively in the input document;
the generated windows satisfy the condition that any sentence contained in the input document is contained in at least one of the windows;
generating a summary of the window by extracting and summarizing sentences by a sampling summary technique using distributed representation for each of the generated windows;
generating a summary of the input document by merging the window summaries and removing duplicate sentences;
The abstract summarization method using the distributed representation is
Based on the word distributed representation information, a sentence distributed representation of the sentence included in the window is calculated, and based on the importance of the sentence included in the window calculated based on the sentence distributed representation, a summary of the window is obtained. A summary generator, characterized by extracting sentences to be included .

In the unnecessary sentence removal processing unit,
The automatic classification method includes:
calculating a cosine similarity between a sentence distributed representation that does not require a sentence label and the sentence distributed representation of the sentence among the sentence distributed representations registered in the unnecessary sentence determination training data information storage unit;
2. A summary generation apparatus according to claim 1, wherein said sentence is determined to be an unnecessary sentence if at least one value of said cosine similarity is greater than a pre-registered threshold value.

In the unnecessary sentence removal processing unit,
The automatic classification method uses supervised machine learning, including k-neighborhood method, neural network, and support vector machine, using sentence labels and sentence distributed representations registered in the unnecessary sentence determination supervised data information storage unit as supervised data. 2. The apparatus for generating a summary according to claim 1, wherein the method is any one of the automatic classification methods according to claim 1.

The abstract summarization method using the distributed representation is
calculating, for a word contained in the input document, an appearance frequency index representing the appearance frequency of the word in the document, which is a real value calculated based on the appearance frequency of the word in the input document;
For sentences contained in the input document,
using a morphological analysis unit to morphologically analyze the sentence and divide it into words;
removing unnecessary words, which are determined to be unnecessary by the unnecessary word removal processing unit, from the sentence;
For words contained in the sentence from which the unnecessary words have been removed,
obtaining a word distributed representation of the word by referring to the word distributed representation information;
calculating a weighted word variance representation by multiplying the word variance representation by an index of the appearance frequency of the word;
calculating the sentence distributed representation by synthesizing the weighted word distributed representation;
Synthesizing the sentence distributed representation to calculate a document distributed representation of the input document;
2. A summary generating apparatus according to claim 1 , wherein sentences to be included in the summary of the input document are extracted based on the degree of importance calculated as cosine similarity between the sentence distributed representation and the document distributed representation.

The index of the frequency of occurrence of the word is
A real number calculated based on the frequency of occurrence of words in the input document. The higher the frequency of occurrence, the smaller the value. 5. A summary generator as claimed in claim 4 .

further comprising word weight information in which words and word weights that are non-negative real values are registered;
The abstract summarization method using the distributed representation is
For words contained in the sentence from which the unnecessary words have been removed,
The weight of the word is obtained by referring to the word weighting information, and the weighted word distributed representation is obtained by multiplying the word distributed representation by the word weight and the word appearance frequency index. 6. The summary generation device according to claim 4 , wherein the calculation is performed.

In the abstract summarization method using the distributed representation ,
3. The unnecessary word removal processing unit determines that one or both of filler words and interjection words are unnecessary as a result of part-of-speech determination of words by the morphological analysis unit. 7. A summary generator according to any one of items 4 to 6 .

further comprising a summary parameter setting unit for setting a window size, which is the maximum number of sentences to be included in the window, from an input device;
In the abstract summary technique, the generated window includes:
The number of sentences included in the generated window is equal to or less than the window size, and a first sentence that is a sentence included in the input document, and a sentence next to the first sentence in the input document For a second sentence that appears, if the first sentence is the last sentence in the order of appearance in at least one of the windows, then the second sentence is also the last sentence in the order of appearance in at least one other of the windows. is the sentence,
2. The summary generation device according to claim 1 , further satisfying the condition that:

a summary parameter setting unit for setting, from the input device, a target number of sentences to be extracted from the summary, a condition for continuing the process of summarizing, and a condition for terminating the process of summarizing;
The extractive summarization method includes:
For an input document that is a document input to the extractive summarization method,
Initialize an output document by substituting said input document;
For said output document,
generating a summary of the output document by applying the extractive summarization technique;
updating the output document by substituting the generated summary of the output document;
perform the update process,
If the condition for continuing the summarization process is satisfied, or the condition for terminating the summarization process is not satisfied, and the number of sentences included in the output document is greater than the target number of extractions, the update process is performed. repetition,
9. A summary generating apparatus according to claim 1 , wherein in cases other than the above, said output document is output as a summary of said input document.

A summary generating method for extracting sentences from a document containing one or more sentences by computer processing and generating a summary of the document,
a document-to-be-summarized obtaining step in which a computer obtains a document to be summarized, which is a document to be summarized;
the computer
a) a distributed word representation of a word distributed representation information storage unit, in which words and word distributed representations in which the words are represented by multi-dimensional real-valued vectors are registered for sentences included in the summary target document; calculating a sentence distributed representation of the sentence based on the information;
b) a training data sentence, a sentence label that is information indicating whether or not the training data sentence is required, and a sentence distributed representation that is a distributed representation of the training data sentence calculated based on the word distributed representation information; Based on the sentence label and the sentence distributed representation included in the unnecessary sentence judgment training data information stored in the unnecessary sentence judgment training data information storage unit, the sentence is classified by an automatic classification method from the sentence distributed representation of the sentence. determine whether it is an unnecessary sentence,
c) an unnecessary sentence removal processing step of generating an unnecessary sentence-removed document by removing sentences determined to be unnecessary sentences from the document to be summarized;
a computer d) generating a summary of the document to be summarized by extracting sentences from the document from which unnecessary sentences have been removed by an extractive summarization technique;
and run
In the abstract summarization method, the computer
For an input document that is a document input to the extractive summarization method,
generating one or more windows of the input document, which are documents consisting of part or all of sentences that appear consecutively in the input document;
the generated windows satisfy the condition that any sentence contained in the input document is contained in at least one of the windows;
generating a summary of the window by extracting and summarizing sentences by a sampling summary technique using distributed representation for each of the generated windows;
generating a summary of the input document by merging the window summaries and removing duplicate sentences;
In the abstract summarization method using the distributed representation, the computer
Based on the word distributed representation information, a sentence distributed representation of the sentence included in the window is calculated, and based on the importance of the sentence included in the window calculated based on the sentence distributed representation, a summary of the window is obtained. A method of generating a summary, characterized by extracting sentences to be included .