JP2008152605A

JP2008152605A - Presentation analysis device and presentation viewing system

Info

Publication number: JP2008152605A
Application number: JP2006340825A
Authority: JP
Inventors: Seiichi Nakagawa; 聖一中川; Norihide Kitaoka; 教英北岡; Shingo Togashi; 慎吾富樫; Masaru Yamaguchi; 優山口
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2006-12-19
Filing date: 2006-12-19
Publication date: 2008-07-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system enabling a user to efficiently understand the content of a presentation. <P>SOLUTION: A processor 6 comprises a data analysis part 18 performing analysis of presentation data; a feature extraction part 20 performing analysis of voice information of the presentation; a voice recognition part 22, and a voice shaping part 24. A summarization part 26 and an indexing part 30 extracts an abstract sentence based on information from the feature extraction part 20 and the voice shaping part 24. The indexing part 30 generates an index (keyword) based on information from the data analysis part 18 and the voice shaping part 24. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声と映像を含むプレゼンテーション情報を容易に利用するためのシステムに関し、特に、プレゼンテーション情報を構造化して利用し易くするシステムに関するものである。 The present invention relates to a system for easily using presentation information including audio and video, and more particularly to a system that makes presentation information structured and easy to use.

講演、講義や各種プレゼンテーションを、音声および映像を含むマルチメディア情報として記録し、記録されたマルチメディア情報を後日、利用することが従来より行なわれている。このようなマルチメディア情報を後日利用する場合、記録された情報をそのまま再生すると、利用者が必要としない部分も再生されるため、利用者の利便性を向上させるための様々な技術が研究されている。その例として、下記に示す特許文献が存在する。
特開２００２−３０４４２０号公報特開２００２−３５１８９３号公報特開２００２−８０５２号公報 It has been conventionally practiced to record lectures, lectures and various presentations as multimedia information including audio and video, and to use the recorded multimedia information at a later date. When such multimedia information is used at a later date, if the recorded information is played back as it is, the parts that the user does not need are played back. ing. For example, the following patent documents exist.
JP 2002-304420 A JP 2002-351893 A JP 2002-8052 A

特許文献１は、講演内容を複数のセッション毎に分割したり、複数のスライド毎に分割することにより、利用者が希望する特定のセッションや特定のスライドに関連する内容を瞬時に再生可能とするものである。また、特許文献２は、講演内容の音声データから検索用インデックスを作成し、利用者がインデックスを選択することで、利用者が希望する講演内容を瞬時に探し出し、再生を可能とするものである。また、特許文献３は、講演者がプレゼンテーション装置を操作して講演を行なう場合、操作タイミングで講演内容を分割し、分割された情報毎に優先度を設定し、講演を利用者に分かり易く再生するものである。 Japanese Patent Laid-Open No. 2004-228867 allows the user to instantly reproduce the content related to a specific session or a specific slide desired by the user by dividing the lecture content into a plurality of sessions or by dividing into a plurality of slides. Is. Further, Patent Document 2 creates a search index from speech data of lecture content, and allows the user to instantly find and reproduce the lecture content desired by the user by selecting the index. . Also, in Patent Document 3, when a lecturer gives a lecture by operating a presentation device, the lecture content is divided at the operation timing, and a priority is set for each divided information, and the lecture is reproduced in an easy-to-understand manner for the user. To do.

上述したいずれの特許文献においても、講演より利用者が希望するであろう情報を抽出し、当該情報を用いて利用者の利便性を向上させるものであったが、以下に示す欠点があった。すなわち、特許文献１においては、講演内容をセッション毎やスライド毎に分割しているため、利用者が再生する際には、セッションあるいはスライドの先頭から再生されてしまい、利用者が希望する内容が再生されるまでに時間がかかることがあった。また、特許文献２では、音声データのみから検索用インデックスを作成しているため、音声認識段階で誤認識が発生したり、利用者が希望する場面とは無関係の場面で発した音声をインデックス化する等、検索用インデックスとして利用するには、精度の面で問題があった。また、特許文献３は、講演内容に優先度を設定して再生するものであるため、必ずしも利用者が希望する内容を再生することができなかった。 In any of the above-mentioned patent documents, information that the user would like to request is extracted from the lecture, and the convenience of the user is improved using the information. However, there are the following drawbacks. . That is, in Patent Document 1, since the content of the lecture is divided for each session or slide, when the user plays back, the content is desired to be played back from the beginning of the session or slide. It could take some time before it was played. Further, in Patent Document 2, since a search index is created only from voice data, an erroneous recognition occurs in the voice recognition stage, or voices generated in scenes unrelated to the scene desired by the user are indexed. For example, there is a problem in terms of accuracy when used as a search index. Further, since Patent Document 3 reproduces a lecture content by setting a priority, the content desired by the user cannot always be reproduced.

また、映像と音声を含む講演内容と、講演者が用いたプレゼンテーション資料とをリンクさせ、利用者が希望するプレゼンテーション資料に対応した講演内容を瞬時に再生可能とするシステムは、既に市販されている。しかしながら、このシステムでは、プレゼンでーション資料のスライドに対応する講演内容の先頭からのみ、再生可能であり、利用者が希望する内容が再生されるまで時間がかかってしまっていた。また、このシステムでは、プレゼンテーション資料の各スライドの見出し（題名）によるインデックス化しかできず、利用者の希望するインデックスが存在しない場合があり、利便性をさらに向上させる必要があった。 There are already commercially available systems that link the lecture content, including video and audio, with the presentation material used by the speaker, and can instantly reproduce the lecture content corresponding to the presentation material desired by the user. . However, with this system, it was possible to reproduce only from the beginning of the lecture content corresponding to the slide of the presentation material in the presentation, and it took time until the content desired by the user was reproduced. Also, with this system, indexing can be performed only by the heading (title) of each slide of the presentation material, and the index desired by the user may not exist, and it is necessary to further improve convenience.

そこで、本発明は上記課題を解決するためになされたものであり、講演の音声情報と文字情報の両方より検索インデックスを作成し、講演内容とリンクさせることにより、利用者の利便性を向上させることを目的とする。また、同様に要約を生成し、利用者の利便性を向上させることを目的とする。 Therefore, the present invention has been made to solve the above-mentioned problems, and improves the convenience of users by creating a search index from both speech information and text information of a lecture and linking it with the contents of the lecture. For the purpose. Moreover, it aims at producing | generating a summary similarly and improving a user's convenience.

本発明は上記目的を達成するために創案されたものであり、請求項１に係る発明は、映像および音声を含むプレゼンテーション情報の音声情報を解析する音声情報解析部と、前記プレゼンテーションの際に用いられるスライド資料を解析する資料解析部とを備え、前記音声情報解析部の解析結果と前記資料解析部の解析結果より、前記プレゼンテーションの音声情報の要約文を生成する要約文生成部を備えることを特徴とするプレゼンテーション解析装置によって構成される。上記の構成によれば、プレゼンテーションの音声情報とスライド資料の両方より要約文が生成されるため、いずれか一方によって生成された要約文に比べ、プレゼンテーションの内容を理解し易い要約を生成することができる。なお、プレゼンテーションとは、各種講演や講義、発表をいい、映像および音声を含むプレゼンテーション情報とは、プレゼンテーションを映像記録装置（ビデオカメラ）等で記録した情報を意味する。また、プレゼンテーションの際に用いられるスライド資料とは、各種プレゼンテーションソフトウェアによって作成された資料であり、通常は、パーソナルコンピュータおよびプロジェクタ装置を用いて各種講演の際に聴衆に示されるものである。また、音声情報解析部は、プレゼンテーションの音声情報より、音声認識、音声整形等を行い、プレゼンテーションの各文を認識し、解析するものである。前記スライド資料には、スライドタイトル、見出し、項目等を含み、図、表が含まれることもある。資料解析部は、スライドに含まれる上述の情報のうちすくなくとも１つを解析するものである。 The present invention was devised to achieve the above object, and the invention according to claim 1 is used for an audio information analysis unit for analyzing audio information of presentation information including video and audio, and for the presentation. And a summary analysis unit for generating a summary sentence of the speech information of the presentation from the analysis result of the speech information analysis part and the analysis result of the material analysis part. Consists of a featured presentation analysis device. According to the above configuration, since a summary sentence is generated from both the audio information of the presentation and the slide material, it is possible to generate a summary that makes it easier to understand the contents of the presentation than the summary sentence generated by either one. it can. The presentation refers to various lectures, lectures, and presentations, and the presentation information including video and audio means information obtained by recording the presentation with a video recording device (video camera) or the like. The slide material used in the presentation is a material created by various presentation software, and is usually presented to the audience at various lectures using a personal computer and a projector device. The voice information analysis unit recognizes and analyzes each sentence of the presentation by performing voice recognition, voice shaping, and the like from the voice information of the presentation. The slide material includes a slide title, a heading, an item, and the like, and may include a figure and a table. The material analysis unit analyzes at least one of the above-described information included in the slide.

また、要約文生成部は、プレゼンテーション全体の文章から要約文を生成するものであれば良く、例えば、音声情報解析部によって解析されたプレゼンテーションの各文から、要約文に該当する文を抽出して要約文を生成しても良いし、プレゼンテーションの各文を要約し、要約された文を集合させて要約文を生成しても良い。 Moreover, the summary sentence generation part should just produce | generate a summary sentence from the sentence of the whole presentation, for example, extracts the sentence applicable to a summary sentence from each sentence of the presentation analyzed by the audio | voice information analysis part. A summary sentence may be generated, or each sentence of the presentation may be summarized and a summary sentence may be generated by collecting the summarized sentences.

また、本発明は、前記要約文生成部は、前記音声情報解析部によって認識された文のうち、前記スライドのタイトルに含まれる品詞名詞が出現する文を、要約として抽出することを特徴とするプレゼンテーション解析装置によって構成することもできる。この構成によれば、通常、スライドのタイトルは、プレゼンテーションの内容を示す重要な品詞を含む場合が多いため、タイトルに含まれる品詞を含む文を要約として採用することにより、プレゼンテーションの内容を反映した要約を生成することができる。なお、タイトルに含まれる品詞には、名詞の他に、動詞、形容詞、形容動詞等も含まれる。 In the present invention, the summary sentence generation unit extracts, as a summary, a sentence in which a part of speech noun included in a title of the slide appears among sentences recognized by the voice information analysis unit. It can also be configured by a presentation analysis device. According to this configuration, the title of a slide usually includes an important part of speech indicating the content of the presentation, so the sentence including the part of speech included in the title is adopted as a summary to reflect the content of the presentation. A summary can be generated. The part of speech included in the title includes verbs, adjectives, adjective verbs and the like in addition to nouns.

また、本発明は、前記要約文生成部は、前記音声情報解析部によって認識された文のうち、前記スライド中に少なくとも一回現れる品詞複数回現れる名詞が出現する文を、要約として抽出することを特徴とするプレゼンテーション解析装置によって構成することもできる。この構成によれば、通常、スライド中に出現する品詞は、プレゼンテーションの内容を示す重要な品詞であることが多いため、スライド中に現れる品詞を含む文を要約として採用することにより、プレゼンテーションの内容を反映した要約を生成することができる。なお、スライド中に現れる品詞は、スライド中に少なくとも一回現れたものを重要な品詞として認識するが、例えば、スライド中に複数回現れる品詞はより重要な品詞として認識することができ、スライド中に数多く現れる品詞ほど、より重要な品詞として認識することができる。 Further, according to the present invention, the summary sentence generation unit extracts, as a summary, a sentence in which a noun that appears at least once in a part of speech appears in the slide among sentences recognized by the speech information analysis unit. It can also be constituted by a presentation analysis device characterized by the above. According to this configuration, the part of speech that usually appears in a slide is often an important part of speech that indicates the content of the presentation. Therefore, by adopting a sentence containing the part of speech that appears in the slide as a summary, the content of the presentation It is possible to generate a summary reflecting the above. The part of speech that appears in the slide is recognized as an important part of speech that appears at least once in the slide, but for example, a part of speech that appears multiple times in the slide can be recognized as a more important part of speech. The more part of speech that appears in the list, the more important part of speech can be recognized.

また、本発明は、前記プレゼンテーション情報、前記スライド資料および前記要約文生成部で生成された要約文を表示する表示部を備えるプレゼンテーション視聴システムによって構成することもできる。この構成によれば、プレゼンテーション情報、スライド情報および要約文を同時に表示するシステムのため、利用者がプレゼンテーションの内容を効率的に理解することができる。なお、要約を表示するのに加え、要約の内容を音声出力装置（スピーカー）を用いて再生することもできる。これにより、利用者は映像と音声よりプレゼンテーションの内容を効率良く理解することができる。 The present invention can also be configured by a presentation viewing system including a display unit that displays the presentation information, the slide material, and the summary sentence generated by the summary sentence generation unit. According to this configuration, since the system displays presentation information, slide information, and a summary sentence at the same time, the user can efficiently understand the contents of the presentation. In addition to displaying the summary, the content of the summary can be reproduced using an audio output device (speaker). As a result, the user can efficiently understand the contents of the presentation from the video and audio.

また、本発明は、映像および音声を含むプレゼンテーション情報の音声情報を解析する音声情報解析部と、前記プレゼンテーションの際に用いられるスライド資料を解析する資料解析部とを備え、前記音声情報解析部の解析結果と前記資料解析部の解析結果より、前記プレゼンテーションのキーワードを生成するキーワード生成部を備えることを特徴とするプレゼンテーション解析装置によって構成することもできる。この構成によれば、音声情報解析部の解析結果と資料解析部の解析結果よりキーワードが生成されるため、いずれか一方に基づき生成したキーワードに比べ、プレゼンテーションの内容を反映したキーワードを生成することができる。これにより、利用者はキーワードを認識することで、プレゼンテーションの内容を効率よく理解することができる。 The present invention further includes an audio information analysis unit that analyzes audio information of presentation information including video and audio, and a material analysis unit that analyzes slide material used in the presentation, and the audio information analysis unit includes: The present invention can also be configured by a presentation analysis apparatus comprising a keyword generation unit that generates a keyword for the presentation from the analysis result and the analysis result of the material analysis unit. According to this configuration, since the keyword is generated from the analysis result of the voice information analysis unit and the analysis result of the material analysis unit, a keyword reflecting the content of the presentation is generated compared to the keyword generated based on either one. Can do. Thus, the user can efficiently understand the content of the presentation by recognizing the keyword.

また、本発明は、前記プレゼンテーション情報、前記スライド資料および前記キーワード生成部で生成されたキーワードを表示する表示部を備えるプレゼンテーション視聴システムによって構成することもできる。この構成によれば、プレゼンテーション情報、スライド情報およびキーワードを同時に表示するシステムのため、利用者はプレゼンテーションの内容を効率的に理解することができる。 The present invention can also be configured by a presentation viewing system including a display unit that displays the presentation information, the slide material, and the keyword generated by the keyword generation unit. According to this configuration, since the system displays presentation information, slide information, and keywords at the same time, the user can efficiently understand the contents of the presentation.

また、本発明は、前記キーワード生成部で生成されたキーワードが発せられた文を、前記プレゼンテーション情報の中から抽出する音声対応区間抽出部をさらに備え、前記表示部に表示されたキーワードが選択された場合、前記表示部は、前記音声対応区間抽出部により抽出された文を発する場面を表示することを特徴とするプレゼンテーション視聴システムによって構成することもできる。この構成によれば、表示部には、選択されたキーワードを含む文を発する場面が即座に表示されるため、利用者は効率良くプレゼンテーションを視聴することができる。なお、表示部が、抽出された文を発する場面を表示することに加え、音声出力装置（スピーカー）から、抽出された文を再生することもできる。これにより、利用者は映像と音声により効率的にプレゼンテーションを視聴することができる。またなお、上述のように、選択されたキーワードと、当該キーワードを含む音声（映像）の区間を対応付けすることは、キーワードを音声（映像）情報と対応させてインデックス化することになるため、以下、インデックスという用語を用いることがある。 In addition, the present invention further includes a voice-corresponding section extracting unit that extracts a sentence in which the keyword generated by the keyword generating unit is issued from the presentation information, and the keyword displayed on the display unit is selected. In this case, the display unit may be configured by a presentation viewing system that displays a scene in which a sentence extracted by the voice-corresponding section extraction unit is emitted. According to this configuration, since the scene that emits the sentence including the selected keyword is immediately displayed on the display unit, the user can efficiently view the presentation. In addition to displaying the scene where the extracted sentence is emitted, the display unit can also reproduce the extracted sentence from the audio output device (speaker). Thereby, the user can view the presentation efficiently by video and audio. In addition, as described above, associating the selected keyword with the audio (video) section including the keyword is indexed by associating the keyword with audio (video) information. Hereinafter, the term index may be used.

また、本発明は、前記表示部は、前記キーワード生成部で生成されたキーワードを、前記スライド資料上に認識可能な態様で重ねて表示することを特徴とするプレゼンテーション視聴システムによって構成することもできる。この構成によれば、生成されたキーワードがスライド上に重ねて表示されるため、利用者が効率的にプレゼンテーションの内容を理解することができる。なお、キーワードをスライド資料上に認識可能な態様で重ねて表示するとは、例えば、スライド上のキーワードに下線を付けてスライド上に重ねて表示しても良いし、スライド上のキーワードを点滅させて表示しても良い。すなわち、認識可能な態様で表示するとは、生成されたキーワードをスライド上の他の用語と区別して表示することを意味し、上記態様に限られず、認識可能であればいかなる態様であっても良い。 In addition, the present invention can also be configured by a presentation viewing system in which the display unit displays the keyword generated by the keyword generation unit in a recognizable manner on the slide material. . According to this configuration, the generated keyword is displayed on the slide so that the user can efficiently understand the contents of the presentation. It should be noted that the keyword is displayed in a recognizable manner on the slide material, for example, the keyword on the slide may be underlined and displayed on the slide, or the keyword on the slide may blink. You may display. In other words, displaying in a recognizable manner means displaying the generated keyword separately from other terms on the slide, and is not limited to the above-described manner, and may be in any manner as long as it is recognizable. .

また、本発明は、映像および音声を含むプレゼンテーション情報の音声情報を解析する音声情報解析部と、前記音声情報解析部の解析結果より、前記プレゼンテーションのキーワードを生成するキーワード生成部と、前記キーワード生成部で生成されたキーワードが発せられた文を、前記プレゼンテーション情報の中から抽出する音声対応区間抽出部と、前記プレゼンテーション情報およびキーワードを表示する表示部を備え、前記表示部に表示されたキーワードが選択された場合、前記表示部は前記音声対応区間抽出部により抽出された文を発する場面を表示することを特徴とするプレゼンテーション視聴システムによって構成することもできる。この構成によれば、音声情報より解析されたキーワードが選択された場合、即座にキーワードを含む文を発話している場面を表示することができるため、利用者は効率的にプレゼンテーションを視聴することができる。また、本発明は、プレゼンテーション資料の解析を行なわないものであるため、システム全体を簡易に構成することができる。なお、表示部が、抽出された文を発する場面を表示することに加え、音声出力装置（スピーカー）から、抽出された文を再生することもできる。これにより、利用者は映像と音声により効率的にプレゼンテーションを視聴することができる。 The present invention also provides an audio information analysis unit that analyzes audio information of presentation information including video and audio, a keyword generation unit that generates a keyword for the presentation from the analysis result of the audio information analysis unit, and the keyword generation A speech-corresponding section extracting unit that extracts a sentence in which the keyword generated by the unit is generated from the presentation information, and a display unit that displays the presentation information and the keyword, and the keyword displayed on the display unit is When selected, the display unit may be configured by a presentation viewing system characterized by displaying a scene that utters a sentence extracted by the voice corresponding section extracting unit. According to this configuration, when a keyword analyzed from voice information is selected, a scene in which a sentence including the keyword is immediately spoken can be displayed, so that the user can efficiently view the presentation. Can do. Further, since the present invention does not analyze presentation materials, the entire system can be easily configured. In addition to displaying the scene where the extracted sentence is emitted, the display unit can also reproduce the extracted sentence from the audio output device (speaker). Thereby, the user can view the presentation efficiently by video and audio.

本発明は、プレゼンテーション資料と音声情報より、キーワードあるいは要約文を生成するため、キーワードあるいは要約文を参照することで、利用者はプレゼンテーションの内容を効率的に理解することができる。 In the present invention, a keyword or summary sentence is generated from presentation material and audio information, so that the user can efficiently understand the contents of the presentation by referring to the keyword or summary sentence.

本発明を実施するための実施の形態について以下に詳細に説明する。図１は、本発明が適用された第１の実施形態のシステム全体の構成を示すシステム図である。プレゼンテーション視聴システム２は、大まかに、メインの記憶装置である主記憶装置４と、処理装置６、ユーザー入力装置８および出力装置１０より構成されている。 Embodiments for carrying out the present invention will be described in detail below. FIG. 1 is a system diagram showing the configuration of the entire system of the first embodiment to which the present invention is applied. The presentation viewing system 2 is roughly composed of a main storage device 4 as a main storage device, a processing device 6, a user input device 8, and an output device 10.

主記憶装置４は、講演者のプレゼンテーション資料を記憶するプレゼンテーション資料記憶部１２と、講演者の講演風景を録画し、映像および音声をデータ化して記憶する映像音声記憶部１４より構成されている。プレゼンテーション資料記憶部１２は、市販のプレゼンテーション資料作成用ソフトウェアによって作成されたプレゼンテーション資料のデータが記憶される。また、映像音声記憶部１４には映像入力装置１６が接続されており、映像音声入力装置１６から入力されるデータが映像音声記憶部に記憶される。なお、本実施形態では、通常、主記憶装置４はノート型（ラップトップ型）のパーソナルコンピュータ内のハードディスク装置、ＲＡＭ等であるが、主記憶装置４を、プレゼンテーション資料のデータや映像音声情報のデータを記憶した外部記憶装置（各種記憶媒体にデータを記憶させたもの）で構成しても良い。 The main storage device 4 includes a presentation material storage unit 12 for storing the presentation material of the lecturer and a video / audio storage unit 14 for recording the lecture scenery of the lecturer and converting the video and audio into data. The presentation material storage unit 12 stores presentation material data created by commercially available presentation material creation software. A video input device 16 is connected to the video / audio storage unit 14, and data input from the video / audio input device 16 is stored in the video / audio storage unit. In the present embodiment, the main storage device 4 is normally a hard disk device, RAM, or the like in a notebook type (laptop type) personal computer, but the main storage device 4 is used for data of presentation materials and video / audio information. An external storage device that stores data (data stored in various storage media) may be used.

次に、図２を用いて、本実施形態の処理装置６について説明する。処理装置６は、主記憶装置４、ユーザー入力装置８および出力装置１０との間でデータ、命令等のやり取りを行うものである。処理装置６は、主記憶装置４で記憶されたデータを解析したり、各種のキーワード等の抽出を行う。プレゼンテーション資料記憶部１２に記憶されたデータは、資料解析部１８に送られる。資料解析部１８では、プレゼンテーション資料をテキストデータ化し、自然言語処理技術を用いた解析が行なわれる。また、プレゼンテーション資料のテキストデータより、テキストデータが単語単位（品詞毎）に分割される。単語単位の分割は、プレゼンテーション資料の各スライドのタイトル、見出し語およびスライドの本文の各部分について行なわれる。また、資料解析部１８では、解析された単語情報が、スライドのタイトル部分の単語情報であるか、スライドの見出し語の単語情報であるか、その他の部分の単語情報であるかが区別して記憶される。また、資料解析部１８では、スライドの図表に関しては、図表の一般的な形式の構造を用いて、テキスト情報項目間の関係が抽出される。 Next, the processing apparatus 6 of this embodiment is demonstrated using FIG. The processing device 6 exchanges data, commands, etc. with the main storage device 4, the user input device 8 and the output device 10. The processing device 6 analyzes the data stored in the main storage device 4 and extracts various keywords and the like. The data stored in the presentation material storage unit 12 is sent to the material analysis unit 18. The material analysis unit 18 converts the presentation material into text data and performs analysis using natural language processing technology. Further, the text data is divided into word units (part of speech) from the text data of the presentation material. The word-by-word division is performed for each slide title, headword, and slide body of the presentation material. Further, the material analysis unit 18 distinguishes and stores whether the analyzed word information is the word information of the title part of the slide, the word information of the headword of the slide, or the word information of the other part. Is done. Further, the material analysis unit 18 extracts the relationship between the text information items with respect to the slide chart by using a general format structure of the chart.

また、処理装置６は、特徴抽出部２０を備えている。特徴抽出部２０では、映像音声記憶部１４に記憶されたデータから音声データを抽出し、音声データを分析し、音声の韻律情報および表層的言語情報の音声特徴量を生成する。特徴抽出部２０で生成された音声特徴量は、音声認識部２２に送られる。音声認識部２２は、映像音声記憶部１４に記憶された音声データから、特徴抽出部２０で得られた音声特徴量を用いて、音声を単語列（文字列）に変換する。 In addition, the processing device 6 includes a feature extraction unit 20. The feature extraction unit 20 extracts audio data from the data stored in the video / audio storage unit 14, analyzes the audio data, and generates audio feature quantities of audio prosodic information and surface language information. The voice feature amount generated by the feature extraction unit 20 is sent to the voice recognition unit 22. The voice recognition unit 22 converts the voice into a word string (character string) from the voice data stored in the video / audio storage unit 14 using the voice feature amount obtained by the feature extraction unit 20.

音声認識部２２によって音声データから変換された単語列は、音声整形部２４に送られる。音声整形部２４では、単語列から、「あのー」「まー」「えーと」等の間投詞（あるいは、フィルドフィラー（有声休止））、言い直し、言いよどみ等、プレゼンテーションの内容の理解に関係の無い語や音声区間が削除される。 The word string converted from the voice data by the voice recognition unit 22 is sent to the voice shaping unit 24. The speech shaping unit 24 uses words from the word string that are not related to the understanding of the content of the presentation, such as interjections (or filled fillers (voiced pause)), rephrasing, stagnation, etc. And voice segments are deleted.

また、資料解析部１８、音声整形部２４で得られたデータは、要約部２６に送られる。また、特徴抽出部２０では、音声のピッチパターン、音声のパワーパターン、ポーズ長、発話文の長さ等の情報が解析され、要約部２６に送られる。要約部２６では、以下の各特徴量により各文の重要度を設定し、要約として抽出する文を選択する。 Further, the data obtained by the material analysis unit 18 and the voice shaping unit 24 is sent to the summarization unit 26. Further, the feature extraction unit 20 analyzes information such as the pitch pattern of the voice, the power pattern of the voice, the pause length, and the length of the spoken sentence, and sends the information to the summarization unit 26. The summarizing unit 26 sets the importance of each sentence according to the following feature amounts, and selects a sentence to be extracted as a summary.

ｔｆ：各文中の名詞のｔｆ（ｔｅｒｍｆｒｅｑｕｅｎｃｙ：語の出願頻度）を計算し、各文の名詞のスコアの和を求め、ユーザーが指定する要約率に相当する文数を抽出する。
頻出単語：出現頻度の高い方から、その語を２つ以上含んでいる文の数が全体の要約率になるように語を選び、文を抽出する。 tf: Calculates the tf (term frequency: word application frequency) of nouns in each sentence, obtains the sum of the noun scores of each sentence, and extracts the number of sentences corresponding to the summarization rate designated by the user.
Frequent word: A word is selected from the higher appearance frequency so that the number of sentences including two or more of the words becomes the overall summary rate, and the sentence is extracted.

Ｓｌｉｄｅ−Ｔｉｔｌｅ：プレゼンテーション資料のスライドのタイトルに含まれる名詞が出現する文を重要文として抽出する。
Ｓｌｉｄｅ−ｔｆ：プレゼンテーション資料のスライド中に三回以上現れる名詞を頻出単語とし、その頻出単語が一回以上含まれる文を重要文として抽出する。
Ｆ０：文あたりの平均基本周波数の高い文の順に、要約率に相当する文集合を抽出する（スライド情報不使用時、Ｓｌｉｄｅ−ｔｆの代替）
パワー：文あたりの平均パワーの大きい文の順に、要約率に相当する文集合を抽出する（スライド情報不使用時、Ｓｌｉｄｅ−Ｔｉｔｌｅの代替） Slide-Title: A sentence in which a noun included in the slide title of the presentation material appears is extracted as an important sentence.
Slide-tf: A noun that appears three or more times in the slide of the presentation material is defined as a frequent word, and a sentence including the frequent word once or more is extracted as an important sentence.
F0: A sentence set corresponding to the summary rate is extracted in the order of sentences having a high average fundamental frequency per sentence (when slide information is not used, an alternative to Slide-tf)
Power: A sentence set corresponding to the summary rate is extracted in the order of sentences with the highest average power per sentence (when slide information is not used, an alternative to Slide-Title)

発話時間長：発話時間長の長い文から順に、要約率に相当する文集合を抽出する。
特徴量の組合せ：発話速度の大きい文から順に、要約率に相当する文集合を抽出し、発話速度を遅い文を非重要文とし、棄却する。例えば、要約率が２５パーセントの場合は、発話速度の遅い文から１５パーセントを抽出し、要約文から棄却する。同様に本基準は、発話長の短い文を非重要文として棄却することとしてもよい。なお、要約率２５パーセントに対する棄却率１５パーセントは経験則として定めたものである。 Speaking time length: A sentence set corresponding to the summarization rate is extracted in order from sentences having a long speaking time length.
Combination of feature values: A sentence set corresponding to the summary rate is extracted in order from sentences with a high utterance speed, and sentences with a low utterance speed are regarded as non-important sentences and rejected. For example, when the summary rate is 25%, 15% is extracted from a sentence with a low utterance speed and rejected from the summary sentence. Similarly, this criterion may reject a sentence with a short utterance length as an unimportant sentence. The rejection rate of 15% with respect to the summary rate of 25% is determined as an empirical rule.

特徴量の組合せにおいては、上述した各特徴量に対し、各基準毎に重みを設定し、抽出された文に対し重みつき和を求め、重みの大きい文から順に、ユーザーが設定した要約率に相当する文集合を抽出する。なお、重要文の抽出には様々な方法があるが、例えば、機械学習のＳＶＭ（サポートベクトルマシーン）を用いる場合は、各文の値の組を入力として、重要文であるか否かの２分類問題として解くことができる。以下に具体的な要約文抽出手法について述べる。 In the combination of feature amounts, for each feature amount described above, a weight is set for each criterion, a weighted sum is obtained for the extracted sentences, and the summarization rate set by the user is set in order from the sentence with the largest weight. The corresponding sentence set is extracted. There are various methods for extracting an important sentence. For example, when a machine learning SVM (support vector machine) is used, a set of values of each sentence is used as an input to determine whether the sentence is an important sentence. It can be solved as a classification problem. A specific summary sentence extraction method is described below.

指標特徴Ｆ_κによるｉ番目の文Ｓ_ｉに対する重要度の判定結果ＳｃｏｒｅＦ_κを、数１のように定義する。
The importance determination result ScoreF _κ for the i-th sentence S _i by the index feature F _κ is defined as shown in Equation 1.

また、棄却可能な特徴Ｄ_ｐによる文Ｓ_ｉに対する棄却スコアＳｃｏｒｅＤ_ｐも数１と同様に、棄却文に該当する場合は１に、棄却文に該当しない場合は０とされる。その後、すべての文について各基準に基づいてスコアが決定され、最終的な文スコアは数２のようになる。
Similarly to Equation 1, the rejection score ScoreD _p for the sentence S _{i with the} rejectable feature D _{p is} also set to 1 when it corresponds to the rejection sentence, and 0 when it does not correspond to the rejection sentence. After that, the score is determined based on each criterion for all sentences, and the final sentence score is as follows.

ここで、α_κ、β_ｐはそれぞれ特徴Ｆ_κ、特徴Ｄ_ｐの寄与度である。実際の実験では、α_κを０から０．６まで０．２刻み、β_ｐは０または−∞で組み合わせた。寄与度の推定は、人間による要約と比較し、κ統計量（κ値）の高かった組合せを採用することで行なった。 Here, α _κ and β _p are contributions of the feature F _κ and the feature D _p , respectively. In actual experiments, increments 0.2 alpha _kappa from 0 to 0.6, beta _p was combined with 0 or -∞. The degree of contribution was estimated by adopting combinations with higher κ statistics (κ values) compared to human summaries.

要約部２６で抽出された文は、要約文生成部２８に送られる。要約文生成部２８では、抽出された各文が結合され、要約文が生成される。 The sentence extracted by the summary unit 26 is sent to the summary sentence generation unit 28. In the summary sentence generation unit 28, the extracted sentences are combined to generate a summary sentence.

また、音声整形部２４で整形された単語列と、資料解析部１８で解析された単語列は、インデックス化部３０に送られる。インデックス化部３０では、音声認識部２２、音声整形部２４および資料解析部１８からの情報より、重要語の抽出を行い、抽出された重要語（キーワード）をインデックスとして格納、記憶する。また、資料解析部１８からの情報より、プレゼンテーション資料のタイトルや見出し語をインデックスとして格納、記憶する。 The word string shaped by the voice shaping unit 24 and the word string analyzed by the material analysis unit 18 are sent to the indexing unit 30. The indexing unit 30 extracts important words from the information from the voice recognition unit 22, the voice shaping unit 24, and the material analysis unit 18, and stores and stores the extracted important words (keywords) as an index. In addition, the title and headword of the presentation material are stored and stored as an index from the information from the material analysis unit 18.

インデックス化部３０で記憶されたインデックスは、音声対応区間抽出部３２に送られる。音声対応区間抽出部３２は、音声データから、インデックス部３０で得られたインデックスを説明している音声区間を抽出し、その音声区間を含む文を特定する。 The index stored in the indexing unit 30 is sent to the voice corresponding section extracting unit 32. The voice corresponding section extraction unit 32 extracts a voice section explaining the index obtained by the index unit 30 from the voice data, and specifies a sentence including the voice section.

再生部３４は、要約文生成部２８、音声対応区間抽出部３２、あるいはユーザーからの入力情報等に基づき、映像音声データを再生するものである。例えば、再生部３４では、要約文生成部２８で生成された要約文に基づき、映像音声データを再構築し、要約文生成部２８で生成された要約文どおりに映像音声データを出力装置１０に出力するものである。また、再生部３４は、インデックス化部３０で生成されたインデックスが、利用者であるユーザーに選択された場合、選択されたインデックスに対応する映像音声区間を出力装置１０に出力する。 The reproduction unit 34 reproduces video / audio data based on the summary sentence generation unit 28, the audio corresponding section extraction unit 32, or input information from the user. For example, the playback unit 34 reconstructs the video / audio data based on the summary sentence generated by the summary sentence generation unit 28, and sends the video / audio data to the output device 10 according to the summary sentence generated by the summary sentence generation unit 28. Output. Further, when the index generated by the indexing unit 30 is selected by a user who is a user, the reproducing unit 34 outputs a video / audio section corresponding to the selected index to the output device 10.

なお、上述の各部、すなわち、資料解析部１８乃至音声対応区間抽出部３２において、抽出された情報等は処理装置６内の記憶媒体に保存されるため、上述の処理を一度行なえば良く、ユーザーによる視聴毎に上記各部における処理を行なう必要は無い。 In the above-described units, that is, in the material analysis unit 18 to the voice corresponding section extraction unit 32, the extracted information and the like are stored in the storage medium in the processing device 6, and therefore the above-described processing only needs to be performed once. It is not necessary to perform the processing in each unit for each viewing.

次に、出力装置１０について図３を用いて説明する。出力装置１０は、映像出力部３６、音声出力部３８、スライド一覧出力部４０、キーワード一覧出力部４２、スライド出力部４４および要約文出力部４６を備えている。映像出力部３６は、再生部３４からの信号に基づき、映像音声記憶部１４に記憶された映像情報を出力するものであり、通常は、パーソナルコンピュータ装置のディスプレイ装置に出力される。また、音声出力部３８は、再生部３４からの信号に基づき、映像音声記憶部１４に記憶された音声情報を出力するものであり、通常はスピーカーより音声を出力する。 Next, the output device 10 will be described with reference to FIG. The output device 10 includes a video output unit 36, an audio output unit 38, a slide list output unit 40, a keyword list output unit 42, a slide output unit 44, and a summary sentence output unit 46. The video output unit 36 outputs video information stored in the video / audio storage unit 14 based on a signal from the playback unit 34, and is normally output to a display device of a personal computer device. The audio output unit 38 outputs audio information stored in the video / audio storage unit 14 based on a signal from the reproduction unit 34, and normally outputs audio from a speaker.

また、スライド一覧出力部４０は、資料解析部１８によって抽出されたスライドのタイトルを一覧として表示するものであり、映像出力部３６と同様にディスプレイ装置に出力するものである。また、キーワード一覧出力部４２は、インデックス部３０により抽出されたキーワードを一覧としてディスプレイ装置に表示するものである。また、スライド出力部４４は、プレゼンテーション資料記憶部１２に記憶されたスライドを出力するものであり、インデックス化部３０によって抽出されたスライド中のキーワードは、スライド中に下線が引かれてディスプレイ装置に表示される。また、要約文出力部４６は、要約文生成部２８で生成された要約文をディスプレイ上に表示するものである。 The slide list output unit 40 displays the titles of the slides extracted by the material analysis unit 18 as a list, and outputs it to the display device in the same manner as the video output unit 36. The keyword list output unit 42 displays the keywords extracted by the index unit 30 as a list on the display device. The slide output unit 44 outputs slides stored in the presentation material storage unit 12, and the keywords in the slide extracted by the indexing unit 30 are underlined during the slide and are displayed on the display device. Is displayed. The summary sentence output unit 46 displays the summary sentence generated by the summary sentence generation unit 28 on the display.

次に、ユーザー入力装置８の詳細について図４を用いて説明する。ユーザー入力装置８は、要約関連入力部４８、スライド選択部５０、スライド画面キーワード選択部５２およびキーワード選択部５４を備えている。要約関連入力部４８では、要約の速さを決定する「要約速度」、文章全体に対する要約量の割合を決定する「要約率」および出力装置１０の要約文出力部４６に要約文を表示するか否かを決定する「要約文表示設定」がユーザーにより入力される。ユーザーから入力された情報に基づき、再生部３４は要約速度等を決定する。 Next, details of the user input device 8 will be described with reference to FIG. The user input device 8 includes a summary-related input unit 48, a slide selection unit 50, a slide screen keyword selection unit 52, and a keyword selection unit 54. The summary-related input unit 48 displays “summary speed” for determining the speed of the summary, “summary rate” for determining the ratio of the summary amount to the whole sentence, and whether the summary sentence is displayed on the summary sentence output unit 46 of the output device 10. The user inputs “summary sentence display setting” for determining whether or not. Based on information input from the user, the playback unit 34 determines a summary speed and the like.

スライド選択部５０は、スライド一覧出力部４０により表示された各スライドのタイトルから、ユーザーが選択したスライドの情報を処理装置６に出力するものである。また、スライド画面キーワード選択部５２は、スライド出力部４４により表示されたスライド内の単語から、ユーザーが選択した単語の情報を処理装置６に出力するものである。 The slide selection unit 50 outputs information on the slide selected by the user from the titles of the slides displayed by the slide list output unit 40 to the processing device 6. The slide screen keyword selection unit 52 outputs information on the word selected by the user from the words in the slide displayed by the slide output unit 44 to the processing device 6.

また、キーワード選択部５４は、キーワード一覧出力部４２により表示されたキーワードから、ユーザーが選択したキーワードの情報を処理装置６に出力するものである。 Further, the keyword selection unit 54 outputs information on the keyword selected by the user from the keywords displayed by the keyword list output unit 42 to the processing device 6.

次に、図５を用いて、本実施形態の動作画面の例を示す。図５は、出力装置１０より出力される情報をディスプレイ装置に出力したものを示す図である。図５において、ディスプレイ装置の左上部には、映像出力部３６からの信号により、プレゼンテーション映像表示部５６が表示されている。プレゼンテーション映像表示部５６の近傍には、再生ボタン、停止ボタン、再生位置を示すスライダ等の操作ボタンが表示されている。また、ディスプレイ装置の左中部には、要約関連入力部４８に関連する要約情報表示部５８が表示されている。この部分には、要約速度の再生、要約率の設定および要約文の表示設定の情報が表示されており、要約関連入力部４８の入力は、要約情報表示部５８の画面をクリックすることにより選択されるものである。また、要約情報表示部５８の下部には、スライド一覧出力部４０からのデータより、スライド一覧表示部６０が表示されている。スライド選択部５０は、スライド一覧表示部６０より１つのスライドのタイトルがユーザーの入力として選択され、処理装置６に情報が伝達される。なお、スライド一覧表示部６０には、スライド一覧表示部６０に一度に表示することができないスライドのタイトルをユーザーが認識することができるように、スクロールバーが設けられている。 Next, an example of an operation screen according to the present embodiment will be described with reference to FIG. FIG. 5 is a diagram showing information output from the output device 10 output to the display device. In FIG. 5, a presentation video display unit 56 is displayed in the upper left part of the display device by a signal from the video output unit 36. In the vicinity of the presentation video display unit 56, operation buttons such as a playback button, a stop button, and a slider indicating a playback position are displayed. A summary information display unit 58 related to the summary-related input unit 48 is displayed in the left middle part of the display device. This section displays information on summary speed playback, summary rate settings, and summary text display settings. Input for summary-related input section 48 is selected by clicking on the screen of summary information display section 58. It is what is done. In addition, a slide list display unit 60 is displayed below the summary information display unit 58 based on data from the slide list output unit 40. In the slide selection unit 50, the title of one slide is selected as an input from the slide list display unit 60, and information is transmitted to the processing device 6. The slide list display unit 60 is provided with a scroll bar so that the user can recognize the titles of slides that cannot be displayed on the slide list display unit 60 at one time.

また、ディスプレイ装置の左下部には、キーワード一覧出力部４２からのデータより、キーワード一覧表示部６２が表示されている。キーワード一覧表示部６２に表示されたキーワードから、１つのキーワードが選択された場合、キーワード選択部５４より処理装置６に伝達される。なお、キーワード一覧表示部６２には、キーワード一覧表示部６２に一度に表示することができないキーワードをユーザーが認識することができるように、スクロールバーが設けられている。また、キーワード一覧表示部６２には、インデックス化部３０で抽出されたキーワードが時系列順に表示されているが、キーワードの表示順序はこれに限らず、あいうえお順に表示しても良いし、キーワードの重要度の順に表示しても良い。 A keyword list display unit 62 is displayed from the data from the keyword list output unit 42 in the lower left part of the display device. When one keyword is selected from the keywords displayed on the keyword list display unit 62, the keyword selection unit 54 transmits the selected keyword to the processing device 6. The keyword list display unit 62 is provided with a scroll bar so that the user can recognize keywords that cannot be displayed on the keyword list display unit 62 at one time. The keyword list display unit 62 displays the keywords extracted by the indexing unit 30 in chronological order. However, the keyword display order is not limited to this, and the keywords may be displayed in order. You may display in order of importance.

また、ディスプレイ装置の右上部には、スライド画面表示部６４が表示される。スライド画面表示部６４は、スライド出力部４４によりスライド画面が表示され、インデックス化部３０によって生成されたキーワードがスライド上で下線を引かれた状態で表示される。このスライド上で下線を引かれたキーワードは、スライド画面キーワード選択部５４によって、ユーザーに選択されるのである。なお、スライド画面表示部６４には、スライドを一枚前に戻すためのボタンとスライドを一枚先に送るためのボタンが設けられている。また、ディスプレイ装置右下部には、要約文表示部６６が表示される。要約文表示部６６は、要約文出力部４８の出力を用いて表示され、プレゼンテーション映像表示部５６の再生速度に応じて、要約文がスクロール表示される。なお、要約文表示部６６には、スクロールして表示されなくなった要約文を見るためのスクロールバーが設けられている。 In addition, a slide screen display unit 64 is displayed in the upper right part of the display device. The slide screen display unit 64 displays the slide screen by the slide output unit 44 and displays the keywords generated by the indexing unit 30 underlined on the slide. The keyword underlined on the slide is selected by the user by the slide screen keyword selection unit 54. The slide screen display section 64 is provided with a button for returning the slide to the previous sheet and a button for sending the slide one sheet ahead. A summary sentence display section 66 is displayed at the lower right of the display device. The summary sentence display unit 66 is displayed using the output of the summary sentence output unit 48, and the summary sentence is scroll-displayed according to the playback speed of the presentation video display unit 56. The summary sentence display section 66 is provided with a scroll bar for viewing summary sentences that are no longer displayed by scrolling.

次に、本実施形態の作用について説明する。プレゼンテーションの映像および音声情報がデータ化され、記憶装置４の映像音声記憶部１４に記憶される。また、プレゼンテーションに用いたプレゼンテーション資料のデータが記憶装置４のプレゼンテーション資料記憶部１２に記憶される。次に、処理装置６を用いて、映像音声記憶部１４およびプレゼンテーション資料記憶部１２に記憶されたデータが解析される。 Next, the operation of this embodiment will be described. The video and audio information of the presentation is converted into data and stored in the video / audio storage unit 14 of the storage device 4. The presentation material data used for the presentation is stored in the presentation material storage unit 12 of the storage device 4. Next, using the processing device 6, the data stored in the video / audio storage unit 14 and the presentation material storage unit 12 is analyzed.

処理装置６における解析では、要約文およびインデックス（キーワード）が解析される。要約の解析では、音声データは特徴抽出部２０、音声認識部２２および音声整形部２４によって解析され、プレゼンテーション資料は資料解析部１８によって解析される。これにより、要約部２６で、全体の文章のうち、要約に用いる文の抽出が行なわれ、要約文生成部２８で要約文が生成される。 In the analysis in the processing device 6, the summary sentence and the index (keyword) are analyzed. In the summary analysis, the voice data is analyzed by the feature extraction unit 20, the voice recognition unit 22, and the voice shaping unit 24, and the presentation material is analyzed by the material analysis unit 18. As a result, the summarization unit 26 extracts sentences used for summarization from the entire sentence, and the summary sentence generation unit 28 generates a summary sentence.

また、インデックスの解析では、音声データは特徴抽出部２０、音声認識部２２および音声整形部２４で解析され、プレゼンテーション資料は資料解析部１８で解析される。これにより、音声データおよびプレゼンテーションの両方よりインデックスが生成されることになる。また、インデックス化部３０で生成された各インデックスは、音声データ中のどの部分（時間）に出現したかが解析され、当該インデックスを含む文が音声対応区間抽出部３２によって、全体の文章より抽出される。 In the index analysis, the voice data is analyzed by the feature extraction unit 20, the voice recognition unit 22, and the voice shaping unit 24, and the presentation material is analyzed by the material analysis unit 18. As a result, an index is generated from both the audio data and the presentation. Further, each index generated by the indexing unit 30 is analyzed in which part (time) in the voice data, and a sentence including the index is extracted from the whole sentence by the voice corresponding section extracting unit 32. Is done.

上記の解析、記憶が終了した後に、ユーザーにより本プレゼンテーション視聴システム２が作動された場合、図５に示す態様がディスプレイ装置に表示される。ユーザーは、要約情報表示部５８の表示から要約率等を選択した後、プレゼンテーション映像表示部５６近傍の再生ボタンをクリックすると、プレゼンテーション映像表示部５６が再生され、プレゼンテーションの進行に応じて、スライド画面表示部６４が切り替えられる。また、要約文表示部６６は、プレゼンテーションの進行に応じて、抽出された要約文が順次表示されていく。 When the present presentation viewing system 2 is operated by the user after the above analysis and storage are completed, the mode shown in FIG. 5 is displayed on the display device. When the user selects a summary rate or the like from the display of the summary information display unit 58 and then clicks the play button near the presentation video display unit 56, the presentation video display unit 56 is played back, and the slide screen is displayed according to the progress of the presentation. The display unit 64 is switched. The summary sentence display unit 66 sequentially displays the extracted summary sentences as the presentation progresses.

ここで、キーワード一覧表示部６２のキーワードの一つがユーザーに選択された場合、プレゼンテーション映像表示部５６は、選択されたキーワードを含む文を発話している場面に切り替わり、当該文の最初よりプレゼンテーション映像が再生される。また、スライド画面表示部６４には、プレゼンテーション画面５６に対応するスライドが表示される。また、要約文表示部６６には、選択されたキーワードを含む文から順次要約文が表示される。 Here, when one of the keywords in the keyword list display unit 62 is selected by the user, the presentation video display unit 56 switches to a scene where a sentence including the selected keyword is spoken, and the presentation video is displayed from the beginning of the sentence. Is played. In addition, the slide corresponding to the presentation screen 56 is displayed on the slide screen display unit 64. The summary sentence display unit 66 displays summary sentences sequentially from sentences including the selected keyword.

次に、スライド画面表示部６４の下線部を引かれたキーワードがユーザーに選択された場合、選択されたキーワードを含む文を発話している場面がプレゼンテーション映像表示部５６に表示される。また、要約文表示部６６には、選択されたキーワードを含む文から順次要約文が表示される。 Next, when the keyword underlined in the slide screen display unit 64 is selected by the user, a scene where a sentence including the selected keyword is uttered is displayed on the presentation video display unit 56. The summary sentence display unit 66 displays summary sentences sequentially from sentences including the selected keyword.

次に、スライド一覧表示部６０のスライドのタイトルの一つがユーザーに選択された場合、プレゼンテーション映像表示部５６には、選択されたスライドに関して発話している部分の先頭から映像が再生される。また、スライド画面表示部６４には、選択されたスライドの画面が表示され、要約文表示部６６には、選択されたスライドに関連する要約文の先頭から順に要約文が表示される。 Next, when one of the titles of the slides in the slide list display unit 60 is selected by the user, the presentation video display unit 56 plays the video from the beginning of the part that is speaking about the selected slide. The slide screen display section 64 displays the screen of the selected slide, and the summary sentence display section 66 displays the summary sentences in order from the beginning of the summary sentences related to the selected slide.

なお、要約情報表示部５８の要約率等の選択は、プレゼンテーション再生の前に行なうことができるが、これに限らず、プレゼンテーションの再生中に行なうこともできる。また、プレゼンテーション映像表示部５６の再生位置を示すスライダを操作した場合、プレゼンテーション映像表示部５６の再生位置に応じて、スライド画面表示部６４および要約文表示部６６が切り替えられるようになっている。 Note that the summary rate and the like of the summary information display unit 58 can be selected before the presentation is reproduced. However, the present invention is not limited to this and can be performed during the presentation. Further, when the slider indicating the playback position of the presentation video display unit 56 is operated, the slide screen display unit 64 and the summary sentence display unit 66 are switched according to the playback position of the presentation video display unit 56.

次に、本実施形態のプレゼンテーション視聴システム２を用い、効果の検証を行なった。まず、本プレゼンテーション視聴システム２で生成した要約文についての評価について説明する。 Next, the effect was verified using the presentation viewing system 2 of the present embodiment. First, evaluation of the summary sentence generated by the present presentation viewing system 2 will be described.

（人間による要約との比較評価）
各特徴量による要約結果および特徴量の組合せによる話者ＳＮ（１−１、１−２の平均）と話者ＮＫ（３−１、３−２の平均）の講義の要約結果を図６に示す（要約率は２５パーセント）。なお図６において、「ｔｒｎ」は人手による書き起こしをおこなったものについて要約文を作成したものであり、「ａｓｒ」は音声認識結果を基にして要約文を作成したものである。 (Comparison evaluation with human summary)
FIG. 6 shows the summary results of the lectures of the speaker SN (1-1, average of 1-2) and speaker NK (3-1, average of 3-2) by the combination of the feature amounts and the summary of each feature amount. Shown (summary rate is 25 percent). In FIG. 6, “trn” is a summary sentence created for human transcription, and “asr” is a summary sentence created based on the speech recognition result.

また、本比較評価に用いた特徴量および寄与度を図７に示す。図７においては、各特徴量について、スライド情報使用時と不使用時のぞれぞれに、「テキスト」と「音声入力」の寄与度が示されている。なお、「テキスト」とは図３の「ｔｒｎ」（人手により書き起こし）の場合に用いられた寄与度であり、「音声入力」は図３の「ａｓｒ」に用いられた寄与度である。 Moreover, the feature-value and contribution used for this comparative evaluation are shown in FIG. In FIG. 7, contributions of “text” and “speech input” are shown for each feature amount when the slide information is used and when it is not used. Note that “text” is the contribution used in the case of “trn” (hand-written by hand) in FIG. 3, and “speech input” is the contribution used in “asr” in FIG.

図６のグラフより、単独の特徴量の中では、頻出単語による要約と発話時間長による要約が、人間による要約に近いという結果が得られた。頻出単語による要約は、本質的にはｔｆと変わらないが、ｔｆが文の長さに影響され易いのに対し、頻出単語では文の長さに関わらず、設定した単語が２回以上出現する文をすべて同位として抽出している点が効果があったものと考えられる。また、発話時間長による要約は、発話時間が長い文が抽出されているので、時間的な要約率は、話者ＳＮで４４パーセント、話者ＮＫで５０パーセントであった。 From the graph of FIG. 6, it was found that, within a single feature amount, the summarization by frequent words and the summarization by utterance time length are close to those by humans. The summary of frequent words is essentially the same as tf, but tf is easily affected by the length of the sentence, whereas the set word appears more than once in the frequent words regardless of the length of the sentence. The fact that all sentences are extracted as peers is considered effective. Moreover, since the sentence with a long utterance time was extracted from the summary by the utterance time length, the temporal summarization rate was 44% for the speaker SN and 50% for the speaker NK.

また、特徴量の組合せによる要約で、話者ＮＫのテキストを用いた要約では、κ値は０．４５１（Ｆ値：０．５８３）となり、音声入力を用いた要約でも、κ値は０．４５８（Ｆ値：０．５８８）となり、人間による要約結果のκ値０．４９０（Ｆ値：０．５９３）と大差ない結果が得られた。これは、特徴量の組合せによる要約では、特徴量に韻律情報および表層的言語情報の両方を用いているため、表層的言語情報のみを用いた要約に比較して、人間による要約に近い要約が生成できていると考えられる。なお、話者ＮＫによる講義では上記の結果が得られたが、これは講義中で、講義の内容と余談部分との区別が明確であったためであると考えられる。これに対し、話者ＳＮの講義では、テキストによる要約と音声入力による要約のκ値の差はやや大きく、κ値で０．３４８−０．３１９（Ｆ値で０．５１８−０．４９７）、人間による要約のκ値０．４７７（Ｆ値：０．５３９）と大きな差があった。 Further, in the summary based on the combination of feature amounts and the summary using the text of the speaker NK, the κ value is 0.451 (F value: 0.583), and even in the summary using speech input, the κ value is 0. It was 458 (F value: 0.588), and a result that was not significantly different from the kappa value 0.490 (F value: 0.593) of the summary result by humans was obtained. This is because the summarization based on the combination of feature quantities uses both prosodic information and superficial language information for the feature quantity. It is thought that it has been generated. Note that the above result was obtained in the lecture by the speaker NK. This is considered to be because the content of the lecture and the digression part were clearly distinguished during the lecture. On the other hand, in the speaker's SN lecture, the difference in κ value between the summary by text and the summary by speech input is slightly large, 0.348-0.319 in κ value (0.518-0.497 in F value). There was a large difference from the human summary κ value of 0.477 (F value: 0.539).

(スライド情報使用による要約の評価)
次に、スライドの情報を使用して生成した要約と、スライドの情報を使用しないで生成した要約の評価について説明する。スライド情報を使用した要約（音声入力）では、話者ＳＮでκ値は０．３１９（Ｆ値：０．５１８）、話者ＮＫでκ値は０．４５８（Ｆ値：０．５８８）であったが、スライド情報を使用しない要約（音声入力）では、κ値は０．２７３（Ｆ値：０．４６３）、話者ＮＫでκ値は０．４２５（Ｆ値：０．５６３）となり、スライド情報を用いた要約の方が人間による要約に近い結果が得られた。 (Evaluation of summary using slide information)
Next, the summary generated using the slide information and the evaluation of the summary generated without using the slide information will be described. In summary (speech input) using slide information, the κ value is 0.319 (F value: 0.518) for the speaker SN, and the κ value is 0.458 (F value: 0.588) for the speaker NK. However, in the summary (speech input) that does not use slide information, the κ value is 0.273 (F value: 0.463), and the speaker NK has a κ value of 0.425 (F value: 0.563). The summary using the slide information was closer to the human summary.

（被験者１０人による要約の評価）
被験者１０人による要約音声の評価を行なった。本評価においては、３つの講義について、人間による要約結果（重要文抽出）に基づく要約音声と、本システムを用いて生成した要約結果（自動要約結果）に基づく要約音声を比較し、被験者１０人より述べ３０人分の回答を得た。 (Evaluation of summary by 10 subjects)
Summary voices were evaluated by 10 subjects. In this evaluation, we compared the summary speech based on the summary results (important sentence extraction) by humans and the summary speech based on the summary results (automatic summary results) generated using this system for the three lectures. More responses were obtained for 30 people.

「質問１」どちらの要約音声の方が講義の内容をつかみ易いか？
人間による要約に基づく要約音声の方が良い：１７人
どちらともいえない：９人
自動要約に基づく要約音声の方が良い：４人
「質問２」どちらの要約音声の方が、文のつながり、流れが自然に聴こえたか？
人間による要約に基づく要約音声の方が良い：１６人
どちらともいえない：４人
自動要約に基づく要約音声の方が良い：１０人
上記の評価結果より、本システムを用いて生成された要約結果に基づく要約音声は、人間による要約結果に基づく要約音声に近いものが得られていることが分かる。 “Question 1” Which summary audio is easier to grasp the contents of the lecture?
Summarized speech based on human summaries is better: 17 Neither can be said: 9 people Summarized speech based on automatic summaries is better: 4 people “Question 2” Which summary speech is more connected to sentences, Did you hear the flow naturally?
Summarized speech based on human summaries is better: 16 Neither can be said: 4 people Summarized speech based on automatic summaries is better: 10 Summarized results generated using this system from the above evaluation results It can be seen that the summary speech based on is similar to the summary speech based on the human summary results.

（インデックス機能についての評価）
次にインデックス機能についての評価について説明する。インデックス化部３０において、スライド中のキーワードは、スライド中の単語のｔｆ・ｉｄｆのスコアを演算し、当該スコアが平均値以上の単語をキーワードとした。ｉｄｆには、ＣＳＪ（日本語話し言葉コーパス２００４年度版）に含まれる講演データ（テーマ：音声処理、聴覚、男性話者２６４人の講演）を用い、マッチング対象の書き起こしテキストは名詞のみを用いた。本評価の対象とするスライドは、スライド中のキーワードの出現順序が時系列順なもの、すなわち文章や箇条書き文で構成されている４枚のスライドを用いた。対応付けには、ＤＰマッチング（動的計画法によるマッチング）の手法を用いた。 (Evaluation of index function)
Next, evaluation of the index function will be described. In the indexing unit 30, for the keyword in the slide, the tf · idf score of the word in the slide is calculated, and the word whose score is equal to or higher than the average value is used as the keyword. For idf, lecture data (theme: speech processing, hearing, lectures of 264 male speakers) included in CSJ (Japanese Spoken Language Corpus 2004 edition) was used, and only a noun was used as the transcription text to be matched. . As the slides to be evaluated, four slides composed of sentences and itemized sentences in which the keywords appear in the slides are in chronological order. For matching, a DP matching (matching by dynamic programming) technique was used.

インデックス機能を備えた講義教材のスライドおよび数分間の講義の視聴を被験者９人に１５分程度体験してもらい、以下に示す２つの質問について回答を得た。また、被験者は発明者が属する大学の情報工学系に所属する学部４年と修士課程１年の学生である。 Nine subjects experienced about 15 minutes of viewing slides of lecture materials with an index function and viewing of lectures for several minutes, and obtained answers to the following two questions. The subjects are 4th year undergraduate students and 1st year master's degree students who belong to the information engineering system of the university to which the inventor belongs.

「質問１」インデックス機能を持った講義教材を便利だと感じたか？
とても不便である：０人
不便である：０人
どちらともいえない：１人
便利である：５人
とても便利である：３人
「質問２」スライド中に表示されるリンクによるインデックスと、音声認識結果からの時系列表示によるインデックス、どちらが便利を感じたか？
スライド中のキーワードによるインデックスの方が断然良い：２人
どちらかというと、スライド中のキーワードによるインデックスの方が良い：３人
どちらともいえない：３人
どちらかというと、音声認識結果からのインデックスの方が良い：１人
音声認識結果からのインデックスの方が断然良い：０人 "Question 1" Did you feel that lecture materials with index functions are useful?
Very inconvenient: 0 people inconvenient: 0 people can say neither: 1 people are convenient: 5 people are very convenient: 3 people The index by the link displayed in the “Question 2” slide and voice recognition Which index is more convenient?
Index by keyword in slide is far better: 2 people are better, index by keyword in slide is better: 3 can't say either: 3 people are better, index from speech recognition results Is better: index from the result of speech recognition by 1 person is far better: 0 people

質問１の結果より、インデックス機能を持った講義教材が便利であるという意見が大多数を占め、これにより、本システムによる効果的な学習が可能であると考えられる。また、質問２の結果より、スライド中のキーワードによるインデックスの方が良いという結果が得られたが、これは、音声認識結果から抽出されたキーワードが時系列順であったためであると考えられる。ｔｆ・ｉｄｆに代わるキーワード抽出の手法やキーワード一覧の表示方法については改善が必要であると考えられる。 Based on the result of Question 1, the majority of the opinions that lecture materials with an index function are convenient occupy the majority, and it is considered that this system enables effective learning. Moreover, the result that the index by the keyword in the slide is better than the result of the question 2 is obtained. This is considered to be because the keywords extracted from the speech recognition result were in chronological order. It is considered that improvement is necessary for a keyword extraction method and a keyword list display method instead of tf · idf.

上述したように、本発明の第１の実施形態においては、プレゼンテーションの資料の情報とプレゼンテーションの音声情報の両方よりインデックス（キーワード）を抽出するため、音声情報のみからインデックスを抽出するのに比べ、適切なインデックスを抽出することができる。これにより、本システムをユーザーが用いる際、適切なインデックスを選択することができるようになる。 As described above, in the first embodiment of the present invention, the index (keyword) is extracted from both the information of the presentation material and the audio information of the presentation. Therefore, compared to extracting the index only from the audio information, Appropriate indexes can be extracted. Thereby, when a user uses this system, an appropriate index can be selected.

また、本実施形態においては、インデックス（キーワード）を選択することにより、選択されたキーワードを含む文を発生発声している場面を即座に再生することができるため、ユーザーが希望する内容を短時間に視聴することができる。また、本実施形態においては、表示されたスライド上の文字にインデックスとして下線が引かれ、下線部をユーザーが選択することにより、選択されたキーワードを含む文を発声発生している場面を即座に再生することができるため、ユーザーはスライド資料より希望する内容を短時間に視聴することができる。 Further, in the present embodiment, by selecting an index (keyword), it is possible to immediately reproduce a scene in which a sentence including the selected keyword is generated and uttered. Can watch. Further, in the present embodiment, the characters on the displayed slide are underlined as an index, and when the user selects the underlined part, a scene in which a sentence including the selected keyword is uttered is immediately generated. Since it can be reproduced, the user can view the desired content from the slide material in a short time.

また、本実施形態においては、プレゼンテーションの資料の情報とプレゼンテーションの音声情報の両方より要約に用いる文が抽出されるため、音声情報のみから要約に用いる文を抽出する場合に比べ、適切な要約文を生成することができる。したがって、ユーザーは、要約文を参照することにより、プレゼンテーションの内容を効率的に理解することができる。 In this embodiment, since the sentence used for the summary is extracted from both the presentation material information and the presentation audio information, the summary sentence is more appropriate than the case where the sentence used for the summary is extracted only from the audio information. Can be generated. Therefore, the user can efficiently understand the content of the presentation by referring to the summary sentence.

また、本実施形態においては、要約抽出のための特徴量に、表層的言語情報だけでなく韻律情報を組み合わせて用い、これとプレゼンテーション資料の情報を加味して要約に用いる文を抽出するため、適切な要約文を生成することができ、プレゼンテーションの内容をユーザーは効率的に理解することができる。また、本実施形態では、図５に示すように、プレゼンテーションの映像、スライド一覧、キーワード、スライドおよび要約文が一度に表示されるため、利用者が効率的にプレゼンテーションの内容を理解することができる。 Further, in the present embodiment, in order to extract the sentence used for the summary by combining the prosody information as well as the superficial language information with the feature amount for the summary extraction, and adding this and the information of the presentation material, An appropriate summary sentence can be generated, and the user can efficiently understand the contents of the presentation. Further, in the present embodiment, as shown in FIG. 5, presentation videos, slide lists, keywords, slides, and summary sentences are displayed at a time, so that the user can efficiently understand the contents of the presentation. .

また、本実施形態においては、スライド資料のタイトルに含まれる名詞を含む文を要約として抽出しているため、プレゼンテーションの内容を反映した要約を生成することができる。通常、スライド資料のタイトルには、プレゼンテーションの内容を示す重要な名詞が含まれることが多いからである。また、本実施形態においては、スライドに複数回現れる文を要約として抽出したため、プレゼンテーションの内容を反映した要約を生成することができる。また、スライド上のキーワードに下線が引かれているため、利用者は効率的にプレゼンテーションの内容を理解することができる。 In the present embodiment, since a sentence including nouns included in the title of the slide material is extracted as a summary, a summary reflecting the content of the presentation can be generated. This is because the titles of slide materials usually include important nouns indicating the content of the presentation. In the present embodiment, since sentences that appear multiple times on the slide are extracted as summaries, it is possible to generate a summary that reflects the content of the presentation. In addition, since the keywords on the slide are underlined, the user can efficiently understand the contents of the presentation.

次に、本発明の第２の実施形態について説明する。第１の実施形態では、プレゼンテーション資料のスライドから要約文およびインデックス（キーワード）を抽出するものであったが、本実施形態では、プレゼンテーション資料が存在しない場合に、ユーザーに要約文およびインデックス（キーワード）を示すものである。なお、第２の実施形態においては、第１の実施形態のプレゼンテーション資料に関する部分（例えば、処理装置６の資料解析部１８など）を備えない点を除き、第１の実施形態と同様のシステムを用いることができるため、詳細な説明は省略する。 Next, a second embodiment of the present invention will be described. In the first embodiment, the summary sentence and the index (keyword) are extracted from the slide of the presentation material. However, in this embodiment, when the presentation material does not exist, the summary sentence and the index (keyword) are shown to the user. Is shown. In the second embodiment, a system similar to that of the first embodiment is provided except that the portion related to the presentation material of the first embodiment (for example, the material analysis unit 18 of the processing device 6) is not provided. Since it can be used, detailed description is omitted.

第２の実施形態のプレゼンテーション視聴システム２について、図８に動作画面を示す。図８において、ディスプレイ装置には、プレゼンテーション映像表示部６８、要約情報表示部７０、キーワード一覧表示部７２および要約文表示部７４が表示されている。第２の実施形態では、プレゼンテーション資料を備えない講演等を視聴するためのシステムのため、スライド画面は表示されない。なお、第２の実施形態では、キーワード一覧表示部７２に表示されるキーワードは、音声入力のみで抽出され、要約文表示部７４に表示される要約文は、図７に示す表の「スライド情報不使用」の重み付けが用いられて抽出される。 FIG. 8 shows an operation screen for the presentation viewing system 2 of the second embodiment. In FIG. 8, a presentation video display unit 68, a summary information display unit 70, a keyword list display unit 72, and a summary sentence display unit 74 are displayed on the display device. In the second embodiment, the slide screen is not displayed because the system is for viewing a lecture or the like that does not include presentation materials. In the second embodiment, the keywords displayed on the keyword list display unit 72 are extracted only by voice input, and the summary text displayed on the summary text display unit 74 is “slide information” in the table shown in FIG. Extracted using a “not used” weighting.

第２の実施形態のプレゼンテーション視聴システム２においても、音声入力より要約文およびインデックス（キーワード）が表示されるため、ユーザーはプレゼンテーションの内容を効率的に理解することができる。また、インデックスを選択することで、当該インデックスを含む文を即座に再生することが可能であるため、ユーザーはプレゼンテーションの視聴を効率的に行なうことができる。 Also in the presentation viewing system 2 of the second embodiment, since the summary sentence and the index (keyword) are displayed by voice input, the user can efficiently understand the contents of the presentation. In addition, by selecting an index, a sentence including the index can be immediately reproduced, so that the user can efficiently view the presentation.

本発明に係る第一実施形態のシステムの全体を示す全体図である。1 is an overall view showing the entire system of a first embodiment according to the present invention. 本発明に係る第一実施形態の処理装置６を説明するための図である。It is a figure for demonstrating the processing apparatus 6 of 1st embodiment which concerns on this invention. 本発明に係る第一実施形態の出力装置１０を説明するための図である。It is a figure for demonstrating the output device 10 of 1st embodiment which concerns on this invention. 本発明に係る第一実施形態のユーザー入力装置８を説明するための図である。It is a figure for demonstrating the user input device 8 of 1st embodiment which concerns on this invention. 本発明に係る第一実施形態の動作画面を示す図である。It is a figure which shows the operation | movement screen of 1st embodiment which concerns on this invention. 本発明に係る第一実施形態のシステムによって生成された要約文の評価結果を示す図である。It is a figure which shows the evaluation result of the summary sentence produced | generated by the system of 1st embodiment which concerns on this invention. 本発明に係る第一実施形態のシステムに用いられる特徴量の値を示す表である。It is a table | surface which shows the value of the feature-value used for the system of 1st embodiment which concerns on this invention. 本発明に係る第二実施形態の動作画面を示す図である。It is a figure which shows the operation | movement screen of 2nd embodiment which concerns on this invention.

Explanation of symbols

２プレゼンテーション視聴システム
４主記憶装置
６処理装置
８ユーザー入力装置
１０出力装置
１２プレゼンテーション資料記憶部
１４映像音声記憶部
１８資料解析部
２０特徴抽出部
２２音声認識部
２４音声整形部
２６要約部
２８要約文生成部
３０インデックス化部
３２音声対応区間抽出部
３４再生部

2 Presentation Viewing System 4 Main Storage Device 6 Processing Device 8 User Input Device 10 Output Device 12 Presentation Material Storage Unit 14 Video / Audio Storage Unit 18 Material Analysis Unit 20 Feature Extraction Unit 22 Speech Recognition Unit 24 Audio Shaping Unit 26 Summarization Unit 28 Summary Text Generation unit 30 Indexing unit 32 Voice corresponding section extraction unit 34 Playback unit

Claims

An audio information analysis unit for analyzing audio information of presentation information including video and audio;
A material analysis unit for analyzing slide materials used in the presentation,
A presentation analysis apparatus comprising: a summary sentence generation unit that generates a summary sentence of the speech information of the presentation from the analysis result of the voice information analysis unit and the analysis result of the material analysis unit.

The summary sentence generation unit extracts, as a summary, a sentence in which a part of speech noun included in a title of the slide appears among sentences recognized by the voice information analysis unit. Presentation analysis device.

The summary sentence generation unit extracts, as a summary, a sentence in which a noun that appears at least once in a part of speech appears in the slide among sentences recognized by the speech information analysis unit. The presentation analysis apparatus according to 1 or 2.

The presentation viewing system according to any one of claims 1 to 3, further comprising a display unit that displays the presentation information, the slide material, and a summary sentence generated by the summary sentence generation unit.

An audio information analysis unit for analyzing audio information of presentation information including video and audio;
A material analysis unit for analyzing slide materials used in the presentation,
A presentation analysis apparatus comprising: a keyword generation unit that generates a keyword for the presentation based on an analysis result of the audio information analysis unit and an analysis result of the material analysis unit.

The presentation viewing system according to claim 5, further comprising a display unit that displays the presentation information, the slide material, and the keyword generated by the keyword generation unit.

A speech-corresponding section extracting unit that extracts a sentence in which the keyword generated by the keyword generating unit is generated from the presentation information, and when the keyword displayed on the display unit is selected, the display unit 7. The presentation viewing system according to claim 6, wherein a scene in which a sentence extracted by the voice corresponding section extraction unit is emitted is displayed.

The presentation viewing system according to claim 6 or 7, wherein the display unit displays the keyword generated by the keyword generation unit in a recognizable manner on the slide material.

An audio information analysis unit for analyzing audio information of presentation information including video and audio;
From the analysis result of the voice information analysis unit, a keyword generation unit that generates a keyword for the presentation;
A voice-corresponding section extracting unit that extracts a sentence from which the keyword generated by the keyword generating unit is issued from the presentation information;
A display unit for displaying the presentation information and keywords;
When the keyword displayed on the display unit is selected, the display unit displays a scene in which a sentence extracted by the speech corresponding section extraction unit is emitted.