JP3803302B2

JP3803302B2 - Video summarization device

Info

Publication number: JP3803302B2
Application number: JP2002060844A
Authority: JP
Inventors: 浩太日▲高▼; 信弥中嶌; 理水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-03-06
Filing date: 2002-03-06
Publication date: 2006-08-02
Anticipated expiration: 2022-03-06
Also published as: JP2003259311A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video reproducing method for compressing the contents of a recorded program at an optional summary rate and reproducing the summary of the contents. <P>SOLUTION: The video reproducing method adopting an audio processing method for deciding an emphasis state on the basis of a feature quantity resulting from analyzing an audio signal by each frame, uses a code table for storing an incident probability in the emphasis state and an incident probability in a still state in cross-reference with each other to obtain the incident probability in the emphasis state corresponding to the feature quantity resulting from analyzing an audio signal by each frame and the incident probability in the still state, calculates the probability of incidence of the emphasis state on the basis of the incidence probability of the audio signal to be summarized recorded on a recording medium and the probability of incidence of the still state on the basis of the incidence probability in the still state, decides an audio signal period wherein a probability ratio of the probability of incidence in the emphasis state to the probability of incidence in the still state is greater than a prescribed coefficient to be a summarized period, calculates the total sum of times in the summarized period, decides the summarized period wherein the summarized time is nearly a prescribed summarized time, and reads the decided summarized period from the recording medium and reproduces it. <P>COPYRIGHT: (C)2003,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は記録媒体に記録されている各種の音声付映像を要約して再生する映像再生方法、映像再生装置及び映像再生プログラムに関する。
【０００２】
【従来の技術】
従来より各種の要約方法が提案されている。その一つとして連続する複数フレームからなる区間動画像を動画全体の各ブロックから抽出し、抽出した各ブロックの区間動画像をつなぎ合わせてダイジェスト画像とする装置があった。例えば、日本国特開平８−９３１０号公報、日本国特開平３−９０９６８号公報、日本国特開平６−１６５００９号公報などに示されている。
また、オーディオセグメントの時間圧縮方法として、ポーズ圧縮の割合を精密に制御し、了解性の高いダイジェストを作成する方法があった。例えば、日本国特開２００１−１５４７００公報などに示されている。
【０００３】
また、テロップや音情報を使って、当該番組映像の特徴となる場面やシーンを抽出してダイジェスト映像とするシステムがあった。例えば、日本国特開２００１−２３０６２公報などに示されている。
【０００４】
【発明が解決しようとする課題】
コンテンツを任意の時間で要約、もしくはダイジェストを生成するには、コンテンツを構成する各シーンの優先順位をあらかじめ求めておく必要がある。日本国特開平８−９３１０号公報、日本国特開平３−９０９６８号公報、日本国特開平６−１６５００９号公報では、ユーザが重要と思うシーンをジョイスティック等のポインティングデバイスや、複数のボタンを用いて入力し、ダイジェスト優先度情報を付与している。利用者にとってダイジェスト生成のための負担が大きい。
【０００５】
また、日本国特開２００１−１５４７００公報では、ポーズ圧縮によって、ダイジェストを生成しているが、コンテンツの大半が通例ポーズでない区間で占められている以上、単にポーズを除去するだけでは要約再生時間を元のコンテンツ再生時間の１／１０以上といった高い圧縮率でコンテンツを圧縮することは非現実的である。
また、日本国特開２０００−２３０６２公報では、ダイジェスト映像生成方法として、音情報の音量値だけを手がかりに特定された要約区間は必ずしも重要な区間とはいえない。何故ならば要点を強調して話す場合、必ずしも音量を大きく話すとは限らないからである。また、テロップ情報を用いる場合、テロップが存在しないコンテンツのダイジェストの生成や、テロップが出現しない区間ではダイジェストを生成することは不可能である。
【０００６】
また、生放送など実時間映像付音声信号の配信を受け、再生しているとき、離席等により当該番組を視聴できなかった場合に、その前半の部分を録画を続けながら要約して視ることができると、番組前半の筋書きを理解した上で後続する映像を視聴できることが期待される。
この発明の目的は記録媒体に格納した映像を任意の時間に圧縮して再生することができる映像再生方法及び映像再生装置、映像再生プログラムを提案しようとするものである。
【０００７】
【課題を解決するための手段】
この発明では、実時間映像信号と音声信号を再生時刻と対応付けて記憶し、要約開始時刻を入力し、要約区間の総延長時間である要約時間又は要約区間の総延長時間の全要約対象区間の比である要約率を入力し、
前記要約時間又は要約率で前記要約開始時刻から要約終了時刻として現在までの要約対象区間における音声信号について強調状態と判定された区間を要約区間と判定し、
前記要約区間の音声信号と映像信号を再生する映像再生方法を提案する。
【０００８】
この発明では更に、前記要約区間の音声信号と映像信号の再生終了の時刻を新たな要約区間の終了時刻とし、前記要約区間の再生終了時刻を新たな要約区間再生開始時刻とする前記要約区間の決定及び当該要約区間の音声信号と映像信号の再生を反復する映像再生方法を提案する。
この発明では更に、前記要約率ｒ（ｒは０＜ｒ＜１となる実数）をｒ／（１＋ｒ）と調整し、当該調整された要約率をもって要約区間を判定する映像再生方法を提案する。
この発明では更に、少なくとも基本周波数又はピッチ周期、パワー、動的特徴量の時間変化特性、又はこれらのフレーム間差分を含む特徴量と強調状態での出現確率と平静状態での出現確率を対応して格納した符号帳を用い、
前記音声信号をフレーム毎に分析した前記特徴量に対応する強調状態での出現確率を求め、
前記音声信号をフレーム毎に分析した前記特徴量に対応する平静状態での出現確率を求め、
前記強調状態での出現確率に基づいて強調状態となる確率を算出し、
前記平静状態での出現確率に基づいて平静状態となる確率を算出し、
前記強調状態となる確率の前記平静状態となる確率に対する確率比を音声信号区間毎に算出し、
前記確率比に対応する音声信号区間の時間を降順に累積して要約時間を算出し、
前記要約時間の全要約対象区間に対する比である要約率が前記入力された要約率となる音声信号区間を前記要約区間と決定する映像再生方法を提案する。
【０００９】
この発明では更に、少なくとも基本周波数又はピッチ周期、パワー、動的特徴量の時間変化特性、又はこれらのフレーム間差分を含む特徴量と強調状態での出現確率と平静状態での出現確率を対応して格納した符号帳を用い、
前記音声信号をフレーム毎に分析した前記特徴量に対応する強調状態での出現確率を求め、
前記音声信号をフレーム毎に分析した前記特徴量に対応する平静状態での出現確率を求め、
前記強調状態での出現確率に基づいて強調状態となる確率を算出し、
前記平静状態での出現確率に基づいて平静状態となる確率を算出し、
前記強調状態となる確率の前記平静状態となる確率に対する確率比が所定の係数より大きい音声信号区間を要約区間と仮判定し、
要約区間の時間の総和、又は要約率として前記音声信号全区間の時間の前記要約区間の時間の総和に対する比率を算出し、
前記要約区間の時間の総和が略所定の要約時間に、又は前記要約率が略所定の要約率となる前記所定の係数を算出して要約区間を決定する映像再生方法を提案する。
【００１０】
この発明では更に、前記音声信号をフレーム毎に無音区間か否か、有声区間か否か判定し、
所定フレーム数以上の無音区間で囲まれ、有声区間を含む部分を音声小段落と判定し、
音声小段落に含まれる有声区間の平均パワーが該音声小段落内の平均パワーの所定の定数倍より小さい音声小段落を末尾とする音声小段落群を音声段落と判定し、
前記音声信号区間は音声段落毎に定められたものであり、
前記要約時間を音声段落毎に累積して求める映像再生方法を提案する。
【００１１】
この発明では更に、実時間映像信号と音声信号を再生時刻と対応付けて記憶する記憶手段と、
要約開始時刻を入力する要約開始時刻入力手段と、
要約区間の総延長時間である要約時間又は要約区間の総延長時間の全要約対象対象区間の比である要約率で定められる要約条件を入力する要約条件入力手段と、
前記要約条件に従って、前記要約開始時刻から要約終了時刻として現在までの要約対象区間における音声信号について強調状態と判定された区間を要約区間と判定する要約区間決定手段と、
前記要約区間決定部で決定した要約区間の音声信号と映像信号を再生する再生手段とを有する映像再生装置を提案する。
【００１２】
この発明では更に、前記映像再生装置において、前記要約区間決定手段は、
少なくとも基本周波数又はピッチ周期、パワー、動的特徴量の時間変化特性、又はこれらのフレーム間差分を含む特徴量と強調状態での出現確率と平静状態での出現確率を対応して格納した符号帳と、
前記符号帳を用いて前記音声信号をフレーム毎に分析した前記特徴量に対応する強調状態での出現確率を求め、
前記強調状態での出現確率に基づいて強調状態となる確率を算出する強調状態確率計算部と、
前記符号帳を用いて前記音声信号をフレーム毎に分析した前記特徴量に対応する平静状態での出現確率を求め、前記平静状態での出現確率に基づいて平静状態となる確率を算出する平静状態確率計算部と、
前記強調状態となる確率の前記平静状態となる確率に対する確率比を音声信号区間毎に算出し、
前記確率比に対応する音声信号区間の時間を降順に累積して要約時間を算出し、要約区間を仮決定する要約区間仮決定部と、
前記要約区間の全要約対象区間に対する比が前記要約率を満たす音声信号区間を前記要約区間と決定する要約区間決定部とを有する映像再生装置を提案する。
【００１３】
この発明では更に、前記映像再生装置において、前記要約区間決定手段は、
少なくとも基本周波数又はピッチ周期、パワー、動的特徴量の時間変化特性、又はこれらのフレーム間差分を含む特徴量と強調状態での出現確率と平静状態での出現確率を対応して格納した符号帳と、
この符号帳を用いて前記音声符号をフレーム毎に分析した前記特徴量に対応する強調状態での出現確率と平静状態での出現確率を求め、
前記強調状態での出現確率に基づいて強調状態となる確率を算出する強調状態確率計算部と、
前記平静状態での出現確率に基づいて平静状態となる確率を算出する平静状態確率計算部と、
前記強調状態となる確率の前記平静状態となる確率に対する確率比が所定の係数より大きい音声信号区間を要約区間と仮判定する要約区間仮判定部と、
要約区間の時間の総和が略所定の要約時間に、又は前記要約率が略所定の要約率となる前記所定の係数を算出して各チャネル毎又は各発話者毎の要約区間を決定する要約区間決定部とを有する映像再生装置。
【００１４】
この発明では更に、コンピュータが解読可能な符号によって記述され、前記記載の映像再生方法の何れかを実行させる映像再生プログラムを提案する。
作用
この発明の映像再生方法によれば記録媒体に記録されている音声の強調状態となる確率が高い音声区間を要約区間として抽出するから、コンテンツの内容で重要な部分を抜き出し、重要な部分をつなぎ合せて要約音声及び要約された映像情報を得ることができる。この結果、要約時間を短時間に圧縮したとしても、そのコンテンツの内容をよく理解することができる。
【００１５】
また、記録媒体として短時間に多量のデータを書き込み及び読み出すことができる記録媒体を用いることから、録画を続けながら、他のコンテンツ又は録画中のコンテンツの前半部分を読み出すことができる。このために、録画を続けながら、その録画中の番組の録画部分を要約し、要約情報を再生することができる。この結果、録画中の映像と、要約を伝える映像を例えば親画面と子画面と異なる表示手段にそれぞれに表示することにより、現在の放送内容と過去の放送内容の双方を視ることができ、要約情報の再生が終了した時点では番組の前半の部を理解した状態で続きを視聴することができる利点が得られる。
【００１６】
この発明の特徴とする点は、コンテンツ要約再生時にユーザからの要求に従って、どのような要約率（圧縮率）にでもコンテンツを要約することができる要約方法を用いる点にある。
この特徴とする要約方法は、先願である特願２００１−２４１２７８で本出願人が提案した、任意の音声小段落の発話状態を判定し、強調状態となる確率が平静状態となる確率よりも大きければ、その音声小段落を強調状態にあると判定し、その音声小段落を含む音声段落を要約区間として抽出する音声強調状態判定方法及び音声要約方法を利用して実現することができる。
【００１７】
【発明の実施の形態】
ここで、この発明で用いられる音声小段落抽出方法、音声段落抽出方法、各音声小段落毎に強調状態となる確率及び平静状態となる確率を求める方法について、説明する。
図５に先に提案した音声要約方法の実施形態の基本手順を示す。ステップＳ１で入力音声信号を分析して音声特徴量を求める。ステップＳ２で、入力音声信号の音声小段落と、複数の音声小段落から構成される音声段落を抽出する。ステップＳ３で各音声小段落を構成するフレームが平静状態か、強調状態か発話状態を判定する。この判定に基づきステップＳ４で要約音声を作成し、要約音声を得る。
【００１８】
以下に、自然な話し言葉や会話音声を、要約に適用する場合の実施例を述べる。音声特徴量は、スペクトル情報等に比べて、雑音環境下でも安定して得られ、かつ話者に依存し難いものを用いる。入力音声信号から音声特徴量として基本周波数（ｆ０）、パワー（ｐ）、音声の動的特徴量の時間変化特性（ｄ）、ポーズ時間長（無音区間）（ｐｓ）を抽出する。これらの音声特徴量の抽出法は、例えば、「音響・音響工学」（古井貞煕、近代科学社、１９９８）、「音声符号化」（守谷健弘、電子情報通信学会、１９９８）、「ディジタル音声処理」（古井貞煕、東海大学出版会、１９８５）、「複合正弦波モデルに基づく音声分析アルゴリズムに関する研究」（嵯峨山茂樹、博士論文、１９９８）などに述べられている。音声の動的特徴量の時間変化は発話速度の尺度となるパラメータであり特許第２９７６９９８号に記載のものを用いてもよい。即ち、動的変化量としてスペクトル包絡を反映するＬＰＣスペクトラム係数の時間変化特性を求め、その時間変化をもとに発話速度係数が求められるものである。より具体的にはフレーム毎にＬＰＣスペクトラム係数Ｃ１（ｔ）、…Ｃｋ（ｔ）を抽出して次式のような動的特徴量ｄ（ダイナミックメジャー）を求める。ｄ（ｔ）＝Σi=1k［Σf=t-f0t+f0［ｆ×Ｃi（ｔ）］／（Σf=t-f0t+f0ｆ2）2 ここで、ｆ０は前後の音声区間フレーム数（必ずしも整数個のフレームでなくとも一定の時間区間でもよい）、ｋはＬＰＣスペクトラムの次数、ｉ＝１、２、…ｋである。発話速度の係数として動的特徴量の変化の極大点の単位時間当たりの個数、もしくは単位時間当たりの変化率が用いられる。
【００１９】
実施例では例えば１００ｍｓを１フレームとし、シフトを５０ｍｓとする。１フレーム毎の平均の基本周波数を求める（ｆ０´）。パワーについても同様に１フレーム毎の平均パワー（ｐ´）を求める。更に現フレームのｆ０´と±ｉフレーム前後のｆ０´との差分をとり、±Δｆ０´ｉ（Δ成分）とする。パワーについても同様に現フレームのｐ´と±ｉフレーム前後のｐ´との差分±Δｐ´ｉ（Δ成分）を求める。ｆ０´、±Δｆ０´ｉ、ｐ´、±Δｐ´ｉを規格化する。この規格は例えばｆ０´、±Δｆ０´ｉをそれぞれ、音声波形全体の平均基本周波数で割り規格化する。これら規格化された値をｆ０″、±ｆ０″ｉと表す。ｐ´、±Δｐ´ｉについても同様に、発話状態判定の対象とする音声波形全体の平均パワーで割り、規格化する。規格化するにあたり、後述する音声小段落、音声段落ごとの平均パワーで割ってもよい。これら規格化された値をｐ″、±Δｐ″ｉと表す。ｉの値は例えばｉ＝４とする。現フレームの前後±Ｔ１ｍｓの、ダイナミックメジャーのピーク本数、即ち動的特徴量の変化の極大点の個数を数える（ｄｐ）。これと、現フレームの開始時刻の、Ｔ２ｍｓ前の時刻を区間に含むフレームのｄｐとのΔ成分（−Δｄｐ）を求める。前記±Ｔ１ｍｓのｄｐ数と、現フレームの終了時刻の、Ｔ３ｍｓ後の時刻を区間に含むフレームのｄｐとのΔ成分（＋Δｄｐ）を求める。これら、Ｔ１、Ｔ２、Ｔ３の値は例えばＴ１＝Ｔ２＝Ｔ３＝４５０ｍｓとする。フレームの前後の無音区間の時間長を±ｐｓとする。ステップＳ１ではこれら音声特徴パラメータの各値をフレーム毎に抽出する。
【００２０】
ステップＳ２における入力音声の音声小段落と、音声段落を抽出する方法の例を図６に示す。ここで音声小段落を発話状態判定を行う単位とする。ステップＳ２０１で、入力音声信号の無音区間と有声区間を抽出する。無音区間は例えばフレーム毎のパワーが所定のパワー値以下であれば無音区間と判定し、有声区間は、例えばフレーム毎の相関関数が所定の相関関数値以上であれば有声区間と判定する。有声／無声の決定は、周期性／非周期性の特徴と同一視することにより、自己相関関数や変形相関関数のピーク値で行うことが多い。入力信号の短時間スペクトルからスペクトル包絡を除去した予測残差の自己相関関数が変形相関関数であり、変形相関関数のピークが所定の閾値より大きいか否かによって有声／無声の判定を行い、又そのピークを与える遅延時間によってピッチ周期１／ｆ０（基本周波数ｆ０）の抽出を行う。これらの区間の抽出法の詳細は、例えば、「ディジタル音声処理」（古井貞煕、東海大学出版会、１９８５）などに述べられている。ここでは音声信号から各音声特徴量をフレーム毎に分析することについて述べたが、既に符号化等により分析された係数もしくは符号に対応する特徴量を符号化に用いる符号帳から読み出して用いてもよい。
【００２１】
ステップＳ２０２で、有声区間を囲む無音区間の時間がそれぞれｔ秒以上になるとき、その無音区間で囲まれた有声区間を含む部分を音声小段落とする。このｔは例えばｔ＝４００ｍｓとする。ステップＳ２０３で、この音声小段落内の好ましくは後半部の、有声区間の平均パワーと、その音声小段落の平均のパワーの値ＢAの定数β倍とを比較し、前者の方が小さい場合はその音声小段落を末尾音声小段落とし、直前の末尾音声小段落後の音声小段落から現に検出した末尾音声小段落までを音声段落として決定する。
図７に、有声区間、音声小段落、音声段落を模式的に示す。音声小段落を前記の、有声区間を囲む無音区間の時間がｔ秒の条件で、抽出する。図７では、音声小段落ｊ−１、ｊ、ｊ＋１について示している。ここで音声小段落ｊは、ｎ個の有声区間から構成され、平均パワーをＰｊとする。有声区間の典型的な例として、音声小段落ｊに含まれる、有声区間ｖの平均パワーはｐｖである。音声段落ｋは、音声小段落ｊと音声小段落を構成する後半部分の有声区間のパワーから抽出する。ｉ＝ｎ−αからｎまでの有声区間の平均パワーｐｉの平均が音声小段落ｊの平均パワーＰｊより小さいとき、即ち、
Σｐｉ／（α＋１）＜βＰｊ式（１）
を満たす時、音声小段落ｊが音声段落ｋの末尾音声小段落であるとする。ただし、Σはｉ＝ｎ−αからｎまでである。式（１）のα、βは定数であり、これらを操作して、音声段落を抽出する。実施例では、αは３、βは０．８とした。このようにして末尾音声小段落を区切りとして隣接する末尾音声小段落間の音声小段落群を音声段落と判定できる。
【００２２】
図５中のステップＳ３における音声小段落発話状態判定方法の例を図８に示す。ステップＳ３０１で、入力音声小段落の音声特徴量をベクトル量子化する。このために、あらかじめ少なくとも２つの量子化音声特徴量（コード）が格納された符号帳（コードブック）を作成しておく。ここでコードブックに蓄えられた音声特徴量と入力音声もしくは既に分析して得られた音声の音声特徴量との照合をとり、コードブックの中から音声特徴量間の歪（距離）を最小にする量子化音声特徴量を特定することが常套である。
図９に、このコードブックの作成法の例を示す。多数の学習用音声を被験者が聴取し、発話状態が平静状態であるものと、強調状態であるものをラベリングする（Ｓ５０１）。
【００２３】
例えば、被験者が発話の中で強調状態とする理由として、
（ａ）声が大きく、名詞や接続詞を伸ばすように発話する
（ｂ）話し始めを伸ばして話題変更を主張、意見を集約するように声を大きくする
（ｃ）声を大きく高くして重要な名詞等を強調する時
（ｄ）高音であるが声はそれほど大きくない
（ｅ）苦笑いしながら、焦りから本音をごまかすような時
（ｆ）周囲に同意を求める、あるいは問いかけるように、語尾が高音になるとき
（ｇ）ゆっくりと力強く、念を押すように、語尾の声が大きくなる時
（ｈ）声が大きく高く、割り込んで発話するという主張、相手より大きな声で
（ｉ）大きな声では憚られるような本音や秘密を発言する場合や、普段、声の大きい人にとっての重要なことを発話するような時（例えば声が小さくボソボソ、ヒソヒソという口調）を挙げた。この例では、平静状態とは、前記の（ａ）〜（ｉ）のいずれでもなく、発話が平静であると被験者が感じたものとした。
【００２４】
尚、上述では強調状態と判定する対象を発話であるものとして説明したが、音楽でも強調状態を特定することができる。ここでは音声付の楽曲において、音声から強調状態を特定しようとした場合に、強調と感じる理由として、
（ａ）声が大きく、かつ声が高い
（ｂ）声が力強い
（ｃ）声が高く、かつアクセントが強い
（ｄ）声が高く、声質が変化する
（ｅ）声を伸長させ、かつ声が大きい
（ｆ）声が大きく、かつ、声が高く、アクセントが強い
（ｇ）声が大きく、かつ、声が高く、叫んでいる
（ｈ）声が高く、アクセントが変化する
（ｉ）声を伸長させ、かつ、声が大きく、語尾が高い
（ｊ）声が高く、かつ、声を伸長させる
（ｋ）声を伸長させ、かつ、叫び、声が高い
（ｌ）語尾上がり力強い
（ｍ）ゆっくり強め
（ｎ）曲調が不規則
（ｏ）曲調が不規則、かつ、声が高い
また、音声を含まない楽器演奏のみの楽曲でも強調状態を特定することができる。その強調と感じる理由として、
（ａ）強調部分全体のパワー増大
（ｂ）音の高低差が大きい
（ｃ）パワーが増大する
（ｄ）楽器の数が変化する
（ｅ）曲調、テンポが変化する
等である。
【００２５】
これらを基にコードブックを作成しておくことにより、発話に限らず音楽の要約も行うことができることになる。
平静状態と強調状態の各ラベル区間について、図５中のステップＳ１と同様に、音声特徴量を抽出し（Ｓ５０２）、パラメータを選択する（Ｓ５０３）。平静状態と強調状態のラベル区間の、前記パラメータを用いて、ＬＢＧアルゴリズムでコードブックを作成する（Ｓ５０４）。ＬＢＧアルゴリズムについては、例えば、（Ｙ．Ｌｉｎｄｅ，Ａ．ＢｕｚｏａｎｄＲ．Ｍ．Ｇｒａｙ，“Ａｎａｌｇｏｒｉｔｈｍｆｏｒｖｅｃｔｏｒｑｕａｎｔｉｚｅｒｄｅｓｉｇｎ，”ＩＥＥＥＴｒａｎｓ．Ｃｏｍｍｕｎ．，ｖｏｌ．Ｃｏｍ−２８，ｐｐ．８４−９５，１９８０）がある。コードブックサイズは２のｎ乗個に可変である。このコードブック作成は音声小段落で又はこれより長い適当な区間毎あるいは学習音声全体の音声特徴量で規格化した音声特徴量を用いることが好ましい。
【００２６】
図８中のステップＳ３０１で、このコードブックを用いて、入力音声小段落の音声特徴量を、各音声特徴量について規格化し、その規格化された音声特徴量をフレーム毎に照合もしくはベクトル量子化し、フレーム毎にコード（量子化された音声特徴量）を得る。この際の入力音声信号より抽出する音声特徴量は前記のコードブック作成に用いたパラメータと同じである。
強調状態が含まれる音声小段落を特定するために、音声小段落でのコードを用いて、発話状態の尤度（らしさ）を、平静状態と強調状態について求める。このために、あらかじめ、任意のコード（量子化音声特徴量）の出現確率を、平静状態の場合と、強調状態の場合について求めておき、この出現確率とそのコードとを組としてコードブックに格納しておく、以下にこの出現確率の求め方の例を述べる。前記のコードブック作成に用いた学習音声中のラベルが与えられた１つの区間（ラベル区間）の音声特徴量のコード（フレーム毎に得られる）が、時系列でＣｉ、Ｃｊ、Ｃｋ、…Ｃｎであるとき、ラベル区間αが強調状態となる確率をＰα（ｅ）、平静状態となる確率をＰα（ｎ）とし、
Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｉ）Ｐｅｍｐ（Ｃｊ｜Ｃｉ）…Ｐｅｍｐ（Ｃｎ｜Ｃｉ…Ｃｎ−１）＝Ｐｅｍｐ（Ｃｉ）ΠＰｅｍｐ（Ｃｘ｜Ｃｉ…Ｃｘ−１）
Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｉ）Ｐｎｒｍ（Ｃｊ｜Ｃｉ）…Ｐｎｒｍ（Ｃｎ｜Ｃｉ…Ｃｎ−１）＝Ｐｅｍｐ（Ｃｉ）ΠＰｎｒｍ（Ｃｘ｜Ｃｉ…Ｃｘ−１）
となる。ただし、Ｐｅｍｐ（Ｃｘ｜Ｃｉ…Ｃｘ−１）はコード列Ｃｉ…Ｃｘ−１の次にＣｘが強調状態となる条件付確率、Ｐｎｒｍ（Ｃｘ｜Ｃｉ…Ｃｘ−１）は同様にＣｉ…Ｃｘ−１に対しＣｘが平静状態となる確率である。ただし、Πはｘ＝ｉ＋１からｎまでの積である。またＰｅｍｐ（Ｃｉ）は学習音声についてフレームで量子化し、これらコード中のＣｉが強調状態とラベリングされた部分に存在した個数を計数し、その計数値を全学習音声の全コード数（フレーム数）で割り算した値であり、Ｐｎｒｍ（Ｃｉ）はＣｉが平静状態とラベリングされた部分に存在した個数を全コード数で割り算した値である。
【００２７】
このラベル区間αの各状態確率を簡単にするために、この例ではＮ−ｇｒａｍモデル（Ｎ＜ｎ）を用いて、
Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｎ｜Ｃｎ−Ｎ＋１…Ｃｎ−１）
Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｎ｜Ｃｎ−Ｎ＋１…Ｃｎ−１）
とする。つまりＣｎよりＮ−１個の過去のコード列Ｃｎ−Ｎ＋１…Ｃｎ−１の次にＣｎが強調状態として得られる確率をＰα（ｅ）とし、同様にＮ−ｇｒａｍの確率値をより低次のＭ−ｇｒａｍ（Ｎ≧Ｍ）の確率値と線形に補間する線形補間法を適応することが好ましい。例えばＣｎよりＮ−１個の過去のコード列Ｃｎ−Ｎ＋１…Ｃｎ−１の次にＣｎが平静状態として得られる確率をＰα（ｎ）とする。このようなＰα（ｅ）、Ｐα（ｎ）の条件付確率をラベリングされた学習音声の量子化コード列から全てを求めるが、入力音声信号の音声特徴量の量子化したコード列と対応するものが学習音声から得られていない場合もある。そのため、高次（即ちコード列の長い）の条件付確率を単独出現確率とより低次の条件付出現確率とを補間して求める。例えばＮ＝３のｔｒｉｇｒａｍ、Ｎ＝２のｂｉｇｒａｍ、Ｎ＝１のｕｎｉｇｒａｍを用いて線形補間法を施す。Ｎ−ｇｒａｍ、線形補間法、ｔｒｉｇｒａｍについては、例えば、「音声言語処理」（北研二、中村哲、永田昌明、森北出版、１９９６、２９頁）などに述べられている。即ち、
Ｎ＝３（ｔｒｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃｎ−１）、Ｐｎｒｍ（Ｃｎ｜Ｃｎ−２Ｃｎ−１）
Ｎ＝２（ｂｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ｜Ｃｎ−１）、Ｐｎｒｍ（Ｃｎ｜Ｃｎ−１）
Ｎ＝１（ｕｎｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ）、Ｐｎｒｍ（Ｃｎ）
であり、これら３つの強調状態でのＣｎの出現確率、また３つの平静状態でのＣｎの出現確率をそれぞれ用いて次式により、Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ−１）、Ｐｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ−１）を計算することにする。

Ｔｒｉｇｒａｍの学習データをＮとしたとき、すなわち、コードが時系列でＣ１、Ｃ２、．．．ＣＮが得られたとき、λｅｍｐ１、λｅｍｐ２、λｅｍｐ３の再推定式は前出の参考文献「音声言語処理」より次のようになる。
λｅｍｐ１＝１／ＮΣ（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−１）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ３Ｐｅｍｐ（Ｃｎ）））
λｅｍｐ２＝１／ＮΣ（λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ３Ｐｅｍｐ（Ｃｎ）））
λｅｍｐ３＝１／ＮΣ（λｅｍｐ３Ｐｅｍｐ（Ｃｎ）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ３Ｐｅｍｐ（Ｃｎ）））
ただし、Σはｎ＝１からＮまでの和である。以下同様にしてλｎｒｍ１、λｎｒｍ２、λｎｒｍ３も求められる。
【００２８】
この例では、ラベル区間αがフレーム数Ｎαで得たコードがＣｉ１、Ｃｉ２、…、ＣｉＮαのとき、このラベル区間αが強調状態となる確率Ｐα（ｅ）、平静状態となる確率Ｐα（ｎ）は、
Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅｍｐ（ＣｉＮα｜Ｃｉ（Ｎα−１）Ｃｉ（Ｎα−２））式（４）
Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎｒｍ（ＣｉＮα｜Ｃｉ（Ｎα−１）Ｃｉ（Ｎα−２））式（５）
となる。この計算ができるように前記のｔｒｉｇｒａｍ、ｕｎｉｇｒａｍ、ｂｉｇｒａｍを任意のコードについて求めてコードブックに格納しておく。つまりコードブックには各コードの音声特徴量とその強調状態での出現確率とこの例では平静状態での出現確率との組が格納され、その強調状態での出現確率は、その音声特徴量が過去のフレームでの音声特徴量と無関係に強調状態で出現する確率（ｕｎｉｇｒａｍ：単独出現確率と記す）のみ、又はこれと、過去のフレームでの音声特徴量から現在のフレームの音声特徴量に至るフレーム単位の音声特徴量列毎に、その音声特徴量が強調状態で出現する条件付確率との組合せの何れかであり、平静状態での出現確率も同様に、その音声特徴量が過去のフレームでの音声特徴量と無関係に平静状態で出現する確率（ｕｎｉｇｒａｍ：単独出現確率と記す）のみ、又はこれと、過去のフレームでの音声特徴量から現在のフレームの音声特徴量に至るフレーム単位の音声特徴量列毎にその音声特徴量が平静状態で出現する条件付確率と組合せの何れかである。
【００２９】
例えば図１０に示すようにコードブックには各コードＣ１、Ｃ２、…毎にその音声特徴量と、その単独出現確率が強調状態、平静状態について、また条件付確率が強調状態、平静状態についてそれぞれ組として格納されている。
図８中のステップＳ３０２では、入力音声小段落の全フレームのコードについてのそのコードブックに格納されている前記確率から、発話状態の尤度を、平静状態と強調状態について求める。図１１に実施例の模式図を示す。時刻ｔから始まる音声小段落のうち、第４フレームまでを▲１▼〜▲４▼で示している。前記のように、ここでは、フレーム長は１００ｍｓ、フレームシフトを５０ｍｓとフレーム長の方を長くした。▲１▼フレーム番号ｆ、時刻ｔ〜ｔ＋１００でコードＣｉが、▲２▼フレーム番号ｆ＋１、時刻ｔ＋５０〜ｔ＋１５０でコードＣｊが、▲３▼フレーム番号ｆ＋２、時刻ｔ＋１００〜ｔ＋２００でコードＣｋが、▲４▼フレーム番号ｆ＋３、時刻ｔ＋１５０〜ｔ＋２５０でコードＣｌが得られ、つまりフレーム順にコードがＣｉ、Ｃｊ、Ｃｋ、Ｃｌであるとき、フレーム番号ｆ＋２以上のフレームでｔｒｉｇｒａｍが計算できる。音声小段落ｓが強調状態となる確率をＰｓ（ｅ）、平静状態となる確率をＰｓ（ｎ）とすると第４フレームまでの確率はそれぞれ、Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）式（６）
Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）式（７）
となる。ただし、この例では、コードブックからＣｋ、Ｃｌの強調状態及び平静状態の各単独出現確率を求め、またＣｊの次にＣｋが強調状態及び平静状態で各出現する条件付確率、更にＣｋがＣｉ、Ｃｊの次に、ＣｌがＣｊ、Ｃｋの次にそれぞれ強調状態及び平静状態でそれぞれ出現する条件付確率をコードブックから求めると、以下のようになる。
Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）＝λｅｍｐ１Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）＋λｅｍｐ２Ｐｅｍｐ（Ｃｋ｜Ｃｊ）＋λｅｍｐ３Ｐｅｍｐ（Ｃｋ）式（８）
Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）＝λｅｍｐ１Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）＋λｅｍｐ２Ｐｅｍｐ（Ｃｌ｜Ｃｋ）＋λｅｍｐ３Ｐｅｍｐ（Ｃｌ）式（９）
Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）＝λｎｒｍ１Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）＋λｎｒｍ２Ｐｎｒｍ（Ｃｋ｜Ｃｊ）＋λｎｒｍ３Ｐｎｒｍ（Ｃｋ）式（１０）
Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）＝λｎｒｍ１Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）＋λｎｒｍ２Ｐｎｒｍ（Ｃｌ｜Ｃｋ）＋λｎｒｍ３Ｐｎｒｍ（Ｃｌ）式（１１）
上記（８）〜（１１）式を用いて（６）式と（７）式で示される第４フレームまでの強調状態となる確率Ｐｓ（ｅ）と、平静状態となる確率Ｐｓ（ｎ）が求まる。ここで、Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）、Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）はフレーム番号ｆ＋２において計算できる。
【００３０】
この例では、音声小段落ｓがフレーム数Ｎｓで得たコードがＣｉ１、Ｃｉ２、…、ＣｉＮｓのとき、この音声小段落ｓが強調状態になる確率Ｐｓ（ｅ）と平静状態になる確率Ｐｓ（ｎ）を次式により計算する。
Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅｍｐ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））
Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎｒｍ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））
この例ではこれらの確率が、Ｐｓ（ｅ）＞Ｐｓ（ｎ）であれば、その音声小段落Ｓは強調状態、Ｐｓ（ｎ）＞Ｐｓ（ｅ）であれば平静状態とする。
【００３１】
図１２は以上説明した音声小段落抽出方法、音声段落抽出方法、各音声小段落毎に強調状態となる確率及び平静状態となる確率を求める方法を用いた音声強調状態判定装置及び音声要約装置の実施形態を示す。
入力部１１に音声強調状態が判定されるべき、又は音声の要約区間を決定されるべき入力音声（入力音声信号）が入力される。入力部１１には必要に応じて入力音声信号をデジタル信号に変換する機能も含まれる。デジタル化された音声信号は必要に応じて記憶部１２に格納される。音声特徴量抽出部１３で前述した音声特徴量がフレーム毎に抽出される。抽出した音声特徴量は必要に応じて、音声特徴量の平均値で規格化され、量子化部１４で各フレームの音声特徴量がコードブック１５を参照して量子化され、量子化された音声特徴量は強調確率計算部１６と平静確率計算部１７に送り込まれる。コードブック１５は例えば図１０に示したようなものである。
【００３２】
強調確率計算部１６によりその量子化された音声特徴量の強調状態での出現確率が、コードブック１５に格納されている対応する確率を用いて、例えば式（８）又は（９）により計算される。同様に平静確率計算部１７により、前記量子化された音声特徴量の平静状態での出現確率がコードブック１５に格納されている対応する確率を用いて、例えば式（１０）又は（１１）により計算される。強調確率計算部１６及び平静確率計算部１７で各フレーム毎に算出された強調状態での出現率と平静状態での出現確率及び各フレームの音声特徴量は各フレームに付与したフレーム番号と共に記憶部12に格納する。
【００３３】
これら各部の制御は制御部１９の制御のもとに順次行われる。
音声要約装置の実施形態は、図１２中に実線ブロックに対し、破線ブロックが付加される。つまり記憶部１２に格納されている各フレームの音声特徴量が無音区間判定部２１と有音区間判定部２２に送り込まれ、無音区間判定部２１により各フレーム毎に無音区間か否かが判定され、また有音区間判定部２２により各フレーム毎に有声区間か否かが判定される。これらの無音区間判定結果と有音区間判定結果が音声小段落判定部２３に入力される。音声小段落判定部２３はこれら無音区間判定、有声区間判定に基づき、先の方法の実施形態で説明したように所定フレーム数を連続する無音区間に囲まれた有声区間を含む部分が音声小段落と判定する。音声小段落判定部２３の判定結果は記憶部１２に書き込まれ、記憶部１２に格納されている音声データ列に付記され、無音区間で囲まれたフレーム群に音声小段落番号列を付与する。これと共に音声小段落判定部２３の判定結果は末尾音声小段落判定部２４に入力される。
【００３４】
末尾音声小段落判定部２４では、例えば図７を参照して説明した手法により末尾音声小段落が検出され、末尾音声小段落判定結果が音声段落判定部２５に入力され、音声段落判定部２５により２つの末尾音声小段落間の複数の音声小段落を含む部分を音声段落と判定する。この音声段落判定結果も記憶部１２に書き込まれ、記憶部１２に記憶している音声小段落番号列に音声段落列番号を付与する。音声要約装置として動作する場合、強調確率計算部１６及び平静確率計算部１７では記憶部１２から各音声小段落を構成する各フレームの強調確率と平静確率を読み出し、各音声小段落毎の確率が例えば式（８）及び式（１０）により計算される。強調状態判定部１８ではこの音声小段落毎の確率計算値を比較して、その音声小段落が強調状態か否かを判定し、要約区間取出し部２６では音声段落中の１つの音声小段落でも強調状態と判定されたものがあればその音声小段落を含む音声段落を取り出す。各部の制御は制御部１９により行われる。
【００３５】
以上により音声で構成される音声波形を音声小段落及び音声段落に分離する方法及び各音声小段落毎に強調状態となる確率及び平静状態となる確率を算出できることが理解できよう。
次に上述した各方法を利用して要約率を自由に設定し、変更することができる音声処理方法、音声処理装置に関わる実施の形態を説明する。
図１３にその音声処理方法の実施の形態の基本手順を示す。この実施例ではステップＳ１１で音声強調確率算出処理を実行し、音声小段落の強調確率及び平静確率を求める。
【００３６】
ステップＳ１２では要約条件入力ステップＳ１２を実行する。この要約条件入力ステップＳ１２では例えば利用者に要約時間又は要約率或は圧縮率の入力を促す情報を提供し、要約時間又は要約率或は要約率又は圧縮率を入力させる。尚、予め設定された複数の要約時間又は要約率、圧縮率の中から一つを選択する入力方法を採ることもできる。
ステップＳ１３では抽出条件の変更を繰り返す動作を実行し、ステップＳ１２の要約条件入力ステップＳ１２で入力された要約時間又は要約率、圧縮率を満たす抽出条件を決定する。
【００３７】
ステップＳ１４で要約抽出ステップを実行する。この要約抽出ステップＳ１４では抽出条件変更ステップＳ１３で決定した抽出条件を用いて採用すべき音声段落を決定し、この採用すべき音声段落の総延長時間を計算する。
ステップ１５では要約再生処理を実行し、要約抽出ステップＳ１４で抽出した音声段落列を再生する。
図１４は図１３に示した音声強調確率算出ステップの詳細を示す。
ステップＳ１０１で要約対象とする音声波形列を音声小段落に分離する。
ステップＳ１０２ではステップＳ１０１で分離した音声小段落列から音声段落を抽出する。音声段落とは図７で説明したように、１つ以上の音声小段落で構成され、意味を理解できる単位である。
【００３８】
ステップＳ１０３及びステップＳ１０４でステップＳ１０１で抽出した音声小段落毎に図１０で説明したコードブックと前記した式（８）、（１０）等を利用して各音声小段落が強調状態となる確率（以下強調確率と称す）Ｐｓ（ｅ）と、平静状態となる確率（以下平静確率と称す）Ｐｓ（ｎ）とを求める。
ステップＳ１０５ではステップＳ１０３及びＳ１０４において各音声小段落毎に求めた強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）などを各音声小段落毎に仕分けして記憶手段に音声強調確率テーブルとして格納する。
図１５に記憶手段に格納した音声強調確率テーブルの一例を示す。図１５に示すＦ１、Ｆ２、Ｆ３…は音声小段落毎に求めた音声小段落強調確率Ｐｓ（ｅ）と、音声小段落平静確率Ｐｓ（ｎ）を記録した小段落確率記憶部を示す。これらの小段落確率記憶部Ｆ１、Ｆ２、Ｆ３…には各音声小段落Ｓに付された音声小段落番号ｉと、開始時刻（要約対象となる音声データ列の先頭から計時した時刻）終了時刻、音声小段落強調確率、音声小段落平静確率、各音声小段落を構成するフレーム数ｆｎ等が格納される。
【００３９】
要約条件入力ステップＳ１２で入力する条件としては要約すべきコンテンツの全長を１／Ｘ（Ｘは正の整数）の時間に要約することを示す要約率ｒ（請求の範囲記載の要約率の逆数ｒ＝１／Ｘを指す）、あるいは要約時間ｔを入力する。
この要約条件の設定に対し、抽出条件変更ステップＳ１３では初期値として重み係数ＷをＷ＝１に設定し、この重み係数を要約抽出ステップＳ１４に入力する。
要約抽出ステップＳ１４は重み係数Ｗ＝１として音声強調確率テーブルから各音声小段落毎に格納されている強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｅ）とを比較し、
Ｗ・Ｐｓ（ｅ）＞Ｐｓ（ｎ）
の関係にある音声小段落を抽出すると共に、更にこの抽出した音声小段落を一つでも含む音声段落を抽出し、抽出した音声段落列の総延長時間ＭＴ（分）を求める。
【００４０】
抽出した音声段落列の総延長時間ＭＴ（分）と要約条件で決めた所定の要約時間ＹＴ（分）とを比較する。ここでＭＴ≒ＹＴ（ＹＴに対するＭＴの誤差が例えば±数％程度の範囲）であればそのまま採用した音声段落列を要約音声として再生する。
要約条件で設定した要約時間ＹＴに対するコンテンツの要約した総延長時間ＭＴとの誤差値が規定より大きく、その関係がＭＴ＞ＹＴであれば抽出した音声段落列の総延長時間ＭＴ（分）が、要約条件で定めた要約時間ＹＴ（分）より長いと判定し、図１３に示した抽出条件変更ステップＳ１３を再実行させる。抽出条件変更ステップＳ１３では重み係数がＷ＝１で抽出した音声段落列の総延長時間ＭＴ（分）が要約条件で定めた要約時間ＹＴ（分）より「長い」とする判定結果を受けて強調確率Ｐｓ（ｅ）に現在値より小さい重み付け係数Ｗ（請求項１記載の所定の係数の場合は現在値よりも大きくする）を乗算Ｗ・Ｐｓ（ｅ）して重み付けを施す。重み係数Ｗとしては例えばＷ＝１−０．００１×Ｋ（Ｋはループ回数）で求める。
【００４１】
つまり、音声強調確率テーブルから読み出した音声段落列の全ての音声小段落で求められている強調確率Ｐｓ（ｅ）の配列に１回目のループではＷ＝１−０．００１×１で決まる重み係数Ｗ＝０．９９９を乗算し、重み付けを施す。この重み付けされた全ての各音声小段落の強調確率Ｗ・Ｐｓ（ｅ）と各音声小段落の平静確率Ｐｓ（ｎ）とを比較し、Ｗ・Ｐｓ（ｅ）＞Ｐｓ（ｎ）の関係にある音声小段落を抽出する。
この抽出結果に従って要約抽出ステップＳ１４では抽出された音声小段落を含む音声段落を抽出し、要約音声段落列を再び求める。これと共に、この要約音声段落列の総延長時間ＭＴ（分）を算出し、この総延長時間ＭＴ（分）と要約条件で定められる要約時間ＹＴ（分）とを比較する。比較の結果がＭＴ≒ＹＴであれば、その音声段落列を要約音声と決定し、再生する。
【００４２】
１回目の重み付け処理の結果が依然としてＭＴ＞ＹＴであれば抽出条件変更ステップを、２回目のループとして実行させる。このとき重み係数ＷはＷ＝１−０．００１×２で求める。全ての強調確率Ｐｓ（ｅ）にＷ＝０．９９８の重み付けを施す。
このように、ループの実行を繰り返す毎にこの例では重み係数Ｗの値を徐々に小さくするように抽出条件を変更していくことによりＷＰｓ（ｅ）＞Ｐｓ（ｎ）の条件を満たす音声小段落の数を漸次減らすことができる。これにより要約条件を満たすＭＴ≒ＹＴの状態を検出することができる。
【００４３】
尚、上述では要約時間ＭＴの収束条件としてＭＴ≒ＹＴとしたが、厳密にＭＴ＝ＹＴに収束させることもできる。この場合には要約条件に例えば５秒不足している場合、あと１つの音声段落を加えると１０秒超過してしまうが、音声段落から５秒のみ再生することで利用者の要約条件に一致させることができる。また、この５秒は強調と判定された音声小段落の付近の５秒でもよいし、音声段落の先頭から５秒でもよい。
また、上述した初期状態でＭＴ＜ＹＴと判定された場合は重み係数Ｗを現在値よりも小さく例えばＷ＝１−０．００１×Ｋとして求め、この重み係数Ｗを平静確率Ｐｓ（ｎ）の配列に乗算し、平静確率Ｐｓ（ｎ）に重み付けを施せばよい。また、他の方法としては初期状態でＭＴ＞ＹＴと判定された場合に重み係数を現在値より大きくＷ＝１＋０．００１×Ｋとし、この重み係数Ｗを平静確率Ｐｓ（ｎ）の配列に乗算してもよい。
【００４４】
また、要約再生ステップＳ１５では要約抽出ステップＳ１４で抽出した音声段落列を再生するものとして説明したが、音声付の画像情報の場合、要約音声として抽出した音声段落に対応した画像情報を切り出してつなぎ合わせ、音声と共に再生することによりテレビ放送の要約、あるいは映画の要約等を行うことができる。
また、上述では音声強調確率テーブルに格納した各音声小段落毎に求めた強調確率又は平静確率のいずれか一方に直接重み係数Ｗを乗算して重み付けを施すことを説明したが、強調状態を精度良く検出するためには重み係数Ｗに各音声小段落を構成するフレームの数Ｆ乗してＷFとして重み付けを行うことが望ましい。
【００４５】
つまり、式（８）及び式（１０）で算出する条件付の強調確率Ｐｓ（ｅ）は各フレーム毎に求めた強調状態となる確率の積を求めている。また平静状態となる確率Ｐｓ（ｎ）も各フレーム毎に算出した平静状態となる確率の積を求めている。従って、例えば強調確率Ｐｓ（ｅ）に重み付けを施すには各フレーム毎に求めた強調状態となる確率毎に重み付け係数Ｗを乗算すれば正しい重み付けを施したことになる。この場合には音声小段落を構成するフレーム数をＦとすれば重み係数ＷはＷFとなる。
この結果、フレームの数Ｆに応じて重み付けの影響が増減され、フレーム数の多い音声小段落ほど、つまり延長時間が長い音声小段落程大きい重みが付されることになる。
【００４６】
但し、単に強調状態を判定するための抽出条件を変更すればよいのであれば各フレーム毎に求めた強調状態となる確率の積又は平静状態となる積に重み係数Ｗを乗算するだけでも抽出条件の変更を行うことができる。従って、必ずしも重み付け係数ＷをＷFとする必要はない。
また、上述では抽出条件の変更手段として音声小段落毎に求めた強調確率Ｐｓ（ｅ）又は平静確率Ｐｓ（ｎ）に重み付けを施してＰｓ（ｅ）＞Ｐｓ（ｎ）を満たす音声小段落の数を変化させる方法を採ったが、他の方法として全ての音声小段落の強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）に関してその確率比Ｐｓ（ｅ）／Ｐｓ（ｎ）を演算し、この確率比の降順に対応する音声信号区間（音声小段落）を累積して要約区間の和を算出し、要約区間の時間の総和が、略所定の要約時間に合致する場合、そのときの音声信号区間を要約区間と決定して要約音声を編成する方法も考えられる。
【００４７】
この場合、編成した要約音声の総延長時間が要約条件で設定した要約時間に対して過不足が生じた場合には、強調状態にあると判定するための確率比Ｐｓ（ｅ）／Ｐｓ（ｎ）の値を選択する閾値を変更すれば抽出条件を変更することができる。この抽出条件変更方法を採る場合には要約条件を満たす要約音声を編成するまでの処理を簡素化することができる利点が得られる。
上述では各音声小段落毎に求める強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）を各フレーム毎に算出した強調状態となる確率の積及び平静状態となる確率の積で算出するものとして説明したが、他の方法として各フレーム毎に求めた強調状態となる確率の平均値を求め、この平均値をその音声小段落の強調確率Ｐｓ（ｅ）及び平静確率Ｐｓ（ｎ）として用いることもできる。
【００４８】
従って、この強調確率Ｐｓ（ｅ）及び平静確率Ｐｓ（ｎ）の算出方法を採る場合には重み付けに用いる重み付け係数Ｗはそのまま強調確率Ｐｓ（ｅ）又は平静確率Ｐｓ（ｎ）に乗算すればよい。
図１６を用いて要約率を自由に設定することができる音声処理装置の実施例を示す。この実施例では図１２に示した音声強調状態要約装置の構成に要約条件入力部３１と、音声強調確率テーブル３２と、強調小段落抽出部３３と、抽出条件変更部３４と、要約区間仮判定部３５と、この要約区間仮判定部３５の内部に要約音声の総延長時間を求める総延長時間算出部３５Ａと、この総延長時間算出部３５Ａが算出した要約音声の総延長時間が要約条件入力部３１で入力した要約時間の設定の範囲に入っているか否かを判定する要約区間決定部３５Ｂと、要約条件に合致した要約音声を保存し、再生する要約音声保存・再生部３５Ｃを設けた構成とした点を特徴とするものである。
【００４９】
入力音声は図１１で説明したように、フレーム毎に音声特徴量が求められ、この音声特徴量に従って強調確率計算部１６と平静確率計算部１７でフレーム毎に強調確率と、平静確率とを算出し、これら強調確率と平静確率を各フレームに付与したフレーム番号と共に記憶部１２に格納する。更に、このフレーム列番号に音声小段落判定部で判定した音声小段落列に付与した音声小段落列番号が付記され、各フレーム及び音声小段落にアドレスが付与される。
この実施例に示した音声処理装置では強調確率算出部１６と平静確率算出部１７は記憶部１２に格納している各フレームの強調確率と平静確率を読み出し、この強調確率及び平静確率から各音声小段落毎に強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）とを求め、これら強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）を音声強調テーブル３２に格納する。
【００５０】
音声強調テーブル３２には各種のコンテンツの音声波形の音声小段落毎に求めた強調確率と平静確率とが格納され、いつでも利用者の要求に応じて要約が実行できる体制が整えられている。利用者は要約条件入力部３１に要約条件を入力する。ここで言う要約条件とは要約したいコンテンツの名称と、そのコンテンツの全長時間に対する要約率を指す。要約率としてはコンテンツの全長を１／１０に要約するか、或は時間で１０分に要約するなどの入力方法が考えられる。ここで例えば１／１０と入力した場合は要約時間算出部３１Ａはコンテンツの全長時間を１／１０した時間を算出し、その算出した要約時間を要約区間仮判定部３５の要約区間決定部３５Ｂに送り込む。
【００５１】
要約条件入力部３１に要約条件が入力されたことを受けて制御部１９は要約音声の生成動作を開始する。その開始の作業としては音声強調テーブル３２から利用者が希望したコンテンツの強調確率と平静確率を読み出す。読み出された強調確率と平静確率を強調小段落抽出部３３に送り込み、強調状態にあると判定される音声小段落番号を抽出する。
強調状態にある音声区間を抽出するための条件を変更する方法としては上述した強調確率Ｐｓ（ｅ）又は平静確率Ｐｓ（ｎ）に確率比の逆数となる重み付け係数Ｗを乗算しＷ・Ｐｓ（ｅ）＞Ｐｓ（ｎ）の関係にある音声小段落を抽出し、音声小段落を含む音声段落により要約音声を得る方法と、確率比Ｐｓ（ｅ）／Ｐｓ（ｎ）を算出し、この確率比を降順に累算して要約時間を得る方法とを用いることができる。
【００５２】
抽出条件の初期値としては重み付けにより抽出条件を変更する場合には重み付け係数ＷをＷ＝１として初期値とすることが考えられる。また、各音声小段落毎に求めた強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）の確率比Ｐｓ（ｅ）／Ｐｓ（ｎ）の値に応じて強調状態と判定する場合は初期値としてその比の値が例えばＰｓ（ｅ）／Ｐｓ（ｎ）≧１である場合を強調状態と判定することが考えられる。この初期設定状態で強調状態と判定された音声小段落番号と開始時刻、終了時刻を表わすデータを強調小段落抽出部３３から要約区間仮判定部３５に送り込む。要約区間仮判定部３５では強調状態と判定した小段落番号を含む音声段落を記憶部１２に格納している音声段落列から検索し、抽出する。抽出した音声段落列の総延長時間を総延長時間算出部３５Ａで算出し、その総延長時間と要約条件で入力された要約時間とを要約区間決定部３５Ｂで比較する。比較の結果が要約条件を満たしていれば、その音声段落列を要約音声保存・再生部３５Ｃで保存し、再生する。この再生動作は強調小段落抽出部３３で強調状態と判定された音声小段落の番号から音声段落を抽出し、その音声段落の開始時刻と終了時刻の指定により各コンテンツの音声データ或は映像データを読み出して要約音声及び要約映像データとして送出する。
【００５３】
要約区間決定部３５Ｂで要約条件を満たしていないと判定した場合は、要約区間決定部３５Ｂから抽出条件変更部３４に抽出条件の変更指令を出力し、抽出条件変更部３４に抽出条件の変更を行わせる。抽出条件変更部３４は抽出条件の変更を行い、その抽出条件を強調小段落抽出部３３に入力する。強調小段落抽出部３３は抽出条件変更部３４から入力された抽出条件に従って再び音声強調確率テーブル３２に格納されている各音声小段落の強調確率と平静確率との比較判定を行う。
強調小段落抽出部３３の抽出結果は再び要約区間仮判定部３５に送り込まれ、強調状態と判定された音声小段落を含む音声段落の抽出を行わせる。この抽出された音声段落の総延長時間を算出し、その算出結果が要約条件を満たすか否かを要約区間決定部３５Ｂで行う。この動作が要約条件を満たすまで繰り返され、要約条件が満たされた音声段落列が要約音声及び要約映像データとして記憶部１２から読み出されユーザ端末に配信される。
【００５４】
尚、上述では要約区間の開始時刻及び終了時刻を要約区間と判定した音声段落列の開始時刻及び終了時刻として取り出すことを説明したが、映像付のコンテンツの場合は要約区間と判定した音声段落列の開始時刻と終了時刻に接近した映像信号のカット点を検出し、このカット点（例えば特開平８−３２９２４号公報記載のように検出した画面の切替わりに発生する信号を利用する）の時刻で要約区間の開始時刻及び終了時刻を規定する方法も考えられる。このように映像信号のカット点を要約区間の開始時刻及び終了時刻に利用した場合は、要約区間の切替わりが画像の切替わりに同期するため、視覚上で視認性が高まり要約の理解度を向上できる利点が得られる。
【００５５】
上述したように、この発明で用いる音声要約方法は音声波形の強調状態となる確率が高い音声区間を要約区間として抽出するから、この音声区間と同一の開始時間と終了時間で抽出される映像情報もコンテンツの内容で重要な映像部分である場合が多い。この結果、この発明で用いる要約方法によればコンテンツの内容を適格に視聴者に伝えることができる要約情報を得ることができる利点が得られる。
図１にこの発明による映像再生方法の処理手順の一例を示す。
この発明では放送中の実時間映像信号と音声信号を再生時刻と対応付けて録画し、この録画状態を維持しながら、録画されている部分を要約再生しようとするものである。以下この状況を追いつき再生と称することにする。
【００５６】
ステップＳ１１１では追いつき再生対象開始時刻又は追いつき再生対象開始映像の特定を行う。この特定のためには例えば視聴者が番組視聴中に一時離席する場合に、押ボタン操作により離席時刻の指定を行う。又は室のドアにセンサが設けられ、ドアの開閉に伴って視聴者が室外に退出したことを検知して離席時刻を指定する。又は離席とは関係なく、既に録画されている番組の一部を早送り再生し、その映像から視聴者が任意に追いつき再生の開始位置を特定するなどの方法が考えられる。
ステップＳ１１２では要約条件（要約時間又は要約率）の入力を行う。この入力は離席した視聴者が席に戻った時点で行われる。つまり、離席中の時間が例えば３０分間であった場合に、視聴者はその３０分間に放送された内容をどの程度に圧縮して視聴するかを視聴者の考えに従って要約条件を入力する。その他の入力方法としては視聴者からの入力が無かった場合に予め定められたデフォルト値例えば３分間を使用するか、又は幾つかの候補を用意しておき、これらを表示し、視聴者が指示選択して入力する方法が考えられる。
【００５７】
また、例えば予約録画により自動的に録画が開始されている状態で視聴者が帰宅した場合、予約録画により録画開始時刻が既知であるため、要約再生開始の指示により要約終了時刻が決定される。ここで例えば要約条件が予めデフォルト値等により決定されていれば、その要約条件に従って録画開始時刻から要約終了時刻までが要約される。
ステップＳ１１３で追いつき再生開始を指定する。この追いつき再生開始の指定により、要約対象区間の終了点（要約終了時刻）が指定される。入力方法としては視聴者が追いつき再生開始を指示する押ボタン操作によりその操作時刻を指定するか、又は室のドアに設けた開閉センサにより視聴者の入室を検出し、入室時刻を持って追いつき再生開始の指定を行ってもよい。
【００５８】
ステップＳ１１４で現在放送中の映像再生を停止する。
ステップＳ１１５で要約処理、要約映像・音声の再生を行う。要約処理はステップＳ１１１で定めた追いつき再生対象開始時刻又は追いつき再生対象開始映像からステップＳ１１３で入力した要約終了時刻までの音声信号について指定された要約条件に従って、要約区間を特定し、その要約区間の音声信号と、要約区間と同期する映像信号を再生する。
ステップＳ１１６で要約区間の再生が終了。
ステップＳ１１７で放送中の映像再生を再会する。
【００５９】
図２に上述した追いつき再生を実行するこの発明の映像再生装置１００の一例を示す。この発明による映像再生装置１００は、記録部１０１と、音声分離部１０２と、音声要約部１０３と、要約区間読出部１０４と、モード切替部１０５とによって構成することができる。
記録部１０１は、例えばハードディスク或は半導体メモリ、ＤＶＤ−ＲＯＭ等のように高速で書込および読出を実行することができる記録再生手段を用いる。高速で書込及び読出を実行できることにより、現在放送中の番組を録画しながら、既に記録されている部分を再生することができる。入力信号Ｓ１はテレビチューナー等から入力され、アナログ信号でもデジタル信号でもどちらでもよい。但し、記録部１０１の記録はデジタル信号で行われる。
【００６０】
音声分離部１０２は指定された要約対象区間の映像信号から音声を分離し、この音声信号を音声要約部１０３に入力する。音声要約部１０３はこの音声信号を用いて強調部分を抽出し、要約区間を特定する。
尚、音声要約部１０３は録画中は常に音声信号を分析し、録画している番組毎に図１５に示した音声強調テーブルを作成し記憶部に格納する。従って、番組の放映中に途中から録画部分の要約再生を行う場合は音声強調テーブルを用いて要約が行われる。また後日、録画された番組の要約を視聴する場合も音声強調確率テーブルを用いて要約が行われる。
【００６１】
要約区間読出部１０４は音声要約部１０３で特定された要約区間に従って記録部１０１から音声付映像信号を読み出し、モード切替部１０５に出力する。モード切替部１０５は要約区間読出部１０４が読み出した音声付映像信号を要約映像信号として出力し、視聴者の視聴に提供される。
モード切替部１０５は要約映像を出力するモードａの他に、記録部１０１から読み出した映像信号を出力する再生モードｂと、入力信号Ｓ１を直接視聴に提供するモードｃとに切替えられ、各種の形態で利用が可能とされる。
ところで、上述した追いつき再生方法には追いつき再生の実行期間中に放送された映像は要約対象区間に含まれないため、視聴者にはその映像を視聴することができない不都合が生じる。
【００６２】
このため、この発明ではこの欠点を解消するために、以下の実施例を提案する。つまり、要約区間の再生が終了後に、以前の再生開始時刻を新たな要約対象開始時刻とし、要約区間再生終了時刻を新たな要約終了時刻として要約処理及び要約映像・音声の再生処理を繰り返す。以前の再生開始時刻と要約区間再生終了時刻の間の時間が所定時間（例えば５〜１０秒）以下となる場合に繰り返しを終了する。
この場合は指定された要約率もしくは要約時間よりも長く要約区間が再生されるという問題が生じる。例えば要約対象区間の時間をＴ_Aとして要約率ｒ（０＜ｒ＜１、ｒ＝要約時間（要約区間の時間の総延長）／要約対象区間の時間）で要約すると、１回目の要約における要約時間Ｔ₁はＴ_Aｒとなる。２回目の要約における要約時間は１回目のものについて更に要約率で要約するのでＴ_Aｒ²となる。この処理が順次繰り返されるので繰り返し終了までに要する時間はＴ_Aｒ／（１−ｒ）となる。
【００６３】
ここで指定された要約率ｒをｒ／（１＋ｒ）と調整し、この調整された要約率をもって要約を行う。その場合、繰り返し終了までに要する時間はＴ_Aｒとなり指定された要約率に適した要約時間となる。同様に要約時間Ｔ₁が指定されたときでも要約対象区間の時間Ｔ_Aが与えられていれば、指定された要約率ｒはＴ₁／Ｔ_Aであるので、この要約率をＴ₁／（Ｔ_A＋Ｔ₁）としても１回目の要約時間をＴ_AＴ₁／（Ｔ_A＋Ｔ₁）と調整してもよい。
図３は上記不都合を解消するための他の実施例を示す。この実施例では入力信号Ｓ１をそのまま出力させ、表示器の親画面２００（図４参照）に現在放送中の映像を表示させると共に、モード切替部１０５に子画面化処理部１０６を設け、子画面化処理した要約映像・音声を入力信号Ｓ１に重畳させて出力し、この要約映像を子画面２０１（図４参照）に表示させる混成モードｄを設けた実施例を示す。
【００６４】
この実施例によれば、視聴者は親画面２００に表示される放映中の番組の内容を視聴しながら、過去に放映された番組の内容を要約して子画面２０１で視聴することができる。この結果、要約情報を視聴している間に放映されている番組の内容は親画面２００から受け取ることができるから、要約情報が全て再生された時点では番組の内容は前半部分から現在放映中の時点までの内容をほぼ切れ目なく視聴者に理解させることができる。
以上説明したこの発明による映像再生方法はコンピュータに映像再生プログラムを実行させて実現される。この場合、映像再生プログラムを通信回線を介してダウンロードしたり、ＣＤ−ＲＯＭや磁気ディスク等の記録媒体に格納させてコンピュータにインストールし、コンピュータ内のＣＰＵ等の処理装置で本発明の方法を実行させる。
【００６５】
【発明の効果】
以上説明したように、この発明によれば録画されている番組を任意の圧縮率に圧縮して要約し、この要約された要約情報を再生することができる。従って録画されている多数の番組の内容を短時間に見分けることができ、例えば眞に視聴したい番組を探す作業を短時間に済ませることができる利点が得られる。
更に、最初から視聴することができなかった番組が放映されている状況でも、その前半部分を要約して視聴することができるため、途中から視聴を始めた番組でも、番組の全体を把握して視聴を続けることができる。
【図面の簡単な説明】
【図１】この発明による映像再生方法を実行する場合の手順の一例を説明するためのフローチャート。
【図２】この発明による映像再生装置の一実施例を説明するためのブロック図。
【図３】この発明による映像再生装置の変形実施例を説明するためのブロック図。
【図４】この発明による映像再生装置の特徴を説明するための表示画面の一例を説明するための図。
【図５】先に提案した音声要約方法を説明するためのフローチャート。
【図６】先に提案した音声段落の抽出方法を説明するためのフローチャート。
【図７】音声段落と音声小段落の関係を説明するための図。
【図８】図５に示したステップＳ２における入力音声小段落の発話状態を判定する方法の例を示すフローチャート。
【図９】先に提案した音声要約方法に用いられるコードブックを作成する手順の例を示すフローチャート。
【図１０】この発明において用いられるコードブックの記憶例を示す図。
【図１１】発話状態尤度計算を説明するための波形図。
【図１２】先に提案した音声強調状態判定装置及び音声要約装置の一実施例を説明するためのブロック図。
【図１３】要約率を自由に変更することができる要約方法を説明するためのフローチャート。
【図１４】音声の要約に用いる音声小段落の抽出動作と各音声小段落の強調確率算出動作、音声小段落平静確率抽出動作を説明するためのフローチャート。
【図１５】音声要約装置に用いる音声強調確率テーブルの構成を説明するための図。
【図１６】要約率を自由に変更することができる音声要約装置の一例を説明するためのブロック図。
【符号の説明】
１００映像再生装置
１０１記録部１０２音声分離部
１０３音声要約部１０４要約区間読出部
１０５モード切替部１０６子画面化処理部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a video playback method, video playback apparatus, and video playback program for summarizing and playing back various types of audio-added video recorded on a recording medium.
[0002]
[Prior art]
Conventionally, various summarization methods have been proposed. As one of them, there is an apparatus that extracts a segment moving image composed of a plurality of continuous frames from each block of the entire moving image, and connects the extracted segment moving images of each block to form a digest image. For example, it is shown in Japanese Unexamined Patent Publication No. 8-9310, Japanese Unexamined Patent Publication No. 3-90968, Japanese Unexamined Patent Publication No. 6-165209, and the like.
Further, as a method for temporal compression of audio segments, there has been a method for precisely controlling the ratio of pause compression and creating a digest with high intelligibility. For example, it is shown in Japanese Unexamined Patent Publication No. 2001-154700.
[0003]
In addition, there is a system that extracts scenes and scenes that are characteristic of the program video by using telop and sound information to make a digest video. For example, it is shown in Japanese Unexamined Patent Publication No. 2001-23062.
[0004]
[Problems to be solved by the invention]
In order to summarize a content or generate a digest at an arbitrary time, it is necessary to obtain the priority order of each scene constituting the content in advance. In JP-A-8-9310, JP-A-3-90968, and JP-A-6-165209, a pointing device such as a joystick or a plurality of buttons is used for a scene that the user considers important. And enter the digest priority information. The burden for the digest generation is large for the user.
[0005]
In Japanese Patent Laid-Open No. 2001-154700, a digest is generated by pose compression. However, since most of the content is usually occupied by a section that is not a pose, simply removing the pose reduces the summary playback time. It is impractical to compress content at a high compression rate such as 1/10 or more of the original content playback time.
In Japanese Laid-Open Patent Publication No. 2000-23062, as a digest video generation method, a summary section specified by using only the volume value of sound information as a clue is not necessarily an important section. This is because when speaking with emphasis on the main points, it is not always necessary to speak loudly. In addition, when using telop information, it is impossible to generate a digest of content in which no telop exists or to generate a digest in a section where no telop appears.
[0006]
Also, when receiving and playing audio signals with real-time video such as live broadcasts, if you cannot watch the program due to leaving your desk, etc., you can summarize the first half of the program while recording If you can, it is expected that you can watch the following video after understanding the scenario of the first half of the program.
An object of the present invention is to propose a video playback method, video playback apparatus, and video playback program capable of compressing and playing back video stored in a recording medium at an arbitrary time.
[0007]
[Means for Solving the Problems]
In this invention, the real-time video signal and the audio signal are stored in association with the reproduction time, the summary start time is input, the summary time that is the total extension time of the summary section, or the total summary target section of the total extension time of the summary section Enter a summary rate that is the ratio of
The section determined to be in the emphasized state for the speech signal in the section to be summarized from the summary start time to the current summary end time at the summary time or the summary rate is determined as a summary section,
A video playback method for playing back the audio signal and video signal of the summary section is proposed.
[0008]
In the present invention, the summarization section playback time is set as the new summarization section end time, and the summarization section playback end time is set as the new summarization section playback start time. We propose a video playback method that repeats the determination and playback of audio and video signals in the summary section.
The present invention further proposes a video reproduction method in which the summary rate r (r is a real number satisfying 0 <r <1) is adjusted to r / (1 + r), and the summary section is determined based on the adjusted summary rate.
The present invention further supports at least the fundamental frequency or pitch period, power, dynamic feature time-dependent characteristics, or feature quantities including differences between these frames, the appearance probability in the emphasized state, and the appearance probability in the calm state. Using the code book stored
Obtaining the appearance probability in the emphasized state corresponding to the feature value obtained by analyzing the speech signal for each frame,
Obtaining the appearance probability in a calm state corresponding to the feature value obtained by analyzing the audio signal for each frame,
Calculating the probability of being in an emphasized state based on the appearance probability in the emphasized state;
Calculate the probability of becoming calm based on the appearance probability in the calm state,
Calculating a probability ratio of the probability of being in the emphasized state to the probability of being in the calm state for each voice signal section;
Summarizing the time of the audio signal interval corresponding to the probability ratio in descending order to calculate the summary time,
A video reproduction method is proposed in which an audio signal section in which a summarization rate, which is a ratio of the summarization time to all summarization target sections, becomes the inputted summarization ratio is determined as the summarization section.
[0009]
The present invention further supports at least the fundamental frequency or pitch period, power, dynamic feature time-dependent characteristics, or feature quantities including differences between these frames, the appearance probability in the emphasized state, and the appearance probability in the calm state. Using the code book stored
Obtaining the appearance probability in the emphasized state corresponding to the feature value obtained by analyzing the speech signal for each frame,
Obtaining the appearance probability in a calm state corresponding to the feature value obtained by analyzing the audio signal for each frame,
Calculating the probability of being in an emphasized state based on the appearance probability in the emphasized state;
Calculate the probability of becoming calm based on the appearance probability in the calm state,
A speech signal section in which a probability ratio of the probability of being in the emphasized state to the probability of being in the calm state is larger than a predetermined coefficient is provisionally determined as a summary section,
Calculate the ratio of the sum of the time of the summary section, or the sum of the time of the summary section of the time of the whole speech signal as the sum of the sum of the time of the summarization section,
There is proposed a video reproduction method for determining a summary section by calculating the predetermined coefficient at which the sum of the times of the summary sections is approximately a predetermined summary time or the summary rate is a substantially predetermined summary rate.
[0010]
In the present invention, it is further determined whether the audio signal is a silent interval for each frame, whether it is a voiced interval,
A part that is surrounded by a silent section of a predetermined number of frames or more and includes a voiced section is determined as an audio sub-paragraph,
A group of audio sub-paragraphs ending with an audio sub-paragraph whose final power is less than a predetermined constant multiple of the average power in the audio sub-paragraph is determined as an audio paragraph;
The audio signal section is determined for each audio paragraph,
A video playback method is proposed in which the summary time is accumulated for each audio paragraph.
[0011]
The present invention further includes storage means for storing the real-time video signal and the audio signal in association with the reproduction time;
A summary start time input means for inputting a summary start time;
A summary condition input means for inputting a summary condition defined by a summary rate that is a summary time that is a total extension time of the summary section or a ratio of all summary target sections of the total extension time of the summary section;
In accordance with the summary condition, a summary section determination unit that determines a section determined to be in an emphasized state for a speech signal in a summary target section from the summary start time to the current summary end time as a summary section;
Proposed is a video playback apparatus having playback means for playing back an audio signal and video signal of a summary section determined by the summary section determination unit.
[0012]
In the present invention, in the video reproduction device, the summary section determining means may include:
Codebook that stores at least the fundamental frequency or pitch period, power, time-varying characteristics of dynamic features, or feature quantities including these inter-frame differences, the appearance probability in the emphasized state, and the appearance probability in the calm state When,
Obtaining the appearance probability in the emphasized state corresponding to the feature value obtained by analyzing the speech signal for each frame using the codebook,
An emphasis state probability calculation unit that calculates a probability of becoming an emphasis state based on the appearance probability in the emphasis state;
A calm state in which the speech signal is analyzed for each frame using the codebook to obtain an appearance probability in a calm state corresponding to the feature amount, and a probability that a calm state is calculated based on the appearance probability in the calm state A probability calculator;
Calculating a probability ratio of the probability of being in the emphasized state to the probability of being in the calm state for each voice signal section;
Summarizing time is calculated by accumulating the time of the speech signal section corresponding to the probability ratio in descending order, and a summary section tentative determination unit that temporarily determines the summary section;
A video playback apparatus is provided that includes a summary section determination unit that determines, as the summary section, an audio signal section in which a ratio of the summary section to all summary target sections satisfies the summary rate.
[0013]
In the present invention, in the video reproduction device, the summary section determining means may include:
Codebook that stores at least the fundamental frequency or pitch period, power, time-varying characteristics of dynamic features, or feature quantities including these inter-frame differences, the appearance probability in the emphasized state, and the appearance probability in the calm state When,
Using the codebook, the appearance probability in the emphasized state and the appearance probability in the calm state corresponding to the feature value obtained by analyzing the speech code for each frame are obtained,
An emphasis state probability calculation unit that calculates a probability of becoming an emphasis state based on the appearance probability in the emphasis state;
A calm state probability calculation unit for calculating a probability of becoming a calm state based on the appearance probability in the calm state;
A summary section provisional determination unit that provisionally determines a speech signal section having a probability ratio of the probability of being in the emphasized state to the probability of being in a calm state greater than a predetermined coefficient as a summary section;
A summarization section for determining the summarization section for each channel or each speaker by calculating the predetermined coefficient at which the sum total of the summarization section time is approximately a predetermined summarization time or the summarization ratio is a substantially predetermined summarization ratio. A video playback device having a determination unit.
[0014]
The present invention further proposes a video playback program that is described by a computer-readable code and that executes any of the video playback methods described above.
Action
According to the video reproduction method of the present invention, since an audio section having a high probability of being in an emphasized state of audio recorded on a recording medium is extracted as a summary section, an important part is extracted from the content contents, and the important part is connected. In addition, summary audio and summarized video information can be obtained. As a result, even if the summary time is compressed in a short time, the contents can be well understood.
[0015]
Further, since a recording medium capable of writing and reading a large amount of data in a short time is used as the recording medium, other contents or the first half of the content being recorded can be read while continuing the recording. For this reason, while recording continues, the recorded part of the program being recorded can be summarized and the summary information can be reproduced. As a result, it is possible to view both the current broadcast content and the past broadcast content by displaying the video being recorded and the video that conveys the summary on different display means, for example, the main screen and the sub screen. At the time when the reproduction of the summary information is finished, there is an advantage that the continuation can be viewed with understanding the first half of the program.
[0016]
A feature of the present invention is that a summarization method capable of summarizing content at any summarization rate (compression rate) according to a request from a user at the time of content summarization reproduction is used.
The summarizing method having this feature is that the utterance state of an arbitrary audio sub-paragraph proposed by the present applicant in Japanese Patent Application No. 2001-241278, which is a prior application, is determined, and the probability of being in an emphasized state is higher than the probability of being in a calm state. If it is larger, it can be realized by using a speech enhancement state determination method and a speech summarization method that determine that the speech sub-paragraph is in an emphasized state and extract a speech paragraph including the speech sub-paragraph as a summary section.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Here, an audio sub-paragraph extraction method, an audio paragraph extraction method, and a method for obtaining the probability of being in an emphasized state and the probability of being in a calm state for each audio sub-paragraph will be described.
FIG. 5 shows a basic procedure of the embodiment of the speech summarization method proposed previously. In step S1, the input voice signal is analyzed to obtain a voice feature amount. In step S2, an audio paragraph consisting of a plurality of audio sub-paragraphs and a plurality of audio sub-paragraphs are extracted. In step S3, it is determined whether the frames constituting each audio sub-paragraph are calm, emphasized or uttered. Based on this determination, a summary voice is created in step S4 to obtain a summary voice.
[0018]
In the following, an embodiment in which natural spoken words and conversational speech are applied to the summary will be described. As the speech feature amount, a speech feature amount that is stably obtained even in a noisy environment and is less dependent on a speaker than spectral information or the like is used. The basic frequency (f0), power (p), time-variation characteristics (d) of voice dynamic feature values, and pause time length (silent interval) (ps) are extracted from the input speech signal as speech feature values. These speech feature extraction methods include, for example, “acoustic / acoustic engineering” (Sadaaki Furui, Modern Science, 1998), “speech coding” (Takehiro Moriya, IEICE, 1998), “digital "Speech processing" (Sadaaki Furui, Tokai University Press, 1985), "Study on speech analysis algorithm based on composite sine wave model" (Shigeki Hatakeyama, PhD thesis, 1998). The time change of the dynamic feature amount of the voice is a parameter serving as a measure of the speech rate, and the one described in Japanese Patent No. 2976998 may be used. That is, the time change characteristic of the LPC spectrum coefficient reflecting the spectrum envelope is obtained as the dynamic change amount, and the speech rate coefficient is obtained based on the time change. More specifically, LPC spectrum coefficients C1 (t),... Ck (t) are extracted for each frame to obtain a dynamic feature amount d (dynamic measure) as shown in the following equation. d (t) = Σi = 1k [Σf = t−f0t + f0 [f × Ci (t)] / (Σf = t−f0t + f0f2) 2) where f0 is the number of frames in the preceding and following speech sections (not necessarily an integer number). K may be a fixed time interval), and k is the order of the LPC spectrum, i = 1, 2,... K. As the coefficient of speech rate, the number of maximum points of change in dynamic feature quantity per unit time or the rate of change per unit time is used.
[0019]
In the embodiment, for example, 100 ms is one frame and the shift is 50 ms. An average fundamental frequency for each frame is obtained (f0 ′). Similarly, the average power (p ′) for each frame is obtained for the power. Further, the difference between f0 ′ of the current frame and f0 ′ before and after ± i frames is taken as ± Δf0′i (Δ component). Similarly for power, the difference ± Δp′i (Δ component) between p ′ of the current frame and p ′ before and after ± i frames is obtained. f0 ′, ± Δf0′i, p ′, and ± Δp′i are normalized. In this standard, for example, f0 ′ and ± Δf0′i are each divided and averaged by the average fundamental frequency of the entire speech waveform. These normalized values are represented as f0 ″ and ± f0 ″ i. Similarly, p ′ and ± Δp′i are divided by the average power of the entire speech waveform that is the target of speech state determination and normalized. In normalization, it may be divided by the average power for each audio sub-paragraph and audio paragraph described later. These normalized values are expressed as p ″ and ± Δp ″ i. The value of i is, for example, i = 4. Count the number of dynamic measure peaks before and after the current frame ± T1 ms, that is, the number of local maximum changes in the dynamic feature (dp). A Δ component (−Δdp) between this and the dp of the frame including the time T2 ms before the start time of the current frame is obtained. A Δ component (+ Δdp) between the number of dp of ± T1 ms and the dp of the frame including the time after T3 ms of the end time of the current frame is obtained. The values of T1, T2, and T3 are, for example, T1 = T2 = T3 = 450 ms. Let the time length of the silent section before and after the frame be ± ps. In step S1, each value of these voice feature parameters is extracted for each frame.
[0020]
FIG. 6 shows an example of the voice sub-paragraph of the input voice and the method of extracting the voice paragraph in step S2. Here, the audio sub-paragraph is used as a unit for determining the speech state. In step S201, a silent section and a voiced section of the input voice signal are extracted. For example, the silent section is determined as a silent section if the power for each frame is equal to or lower than a predetermined power value, and the voiced section is determined as the voiced section if the correlation function for each frame is equal to or higher than a predetermined correlation function value. The determination of voiced / unvoiced is often performed using the peak value of the autocorrelation function or the modified correlation function by equating it with the periodicity / non-periodic characteristics. The autocorrelation function of the prediction residual obtained by removing the spectral envelope from the short-time spectrum of the input signal is a modified correlation function, and whether the peak of the modified correlation function is greater than a predetermined threshold value is determined as voiced / unvoiced. The pitch period 1 / f0 (basic frequency f0) is extracted according to the delay time giving the peak. Details of the method for extracting these sections are described in, for example, “Digital Speech Processing” (Sadaaki Furui, Tokai University Press, 1985). In this example, each voice feature amount is analyzed for each frame from the voice signal. However, the feature amount corresponding to the coefficient or code already analyzed by encoding or the like may be read from the codebook used for encoding and used. Good.
[0021]
In step S202, when the time of the silent section surrounding the voiced section is equal to or longer than t seconds, the part including the voiced section surrounded by the silent section is set as a small voice paragraph. This t is, for example, t = 400 ms. In step S203, the average power of the voiced section, preferably in the latter half of the speech sub-paragraph, is compared with the constant β times the average power value BA of the speech sub-paragraph, and if the former is smaller The audio sub-paragraph is set as the end audio sub-paragraph, and the audio paragraph from the audio sub-paragraph after the last audio sub-paragraph immediately before to the currently detected audio sub-paragraph is determined as the audio paragraph.
FIG. 7 schematically shows voiced sections, audio sub-paragraphs, and audio paragraphs. A voice sub-paragraph is extracted under the condition that the time of the silent section surrounding the voiced section is t seconds. FIG. 7 shows the audio sub-paragraphs j-1, j, j + 1. Here, the audio sub-paragraph j is composed of n voiced sections, and the average power is Pj. As a typical example of the voiced section, the average power of the voiced section v included in the audio sub-paragraph j is pv. The voice paragraph k is extracted from the power of the voiced section of the latter half part constituting the voice sub-paragraph j and the voice sub-paragraph. When the average power pi of the voiced interval from i = n−α to n is smaller than the average power Pj of the audio sub-paragraph j, that is,
Σpi / (α + 1) <βPj Equation (1)
Assume that the audio sub-paragraph j is the last audio sub-paragraph of the audio paragraph k. However, Σ is from i = n−α to n. In Expression (1), α and β are constants, and these are operated to extract a voice paragraph. In the embodiment, α is 3 and β is 0.8. In this way, a group of audio sub-paragraphs between adjacent end audio sub-paragraphs with the end audio sub-paragraph as a delimiter can be determined as an audio paragraph.
[0022]
FIG. 8 shows an example of a voice sub-paragraph utterance state determination method in step S3 in FIG. In step S301, the speech feature amount of the input speech sub-paragraph is vector quantized. For this purpose, a code book (code book) in which at least two quantized speech feature values (codes) are stored in advance is created. Here, the speech feature value stored in the codebook is compared with the speech feature value of the input speech or the speech already analyzed, and the distortion (distance) between the speech feature values is minimized from within the codebook. It is customary to specify the quantized speech feature value.
FIG. 9 shows an example of a method for creating this code book. The test subject listens to a large number of learning voices, and the speech state is calm and the emphasized state is labeled (S501).
[0023]
For example, as a reason for the subject to emphasize in the utterance,
(A) The voice is loud and utters so that the nouns and conjunctions are extended
(B) Extend the beginning of the conversation, insist on a topic change, and make a loud voice to gather opinions
(C) When emphasizing important nouns with a loud voice
(D) High pitched but not very loud
(E) When you are laughing and deceiving your true intentions
(F) When the ending of the ending sound is high, asking for consent or asking questions
(G) When the ending voice is loud enough to be strong
(H) The voice is loud and loud, claims to speak and speak louder than the other party
(I) Speaking of real voices and secrets that could be struck by loud voices, or talking about things that are usually important for loud voices (for example, the voice is low and the tone of hissing) . In this example, the calm state is not any of the above (a) to (i), and the subject felt that the utterance was calm.
[0024]
In the above description, the object to be determined to be in the emphasized state has been described as an utterance, but the emphasized state can also be specified in music. Here, as a reason to feel emphasis when trying to specify the emphasis state from audio in a song with audio,
(A) Loud voice and loud voice
(B) Strong voice
(C) High voice and strong accent
(D) Voice is high and voice quality changes
(E) Extend voice and loud
(F) Loud voice, loud voice, strong accent
(G) Loud voice, loud voice, screaming
(H) Voice is loud and accent changes
(I) Extend voice, loud voice, high ending
(J) The voice is high and the voice is extended
(K) Extend voice, scream, high voice
(L) Strong ending
(M) Slowly strengthen
(N) The tone is irregular
(O) The tone is irregular and the voice is high
In addition, the emphasized state can be specified even for a musical piece performance that does not include sound. As a reason to feel that emphasis,
(A) Increased power of the whole emphasized part
(B) Large pitch difference
(C) Power increases
(D) Number of instruments changes
(E) The tune and tempo change
Etc.
[0025]
By creating a code book based on these, it is possible to summarize not only utterances but also music.
For each label section in the calm state and the emphasized state, the speech feature amount is extracted (S502) and the parameter is selected (S503), as in step S1 in FIG. A code book is created by the LBG algorithm using the parameters in the label sections of the calm state and the emphasized state (S504). As for the LBG algorithm, for example, (Y. Linde, A. Buzo and RM Gray, “An algorithm for vector quantifier design,” IEEE Trans. Commun., Vol. Com-28, pp. 84-95, 1980). ) The codebook size is variable to 2 to the nth power. It is preferable that the code book is created by using a speech feature amount normalized by a speech sub-paragraph or by an appropriate section longer than this or by a speech feature amount of the entire learning speech.
[0026]
In step S301 in FIG. 8, using this codebook, the speech feature amount of the input speech sub-paragraph is normalized for each speech feature amount, and the normalized speech feature amount is collated or vector quantized for each frame. A code (quantized speech feature value) is obtained for each frame. The voice feature amount extracted from the input voice signal at this time is the same as the parameter used for the code book creation.
In order to specify a voice sub-paragraph including an emphasized state, the likelihood (likeness) of the utterance state is obtained for the calm state and the emphasized state using the code in the voice sub-paragraph. For this purpose, the appearance probability of an arbitrary code (quantized speech feature value) is obtained in advance for the calm state and the emphasized state, and the appearance probability and the code are stored in the code book as a set. In the following, an example of how to determine the appearance probability will be described. The speech feature code (obtained for each frame) in one section (label section) given a label in the learning speech used for the codebook creation is Ci, Cj, Ck,... Cn in time series. , Pα (e) is the probability that the label section α is in the emphasized state, and Pα (n) is the probability that the label section α is in the calm state.
Pα (e) = Pemp (Ci) Pemp (Cj | Ci) ... Pemp (Cn | Ci ... Cn-1) = Pemp (Ci) ΠPemp (Cx | Ci ... Cx-1)
Pα (n) = Pnrm (Ci) Pnrm (Cj | Ci) ... Pnrm (Cn | Ci ... Cn-1) = Pemp (Ci) ｉPnrm (Cx | Ci ... Cx-1)
It becomes. However, Pemp (Cx | Ci... Cx-1) is a conditional probability that Cx is in an emphasized state next to the code string Ci... Cx-1, and Pnrm (Cx | Ci... Cx-1) is similarly Ci. 1 is the probability that Cx will be in a calm state. However, Π is a product from x = i + 1 to n. Pemp (Ci) quantizes the learning speech with a frame, counts the number of Cis present in the portion labeled with the emphasized state, and counts the counted value as the total number of codes (number of frames) of all the learning speech. Pnrm (Ci) is a value obtained by dividing the number of Ci existing in a portion labeled as calm and divided by the total number of codes.
[0027]
In order to simplify each state probability of the label section α, in this example, an N-gram model (N <n) is used.
Pα (e) = Pemp (Cn | Cn−N + 1... Cn−1)
Pα (n) = Pnrm (Cn | Cn−N + 1... Cn−1)
And That is, the probability that Cn is obtained in the emphasized state after N−1 past code strings Cn−N + 1... Cn−1 from Cn is Pα (e), and the probability value of N-gram is similarly set to a lower order. It is preferable to apply a linear interpolation method that linearly interpolates with a probability value of M-gram (N ≧ M). For example, let Pα (n) be the probability that Cn is obtained in a calm state next to N−1 past code strings Cn−N + 1... Cn−1 from Cn. All of the conditional probabilities of Pα (e) and Pα (n) are obtained from the quantized code sequence of the learning speech that is labeled, and corresponds to the quantized code sequence of the speech feature value of the input speech signal. May not be obtained from the learning speech. Therefore, a higher-order (that is, a long code string) conditional probability is obtained by interpolating between the single appearance probability and the lower-order conditional appearance probability. For example, a linear interpolation method is performed using a trigram with N = 3, a bigram with N = 2, and a unigram with N = 1. N-gram, linear interpolation, and trigram are described in, for example, “Spoken Language Processing” (Ken Kenji, Satoshi Nakamura, Masaaki Nagata, Morikita Publishing, 1996, p. 29). That is,
N = 3 (trigram): Pemp (Cn | Cn-2Cn-1), Pnrm (Cn | Cn-2Cn-1)
N = 2 (bigram): Pemp (Cn | Cn-1), Pnrm (Cn | Cn-1)
N = 1 (unigram): Pemp (Cn), Pnrm (Cn)
Pn (Cn | Cn−2Cn−1) and Pnrm (Cn | Cn) are obtained by the following equations using the appearance probability of Cn in these three emphasized states and the appearance probability of Cn in the three calm states, respectively. -2Cn-1) will be calculated.

When the learning data of Trigram is N, that is, the codes are time-series C1, C2,. . . When CN is obtained, the re-estimation equations for λemp1, λemp2, and λemp3 are as follows from the above-mentioned reference “spoken language processing”.
λemp1 = 1 / NΣ (λemp1Pemp (Cn | Cn-2C-1) / (λemp1Pemp (Cn | Cn-2C-1) + λemp2Pemp (Cn | C-1) + λemp3Pemp (Cn)))
λemp2 = 1 / NΣ (λemp2Pemp (Cn | C-1) / (λemp1Pemp (Cn | Cn-2C-1) + λemp2Pemp (Cn | C-1) + λemp3Pemp (Cn)))
λemp3 = 1 / NΣ (λemp3Pemp (Cn) / (λemp1Pemp (Cn | Cn-2C-1) + λemp2Pemp (Cn | C-1) + λemp3Pemp (Cn)))
Here, Σ is the sum from n = 1 to N. Similarly, λnrm1, λnrm2, and λnrm3 are also obtained.
[0028]
In this example, when the code obtained with the label interval α and the frame number Nα is Ci1, Ci2,..., CiNα, the probability Pα (e) that the label interval α is in an emphasized state, and the probability Pα (n) that the label interval α is in a calm state. Is
Pα (e) = Pemp (Ci3 | Ci1Ci2)... Pemp (CiNα | Ci (Nα-1) Ci (Nα-2)) Equation (4)
Pα (n) = Pnrm (Ci3 | Ci1Ci2)... Pnrm (CiNα | Ci (Nα-1) Ci (Nα-2)) Equation (5)
It becomes. The trigram, unigram, and bigram are obtained for an arbitrary code and stored in a codebook so that this calculation can be performed. In other words, the codebook stores a set of the speech feature amount of each code, its appearance probability in the emphasized state, and in this example, the appearance probability in the calm state, and the appearance probability in the emphasized state is the speech feature amount. Only the probability of appearing in an emphasized state (unigram: described as a single appearance probability) irrespective of the speech feature amount in the past frame, or this and the speech feature amount in the past frame to the speech feature amount in the current frame For each audio feature string in units of frames, it is either a combination with a conditional probability that the audio feature value appears in the emphasized state, and the appearance probability in the calm state is also the same as that in the past frame. Only the probability of appearing in a calm state (unigram: referred to as a single appearance probability) regardless of the voice feature value in the current frame, or from the voice feature value in the past frame to the voice feature value in the current frame The voice feature amount for each audio feature sequence of frames that are either conditional probability and combinations appearing in calm conditions.
[0029]
For example, as shown in FIG. 10, in the code book, for each code C1, C2,..., The voice feature amount and the single appearance probability are emphasized and calm, and the conditional probability is emphasized and calm. Stored as a set.
In step S302 in FIG. 8, the likelihood of the speech state is obtained for the calm state and the emphasized state from the probabilities stored in the code book for the codes of all frames of the input speech sub-paragraph. FIG. 11 shows a schematic diagram of the embodiment. Among the audio sub-paragraphs starting from time t, the first frame to the fourth frame are indicated by (1) to (4). As described above, here, the frame length is 100 ms, the frame shift is 50 ms, and the frame length is longer. (1) Code Ci at frame number f, time t to t + 100, (2) Code Cj at frame number f + 1, time t + 50 to t + 150, (3) Code Ck at frame number f + 2, time t + 100 to t + 200, (4) A code Cl is obtained at frame number f + 3 and times t + 150 to t + 250, that is, when the codes are Ci, Cj, Ck, Cl in the order of frames, the trigram can be calculated for the frame number f + 2 or more. If the probability that the voice sub-paragraph s is in an emphasized state is Ps (e) and the probability that it is in a calm state is Ps (n), the probability up to the fourth frame is Ps (e) = Pemp (Ck | CiCj) Pemp ( Cl | CjCk) Equation (6)
Ps (n) = Pnrm (Ck | CiCj) Pnrm (Cl | CjCk) Equation (7)
It becomes. However, in this example, the individual appearance probabilities of the emphasized state and the calm state of Ck and Cl are obtained from the code book, the conditional probability that Ck appears in the emphasized state and the calm state next to Cj, and further Ck is Ci. The conditional probabilities that Cl appears in the emphasized state and the calm state after Cj and Ck, respectively after Cj and Cj are obtained from the codebook as follows.
Pemp (Ck | CiCj) = λemp1Pemp (Ck | CiCj) + λemp2Pemp (Ck | Cj) + λemp3Pemp (Ck) Equation (8)
Pemp (Cl | CjCk) = λemp1Pemp (Cl | CjCk) + λemp2Pemp (Cl | Ck) + λemp3Pemp (Cl) Equation (9)
Pnrm (Ck | CiCj) = λnrm1Pnrm (Ck | CiCj) + λnrm2Pnrm (Ck | Cj) + λnrm3Pnrm (Ck) Equation (10)
Pnrm (Cl | CjCk) = λnrm1Pnrm (Cl | CjCk) + λnrm2Pnrm (Cl | Ck) + λnrm3Pnrm (Cl) Equation (11)
Using the above formulas (8) to (11), the probability Ps (e) of being in an emphasized state up to the fourth frame shown by the formulas (6) and (7) and the probability Ps (n) of being in a calm state are I want. Here, Pemp (Ck | CiCj) and Pnrm (Ck | CiCj) can be calculated at the frame number f + 2.
[0030]
In this example, when the code obtained by the audio sub-story s with the number of frames Ns is Ci1, Ci2,..., CiNs, the probability Ps (e) that the audio sub-paragraph s is in an emphasized state and the probability Ps ( n) is calculated by the following equation.
Ps (e) = Pemp (Ci3 | Ci1Ci2) ... Pemp (CiNs | Ci (Ns-1) Ci (Ns-2))
Ps (n) = Pnrm (Ci3 | Ci1Ci2) ... Pnrm (CiNs | Ci (Ns-1) Ci (Ns-2))
In this example, if these probabilities are Ps (e)> Ps (n), the audio sub-paragraph S is in an emphasized state, and if Ps (n)> Ps (e), it is in a calm state.
[0031]
FIG. 12 shows a speech enhancement state determination apparatus and a speech summarization apparatus using the speech subparagraph extraction method, speech paragraph extraction method, and the method for obtaining the probability of being in an emphasized state and the probability of being in a calm state for each speech subparagraph. An embodiment is shown.
An input speech (input speech signal) whose speech enhancement state should be determined or whose speech summary section should be determined is input to the input unit 11. The input unit 11 includes a function of converting an input audio signal into a digital signal as necessary. The digitized audio signal is stored in the storage unit 12 as necessary. The voice feature quantity extraction unit 13 extracts the above-described voice feature quantity for each frame. The extracted speech feature value is normalized by the average value of the speech feature value as necessary, and the speech feature value of each frame is quantized by the quantization unit 14 with reference to the code book 15, and the speech is quantized. The feature amount is sent to the emphasis probability calculation unit 16 and the calm probability calculation unit 17. The code book 15 is, for example, as shown in FIG.
[0032]
The appearance probability in the emphasized state of the quantized speech feature amount is calculated by the enhancement probability calculation unit 16 using the corresponding probability stored in the code book 15 by, for example, the equation (8) or (9). The Similarly, the calm probability calculation unit 17 uses the corresponding probability stored in the codebook 15 for the appearance probability of the quantized speech feature amount in the calm state, for example, using the equation (10) or (11). Calculated. The appearance rate in the emphasized state, the appearance probability in the calm state, and the speech feature amount of each frame calculated for each frame by the enhancement probability calculation unit 16 and the calm probability calculation unit 17 are stored together with the frame number assigned to each frame. Store in 12.
[0033]
Control of these units is sequentially performed under the control of the control unit 19.
In the embodiment of the speech summarization apparatus, a broken line block is added to a solid line block in FIG. That is, the audio feature amount of each frame stored in the storage unit 12 is sent to the silent section determination unit 21 and the voiced section determination unit 22, and the silent section determination unit 21 determines whether each frame is a silent section. In addition, the voiced section determination unit 22 determines whether each frame is a voiced section. These silent section determination results and voiced section determination results are input to the audio sub-paragraph determination unit 23. Based on these silent section determination and voiced section determination, the voice sub-paragraph determination unit 23 includes a voice sub-section including a voiced section surrounded by a silent section having a predetermined number of frames as described in the previous method embodiment. Is determined. The determination result of the audio sub-paragraph determination unit 23 is written in the storage unit 12, added to the audio data string stored in the storage unit 12, and an audio sub-paragraph number sequence is assigned to the frame group surrounded by the silent section. At the same time, the determination result of the audio sub-paragraph determination unit 23 is input to the end audio sub-paragraph determination unit 24.
[0034]
In the end audio sub-paragraph determination unit 24, the end audio sub-paragraph is detected by the method described with reference to FIG. 7, for example, and the end audio sub-paragraph determination result is input to the audio paragraph determination unit 25. A portion including a plurality of audio sub-paragraphs between two end audio sub-paragraphs is determined as an audio paragraph. The voice paragraph determination result is also written in the storage unit 12, and the voice paragraph string number is assigned to the voice sub-paragraph number string stored in the storage unit 12. When operating as a speech summarization device, the enhancement probability calculation unit 16 and the calm probability calculation unit 17 read the enhancement probability and the calm probability of each frame constituting each speech sub-paragraph from the storage unit 12, and the probability for each speech sub-paragraph is determined. For example, it is calculated by Equation (8) and Equation (10). The emphasis state determination unit 18 compares the probability calculation values for the respective audio sub-paragraphs to determine whether or not the audio sub-paragraph is in an emphasis state, and the summary section extraction unit 26 determines even one audio sub-paragraph in the audio paragraph. If there is an object determined to be in an emphasized state, an audio paragraph including the audio sub-paragraph is extracted. Control of each unit is performed by the control unit 19.
[0035]
It will be understood from the above that the speech waveform composed of speech can be separated into speech sub-paragraphs and speech paragraphs, and the probability of being in an emphasized state and the probability of being in a calm state can be calculated for each speech sub-paragraph.
Next, an embodiment related to a speech processing method and speech processing apparatus that can freely set and change the summarization rate using each of the above-described methods will be described.
FIG. 13 shows a basic procedure of the embodiment of the voice processing method. In this embodiment, the speech enhancement probability calculation process is executed in step S11, and the enhancement probability and calm probability of the speech sub-paragraph are obtained.
[0036]
In step S12, summary condition input step S12 is executed. In this summary condition input step S12, for example, information prompting the user to input the summary time or summary rate or compression rate is provided, and the summary time or summary rate or summary rate or compression rate is input. Note that an input method of selecting one of a plurality of preset summarization times, summarization ratios, and compression ratios may be employed.
In step S13, an operation of repeatedly changing the extraction condition is executed, and an extraction condition that satisfies the summary time, summary rate, or compression rate input in the summary condition input step S12 of step S12 is determined.
[0037]
In step S14, a summary extraction step is executed. In this summary extraction step S14, a speech paragraph to be adopted is determined using the extraction conditions determined in the extraction condition change step S13, and the total extension time of the speech paragraph to be adopted is calculated.
In step 15, summary reproduction processing is executed, and the speech paragraph string extracted in summary extraction step S14 is reproduced.
FIG. 14 shows details of the speech enhancement probability calculation step shown in FIG.
In step S101, the speech waveform sequence to be summarized is separated into speech sub-paragraphs.
In step S102, the audio paragraph is extracted from the audio substring sequence separated in step S101. As described with reference to FIG. 7, the audio paragraph is a unit that includes one or more audio sub-paragraphs and that can be understood.
[0038]
For each audio sub-paragraph extracted in step S101 in steps S103 and S104, the probability that each audio sub-paragraph is in an emphasized state using the code book described in FIG. 10 and the above-described equations (8) and (10) ( Ps (e), hereinafter referred to as an emphasis probability, and Ps (n), which is a calm state (hereinafter referred to as a calm probability), are obtained.
In step S105, the enhancement probability Ps (e) and the calm probability Ps (n) obtained for each voice sub-paragraph in steps S103 and S104 are sorted for each voice sub-paragraph and stored in the storage means as a voice enhancement probability table. .
FIG. 15 shows an example of the speech enhancement probability table stored in the storage means. F1, F2, F3... Shown in FIG. 15 indicate a small paragraph probability storage unit in which the small voice paragraph emphasis probability Ps (e) obtained for each small voice paragraph and the small voice paragraph calm probability Ps (n) are recorded. These sub-paragraph probability storage units F1, F2, F3,... Have an audio sub-paragraph number i assigned to each audio sub-paragraph S and a start time (time measured from the beginning of the audio data string to be summarized) end time. , Audio sub-paragraph emphasis probability, audio sub-paragraph calm probability, number of frames fn constituting each audio sub-paragraph, and the like are stored.
[0039]
The condition input in the summarizing condition input step S12 is the summarization ratio r indicating that the total length of the contents to be summarized is summarized in time of 1 / X (X is a positive integer) (the reciprocal r of the summarization ratio described in the claims). = 1 / X) or the summary time t is input.
In response to the setting of the summary condition, in the extraction condition changing step S13, the weighting factor W is set to W = 1 as an initial value, and this weighting factor is input to the summary extracting step S14.
The summary extraction step S14 compares the emphasis probability Ps (e) and the calm probability Ps (e) stored for each speech sub-paragraph from the speech enhancement probability table with the weighting factor W = 1.
W · Ps (e)> Ps (n)
Are extracted, and a speech paragraph including at least one extracted speech subparagraph is extracted, and a total extension time MT (min) of the extracted speech paragraph string is obtained.
[0040]
The total extension time MT (minute) of the extracted speech paragraph string is compared with a predetermined summary time YT (minute) determined by the summary condition. Here, if MT.apprxeq.YT (MT error relative to YT is in the range of, for example, about ± several percent), the adopted audio paragraph string is reproduced as summary audio.
If the error value of the summed extension time MT of the content with respect to the summarization time YT set in the summarizing condition is larger than specified and the relationship is MT> YT, the total extension time MT (minutes) of the extracted audio paragraph string is It is determined that it is longer than the summary time YT (minutes) determined by the summary condition, and the extraction condition changing step S13 shown in FIG. 13 is re-executed. In the extraction condition changing step S13, the total extension time MT (minutes) of the extracted speech paragraph sequence extracted with the weighting factor W = 1 is emphasized in response to the determination result that the summary time YT (minutes) set in the summary condition is “longer”. Weighting is performed by multiplying the probability Ps (e) by a weighting coefficient W smaller than the current value (in the case of a predetermined coefficient according to claim 1, larger than the current value) W · Ps (e). The weighting factor W is obtained, for example, by W = 1−0.001 × K (K is the number of loops).
[0041]
That is, the weighting coefficient determined by W = 1−0.001 × 1 in the first loop in the array of enhancement probabilities Ps (e) obtained in all speech sub-paragraphs of the speech paragraph sequence read from the speech enhancement probability table. Multiply by W = 0.999 and give weight. The weighted emphasis probabilities W · Ps (e) of all the audio sub-paragraphs are compared with the calm probabilities Ps (n) of the audio sub-paragraphs, and the relationship of W · Ps (e)> Ps (n) is established. Extract a small audio paragraph.
In the summary extraction step S14 according to this extraction result, a speech paragraph including the extracted speech sub-paragraph is extracted, and a summary speech paragraph string is obtained again. At the same time, the total extension time MT (minutes) of the summary speech paragraph string is calculated, and the total extension time MT (minutes) is compared with the summary time YT (minutes) determined by the summary conditions. If the comparison result is MT≈YT, the speech paragraph string is determined as summary speech and reproduced.
[0042]
If the result of the first weighting process is still MT> YT, the extraction condition changing step is executed as a second loop. At this time, the weighting factor W is obtained as W = 1−0.001 × 2. All weighting probabilities Ps (e) are weighted with W = 0.998.
Thus, every time the execution of the loop is repeated, in this example, by changing the extraction condition so that the value of the weighting factor W is gradually reduced, the audio level that satisfies the condition of WPs (e)> Ps (n) is satisfied. The number of paragraphs can be gradually reduced. As a result, it is possible to detect a state of MT≈YT that satisfies the summary condition.
[0043]
In the above description, MT≈YT is set as the convergence condition of the summary time MT, but it can be strictly converged to MT = YT. In this case, for example, if the summarization condition is insufficient for 5 seconds, for example, if one more audio paragraph is added, it will exceed 10 seconds. However, by playing only 5 seconds from the audio paragraph, the summarization condition of the user is matched. be able to. Further, the 5 seconds may be 5 seconds near the audio sub-paragraph determined to be emphasized, or 5 seconds from the beginning of the audio paragraph.
Further, when it is determined that MT <YT in the initial state described above, the weighting factor W is determined to be smaller than the current value, for example, W = 1−0.001 × K, and this weighting factor W is calculated as the calm probability Ps (n). What is necessary is to multiply the array and weight the calming probability Ps (n). As another method, when it is determined that MT> YT in the initial state, the weighting coefficient is set larger than the current value to W = 1 + 0.001 × K, and the weighting coefficient W is multiplied by the array of the calming probability Ps (n). May be.
[0044]
In the summary playback step S15, the speech paragraph sequence extracted in the summary extraction step S14 has been described as being played back. However, in the case of image information with speech, the image information corresponding to the speech paragraph extracted as the summary speech is cut out and connected. In addition, it is possible to summarize a television broadcast or a movie by playing it with sound.
In the above description, it has been described that weighting is performed by directly multiplying either the enhancement probability or the calm probability obtained for each speech sub-paragraph stored in the speech enhancement probability table by the weighting factor W. In order to detect well, it is desirable to weight the weighting coefficient W as WF by raising the weighting coefficient W to the number F of the frames constituting each audio sub-paragraph.
[0045]
That is, the conditional emphasis probability Ps (e) calculated by the equations (8) and (10) is a product of the probabilities of the emphasis state obtained for each frame. Further, the probability Ps (n) of being in a calm state is obtained as a product of the probability of being in a calm state calculated for each frame. Accordingly, for example, in order to weight the emphasis probability Ps (e), the weighting coefficient W is multiplied by the weighting coefficient W for each probability of the emphasis state obtained for each frame. In this case, if the number of frames constituting the audio sub-paragraph is F, the weight coefficient W is WF.
As a result, the influence of weighting is increased / decreased according to the number F of frames, and an audio sub-paragraph with a larger number of frames, that is, an audio sub-paragraph with a longer extension time is given a higher weight.
[0046]
However, if it is only necessary to change the extraction condition for determining the emphasis state, the extraction condition can be obtained simply by multiplying the product of the probability of the emphasis state obtained for each frame or the product of the calm state by the weighting factor W. Changes can be made. Therefore, the weighting coefficient W is not necessarily set to WF.
Further, in the above description, as the extraction condition changing means, weighting is applied to the emphasis probability Ps (e) or the calm probability Ps (n) obtained for each audio subparagraph, and the audio subparagraph satisfying Ps (e)> Ps (n) is satisfied. Although the method of changing the number is adopted, as another method, the probability ratio Ps (e) / Ps (n) is calculated with respect to the emphasis probability Ps (e) and the calm probability Ps (n) of all the audio sub-paragraphs, The sum of the summary sections is calculated by accumulating the speech signal sections (speech sub-paragraphs) corresponding to the descending order of the probability ratios. If the sum of the sum of the summary sections matches approximately the predetermined summary time, the speech at that time A method of organizing summary speech by determining a signal interval as a summary interval is also conceivable.
[0047]
In this case, the probability ratio Ps (e) / Ps (n) for determining that the total extended time of the organized summary speech is in an emphasized state when excess or deficiency occurs with respect to the summary time set in the summary condition. The extraction condition can be changed by changing the threshold value for selecting the value of). In the case of adopting this extraction condition changing method, there is an advantage that it is possible to simplify the process until the summary voice that satisfies the summary condition is organized.
In the above description, it is assumed that the emphasis probability Ps (e) and the calm probability Ps (n) obtained for each voice sub-paragraph are calculated by the product of the probability of becoming the emphasized state and the probability of becoming the calm state calculated for each frame. However, as another method, an average value of the probability of the enhancement state obtained for each frame is obtained, and the average value is used as the enhancement probability Ps (e) and the calm probability Ps (n) of the speech sub-paragraph. it can.
[0048]
Therefore, when the calculation method of the emphasis probability Ps (e) and the calm probability Ps (n) is adopted, the weighting coefficient W used for weighting is simply multiplied by the emphasis probability Ps (e) or the calm probability Ps (n). .
FIG. 16 shows an embodiment of a speech processing apparatus that can freely set the summary rate. In this embodiment, the configuration of the speech enhancement state summarization apparatus shown in FIG. 12 includes a summary condition input unit 31, a speech enhancement probability table 32, an enhanced small paragraph extraction unit 33, an extraction condition change unit 34, and a summary section tentative determination. 35, a total extension time calculation unit 35A for obtaining the total extension time of the summary speech, and a summary condition input time for the summary speech calculated by the total extension time calculation unit 35A. A summary section determination unit 35B that determines whether or not the summary time input by the unit 31 is within the range of setting, and a summary speech storage / reproduction unit 35C that stores and reproduces summary speech that meets the summary conditions are provided. It is characterized by the point of construction.
[0049]
As described with reference to FIG. 11, a speech feature amount is obtained for each frame of the input speech, and the enhancement probability calculation unit 16 and the calm probability calculation unit 17 calculate the enhancement probability and the calm probability for each frame according to the speech feature amount. The emphasis probability and the calm probability are stored in the storage unit 12 together with the frame number assigned to each frame. Further, the audio sub-paragraph sequence number assigned to the audio sub-paragraph sequence determined by the audio sub-paragraph determination unit is added to the frame sequence number, and an address is assigned to each frame and audio sub-paragraph.
In the speech processing apparatus shown in this embodiment, the enhancement probability calculation unit 16 and the calm probability calculation unit 17 read the enhancement probability and the calm probability of each frame stored in the storage unit 12, and each speech is determined from the enhancement probability and the calm probability. The enhancement probability Ps (e) and the calm probability Ps (n) are obtained for each small paragraph, and the enhancement probability Ps (e) and the calm probability Ps (n) are stored in the speech enhancement table 32.
[0050]
The speech emphasis table 32 stores the emphasis probabilities and calming probabilities obtained for each audio sub-paragraph of the speech waveforms of various contents, and is ready for summarization at any time according to the user's request. The user inputs the summary condition to the summary condition input unit 31. The summarization condition here refers to the name of the content to be summarized and the summarization rate for the total time of the content. As the summarization rate, an input method such as summarizing the total length of content to 1/10 or summing up to 10 minutes in time can be considered. For example, when 1/10 is input, the summary time calculation unit 31A calculates a time obtained by reducing the total length time of the content to 1/10, and the calculated summary time is input to the summary interval determination unit 35B of the summary interval provisional determination unit 35. Send it in.
[0051]
In response to the summary condition being input to the summary condition input unit 31, the control unit 19 starts the operation for generating the summary speech. As the starting work, the content enhancement probability and calmness probability desired by the user are read from the speech enhancement table 32. The read-out emphasis probability and calmness probability are sent to the emphasis sub-paragraph extraction unit 33, and the audio sub-paragraph number determined to be in the emphasis state is extracted.
As a method of changing the condition for extracting the speech section in the emphasized state, the above-described enhancement probability Ps (e) or calm probability Ps (n) is multiplied by a weighting coefficient W that is the reciprocal of the probability ratio, and W · Ps ( e) Extracting audio sub-paragraphs having a relationship of> Ps (n), obtaining a summary audio from an audio paragraph including audio sub-paragraphs, and calculating a probability ratio Ps (e) / Ps (n) A method of accumulating the ratios in descending order to obtain a summary time.
[0052]
As an initial value of the extraction condition, when the extraction condition is changed by weighting, it is conceivable that the weighting coefficient W is set to W = 1 to be an initial value. Further, when determining the enhancement state according to the value of the probability ratio Ps (e) / Ps (n) of the enhancement probability Ps (e) and the calm probability Ps (n) obtained for each voice sub-paragraph, as an initial value For example, it may be determined that the ratio value is Ps (e) / Ps (n) ≧ 1 as an emphasized state. Data representing the voice sub-paragraph number, the start time, and the end time determined to be in the emphasized state in the initial setting state are sent from the emphasized small paragraph extracting unit 33 to the summary section temporary determining unit 35. The summary section tentative determination unit 35 searches for and extracts a speech paragraph including the small paragraph number determined to be in the emphasized state from the speech paragraph string stored in the storage unit 12. The total extension time of the extracted speech paragraph string is calculated by the total extension time calculation unit 35A, and the total extension time and the summary time input under the summary condition are compared by the summary section determination unit 35B. If the comparison result satisfies the digest condition, the speech paragraph string is stored in the summary speech storage / playback unit 35C and played back. In this reproduction operation, an audio paragraph is extracted from the number of the audio sub-paragraph determined to be in the emphasized state by the emphasized sub-paragraph extracting unit 33, and the audio data or video data of each content is designated by specifying the start time and end time of the audio paragraph. Are output as summary audio and summary video data.
[0053]
When the summary section determination unit 35B determines that the summary condition is not satisfied, the summary section determination unit 35B outputs an extraction condition change instruction to the extraction condition change unit 34, and the extraction condition change unit 34 changes the extraction condition. Let it be done. The extraction condition changing unit 34 changes the extraction condition and inputs the extraction condition to the emphasized small paragraph extracting unit 33. The emphasized small paragraph extraction unit 33 performs comparison determination between the enhancement probability and the calm probability of each audio subparagraph stored in the speech enhancement probability table 32 again according to the extraction condition input from the extraction condition changing unit 34.
The extraction result of the emphasized small paragraph extraction unit 33 is sent again to the summary section provisional determination unit 35 to extract the audio paragraph including the audio subparagraph determined to be in the emphasized state. The total extension time of the extracted speech paragraph is calculated, and the summary section determination unit 35B determines whether or not the calculation result satisfies the summary condition. This operation is repeated until the summary condition is satisfied, and the audio paragraph string that satisfies the summary condition is read from the storage unit 12 as summary audio and summary video data and distributed to the user terminal.
[0054]
In the above description, it has been described that the start time and end time of the summary section are extracted as the start time and end time of the audio paragraph sequence determined as the summary section. However, in the case of content with video, the audio paragraph sequence determined as the summary section The cut point of the video signal approaching the start time and the end time of the video signal is detected, and at the time of this cut point (for example, using a signal generated instead of the detected screen switching as described in JP-A-8-32924) A method of defining the start time and end time of the summary section is also conceivable. When the cut points of the video signal are used for the start time and end time of the summary section in this way, the summary section switching is synchronized with the image switching, so the visibility is improved visually and the understanding of the summary is improved. Benefits that can be obtained.
[0055]
As described above, since the speech summarization method used in the present invention extracts a speech section having a high probability of being in a speech waveform enhancement state as a summary section, video information extracted at the same start time and end time as this speech section. Are often important video parts in the content. As a result, according to the summarization method used in the present invention, there is an advantage that it is possible to obtain summary information that can properly convey the contents to the viewer.
FIG. 1 shows an example of the processing procedure of the video reproduction method according to the present invention.
In the present invention, a real-time video signal and an audio signal being broadcast are recorded in association with a reproduction time, and the recorded portion is summarized and reproduced while maintaining the recording state. Hereinafter, this situation is referred to as catch-up reproduction.
[0056]
In step S111, the catch-up playback target start time or the catch-up playback target start video is specified. For this specification, for example, when the viewer leaves the room temporarily while watching the program, the time for leaving the seat is designated by a push button operation. Alternatively, a sensor is provided on the door of the room, and the absence time is designated by detecting that the viewer has left the room as the door is opened and closed. Alternatively, a method may be considered in which a part of a program that has already been recorded is fast-forwarded and the viewer arbitrarily identifies the start position of catch-up playback from the video regardless of the absence.
In step S112, a summary condition (summary time or summary rate) is input. This input is performed when the viewer who has left the seat returns to the seat. In other words, when the time during which the user is away from the seat is, for example, 30 minutes, the viewer inputs the summary condition according to the viewer's thought on how much the content broadcast during the 30 minutes is compressed and viewed. As other input methods, when there is no input from the viewer, a predetermined default value, for example, 3 minutes is used, or several candidates are prepared and displayed, and the viewer instructs A method of selecting and inputting can be considered.
[0057]
Further, for example, when the viewer returns home in a state where the recording is automatically started by the reserved recording, since the recording start time is known by the reserved recording, the summary end time is determined by the summary reproduction start instruction. Here, for example, if the summarization condition is determined in advance by a default value or the like, the recording start time to the summarization end time are summarized according to the summarization condition.
In step S113, start of catch-up reproduction is designated. By specifying the start of catch-up reproduction, the end point (summary end time) of the summary target section is designated. As an input method, the viewer designates the operation time by a push button operation instructing the start of catch-up reproduction, or detects the viewer's entry by an open / close sensor provided on the door of the room, and catches up reproduction with the entry time. You may specify the start.
[0058]
In step S114, playback of the currently broadcast video is stopped.
In step S115, summary processing and summary video / audio reproduction are performed. The summarization process specifies a summarization section in accordance with the summarization conditions specified for the audio signal from the catch-up playback target start time or the catch-up playback target start video determined in step S111 to the summarization end time input in step S113, and The audio signal and the video signal synchronized with the summary section are reproduced.
In step S116, the playback of the summary section ends.
In step S117, the video reproduction being broadcast is reunited.
[0059]
FIG. 2 shows an example of the video reproduction apparatus 100 of the present invention that executes the above-described catch-up reproduction. The video reproduction apparatus 100 according to the present invention can be configured by a recording unit 101, an audio separation unit 102, an audio summarization unit 103, a summary section reading unit 104, and a mode switching unit 105.
The recording unit 101 uses recording / reproducing means capable of executing writing and reading at high speed, such as a hard disk, a semiconductor memory, a DVD-ROM, or the like. Since the writing and reading can be executed at high speed, the recorded part can be reproduced while the program currently being broadcast is recorded. The input signal S1 is input from a TV tuner or the like, and may be either an analog signal or a digital signal. However, the recording of the recording unit 101 is performed with a digital signal.
[0060]
The audio separation unit 102 separates audio from the video signal in the designated summary target section, and inputs this audio signal to the audio summarization unit 103. The voice summarizing section 103 extracts an emphasized portion using this voice signal and specifies a summary section.
Note that the voice summarization unit 103 always analyzes the voice signal during recording, creates the voice enhancement table shown in FIG. 15 for each program being recorded, and stores it in the storage unit. Accordingly, when the summary playback of the recorded portion is performed from the middle while the program is being broadcast, the summary is performed using the voice enhancement table. In addition, when a summary of a recorded program is viewed at a later date, the summary is performed using the voice enhancement probability table.
[0061]
The summary section reading unit 104 reads the video signal with audio from the recording unit 101 according to the summary section specified by the audio summary unit 103, and outputs it to the mode switching unit 105. The mode switching unit 105 outputs the audio-added video signal read by the summary section reading unit 104 as a summary video signal, which is provided for viewing by the viewer.
The mode switching unit 105 is switched to a reproduction mode b for outputting a video signal read from the recording unit 101 and a mode c for directly providing the input signal S1 for viewing, in addition to the mode a for outputting the summary video. It can be used in the form.
By the way, in the above-described catch-up playback method, since the video broadcast during the catch-up playback execution period is not included in the summary target section, there is a disadvantage that the viewer cannot view the video.
[0062]
Therefore, the present invention proposes the following embodiments in order to eliminate this drawback. That is, after the playback of the summary section is completed, the summary process and the summary video / audio playback process are repeated with the previous playback start time as the new summary target start time and the summary section playback end time as the new summary end time. When the time between the previous playback start time and the summary section playback end time is a predetermined time (for example, 5 to 10 seconds) or less, the repetition is ended.
In this case, there arises a problem that the summary section is reproduced longer than the designated summary rate or summary time. For example, T_AAs a summary rate r (0 <r <1, r = summarization time (total extension of summarization section time) / summarization section time), summarization time T in the first summarization₁Is T_Ar. The summarization time in the second summarization is summarized by the summarization rate for the first summarization._Ar²It becomes. Since this process is repeated sequentially, the time required until the repetition ends is T_Ar / (1-r).
[0063]
The summarization rate r specified here is adjusted to r / (1 + r), and the summarization is performed with the adjusted summarization rate. In that case, the time required to end the repetition is T_Ar is the summarization time suitable for the specified summarization rate. Similarly, the summary time T₁The time T of the section to be summarized even when is specified_AIs given, the specified summarization rate r is T₁/ T_ATherefore, this summarization rate is expressed as T₁/ (T_A+ T₁) For the first summary time T_AT₁/ (T_A+ T₁).
FIG. 3 shows another embodiment for solving the above inconvenience. In this embodiment, the input signal S1 is output as it is, the video currently being broadcast is displayed on the main screen 200 of the display (see FIG. 4), and the sub-screen conversion processing unit 106 is provided in the mode switching unit 105. An embodiment is provided in which a mixed mode d is provided in which the summary video / audio that has been processed is superimposed on the input signal S1 and output, and the summary video is displayed on the sub-screen 201 (see FIG. 4).
[0064]
According to this embodiment, the viewer can summarize the contents of programs broadcast in the past and view them on the sub-screen 201 while watching the contents of the program being broadcast displayed on the main screen 200. As a result, since the contents of the program being broadcast while viewing the summary information can be received from the parent screen 200, the contents of the program are currently being broadcast from the first half when all the summary information is reproduced. The viewer can understand the content up to the time almost seamlessly.
The video playback method according to the present invention described above is realized by causing a computer to execute a video playback program. In this case, the video reproduction program is downloaded via a communication line or stored in a recording medium such as a CD-ROM or a magnetic disk and installed in a computer, and the method of the present invention is executed by a processing device such as a CPU in the computer. Let
[0065]
【The invention's effect】
As described above, according to the present invention, a recorded program can be summarized by compressing it to an arbitrary compression rate, and the summarized information can be reproduced. Therefore, it is possible to distinguish the contents of a large number of recorded programs in a short time, and for example, it is possible to obtain an advantage that an operation of searching for a program to be watched can be completed in a short time.
In addition, even when a program that could not be viewed from the beginning is being broadcast, the first half of the program can be summarized and viewed. You can continue watching.
[Brief description of the drawings]
FIG. 1 is a flowchart for explaining an example of a procedure when a video reproduction method according to the present invention is executed.
FIG. 2 is a block diagram for explaining an embodiment of a video reproduction apparatus according to the present invention.
FIG. 3 is a block diagram for explaining a modified embodiment of the video reproduction apparatus according to the present invention.
FIG. 4 is a diagram for explaining an example of a display screen for explaining features of the video reproduction apparatus according to the present invention.
FIG. 5 is a flowchart for explaining the previously proposed speech summarization method.
FIG. 6 is a flowchart for explaining a previously proposed method for extracting an audio paragraph;
FIG. 7 is a diagram for explaining the relationship between an audio paragraph and an audio sub-paragraph.
FIG. 8 is a flowchart showing an example of a method for determining an utterance state of an input speech sub-paragraph in step S2 shown in FIG.
FIG. 9 is a flowchart showing an example of a procedure for creating a code book used in the previously proposed speech summarization method.
FIG. 10 is a diagram showing a storage example of a code book used in the present invention.
FIG. 11 is a waveform diagram for explaining speech state likelihood calculation.
FIG. 12 is a block diagram for explaining one embodiment of the previously proposed speech enhancement state determination device and speech summarization device.
FIG. 13 is a flowchart for explaining a summarization method capable of freely changing the summarization rate.
FIG. 14 is a flowchart for explaining a voice sub-paragraph extraction operation, a voice sub-paragraph enhancement probability calculation operation, and a voice sub-paragraph calm probability extraction operation used for voice summarization;
FIG. 15 is a diagram for explaining the configuration of a speech enhancement probability table used in the speech summarization device.
FIG. 16 is a block diagram for explaining an example of a voice summarizing apparatus that can freely change the summarization rate.
[Explanation of symbols]
100 Video playback device
101 Recording unit 102 Audio separation unit
103 Speech Summarization Unit 104 Summarization Section Reading Unit
105 Mode switching unit 106 Sub-screen display processing unit

Claims

Storage means for storing the real-time video signal and the audio signal in association with the time;
Summarization start time input means for inputting the summarization start time of the section to be summarized,
Summarization condition input means for inputting the summarization condition determined by the summarization rate;
Summary playback start instruction means for inputting a summary playback start instruction;
The summarization start time is set so that the summarization time of the summarization section and the summarization section time are equal to the summarization rate when the summarization time of the summarization section is the summarization end time of the summarization section. and determining summary section determination unit summary section from the section where it is determined that emphasized the audio signal in the summary target section in engraved at summary completion from the time,
Output means for outputting to the reproduction means for reproducing the audio signal and video signal of the summary section determined by the summary section determination means;
Summarization condition setting means for setting the playback start time of the summary section as a new summary start time, and setting the playback end time of the summary section as a new summary end time ;
The summary start time and summary end time are set by the summary condition setting means, the summary section is determined by the summary section determining means, and the audio signal and video signal of the summary section are output to the reproducing means by the output means. A video summarizing apparatus comprising: repeating means for repeating .

Storage means for storing the real-time video signal and the audio signal in association with the time;
Summarization start time input means for inputting the summarization start time of the section to be summarized,
A summary condition input means for inputting a summary condition determined by the summary time;
Summary playback start instruction means for inputting a summary playback start instruction;
A summary end time of the summary playback start summarized target section of the instruction is input time, so that the total extension time summary section is the summary time, summary subject in engraved at the summary ends from the summary start time A summary section determination means for determining a summary section from sections determined to be in an emphasis state for the audio signal in the section;
Output means for outputting to the reproduction means for reproducing the audio signal and video signal of the summary section determined by the summary section determination means;
The ratio of the summary time to the time of the summary target section is a summary rate, the playback start time of the summary section is a new summary start time, the playback end time of the summary section is a new summary end time, and a new summary Summarization condition setting means for obtaining a new summarization time so that a ratio of the new summarization time to the time of the target section becomes the summarization rate ;
Repeating means for repeating the setting of the summary time by the summary condition setting means, the determination of the summary section by the summary section determining means, and the output of the audio signal and the video signal of the summary section by the output means to the reproduction means; video summarization apparatus comprising: a.

The video summarization device according to claim 1 ,
The summary section determination means adjusts the input summary rate r (r is a real number satisfying 0 <r <1) to r / (1 + r), and in the determination of each summary section, the summary for the time of the summary target section is summarized. A video summarization apparatus, comprising: determining a summary section from sections determined to be in an emphasized state so that a ratio with a total extension time of the section becomes the adjusted summary rate.

The video summarization device according to claim 2 ,
In the determination of the first summary section, the summary section determination means determines the summary rate from the time T _A of the summary target section defined by the input summary start time and the input summary time T ₁ to T ₁ / ( T _A + T ₁ ), the summarization time is adjusted so that the ratio of the summed extension time of the summarization section to the summarization target time is the adjusted summarization rate, and the summation time of the summarization section is adjusted. The summary section is determined from the sections determined to be in the emphasized state for the speech signal in the section to be summarized from the summary start time to the present summary end time so as to be the summary time,
In the determination of each summary section to be repeated, the summary condition setting means obtains a new summary time so that the ratio of the new summary time to the time of the new summary target section becomes the summary rate,
The video summarization apparatus, wherein the summary section determination means determines the summary section so that a total extension time of the summary section becomes the new summary time.