JP2003259311A

JP2003259311A - Video reproducing method, video reproducing apparatus, and video reproducing program

Info

Publication number: JP2003259311A
Application number: JP2002060844A
Authority: JP
Inventors: Kota Hidaka; 浩太日▲高▼; Shinya Nakajima; 信弥中嶌; Osamu Mizuno; 理水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-03-06
Filing date: 2002-03-06
Publication date: 2003-09-12
Anticipated expiration: 2022-03-06
Also published as: JP3803302B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video reproducing method for compressing the contents of a recorded program at an optional summary rate and reproducing the summary of the contents. <P>SOLUTION: The video reproducing method adopting an audio processing method for deciding an emphasis state on the basis of a feature quantity resulting from analyzing an audio signal by each frame, uses a code table for storing an incident probability in the emphasis state and an incident probability in a still state in cross-reference with each other to obtain the incident probability in the emphasis state corresponding to the feature quantity resulting from analyzing an audio signal by each frame and the incident probability in the still state, calculates the probability of incidence of the emphasis state on the basis of the incidence probability of the audio signal to be summarized recorded on a recording medium and the probability of incidence of the still state on the basis of the incidence probability in the still state, decides an audio signal period wherein a probability ratio of the probability of incidence in the emphasis state to the probability of incidence in the still state is greater than a prescribed coefficient to be a summarized period, calculates the total sum of times in the summarized period, decides the summarized period wherein the summarized time is nearly a prescribed summarized time, and reads the decided summarized period from the recording medium and reproduces it. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は記録媒体に記録さ
れている各種の音声付映像を要約して再生する映像再生
方法、映像再生装置及び映像再生プログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video reproducing method, a video reproducing apparatus and a video reproducing program for reproducing various video with audio recorded on a recording medium in a summarized manner.

【０００２】[0002]

【従来の技術】従来より各種の要約方法が提案されてい
る。その一つとして連続する複数フレームからなる区間
動画像を動画全体の各ブロックから抽出し、抽出した各
ブロックの区間動画像をつなぎ合わせてダイジェスト画
像とする装置があった。例えば、日本国特開平８−９３
１０号公報、日本国特開平３−９０９６８号公報、日本
国特開平６−１６５００９号公報などに示されている。
また、オーディオセグメントの時間圧縮方法として、ポ
ーズ圧縮の割合を精密に制御し、了解性の高いダイジェ
ストを作成する方法があった。例えば、日本国特開２０
０１−１５４７００公報などに示されている。2. Description of the Related Art Conventionally, various summarizing methods have been proposed. As one of them, there is a device that extracts a section moving image consisting of a plurality of continuous frames from each block of the entire moving image and connects the extracted section moving images of each block to form a digest image. For example, Japanese Patent Laid-Open No. 8-93
No. 10, Japanese Unexamined Patent Publication No. 3-90968, Japanese Unexamined Patent Publication No. 6-165509, and the like.
Also, as a time compression method for audio segments, there has been a method in which the ratio of pause compression is precisely controlled to create a digest with high intelligibility. For example, Japanese Patent Laid-Open No. 20
No. 01-154700, etc.

【０００３】また、テロップや音情報を使って、当該番
組映像の特徴となる場面やシーンを抽出してダイジェス
ト映像とするシステムがあった。例えば、日本国特開２
００１−２３０６２公報などに示されている。There has also been a system in which a telop or sound information is used to extract a characteristic scene or scene of the program video to make it into a digest video. For example, Japanese Patent Laid-Open No. 2
No. 001-23062.

【０００４】[0004]

【発明が解決しようとする課題】コンテンツを任意の時
間で要約、もしくはダイジェストを生成するには、コン
テンツを構成する各シーンの優先順位をあらかじめ求め
ておく必要がある。日本国特開平８−９３１０号公報、
日本国特開平３−９０９６８号公報、日本国特開平６−
１６５００９号公報では、ユーザが重要と思うシーンを
ジョイスティック等のポインティングデバイスや、複数
のボタンを用いて入力し、ダイジェスト優先度情報を付
与している。利用者にとってダイジェスト生成のための
負担が大きい。In order to summarize the content at an arbitrary time or generate a digest, it is necessary to previously determine the priority order of each scene constituting the content. JP-A-8-9310,
JP-A-3-90968, JP-A-6-
In Japanese Patent No. 165009, a user thinks that a scene is important by using a pointing device such as a joystick or a plurality of buttons to add digest priority information. The burden on the user for generating the digest is large.

【０００５】また、日本国特開２００１−１５４７００
公報では、ポーズ圧縮によって、ダイジェストを生成し
ているが、コンテンツの大半が通例ポーズでない区間で
占められている以上、単にポーズを除去するだけでは要
約再生時間を元のコンテンツ再生時間の１／１０以上と
いった高い圧縮率でコンテンツを圧縮することは非現実
的である。また、日本国特開２０００−２３０６２公報
では、ダイジェスト映像生成方法として、音情報の音量
値だけを手がかりに特定された要約区間は必ずしも重要
な区間とはいえない。何故ならば要点を強調して話す場
合、必ずしも音量を大きく話すとは限らないからであ
る。また、テロップ情報を用いる場合、テロップが存在
しないコンテンツのダイジェストの生成や、テロップが
出現しない区間ではダイジェストを生成することは不可
能である。Further, Japanese Patent Laid-Open No. 2001-154700
In the official gazette, the digest is generated by the pause compression. However, since most of the content is normally occupied by the non-pause section, simply removing the pause makes the summary reproduction time 1/10 of the original content reproduction time. It is unrealistic to compress the content at a high compression rate as described above. Further, in Japanese Unexamined Patent Publication No. 2000-23062, as a digest video generation method, a summary section specified only by a sound volume value of sound information is not necessarily an important section. This is because the volume is not always loud when the point is emphasized. Further, when the telop information is used, it is impossible to generate the digest of the content in which the telop does not exist or the digest in the section in which the telop does not appear.

【０００６】また、生放送など実時間映像付音声信号の
配信を受け、再生しているとき、離席等により当該番組
を視聴できなかった場合に、その前半の部分を録画を続
けながら要約して視ることができると、番組前半の筋書
きを理解した上で後続する映像を視聴できることが期待
される。この発明の目的は記録媒体に格納した映像を任
意の時間に圧縮して再生することができる映像再生方法
及び映像再生装置、映像再生プログラムを提案しようと
するものである。[0006] In addition, when receiving and reproducing a real-time audio signal with a video such as a live broadcast, if the program cannot be viewed due to leaving the seat, etc., the first half of the program is summarized while continuing recording. If you can see it, you are expected to be able to watch the following video after understanding the scenario in the first half of the program. An object of the present invention is to propose a video reproducing method, a video reproducing apparatus, and a video reproducing program capable of compressing and reproducing a video stored in a recording medium at an arbitrary time.

【０００７】[0007]

【課題を解決するための手段】この発明では、実時間映
像信号と音声信号を再生時刻と対応付けて記憶し、要約
開始時刻を入力し、要約区間の総延長時間である要約時
間又は要約区間の総延長時間の全要約対象区間の比であ
る要約率を入力し、前記要約時間又は要約率で前記要約
開始時刻から要約終了時刻として現在までの要約対象区
間における音声信号について強調状態と判定された区間
を要約区間と判定し、前記要約区間の音声信号と映像信
号を再生する映像再生方法を提案する。According to the present invention, a real-time video signal and an audio signal are stored in association with a reproduction time, a summary start time is input, and a summary time or a summary section which is a total extension time of the summary section. Enter the summarization rate, which is the ratio of the total extension time to all summarization target sections, and the summarization time or summarization rate is determined to be the emphasized state for the voice signal in the summarization target section from the summarization start time to the summarization end time to the present. We propose a video reproduction method for determining the section as a summary section and reproducing the audio signal and the video signal in the summary section.

【０００８】この発明では更に、前記要約区間の音声信
号と映像信号の再生終了の時刻を新たな要約区間の終了
時刻とし、前記要約区間の再生終了時刻を新たな要約区
間再生開始時刻とする前記要約区間の決定及び当該要約
区間の音声信号と映像信号の再生を反復する映像再生方
法を提案する。この発明では更に、前記要約率ｒ（ｒは
０＜ｒ＜１となる実数）をｒ／（１＋ｒ）と調整し、当
該調整された要約率をもって要約区間を判定する映像再
生方法を提案する。この発明では更に、少なくとも基本
周波数又はピッチ周期、パワー、動的特徴量の時間変化
特性、又はこれらのフレーム間差分を含む特徴量と強調
状態での出現確率と平静状態での出現確率を対応して格
納した符号帳を用い、前記音声信号をフレーム毎に分析
した前記特徴量に対応する強調状態での出現確率を求
め、前記音声信号をフレーム毎に分析した前記特徴量に
対応する平静状態での出現確率を求め、前記強調状態で
の出現確率に基づいて強調状態となる確率を算出し、前
記平静状態での出現確率に基づいて平静状態となる確率
を算出し、前記強調状態となる確率の前記平静状態とな
る確率に対する確率比を音声信号区間毎に算出し、前記
確率比に対応する音声信号区間の時間を降順に累積して
要約時間を算出し、前記要約時間の全要約対象区間に対
する比である要約率が前記入力された要約率となる音声
信号区間を前記要約区間と決定する映像再生方法を提案
する。In the present invention, further, the reproduction end time of the audio signal and the video signal of the digest section is set as the end time of the new digest section, and the reproduction end time of the digest section is set as the new digest section reproduction start time. We propose a video playback method that repeats the determination of the summary section and the playback of the audio and video signals in the summary section. The present invention further proposes a video reproducing method in which the summarization rate r (r is a real number satisfying 0 <r <1) is adjusted to r / (1 + r), and the summarization section is determined based on the adjusted summarization rate. Further, in the present invention, at least the fundamental frequency or the pitch period, the power, the time change characteristic of the dynamic feature amount, or the feature amount including the difference between the frames, the appearance probability in the emphasized state, and the appearance probability in the quiet state are associated. Using the codebook stored as above, the appearance probability in an emphasized state corresponding to the feature amount obtained by analyzing the voice signal for each frame is obtained, and the voice signal is obtained at a quiet state corresponding to the feature amount analyzed for each frame. The probability of appearance is calculated, the probability of becoming an emphasized state is calculated based on the appearance probability of the emphasized state, the probability of becoming a calm state is calculated based on the appearance probability of the calm state, and the probability of becoming the emphasized state is calculated. The probability ratio to the probability of being in the quiet state is calculated for each voice signal section, the time of the voice signal section corresponding to the probability ratio is accumulated in descending order to calculate the summarization time, and the total sum of the summarization time is calculated. The speech signal interval summarization rate which is the ratio of target section serving as the input summarization rate proposes a video reproducing method of determining with the summary section.

【０００９】この発明では更に、少なくとも基本周波数
又はピッチ周期、パワー、動的特徴量の時間変化特性、
又はこれらのフレーム間差分を含む特徴量と強調状態で
の出現確率と平静状態での出現確率を対応して格納した
符号帳を用い、前記音声信号をフレーム毎に分析した前
記特徴量に対応する強調状態での出現確率を求め、前記
音声信号をフレーム毎に分析した前記特徴量に対応する
平静状態での出現確率を求め、前記強調状態での出現確
率に基づいて強調状態となる確率を算出し、前記平静状
態での出現確率に基づいて平静状態となる確率を算出
し、前記強調状態となる確率の前記平静状態となる確率
に対する確率比が所定の係数より大きい音声信号区間を
要約区間と仮判定し、要約区間の時間の総和、又は要約
率として前記音声信号全区間の時間の前記要約区間の時
間の総和に対する比率を算出し、前記要約区間の時間の
総和が略所定の要約時間に、又は前記要約率が略所定の
要約率となる前記所定の係数を算出して要約区間を決定
する映像再生方法を提案する。Further, according to the present invention, at least the fundamental frequency or the pitch period, the power, and the time change characteristic of the dynamic feature quantity,
Or, using a codebook in which the feature amount including these inter-frame differences, the appearance probability in the emphasized state, and the appearance probability in the calm state are stored in correspondence with each other, the voice signal corresponds to the feature amount analyzed for each frame. The appearance probability in the emphasized state is obtained, the appearance probability in the quiet state corresponding to the feature amount obtained by analyzing the voice signal for each frame is obtained, and the probability of becoming the emphasized state is calculated based on the appearance probability in the emphasized state. Then, the probability of being in a calm state is calculated based on the appearance probability in the calm state, and a probability signal ratio of the probability of becoming the emphasized state to the probability of becoming the calm state is larger than a predetermined coefficient as a summary section. A tentative determination is made, or the ratio of the time of all the voice signal sections to the sum of the times of the summary sections is calculated as the sum of the times of the summary sections or the summarization rate, and the sum of the times of the summary sections is a substantially predetermined summary. During, or the summarization ratio proposes a video reproducing method of determining the predetermined factor is calculated and summarization segment to which a substantially predetermined summarization ratio.

【００１０】この発明では更に、前記音声信号をフレー
ム毎に無音区間か否か、有声区間か否か判定し、所定フ
レーム数以上の無音区間で囲まれ、有声区間を含む部分
を音声小段落と判定し、音声小段落に含まれる有声区間
の平均パワーが該音声小段落内の平均パワーの所定の定
数倍より小さい音声小段落を末尾とする音声小段落群を
音声段落と判定し、前記音声信号区間は音声段落毎に定
められたものであり、前記要約時間を音声段落毎に累積
して求める映像再生方法を提案する。Further, according to the present invention, it is determined for each frame whether the voice signal is a silent section or a voiced section, and a portion surrounded by a voiceless section of a predetermined number of frames or more, and a portion including the voiced section is referred to as a voice sub-paragraph. It is determined that the average power of voiced sections included in the audio sub-paragraph is less than a predetermined constant multiple of the average power in the audio sub-paragraph, the audio sub-paragraph group ending with the audio sub-paragraph is determined as the audio paragraph, the audio The signal section is defined for each audio paragraph, and a video reproducing method for accumulating and calculating the summary time for each audio paragraph is proposed.

【００１１】この発明では更に、実時間映像信号と音声
信号を再生時刻と対応付けて記憶する記憶手段と、要約
開始時刻を入力する要約開始時刻入力手段と、要約区間
の総延長時間である要約時間又は要約区間の総延長時間
の全要約対象対象区間の比である要約率で定められる要
約条件を入力する要約条件入力手段と、前記要約条件に
従って、前記要約開始時刻から要約終了時刻として現在
までの要約対象区間における音声信号について強調状態
と判定された区間を要約区間と判定する要約区間決定手
段と、前記要約区間決定部で決定した要約区間の音声信
号と映像信号を再生する再生手段とを有する映像再生装
置を提案する。Further, according to the present invention, a storage means for storing the real-time video signal and the audio signal in association with the reproduction time, a summary start time input means for inputting the summary start time, and a summary which is a total extension time of the summary section. Summarization condition input means for inputting summarization conditions defined by the summarization rate, which is the ratio of all summarization target sections of time or total extension time of summarization sections, and according to the summarization conditions, from the summarization start time to the summarization end time to the present A summarization section determining means for determining a section determined to be an emphasized state with respect to the audio signal in the summarization target section, and a reproducing means for reproducing the audio signal and the video signal of the summarization section determined by the summarization section determining unit. A video reproducing apparatus having the above is proposed.

【００１２】この発明では更に、前記映像再生装置にお
いて、前記要約区間決定手段は、少なくとも基本周波数
又はピッチ周期、パワー、動的特徴量の時間変化特性、
又はこれらのフレーム間差分を含む特徴量と強調状態で
の出現確率と平静状態での出現確率を対応して格納した
符号帳と、前記符号帳を用いて前記音声信号をフレーム
毎に分析した前記特徴量に対応する強調状態での出現確
率を求め、前記強調状態での出現確率に基づいて強調状
態となる確率を算出する強調状態確率計算部と、前記符
号帳を用いて前記音声信号をフレーム毎に分析した前記
特徴量に対応する平静状態での出現確率を求め、前記平
静状態での出現確率に基づいて平静状態となる確率を算
出する平静状態確率計算部と、前記強調状態となる確率
の前記平静状態となる確率に対する確率比を音声信号区
間毎に算出し、前記確率比に対応する音声信号区間の時
間を降順に累積して要約時間を算出し、要約区間を仮決
定する要約区間仮決定部と、前記要約区間の全要約対象
区間に対する比が前記要約率を満たす音声信号区間を前
記要約区間と決定する要約区間決定部とを有する映像再
生装置を提案する。According to the present invention, further, in the video reproducing apparatus, the summarizing section determining means has at least a fundamental frequency or pitch period, power, and a time change characteristic of a dynamic feature quantity,
Alternatively, a codebook that stores the feature quantity including these interframe differences, the appearance probability in the emphasized state, and the appearance probability in the quiet state in association with each other, and the speech signal is analyzed for each frame using the codebook. Obtaining the appearance probability in the emphasized state corresponding to the feature amount, an emphasized state probability calculation unit that calculates the probability of becoming the emphasized state based on the appearance probability in the emphasized state, and the voice signal framed using the codebook. Obtaining the appearance probability in a quiet state corresponding to each feature amount analyzed for each, a quiet state probability calculation unit that calculates the probability of being in a quiet state based on the appearance probability in the quiet state, and the probability of becoming the emphasized state The probability ratio to the probability of being in the quiet state is calculated for each voice signal section, the time of the voice signal section corresponding to the probability ratio is accumulated in descending order to calculate the summary time, and the summary section for temporarily determining the summary section Temporary And tough, the ratio to the total summary target section of the summary section proposes a video reproducing apparatus and a summary section determining unit that determines said summarization segment the speech signal segment that satisfies said summarization rate.

【００１３】この発明では更に、前記映像再生装置にお
いて、前記要約区間決定手段は、少なくとも基本周波数
又はピッチ周期、パワー、動的特徴量の時間変化特性、
又はこれらのフレーム間差分を含む特徴量と強調状態で
の出現確率と平静状態での出現確率を対応して格納した
符号帳と、この符号帳を用いて前記音声符号をフレーム
毎に分析した前記特徴量に対応する強調状態での出現確
率と平静状態での出現確率を求め、前記強調状態での出
現確率に基づいて強調状態となる確率を算出する強調状
態確率計算部と、前記平静状態での出現確率に基づいて
平静状態となる確率を算出する平静状態確率計算部と、
前記強調状態となる確率の前記平静状態となる確率に対
する確率比が所定の係数より大きい音声信号区間を要約
区間と仮判定する要約区間仮判定部と、要約区間の時間
の総和が略所定の要約時間に、又は前記要約率が略所定
の要約率となる前記所定の係数を算出して各チャネル毎
又は各発話者毎の要約区間を決定する要約区間決定部と
を有する映像再生装置。According to the present invention, further, in the video reproducing apparatus, the summarizing section determining means has at least a fundamental frequency or pitch period, power, and a time change characteristic of a dynamic feature quantity,
Alternatively, a codebook that stores the feature quantity including the inter-frame difference, the appearance probability in the emphasized state, and the appearance probability in the quiet state in association with each other, and the speech code is analyzed for each frame using the codebook. An appearance probability in an emphasized state corresponding to a feature amount and an appearance probability in a calm state are obtained, and an emphasized state probability calculation unit that calculates a probability of becoming an emphasized state based on the appearance probability in the emphasized state, and in the calm state A calm state probability calculation unit that calculates the probability of being in a calm state based on the occurrence probability of
A summary section tentative determination unit that tentatively determines that a speech signal section whose probability ratio of the probability of being in the emphasized state to the probability of being in the quiet state is larger than a predetermined coefficient is a summary section, and a summation of the time of the summary sections is approximately a predetermined summary. A video reproduction device having a digest section determining unit that determines the digest section for each channel or each speaker by calculating the predetermined coefficient at which the summarization rate becomes a substantially predetermined summarization rate.

【００１４】この発明では更に、コンピュータが解読可
能な符号によって記述され、前記記載の映像再生方法の
何れかを実行させる映像再生プログラムを提案する。作用この発明の映像再生方法によれば記録媒体に記録されて
いる音声の強調状態となる確率が高い音声区間を要約区
間として抽出するから、コンテンツの内容で重要な部分
を抜き出し、重要な部分をつなぎ合せて要約音声及び要
約された映像情報を得ることができる。この結果、要約
時間を短時間に圧縮したとしても、そのコンテンツの内
容をよく理解することができる。The present invention further proposes a video reproducing program which is described by a computer-readable code and executes any one of the above-mentioned video reproducing methods. Effect According to the video reproducing method of the present invention, the voice section recorded in the recording medium, which has a high probability of being in the emphasized state, is extracted as the summary section. Therefore, an important part is extracted from the content and the important part is extracted. It is possible to obtain the summary audio and the summary video information by connecting them. As a result, even if the summarization time is compressed in a short time, the content can be well understood.

【００１５】また、記録媒体として短時間に多量のデー
タを書き込み及び読み出すことができる記録媒体を用い
ることから、録画を続けながら、他のコンテンツ又は録
画中のコンテンツの前半部分を読み出すことができる。
このために、録画を続けながら、その録画中の番組の録
画部分を要約し、要約情報を再生することができる。こ
の結果、録画中の映像と、要約を伝える映像を例えば親
画面と子画面と異なる表示手段にそれぞれに表示するこ
とにより、現在の放送内容と過去の放送内容の双方を視
ることができ、要約情報の再生が終了した時点では番組
の前半の部を理解した状態で続きを視聴することができ
る利点が得られる。Further, since a recording medium capable of writing and reading a large amount of data in a short time is used as the recording medium, it is possible to read other contents or the first half of the contents being recorded while continuing recording.
Therefore, while continuing the recording, the recorded portion of the program being recorded can be summarized and the summarized information can be reproduced. As a result, it is possible to view both the current broadcast content and the past broadcast content by displaying the video being recorded and the video that conveys the summary on different display means from the main screen and the sub screen, respectively. When the reproduction of the summary information is finished, there is an advantage that the continuation can be viewed while understanding the first half of the program.

【００１６】この発明の特徴とする点は、コンテンツ要
約再生時にユーザからの要求に従って、どのような要約
率（圧縮率）にでもコンテンツを要約することができる
要約方法を用いる点にある。この特徴とする要約方法
は、先願である特願２００１−２４１２７８で本出願人
が提案した、任意の音声小段落の発話状態を判定し、強
調状態となる確率が平静状態となる確率よりも大きけれ
ば、その音声小段落を強調状態にあると判定し、その音
声小段落を含む音声段落を要約区間として抽出する音声
強調状態判定方法及び音声要約方法を利用して実現する
ことができる。A feature of the present invention is to use a summarization method capable of summarizing contents at any summarization rate (compression rate) according to a request from a user at the time of content summary reproduction. This characteristic summarizing method determines the utterance state of an arbitrary speech sub-paragraph proposed by the present applicant in Japanese Patent Application No. 2001-241278, which is a prior application, and the probability of being in an emphasized state is higher than the probability of being in a quiet state. If it is large, it can be realized by using the voice emphasis state determination method and the voice summarization method, which determine that the voice sub-paragraph is in the emphasized state and extract the voice paragraph including the voice sub-paragraph as a summary section.

【００１７】[0017]

【発明の実施の形態】ここで、この発明で用いられる音
声小段落抽出方法、音声段落抽出方法、各音声小段落毎
に強調状態となる確率及び平静状態となる確率を求める
方法について、説明する。図５に先に提案した音声要約
方法の実施形態の基本手順を示す。ステップＳ１で入力
音声信号を分析して音声特徴量を求める。ステップＳ２
で、入力音声信号の音声小段落と、複数の音声小段落か
ら構成される音声段落を抽出する。ステップＳ３で各音
声小段落を構成するフレームが平静状態か、強調状態か
発話状態を判定する。この判定に基づきステップＳ４で
要約音声を作成し、要約音声を得る。BEST MODE FOR CARRYING OUT THE INVENTION Here, a method for extracting a voice sub-paragraph, a method for extracting a voice paragraph, and a method for obtaining a probability of being in an emphasized state and a probability of being in a quiet state for each voice sub-paragraph used in the present invention will be described. . FIG. 5 shows the basic procedure of an embodiment of the previously proposed voice summarization method. In step S1, the input voice signal is analyzed to obtain a voice feature amount. Step S2
Then, an audio sub-paragraph of the input audio signal and an audio paragraph composed of a plurality of audio sub-paragraphs are extracted. In step S3, it is determined whether the frame forming each audio sub-paragraph is in a calm state, emphasized state, or uttered state. Based on this determination, a summary voice is created in step S4 to obtain the summary voice.

【００１８】以下に、自然な話し言葉や会話音声を、要
約に適用する場合の実施例を述べる。音声特徴量は、ス
ペクトル情報等に比べて、雑音環境下でも安定して得ら
れ、かつ話者に依存し難いものを用いる。入力音声信号
から音声特徴量として基本周波数（ｆ０）、パワー
（ｐ）、音声の動的特徴量の時間変化特性（ｄ）、ポー
ズ時間長（無音区間）（ｐｓ）を抽出する。これらの音
声特徴量の抽出法は、例えば、「音響・音響工学」（古
井貞煕、近代科学社、１９９８）、「音声符号化」（守
谷健弘、電子情報通信学会、１９９８）、「ディジタル
音声処理」（古井貞煕、東海大学出版会、１９８５）、
「複合正弦波モデルに基づく音声分析アルゴリズムに関
する研究」（嵯峨山茂樹、博士論文、１９９８）などに
述べられている。音声の動的特徴量の時間変化は発話速
度の尺度となるパラメータであり特許第２９７６９９８
号に記載のものを用いてもよい。即ち、動的変化量とし
てスペクトル包絡を反映するＬＰＣスペクトラム係数の
時間変化特性を求め、その時間変化をもとに発話速度係
数が求められるものである。より具体的にはフレーム毎
にＬＰＣスペクトラム係数Ｃ１（ｔ）、…Ｃｋ（ｔ）を
抽出して次式のような動的特徴量ｄ（ダイナミックメジ
ャー）を求める。ｄ（ｔ）＝Σi=1k［Σf=t-f0t+f0［ｆ
×Ｃi（ｔ）］／（Σf=t-f0t+f0ｆ2）2 ここで、ｆ０は
前後の音声区間フレーム数（必ずしも整数個のフレーム
でなくとも一定の時間区間でもよい）、ｋはＬＰＣスペ
クトラムの次数、ｉ＝１、２、…ｋである。発話速度の
係数として動的特徴量の変化の極大点の単位時間当たり
の個数、もしくは単位時間当たりの変化率が用いられ
る。An example of applying natural spoken language or conversational voice to the summary will be described below. As the voice feature amount, one that is more stable than the spectral information even in a noisy environment and is less likely to depend on the speaker is used. The fundamental frequency (f0), the power (p), the time variation characteristic (d) of the dynamic feature amount of the voice, and the pause time length (silent section) (ps) are extracted from the input voice signal as the voice feature amount. The method of extracting these speech feature amounts is, for example, “acoustic / acoustic engineering” (Sadahiro Furui, Modern Science Co., 1998), “speech coding” (Takehiro Moriya, Institute of Electronics, Information and Communication Engineers, 1998), “Digital”. Speech processing "(Sadahiro Furui, Tokai University Press, 1985),
"Sound analysis algorithm based on complex sine wave model" (Shigeki Sagayama, Ph.D. thesis, 1998). The change over time in the dynamic feature amount of voice is a parameter that is a measure of the speech rate, and is disclosed in Japanese Patent No. 2976998.
You may use the thing of the No. That is, the time variation characteristic of the LPC spectrum coefficient that reflects the spectrum envelope as the dynamic variation is obtained, and the speech rate coefficient is obtained based on the time variation. More specifically, the LPC spectrum coefficient C1 (t), ... d (t) = Σi = 1k [Σf = t-f0t + f0 [f
× Ci (t)] / (Σf = t-f0t + f0f2) 2 Here, f0 is the number of speech section frames before and after (not necessarily an integral number of frames but may be a fixed time section), and k is the LPC spectrum. The order is i = 1, 2, ... K. As the coefficient of the speech rate, the number of maximum points of the change in the dynamic feature amount per unit time or the rate of change per unit time is used.

【００１９】実施例では例えば１００ｍｓを１フレーム
とし、シフトを５０ｍｓとする。１フレーム毎の平均の
基本周波数を求める（ｆ０´）。パワーについても同様
に１フレーム毎の平均パワー（ｐ´）を求める。更に現
フレームのｆ０´と±ｉフレーム前後のｆ０´との差分
をとり、±Δｆ０´ｉ（Δ成分）とする。パワーについ
ても同様に現フレームのｐ´と±ｉフレーム前後のｐ´
との差分±Δｐ´ｉ（Δ成分）を求める。ｆ０´、±Δ
ｆ０´ｉ、ｐ´、±Δｐ´ｉを規格化する。この規格は
例えばｆ０´、±Δｆ０´ｉをそれぞれ、音声波形全体
の平均基本周波数で割り規格化する。これら規格化され
た値をｆ０″、±ｆ０″ｉと表す。ｐ´、±Δｐ´ｉに
ついても同様に、発話状態判定の対象とする音声波形全
体の平均パワーで割り、規格化する。規格化するにあた
り、後述する音声小段落、音声段落ごとの平均パワーで
割ってもよい。これら規格化された値をｐ″、±Δｐ″
ｉと表す。ｉの値は例えばｉ＝４とする。現フレームの
前後±Ｔ１ｍｓの、ダイナミックメジャーのピーク本
数、即ち動的特徴量の変化の極大点の個数を数える（ｄ
ｐ）。これと、現フレームの開始時刻の、Ｔ２ｍｓ前の
時刻を区間に含むフレームのｄｐとのΔ成分（−Δｄ
ｐ）を求める。前記±Ｔ１ｍｓのｄｐ数と、現フレーム
の終了時刻の、Ｔ３ｍｓ後の時刻を区間に含むフレーム
のｄｐとのΔ成分（＋Δｄｐ）を求める。これら、Ｔ
１、Ｔ２、Ｔ３の値は例えばＴ１＝Ｔ２＝Ｔ３＝４５０
ｍｓとする。フレームの前後の無音区間の時間長を±ｐ
ｓとする。ステップＳ１ではこれら音声特徴パラメータ
の各値をフレーム毎に抽出する。In the embodiment, for example, 100 ms is set as one frame and the shift is set as 50 ms. An average fundamental frequency is calculated for each frame (f0 '). Regarding the power, similarly, the average power (p ') for each frame is obtained. Further, the difference between f0 ′ of the current frame and f0 ′ before and after ± i frames is taken as ± Δf0′i (Δ component). Similarly for power, p ′ of the current frame and p ′ before and after ± i frames
And the difference ± Δp′i (Δ component) is calculated. f0 ', ± Δ
Normalize f0'i, p ', ± Δp'i. In this standard, for example, f0 ′ and ± Δf0′i are divided by the average fundamental frequency of the entire voice waveform and standardized. These standardized values are represented as f0 ″ and ± f0 ″ i. Similarly, p ′ and ± Δp′i are also normalized by dividing by the average power of the entire speech waveform that is the target of speech state determination. In normalizing, it may be divided by an average power for each audio sub-paragraph and audio paragraph described later. These normalized values are p ″, ± Δp ″
Denote by i. The value of i is, for example, i = 4. The number of peaks of the dynamic measure, that is, the number of maximum points of the change in the dynamic feature amount, within ± T1 ms before and after the current frame, is counted (d
p). The Δ component (−Δd) between this and the dp of the frame including the time T2ms before the start time of the current frame in the section.
p) is calculated. The Δ component (+ Δdp) between the dp number of ± T1 ms and the dp of the frame including the time T3 ms after the end time of the current frame in the section is obtained. These, T
The values of 1, T2 and T3 are, for example, T1 = T2 = T3 = 450.
ms. The time length of the silent section before and after the frame is ± p
Let s. In step S1, each value of these audio characteristic parameters is extracted for each frame.

【００２０】ステップＳ２における入力音声の音声小段
落と、音声段落を抽出する方法の例を図６に示す。ここ
で音声小段落を発話状態判定を行う単位とする。ステッ
プＳ２０１で、入力音声信号の無音区間と有声区間を抽
出する。無音区間は例えばフレーム毎のパワーが所定の
パワー値以下であれば無音区間と判定し、有声区間は、
例えばフレーム毎の相関関数が所定の相関関数値以上で
あれば有声区間と判定する。有声／無声の決定は、周期
性／非周期性の特徴と同一視することにより、自己相関
関数や変形相関関数のピーク値で行うことが多い。入力
信号の短時間スペクトルからスペクトル包絡を除去した
予測残差の自己相関関数が変形相関関数であり、変形相
関関数のピークが所定の閾値より大きいか否かによって
有声／無声の判定を行い、又そのピークを与える遅延時
間によってピッチ周期１／ｆ０（基本周波数ｆ０）の抽
出を行う。これらの区間の抽出法の詳細は、例えば、
「ディジタル音声処理」（古井貞煕、東海大学出版会、
１９８５）などに述べられている。ここでは音声信号か
ら各音声特徴量をフレーム毎に分析することについて述
べたが、既に符号化等により分析された係数もしくは符
号に対応する特徴量を符号化に用いる符号帳から読み出
して用いてもよい。FIG. 6 shows an example of the voice sub-paragraph of the input voice and the method of extracting the voice paragraph in step S2. Here, the audio sub-paragraph is used as a unit for determining the utterance state. In step S201, a silent section and a voiced section of the input voice signal are extracted. The silent section is determined to be a silent section if the power of each frame is equal to or less than a predetermined power value, and the voiced section is
For example, if the correlation function for each frame is greater than or equal to a predetermined correlation function value, it is determined as a voiced section. The voiced / unvoiced decision is often made with the peak value of the autocorrelation function or modified correlation function by equating it with the characteristic of periodicity / non-periodicity. The autocorrelation function of the prediction residual obtained by removing the spectrum envelope from the short-time spectrum of the input signal is a modified correlation function, and voiced / unvoiced determination is performed depending on whether the peak of the modified correlation function is larger than a predetermined threshold value. The pitch period 1 / f0 (fundamental frequency f0) is extracted by the delay time giving the peak. For details of the extraction method of these sections, for example,
"Digital audio processing" (Sadahiro Furui, Tokai University Press,
1985) and the like. Here, it has been described that each voice feature amount is analyzed for each frame from the voice signal. Good.

【００２１】ステップＳ２０２で、有声区間を囲む無音
区間の時間がそれぞれｔ秒以上になるとき、その無音区
間で囲まれた有声区間を含む部分を音声小段落とする。
このｔは例えばｔ＝４００ｍｓとする。ステップＳ２０
３で、この音声小段落内の好ましくは後半部の、有声区
間の平均パワーと、その音声小段落の平均のパワーの値
ＢAの定数β倍とを比較し、前者の方が小さい場合はそ
の音声小段落を末尾音声小段落とし、直前の末尾音声小
段落後の音声小段落から現に検出した末尾音声小段落ま
でを音声段落として決定する。図７に、有声区間、音声
小段落、音声段落を模式的に示す。音声小段落を前記
の、有声区間を囲む無音区間の時間がｔ秒の条件で、抽
出する。図７では、音声小段落ｊ−１、ｊ、ｊ＋１につ
いて示している。ここで音声小段落ｊは、ｎ個の有声区
間から構成され、平均パワーをＰｊとする。有声区間の
典型的な例として、音声小段落ｊに含まれる、有声区間
ｖの平均パワーはｐｖである。音声段落ｋは、音声小段
落ｊと音声小段落を構成する後半部分の有声区間のパワ
ーから抽出する。ｉ＝ｎ−αからｎまでの有声区間の平
均パワーｐｉの平均が音声小段落ｊの平均パワーＰｊよ
り小さいとき、即ち、 Σｐｉ／（α＋１）＜βＰｊ式（１）を満たす時、音声小段落ｊが音声段落ｋの末尾音声小段
落であるとする。ただし、Σはｉ＝ｎ−αからｎまでで
ある。式（１）のα、βは定数であり、これらを操作し
て、音声段落を抽出する。実施例では、αは３、βは
０．８とした。このようにして末尾音声小段落を区切り
として隣接する末尾音声小段落間の音声小段落群を音声
段落と判定できる。In step S202, when the time of each silent section surrounding the voiced section is t seconds or more, a portion including the voiced section surrounded by the silent section is set as a voice sub-paragraph.
This t is, for example, t = 400 ms. Step S20
In 3, the average power of the voiced section, preferably in the latter half of this audio sub-paragraph, is compared with the constant β times the average power value BA of the audio sub-paragraph. If the former is smaller, then The audio sub-paragraph is determined as the final audio sub-paragraph, and the audio sub-paragraph after the immediately preceding final audio sub-paragraph to the currently detected final audio sub-paragraph is determined as the audio paragraph. FIG. 7 schematically shows a voiced section, a voice sub-paragraph, and a voice paragraph. The voice sub-paragraph is extracted under the condition that the time of the silent section surrounding the voiced section is t seconds. FIG. 7 shows audio subparagraphs j-1, j, and j + 1. Here, the speech subsection j is composed of n voiced sections, and the average power is Pj. As a typical example of the voiced section, the average power of the voiced section v included in the speech subsection j is pv. The voice paragraph k is extracted from the power of the voiced section in the latter half of the voice sub-paragraph j and the voice sub-paragraph. When the average of the average power pi of voiced sections from i = n-α to n is smaller than the average power Pj of the voice sub-paragraph j, that is, when Σpi / (α + 1) <βPj Expression (1) is satisfied, the voice sub-paragraph It is assumed that j is the last audio sub-paragraph of the audio paragraph k. However, Σ is from i = n−α to n. Α and β in the equation (1) are constants, and these are manipulated to extract a voice paragraph. In the example, α was 3 and β was 0.8. In this way, a group of audio sub-paragraphs between the adjacent final audio sub-paragraphs can be determined as an audio paragraph with the final audio sub-paragraph as a delimiter.

【００２２】図５中のステップＳ３における音声小段落
発話状態判定方法の例を図８に示す。ステップＳ３０１
で、入力音声小段落の音声特徴量をベクトル量子化す
る。このために、あらかじめ少なくとも２つの量子化音
声特徴量（コード）が格納された符号帳（コードブッ
ク）を作成しておく。ここでコードブックに蓄えられた
音声特徴量と入力音声もしくは既に分析して得られた音
声の音声特徴量との照合をとり、コードブックの中から
音声特徴量間の歪（距離）を最小にする量子化音声特徴
量を特定することが常套である。図９に、このコードブ
ックの作成法の例を示す。多数の学習用音声を被験者が
聴取し、発話状態が平静状態であるものと、強調状態で
あるものをラベリングする（Ｓ５０１）。FIG. 8 shows an example of the speech subparagraph utterance state determination method in step S3 in FIG. Step S301
Then, the voice feature quantity of the input voice sub-paragraph is vector-quantized. For this purpose, a codebook in which at least two quantized speech feature quantities (codes) are stored is created in advance. Here, the voice feature stored in the codebook is compared with the voice feature of the input voice or the voice already obtained by analysis, and the distortion (distance) between the voice features in the codebook is minimized. It is conventional to specify the quantized speech feature amount to be used. FIG. 9 shows an example of how to create this codebook. The test subject listens to a large number of learning voices and labels the one in the quiescent state and the one in the emphasized state (S501).

【００２３】例えば、被験者が発話の中で強調状態とす
る理由として、（ａ）声が大きく、名詞や接続詞を伸ばすように発話す
る（ｂ）話し始めを伸ばして話題変更を主張、意見を集約
するように声を大きくする（ｃ）声を大きく高くして重要な名詞等を強調する時（ｄ）高音であるが声はそれほど大きくない（ｅ）苦笑いしながら、焦りから本音をごまかすような
時（ｆ）周囲に同意を求める、あるいは問いかけるよう
に、語尾が高音になるとき（ｇ）ゆっくりと力強く、念を押すように、語尾の声が
大きくなる時（ｈ）声が大きく高く、割り込んで発話するという主
張、相手より大きな声で（ｉ）大きな声では憚られるような本音や秘密を発言す
る場合や、普段、声の大きい人にとっての重要なことを
発話するような時（例えば声が小さくボソボソ、ヒソヒ
ソという口調）を挙げた。この例では、平静状態とは、
前記の（ａ）〜（ｉ）のいずれでもなく、発話が平静で
あると被験者が感じたものとした。For example, the reason why the test subject puts emphasis in the utterance is as follows: (a) utter a loud voice and extend the noun or conjunction. (B) Extend the beginning of the talk to insist on a topic change and collect opinions. (C) When raising a voice to emphasize an important noun, etc. (d) It is a high-pitched sound, but the voice is not so loud (e) While laughing bitterly, it falsifies the true intention Time (f) When the end of the word becomes high-pitched, like asking or asking for consent from the surroundings (g) Slowly and powerfully, when the end of the word becomes louder, just like pressing (h) The voice is loud and high, and interrupts Insist that you speak louder than the other party, (i) when you speak a real intention or secret that is loud in a loud voice, or when you say something important to a loud person normally (for example, voice Is small He said that he was sick and sick. In this example, the calm state is
It was assumed that the subject felt that the utterance was calm, not any of the above (a) to (i).

【００２４】尚、上述では強調状態と判定する対象を発
話であるものとして説明したが、音楽でも強調状態を特
定することができる。ここでは音声付の楽曲において、
音声から強調状態を特定しようとした場合に、強調と感
じる理由として、（ａ）声が大きく、かつ声が高い（ｂ）声が力強い（ｃ）声が高く、かつアクセントが強い（ｄ）声が高く、声質が変化する（ｅ）声を伸長させ、かつ声が大きい（ｆ）声が大きく、かつ、声が高く、アクセントが強い（ｇ）声が大きく、かつ、声が高く、叫んでいる（ｈ）声が高く、アクセントが変化する（ｉ）声を伸長させ、かつ、声が大きく、語尾が高い（ｊ）声が高く、かつ、声を伸長させる（ｋ）声を伸長させ、かつ、叫び、声が高い（ｌ）語尾上がり力強い（ｍ）ゆっくり強め（ｎ）曲調が不規則（ｏ）曲調が不規則、かつ、声が高いまた、音声を含まない楽器演奏のみの楽曲でも強調状態
を特定することができる。その強調と感じる理由とし
て、（ａ）強調部分全体のパワー増大（ｂ）音の高低差が大きい（ｃ）パワーが増大する（ｄ）楽器の数が変化する（ｅ）曲調、テンポが変化する等である。In the above description, the object to be determined as the emphasized state is the utterance, but the emphasized state can be specified by music. Here, in music with audio,
When trying to specify the emphasis state from the voice, the reasons for feeling emphasis are as follows: (a) loud voice and high voice (b) strong voice (c) high voice and strong accent (d) voice High voice, voice quality changes (e) voice is extended, and voice is loud (f) voice is high, voice is high and accent is strong (g) voice is voice, voice is high, and yell A high (h) voice with a high accent and a changing voice (i) a long voice, and a large voice with a high ending (j) a high voice and a long voice (k) a long voice, Also, screaming and high voice (l) Word rising, powerful (m) Slowly strengthening (n) Irregular tone (o) Irregular tone and high voice The emphasis state can be specified. The reasons for feeling the emphasis are as follows: (a) increase in power of the entire emphasized part (b) large difference in pitch between sounds (c) increase in power (d) change in number of musical instruments (e) change in tone and tempo Etc.

【００２５】これらを基にコードブックを作成しておく
ことにより、発話に限らず音楽の要約も行うことができ
ることになる。平静状態と強調状態の各ラベル区間につ
いて、図５中のステップＳ１と同様に、音声特徴量を抽
出し（Ｓ５０２）、パラメータを選択する（Ｓ５０
３）。平静状態と強調状態のラベル区間の、前記パラメ
ータを用いて、ＬＢＧアルゴリズムでコードブックを作
成する（Ｓ５０４）。ＬＢＧアルゴリズムについては、
例えば、（Ｙ．Ｌｉｎｄｅ，Ａ．Ｂｕｚｏａｎｄ
Ｒ．Ｍ．Ｇｒａｙ，“Ａｎａｌｇｏｒｉｔｈｍｆｏ
ｒｖｅｃｔｏｒｑｕａｎｔｉｚｅｒｄｅｓｉｇ
ｎ，”ＩＥＥＥＴｒａｎｓ．Ｃｏｍｍｕｎ．，ｖｏ
ｌ．Ｃｏｍ−２８，ｐｐ．８４−９５，１９８０）があ
る。コードブックサイズは２のｎ乗個に可変である。こ
のコードブック作成は音声小段落で又はこれより長い適
当な区間毎あるいは学習音声全体の音声特徴量で規格化
した音声特徴量を用いることが好ましい。By creating a codebook based on these, not only utterance but also music can be summarized. For each label section in the quiet state and the emphasized state, the voice feature amount is extracted (S502) and parameters are selected (S50), as in step S1 in FIG.
3). A codebook is created by the LBG algorithm using the parameters in the label sections in the quiet state and the emphasized state (S504). For the LBG algorithm,
For example, (Y. Linde, A. Buzo and
R. M. Gray, “An algorithmic fo
r vector quantizer design
n, "IEEE Trans. Commun., vo
l. Com-28, pp. 84-95, 1980). The codebook size can be changed to 2 to the n-th power. For this codebook creation, it is preferable to use a voice feature amount standardized in voice sub-paragraphs or for each appropriate section longer than this, or the voice feature amount of the entire learning voice.

【００２６】図８中のステップＳ３０１で、このコード
ブックを用いて、入力音声小段落の音声特徴量を、各音
声特徴量について規格化し、その規格化された音声特徴
量をフレーム毎に照合もしくはベクトル量子化し、フレ
ーム毎にコード（量子化された音声特徴量）を得る。こ
の際の入力音声信号より抽出する音声特徴量は前記のコ
ードブック作成に用いたパラメータと同じである。強調
状態が含まれる音声小段落を特定するために、音声小段
落でのコードを用いて、発話状態の尤度（らしさ）を、
平静状態と強調状態について求める。このために、あら
かじめ、任意のコード（量子化音声特徴量）の出現確率
を、平静状態の場合と、強調状態の場合について求めて
おき、この出現確率とそのコードとを組としてコードブ
ックに格納しておく、以下にこの出現確率の求め方の例
を述べる。前記のコードブック作成に用いた学習音声中
のラベルが与えられた１つの区間（ラベル区間）の音声
特徴量のコード（フレーム毎に得られる）が、時系列で
Ｃｉ、Ｃｊ、Ｃｋ、…Ｃｎであるとき、ラベル区間αが
強調状態となる確率をＰα（ｅ）、平静状態となる確率
をＰα（ｎ）とし、Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｉ）Ｐｅｍｐ（Ｃｊ｜Ｃｉ）
…Ｐｅｍｐ（Ｃｎ｜Ｃｉ…Ｃｎ−１）＝Ｐｅｍｐ（Ｃ
ｉ）ΠＰｅｍｐ（Ｃｘ｜Ｃｉ…Ｃｘ−１）Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｉ）Ｐｎｒｍ（Ｃｊ｜Ｃｉ）
…Ｐｎｒｍ（Ｃｎ｜Ｃｉ…Ｃｎ−１）＝Ｐｅｍｐ（Ｃ
ｉ）ΠＰｎｒｍ（Ｃｘ｜Ｃｉ…Ｃｘ−１）となる。ただし、Ｐｅｍｐ（Ｃｘ｜Ｃｉ…Ｃｘ−１）は
コード列Ｃｉ…Ｃｘ−１の次にＣｘが強調状態となる条
件付確率、Ｐｎｒｍ（Ｃｘ｜Ｃｉ…Ｃｘ−１）は同様に
Ｃｉ…Ｃｘ−１に対しＣｘが平静状態となる確率であ
る。ただし、Πはｘ＝ｉ＋１からｎまでの積である。ま
たＰｅｍｐ（Ｃｉ）は学習音声についてフレームで量子
化し、これらコード中のＣｉが強調状態とラベリングさ
れた部分に存在した個数を計数し、その計数値を全学習
音声の全コード数（フレーム数）で割り算した値であ
り、Ｐｎｒｍ（Ｃｉ）はＣｉが平静状態とラベリングさ
れた部分に存在した個数を全コード数で割り算した値で
ある。In step S301 in FIG. 8, the codebook is used to standardize the voice feature amount of the input voice sub-paragraph for each voice feature amount, and the standardized voice feature amount is collated for each frame or Vector quantization is performed to obtain a code (quantized speech feature amount) for each frame. The voice feature quantity extracted from the input voice signal at this time is the same as the parameter used for the codebook creation. In order to identify the speech sub-paragraph that includes the emphasis state, the likelihood of the utterance state is calculated using the code in the speech sub-paragraph.
Ask for calmness and emphasis. Therefore, the appearance probabilities of an arbitrary code (quantized speech feature amount) are obtained in advance for the case of a quiet state and the case of an emphasized state, and the appearance probability and the code are stored in a codebook as a set. In the following, an example of how to obtain this appearance probability will be described. Codes (obtained for each frame) of the speech feature amount of one section (label section) given a label in the learning speech used for creating the codebook described above are Ci, Cj, Ck, ... Cn in time series. , The probability that the label section α is in the emphasized state is Pα (e), and the probability that the label section is in the stationary state is Pα (n). Pα (e) = Pemp (Ci) Pemp (Cj | Ci)
... Pemp (Cn | Ci ... Cn-1) = Pemp (C
i) ΠPemp (Cx | Ci ... Cx-1) Pα (n) = Pnrm (Ci) Pnrm (Cj | Ci)
... Pnrm (Cn | Ci ... Cn-1) = Pemp (C
i) ΠPnrm (Cx | Ci ... Cx-1). However, Pemp (Cx | Ci ... Cx-1) is a conditional probability that Cx is in an emphasized state next to the code sequence Ci ... Cx-1, and Pnrm (Cx | Ci ... Cx-1) is similarly Ci ... Cx-. It is the probability that Cx will be in a calm state with respect to 1. However, Π is a product of x = i + 1 to n. In addition, Pemp (Ci) quantizes the learning speech in frames, counts the number of Ci in these codes existing in the portion labeled as the emphasized state, and counts the count value for the total number of codes (the number of frames) of all the learning speeches. Pnrm (Ci) is a value obtained by dividing the number of Cis present in the portion labeled as a quiescent state by the total number of codes.

【００２７】このラベル区間αの各状態確率を簡単にす
るために、この例ではＮ−ｇｒａｍモデル（Ｎ＜ｎ）を
用いて、Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｎ｜Ｃｎ−Ｎ＋１…Ｃｎ−１）Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｎ｜Ｃｎ−Ｎ＋１…Ｃｎ−１）とする。つまりＣｎよりＮ−１個の過去のコード列Ｃｎ
−Ｎ＋１…Ｃｎ−１の次にＣｎが強調状態として得られ
る確率をＰα（ｅ）とし、同様にＮ−ｇｒａｍの確率値
をより低次のＭ−ｇｒａｍ（Ｎ≧Ｍ）の確率値と線形に
補間する線形補間法を適応することが好ましい。例えば
ＣｎよりＮ−１個の過去のコード列Ｃｎ−Ｎ＋１…Ｃｎ
−１の次にＣｎが平静状態として得られる確率をＰα
（ｎ）とする。このようなＰα（ｅ）、Ｐα（ｎ）の条
件付確率をラベリングされた学習音声の量子化コード列
から全てを求めるが、入力音声信号の音声特徴量の量子
化したコード列と対応するものが学習音声から得られて
いない場合もある。そのため、高次（即ちコード列の長
い）の条件付確率を単独出現確率とより低次の条件付出
現確率とを補間して求める。例えばＮ＝３のｔｒｉｇｒ
ａｍ、Ｎ＝２のｂｉｇｒａｍ、Ｎ＝１のｕｎｉｇｒａｍ
を用いて線形補間法を施す。Ｎ−ｇｒａｍ、線形補間
法、ｔｒｉｇｒａｍについては、例えば、「音声言語処
理」（北研二、中村哲、永田昌明、森北出版、１９
９６、２９頁）などに述べられている。即ち、Ｎ＝３（ｔｒｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２
Ｃｎ−１）、Ｐｎｒｍ（Ｃｎ｜Ｃｎ−２Ｃｎ−１）Ｎ＝２（ｂｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ｜Ｃｎ−
１）、Ｐｎｒｍ（Ｃｎ｜Ｃｎ−１）Ｎ＝１（ｕｎｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ）、Ｐｎｒ
ｍ（Ｃｎ）であり、これら３つの強調状態でのＣｎの出現確率、ま
た３つの平静状態でのＣｎの出現確率をそれぞれ用いて
次式により、Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ−１）、Ｐ
ｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ−１）を計算することにす
る。Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ−１）＝λｅｍｐ１Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ −１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ|Ｃｎ−１）＋λｅｍｐ３Ｐｅｍｐ（Ｃｎ）式（２）Ｐｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ−１）＝λｎｒｍｌＰｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ −１）＋λｎｒｍ２Ｐｎｒｍ（Ｃｎ|Ｃｎ−１）＋λｎｒｍ３Ｐｎｒｍ（Ｃｎ）式（３）Ｔｒｉｇｒａｍの学習データをＮとしたとき、すなわ
ち、コードが時系列でＣ１、Ｃ２、．．．ＣＮが得られ
たとき、λｅｍｐ１、λｅｍｐ２、λｅｍｐ３の再推定
式は前出の参考文献「音声言語処理」より次のようにな
る。 λｅｍｐ１＝１／ＮΣ（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃ
ｎ−２Ｃ−１）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−
２Ｃ−１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λ
ｅｍｐ３Ｐｅｍｐ（Ｃｎ））） λｅｍｐ２＝１／ＮΣ（λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ
−１）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−
１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ
３Ｐｅｍｐ（Ｃｎ））） λｅｍｐ３＝１／ＮΣ（λｅｍｐ３Ｐｅｍｐ（Ｃｎ）／
（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−１）＋λｅ
ｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ３Ｐｅｍｐ
（Ｃｎ）））ただし、Σはｎ＝１からＮまでの和である。以下同様に
してλｎｒｍ１、λｎｒｍ２、λｎｒｍ３も求められ
る。In order to simplify each state probability of this label section α, in this example, using the N-gram model (N <n), Pα (e) = Pemp (Cn | Cn-N + 1 ... Cn-1 ) Pα (n) = Pnrm (Cn | Cn−N + 1 ... Cn−1). That is, N-1 past code strings Cn from Cn
The probability that Cn will be obtained as an emphasized state next to -N + 1 ... Cn-1 is Pα (e), and similarly, the probability value of N-gram is linear with the probability value of lower-order M-gram (N ≧ M). It is preferable to apply a linear interpolation method that interpolates to For example, N-1 past code strings Cn-N + 1 ... Cn from Cn
−1, the probability that Cn is obtained in a calm state is Pα
(N). All of the conditional probabilities of Pα (e) and Pα (n) are obtained from the quantized code strings of the labeled learning speech, which correspond to the quantized code strings of the speech feature amount of the input speech signal. May not be obtained from the learning voice. Therefore, a high-order (that is, a long code string) conditional probability is obtained by interpolating a single occurrence probability and a lower-order conditional occurrence probability. For example, N = 3 trigr
am, bigram of N = 2, unigram of N = 1
Is used to perform the linear interpolation method. Regarding N-gram, linear interpolation method, and trigram, for example, “Spoken language processing” (Kenji Kita, Satoshi Nakamura, Masaaki Nagata, Morikita Publishing, 19
96, 29). That is, N = 3 (trigram): Pemp (Cn | Cn-2
Cn-1), Pnrm (Cn | Cn-2Cn-1) N = 2 (bigram): Pemp (Cn | Cn-
1), Pnrm (Cn | Cn-1) N = 1 (unigram): Pemp (Cn), Pnr
m (Cn), the probability of occurrence of Cn in these three emphasized states, and the probability of occurrence of Cn in three calm states are respectively calculated by the following equations, Pemp (Cn | Cn-2Cn-1), Pm
We will calculate nrm (Cn | Cn-2Cn-1). Pemp (Cn | Cn-2Cn-1) = [lambda] emp1Pemp (Cn | Cn-2Cn-1) + [lambda] emp2Pemp (Cn | Cn-1) + [lambda] emp3Pemp (Cn) Formula (2) Pnrm (Cn | Cn-2Cn-1) = [lambda] nrmlPnrm (). Cn | Cn-2Cn-1) + [lambda] nrm2Pnrm (Cn | Cn-1) + [lambda] nrm3Pnrm (Cn) Formula (3) When the learning data of Trigram is N, that is, the codes are C1, C2 ,. ．． When the CN is obtained, the re-estimation formulas for λemp1, λemp2, and λemp3 are as follows from the above-mentioned reference “Spoken Language Processing”. λemp1 = 1 / NΣ (λemp1Pemp (Cn | C
n-2C-1) / (λemp1Pemp (Cn | Cn-
2C-1) + λemp2Pemp (Cn | C-1) + λ
emp3Pemp (Cn))) λemp2 = 1 / NΣ (λemp2Pemp (Cn | C
-1) / (λemp1Pemp (Cn | Cn-2C-
1) + λemp2Pemp (Cn | C-1) + λemp
3Pemp (Cn))) λemp3 = 1 / NΣ (λemp3Pemp (Cn) /
(Λemp1Pemp (Cn | Cn-2C-1) + λe
mp2Pemp (Cn | C-1) + λemp3Pemp
(Cn))) where Σ is the sum of n = 1 to N. Similarly, λnrm1, λnrm2, and λnrm3 are obtained in the same manner.

【００２８】この例では、ラベル区間αがフレーム数Ｎ
αで得たコードがＣｉ１、Ｃｉ２、…、ＣｉＮαのと
き、このラベル区間αが強調状態となる確率Ｐα
（ｅ）、平静状態となる確率Ｐα（ｎ）は、Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅｍｐ（ＣｉＮα｜Ｃｉ（Ｎα−１）Ｃｉ（Ｎα−２））式（４）Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎｒｍ（ＣｉＮα｜Ｃｉ（Ｎα−１）Ｃｉ（Ｎα−２））式（５）となる。この計算ができるように前記のｔｒｉｇｒａ
ｍ、ｕｎｉｇｒａｍ、ｂｉｇｒａｍを任意のコードにつ
いて求めてコードブックに格納しておく。つまりコード
ブックには各コードの音声特徴量とその強調状態での出
現確率とこの例では平静状態での出現確率との組が格納
され、その強調状態での出現確率は、その音声特徴量が
過去のフレームでの音声特徴量と無関係に強調状態で出
現する確率（ｕｎｉｇｒａｍ：単独出現確率と記す）の
み、又はこれと、過去のフレームでの音声特徴量から現
在のフレームの音声特徴量に至るフレーム単位の音声特
徴量列毎に、その音声特徴量が強調状態で出現する条件
付確率との組合せの何れかであり、平静状態での出現確
率も同様に、その音声特徴量が過去のフレームでの音声
特徴量と無関係に平静状態で出現する確率（ｕｎｉｇｒ
ａｍ：単独出現確率と記す）のみ、又はこれと、過去の
フレームでの音声特徴量から現在のフレームの音声特徴
量に至るフレーム単位の音声特徴量列毎にその音声特徴
量が平静状態で出現する条件付確率と組合せの何れかで
ある。In this example, the label section α is the number of frames N
When the code obtained in α is Ci1, Ci2, ..., CiNα, the probability Pα that this label section α is in the emphasized state
(E), the probability Pα (n) of being in a calm state is as follows: Pα (e) = Pemp (Ci3 | Ci1Ci2) ... Pemp (CiNα | Ci (Nα-1) Ci (Nα-2)) Formula (4) Pα ( n) = Pnrm (Ci3 | Ci1Ci2) ... Pnrm (CiNα | Ci (Nα-1) Ci (Nα-2)) Formula (5) is obtained. To enable this calculation,
m, unigram, and bigram are obtained for arbitrary codes and stored in the codebook. That is, the codebook stores a set of the voice feature amount of each code, the appearance probability in the emphasized state, and the appearance probability in the quiet state in this example. The appearance probability in the emphasized state is the voice feature amount. Only the probability of appearing in an emphasized state irrespective of the voice feature amount in the past frame (unigram: described as a single appearance probability) or this and the voice feature amount in the past frame to the voice feature amount in the current frame For each frame-based audio feature quantity sequence, the audio feature quantity is either a combination with the conditional probability of appearing in the emphasized state, and the appearance probability in the quiet state is also the same as that of the previous frame Probability of appearing in a quiet state irrespective of the voice feature amount (unigr
am: written as a single appearance probability) or this and the voice feature quantity appears in a quiet state for each voice feature quantity sequence in frame units from the voice feature quantity in the past frame to the voice feature quantity in the current frame. It is either a conditional probability to perform or a combination.

【００２９】例えば図１０に示すようにコードブックに
は各コードＣ１、Ｃ２、…毎にその音声特徴量と、その
単独出現確率が強調状態、平静状態について、また条件
付確率が強調状態、平静状態についてそれぞれ組として
格納されている。図８中のステップＳ３０２では、入力
音声小段落の全フレームのコードについてのそのコード
ブックに格納されている前記確率から、発話状態の尤度
を、平静状態と強調状態について求める。図１１に実施
例の模式図を示す。時刻ｔから始まる音声小段落のう
ち、第４フレームまでを〜で示している。前記のよ
うに、ここでは、フレーム長は１００ｍｓ、フレームシ
フトを５０ｍｓとフレーム長の方を長くした。フレー
ム番号ｆ、時刻ｔ〜ｔ＋１００でコードＣｉが、フレ
ーム番号ｆ＋１、時刻ｔ＋５０〜ｔ＋１５０でコードＣ
ｊが、フレーム番号ｆ＋２、時刻ｔ＋１００〜ｔ＋２
００でコードＣｋが、フレーム番号ｆ＋３、時刻ｔ＋
１５０〜ｔ＋２５０でコードＣｌが得られ、つまりフレ
ーム順にコードがＣｉ、Ｃｊ、Ｃｋ、Ｃｌであるとき、
フレーム番号ｆ＋２以上のフレームでｔｒｉｇｒａｍが
計算できる。音声小段落ｓが強調状態となる確率をＰｓ
（ｅ）、平静状態となる確率をＰｓ（ｎ）とすると第４
フレームまでの確率はそれぞれ、Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）式（６）Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）式（７）となる。ただし、この例では、コードブックからＣｋ、
Ｃｌの強調状態及び平静状態の各単独出現確率を求め、
またＣｊの次にＣｋが強調状態及び平静状態で各出現す
る条件付確率、更にＣｋがＣｉ、Ｃｊの次に、ＣｌがＣ
ｊ、Ｃｋの次にそれぞれ強調状態及び平静状態でそれぞ
れ出現する条件付確率をコードブックから求めると、以
下のようになる。Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）＝λｅｍｐ１Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）＋λｅｍｐ２Ｐｅｍｐ（Ｃｋ｜Ｃｊ）＋λｅｍｐ３Ｐｅｍｐ（Ｃｋ）式（８）Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）＝λｅｍｐ１Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）＋λｅｍｐ２Ｐｅｍｐ（Ｃｌ｜Ｃｋ）＋λｅｍｐ３Ｐｅｍｐ（Ｃｌ）式（９）Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）＝λｎｒｍ１Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）＋λｎｒｍ２Ｐｎｒｍ（Ｃｋ｜Ｃｊ）＋λｎｒｍ３Ｐｎｒｍ（Ｃｋ）式（１０）Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）＝λｎｒｍ１Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）＋λｎｒｍ２Ｐｎｒｍ（Ｃｌ｜Ｃｋ）＋λｎｒｍ３Ｐｎｒｍ（Ｃｌ）式（１１）上記（８）〜（１１）式を用いて（６）式と（７）式で
示される第４フレームまでの強調状態となる確率Ｐｓ
（ｅ）と、平静状態となる確率Ｐｓ（ｎ）が求まる。こ
こで、Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）、Ｐｎｒｍ（Ｃｋ｜
ＣｉＣｊ）はフレーム番号ｆ＋２において計算できる。For example, as shown in FIG. 10, for each code C1, C2, ... Each state is stored as a set. In step S302 in FIG. 8, the likelihood of the utterance state is calculated for the calm state and the emphasized state from the probabilities stored in the codebook for the codes of all the frames of the input speech sub-paragraph. FIG. 11 shows a schematic diagram of the embodiment. Among the audio sub-paragraphs starting from time t, up to the fourth frame are indicated by ˜. As described above, here, the frame length is 100 ms and the frame shift is 50 ms, which is longer. Code Ci at frame number f and times t to t + 100, code C at frame number f + 1 and times t + 50 to t + 150
j is the frame number f + 2, time t + 100 to t + 2
At 00, code Ck is frame number f + 3, time t +
When the code Cl is obtained at 150 to t + 250, that is, when the code is Ci, Cj, Ck, Cl in the frame order,
The trigram can be calculated for frames with frame numbers f + 2 and above. Ps is the probability that the voice sub-paragraph s is in the emphasized state.
(E), the probability of being in a calm state is Ps (n)
The probabilities up to the frame are: Ps (e) = Pemp (Ck | CiCj) Pemp (Cl | CjCk) Formula (6) Ps (n) = Pnrm (Ck | CiCj) Pnrm (Cl | CjCk) Formula (7) Become. However, in this example, Ck,
Obtaining the individual appearance probabilities of the Cl emphasized state and the calm state,
The conditional probability that Ck appears next to Cj in the emphasized state and the stationary state, Ck is Ci, and Cj is C next to Cj.
The conditional probabilities that appear in the emphasized state and the calm state next to j and Ck, respectively, are obtained from the codebook as follows. Pemp (Ck | CiCj) = λemp1Pemp (Ck | CiCj) + λem p2Pemp (Ck | Cj) + λemp3Pemp (Ck) Formula (8) Pemp (Cl | CjCmpλ = λemp1Pemp (Cl | CjCk) + λemp (Pe) (λ | emp)) Cl) Formula (9) Pnrm (Ck | CiCj) = λnrm1Pnrm (Ck | CiCj) + λnr m2Pnrm (Ck | Cj) + λnrm3Pnrm (Ck) Formula (10) Pnrm (ClrCrCr) Prrm (Cr) Crn (CrCm) CnCrm) Cl | Ck) + λnrm3Pnrm (Cl) Expression (11) Probability Ps of being in the emphasized state up to the fourth frame shown in Expressions (6) and (7) using Expressions (8) to (11) above.
From (e), the probability Ps (n) of being in a calm state is obtained. Here, Pemp (Ck | CiCj) and Pnrm (Ck |
CiCj) can be calculated at frame number f + 2.

【００３０】この例では、音声小段落ｓがフレーム数Ｎ
ｓで得たコードがＣｉ１、Ｃｉ２、…、ＣｉＮｓのと
き、この音声小段落ｓが強調状態になる確率Ｐｓ（ｅ）
と平静状態になる確率Ｐｓ（ｎ）を次式により計算す
る。Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅ
ｍｐ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎ
ｒｍ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））この例ではこれらの確率が、Ｐｓ（ｅ）＞Ｐｓ（ｎ）で
あれば、その音声小段落Ｓは強調状態、Ｐｓ（ｎ）＞Ｐ
ｓ（ｅ）であれば平静状態とする。In this example, the audio sub-paragraph s has N frames.
When the code obtained in s is Ci1, Ci2, ..., CiNs, the probability Ps (e) that this speech subsection s is in the emphasized state
And the probability Ps (n) of being in a calm state is calculated by the following equation. Ps (e) = Pemp (Ci3 | Ci1Ci2) ... Pe
mp (CiNs | Ci (Ns-1) Ci (Ns-2)) Ps (n) = Pnrm (Ci3 | Ci1Ci2) ... Pn
rm (CiNs | Ci (Ns-1) Ci (Ns-2)) In this example, if these probabilities are Ps (e)> Ps (n), the audio subsection S is in the emphasized state, Ps (n). )> P
If s (e), it is in a calm state.

【００３１】図１２は以上説明した音声小段落抽出方
法、音声段落抽出方法、各音声小段落毎に強調状態とな
る確率及び平静状態となる確率を求める方法を用いた音
声強調状態判定装置及び音声要約装置の実施形態を示
す。入力部１１に音声強調状態が判定されるべき、又は
音声の要約区間を決定されるべき入力音声（入力音声信
号）が入力される。入力部１１には必要に応じて入力音
声信号をデジタル信号に変換する機能も含まれる。デジ
タル化された音声信号は必要に応じて記憶部１２に格納
される。音声特徴量抽出部１３で前述した音声特徴量が
フレーム毎に抽出される。抽出した音声特徴量は必要に
応じて、音声特徴量の平均値で規格化され、量子化部１
４で各フレームの音声特徴量がコードブック１５を参照
して量子化され、量子化された音声特徴量は強調確率計
算部１６と平静確率計算部１７に送り込まれる。コード
ブック１５は例えば図１０に示したようなものである。FIG. 12 shows a voice emphasis state determination device and voice using the voice sub-paragraph extraction method, the voice paragraph extraction method, and the method for obtaining the probability of the emphasis state and the probability of the calm state for each voice sub-paragraph described above. 1 illustrates an embodiment of a summarizing device. An input voice (input voice signal) whose voice emphasis state is to be determined or whose voice summary section is to be determined is input to the input unit 11. The input unit 11 also includes a function of converting an input audio signal into a digital signal as needed. The digitized audio signal is stored in the storage unit 12 as needed. The voice feature amount extraction unit 13 extracts the voice feature amount described above for each frame. The extracted voice feature amount is normalized by the average value of the voice feature amount as necessary, and the quantization unit 1
In 4, the voice feature amount of each frame is quantized with reference to the codebook 15, and the quantized voice feature amount is sent to the emphasis probability calculation unit 16 and the quietness probability calculation unit 17. The codebook 15 is, for example, as shown in FIG.

【００３２】強調確率計算部１６によりその量子化され
た音声特徴量の強調状態での出現確率が、コードブック
１５に格納されている対応する確率を用いて、例えば式
（８）又は（９）により計算される。同様に平静確率計
算部１７により、前記量子化された音声特徴量の平静状
態での出現確率がコードブック１５に格納されている対
応する確率を用いて、例えば式（１０）又は（１１）に
より計算される。強調確率計算部１６及び平静確率計算
部１７で各フレーム毎に算出された強調状態での出現率
と平静状態での出現確率及び各フレームの音声特徴量は
各フレームに付与したフレーム番号と共に記憶部12に格
納する。The appearance probability in the emphasized state of the quantized speech feature quantity by the emphasis probability calculation unit 16 is calculated by using the corresponding probability stored in the codebook 15, for example, equation (8) or (9). Calculated by Similarly, the quietness probability calculation unit 17 uses the corresponding probability that the appearance probability of the quantized speech feature amount in a quiet state is stored in the codebook 15, for example, according to Expression (10) or (11). Calculated. The appearance rate in the emphasized state, the appearance probability in the quiet state, and the voice feature amount of each frame calculated for each frame by the emphasis probability calculation unit 16 and the quietness probability calculation unit 17 are stored in the storage unit together with the frame number assigned to each frame. Store in 12.

【００３３】これら各部の制御は制御部１９の制御のも
とに順次行われる。音声要約装置の実施形態は、図１２
中に実線ブロックに対し、破線ブロックが付加される。
つまり記憶部１２に格納されている各フレームの音声特
徴量が無音区間判定部２１と有音区間判定部２２に送り
込まれ、無音区間判定部２１により各フレーム毎に無音
区間か否かが判定され、また有音区間判定部２２により
各フレーム毎に有声区間か否かが判定される。これらの
無音区間判定結果と有音区間判定結果が音声小段落判定
部２３に入力される。音声小段落判定部２３はこれら無
音区間判定、有声区間判定に基づき、先の方法の実施形
態で説明したように所定フレーム数を連続する無音区間
に囲まれた有声区間を含む部分が音声小段落と判定す
る。音声小段落判定部２３の判定結果は記憶部１２に書
き込まれ、記憶部１２に格納されている音声データ列に
付記され、無音区間で囲まれたフレーム群に音声小段落
番号列を付与する。これと共に音声小段落判定部２３の
判定結果は末尾音声小段落判定部２４に入力される。The control of each of these units is sequentially performed under the control of the control unit 19. An embodiment of the voice summarization device is shown in FIG.
A broken line block is added to the solid line block.
That is, the audio feature amount of each frame stored in the storage unit 12 is sent to the silent section determination unit 21 and the sound section determination unit 22, and the silent section determination unit 21 determines whether each frame is a silent section or not. Also, the voiced section determination unit 22 determines for each frame whether or not it is a voiced section. The silent segment determination result and the voiced segment determination result are input to the audio sub-paragraph determining unit 23. Based on the silent section determination and the voiced section determination, the speech subsection determining unit 23 determines that a portion including a voiced section surrounded by a continuous silent section of a predetermined number of frames is a speech subsection based on the determination of the voiced section. To determine. The determination result of the audio sub-paragraph determining unit 23 is written in the storage unit 12, added to the audio data sequence stored in the storage unit 12, and the audio sub-paragraph number sequence is given to the frame group surrounded by the silent section. At the same time, the determination result of the audio sub-paragraph determination unit 23 is input to the final audio sub-paragraph determination unit 24.

【００３４】末尾音声小段落判定部２４では、例えば図
７を参照して説明した手法により末尾音声小段落が検出
され、末尾音声小段落判定結果が音声段落判定部２５に
入力され、音声段落判定部２５により２つの末尾音声小
段落間の複数の音声小段落を含む部分を音声段落と判定
する。この音声段落判定結果も記憶部１２に書き込ま
れ、記憶部１２に記憶している音声小段落番号列に音声
段落列番号を付与する。音声要約装置として動作する場
合、強調確率計算部１６及び平静確率計算部１７では記
憶部１２から各音声小段落を構成する各フレームの強調
確率と平静確率を読み出し、各音声小段落毎の確率が例
えば式（８）及び式（１０）により計算される。強調状
態判定部１８ではこの音声小段落毎の確率計算値を比較
して、その音声小段落が強調状態か否かを判定し、要約
区間取出し部２６では音声段落中の１つの音声小段落で
も強調状態と判定されたものがあればその音声小段落を
含む音声段落を取り出す。各部の制御は制御部１９によ
り行われる。In the final voice sub-paragraph determination section 24, the final voice sub-paragraph is detected by the method described with reference to FIG. 7, the final voice sub-paragraph determination result is input to the voice paragraph determination section 25, and the voice paragraph determination is performed. The section 25 determines that a part including a plurality of audio sub-paragraphs between the two final audio sub-paragraphs is an audio paragraph. This audio paragraph determination result is also written in the storage unit 12, and an audio paragraph sequence number is given to the audio small paragraph number sequence stored in the storage unit 12. When operating as a voice summarizing device, the emphasis probability calculation unit 16 and the quietness probability calculation unit 17 read the emphasis probability and the quietness probability of each frame forming each audio subparagraph from the storage unit 12, and the probability of each audio subparagraph is calculated. For example, it is calculated by equation (8) and equation (10). The emphasis state determination unit 18 compares the probability calculation values for each audio sub-paragraph to determine whether or not the audio sub-paragraph is in the emphasized state, and the summary segment extraction unit 26 determines even one audio sub-paragraph in the audio paragraphs. If there is one that is determined to be in the emphasized state, the audio paragraph including the audio sub-paragraph is extracted. The control of each unit is performed by the control unit 19.

【００３５】以上により音声で構成される音声波形を音
声小段落及び音声段落に分離する方法及び各音声小段落
毎に強調状態となる確率及び平静状態となる確率を算出
できることが理解できよう。次に上述した各方法を利用
して要約率を自由に設定し、変更することができる音声
処理方法、音声処理装置に関わる実施の形態を説明す
る。図１３にその音声処理方法の実施の形態の基本手順
を示す。この実施例ではステップＳ１１で音声強調確率
算出処理を実行し、音声小段落の強調確率及び平静確率
を求める。From the above, it can be understood that the method of separating the speech waveform composed of speech into the speech sub-paragraphs and the speech sub-paragraphs, and the probability of being in the emphasized state and the probability of being in the quiet state can be calculated for each of the speech sub-paragraphs. Next, an embodiment relating to a voice processing method and a voice processing apparatus in which the summarization rate can be freely set and changed using each of the above-described methods will be described. FIG. 13 shows a basic procedure of the embodiment of the voice processing method. In this embodiment, the voice emphasis probability calculation process is executed in step S11 to obtain the emphasis probability and the quietness probability of the voice sub-paragraph.

【００３６】ステップＳ１２では要約条件入力ステップ
Ｓ１２を実行する。この要約条件入力ステップＳ１２で
は例えば利用者に要約時間又は要約率或は圧縮率の入力
を促す情報を提供し、要約時間又は要約率或は要約率又
は圧縮率を入力させる。尚、予め設定された複数の要約
時間又は要約率、圧縮率の中から一つを選択する入力方
法を採ることもできる。ステップＳ１３では抽出条件の
変更を繰り返す動作を実行し、ステップＳ１２の要約条
件入力ステップＳ１２で入力された要約時間又は要約
率、圧縮率を満たす抽出条件を決定する。In step S12, a summary condition input step S12 is executed. In this summarization condition input step S12, for example, information for prompting the user to input the summarization time or summarization rate or compression rate is provided, and the summarization time or summarization rate or summarization rate or compression rate is input. It is also possible to adopt an input method of selecting one from a plurality of preset summarization times or summarization rates and compression rates. In step S13, the operation of repeating the change of the extraction condition is executed, and the extraction condition satisfying the summarization time or the summarization ratio and the compression ratio input in the summarization condition input step S12 of step S12 is determined.

【００３７】ステップＳ１４で要約抽出ステップを実行
する。この要約抽出ステップＳ１４では抽出条件変更ス
テップＳ１３で決定した抽出条件を用いて採用すべき音
声段落を決定し、この採用すべき音声段落の総延長時間
を計算する。ステップ１５では要約再生処理を実行し、
要約抽出ステップＳ１４で抽出した音声段落列を再生す
る。図１４は図１３に示した音声強調確率算出ステップ
の詳細を示す。ステップＳ１０１で要約対象とする音声
波形列を音声小段落に分離する。ステップＳ１０２では
ステップＳ１０１で分離した音声小段落列から音声段落
を抽出する。音声段落とは図７で説明したように、１つ
以上の音声小段落で構成され、意味を理解できる単位で
ある。In step S14, the abstract extraction step is executed. In this abstract extraction step S14, the voice paragraph to be adopted is determined using the extraction condition determined in the extraction condition change step S13, and the total extension time of this voice paragraph to be adopted is calculated. In step 15, a summary reproduction process is executed,
The voice paragraph string extracted in the abstract extraction step S14 is reproduced. FIG. 14 shows details of the voice emphasis probability calculation step shown in FIG. In step S101, the speech waveform string to be summarized is separated into speech sub-paragraphs. In step S102, a voice paragraph is extracted from the voice sub-paragraph string separated in step S101. As described with reference to FIG. 7, the audio paragraph is a unit which is composed of one or more audio sub-paragraphs and whose meaning can be understood.

【００３８】ステップＳ１０３及びステップＳ１０４で
ステップＳ１０１で抽出した音声小段落毎に図１０で説
明したコードブックと前記した式（８）、（１０）等を
利用して各音声小段落が強調状態となる確率（以下強調
確率と称す）Ｐｓ（ｅ）と、平静状態となる確率（以下
平静確率と称す）Ｐｓ（ｎ）とを求める。ステップＳ１
０５ではステップＳ１０３及びＳ１０４において各音声
小段落毎に求めた強調確率Ｐｓ（ｅ）と平静確率Ｐｓ
（ｎ）などを各音声小段落毎に仕分けして記憶手段に音
声強調確率テーブルとして格納する。図１５に記憶手段
に格納した音声強調確率テーブルの一例を示す。図１５
に示すＦ１、Ｆ２、Ｆ３…は音声小段落毎に求めた音声
小段落強調確率Ｐｓ（ｅ）と、音声小段落平静確率Ｐｓ
（ｎ）を記録した小段落確率記憶部を示す。これらの小
段落確率記憶部Ｆ１、Ｆ２、Ｆ３…には各音声小段落Ｓ
に付された音声小段落番号ｉと、開始時刻（要約対象と
なる音声データ列の先頭から計時した時刻）終了時刻、
音声小段落強調確率、音声小段落平静確率、各音声小段
落を構成するフレーム数ｆｎ等が格納される。For each audio sub-paragraph extracted in step S101 in steps S103 and S104, each audio sub-paragraph is highlighted using the codebook described in FIG. 10 and the above equations (8) and (10). A probability (hereinafter referred to as an emphasis probability) Ps (e) and a probability of being in a calm state (hereinafter referred to as a calm probability) Ps (n) are obtained. Step S1
In 05, the emphasis probability Ps (e) and the quietness probability Ps obtained for each audio sub-paragraph in steps S103 and S104.
(N) and the like are sorted for each voice sub-paragraph and stored as a voice enhancement probability table in the storage means. FIG. 15 shows an example of the voice enhancement probability table stored in the storage means. Figure 15
F1, F2, F3 ... Shown in Fig. 4 are the speech sub-paragraph emphasis probability Ps (e) and the voice sub-paragraph calm probability Ps obtained for each voice sub-paragraph.
The subparagraph probability storage part which recorded (n) is shown. These audio sub-paragraphs S are stored in these sub-paragraph probability storage units F1, F2, F3, ....
Audio sub-paragraph number i attached to, start time (time counted from the beginning of the audio data string to be summarized) end time,
The audio small paragraph emphasis probability, the audio small paragraph quiet probability, the number of frames fn forming each audio small paragraph, and the like are stored.

【００３９】要約条件入力ステップＳ１２で入力する条
件としては要約すべきコンテンツの全長を１／Ｘ（Ｘは
正の整数）の時間に要約することを示す要約率ｒ（請求
の範囲記載の要約率の逆数ｒ＝１／Ｘを指す）、あるい
は要約時間ｔを入力する。この要約条件の設定に対し、
抽出条件変更ステップＳ１３では初期値として重み係数
ＷをＷ＝１に設定し、この重み係数を要約抽出ステップ
Ｓ１４に入力する。要約抽出ステップＳ１４は重み係数
Ｗ＝１として音声強調確率テーブルから各音声小段落毎
に格納されている強調確率Ｐｓ（ｅ）と平静確率Ｐｓ
（ｅ）とを比較し、Ｗ・Ｐｓ（ｅ）＞Ｐｓ（ｎ）の関係にある音声小段落を抽出すると共に、更にこの抽
出した音声小段落を一つでも含む音声段落を抽出し、抽
出した音声段落列の総延長時間ＭＴ（分）を求める。As a condition input in the summarization condition input step S12, a summarization rate r (summarization rate described in claims) indicating that the total length of contents to be summarized is summarized in a time of 1 / X (X is a positive integer). , Which is the reciprocal of r = 1 / X), or the summary time t. For the setting of this summary condition,
In the extraction condition changing step S13, the weighting factor W is set to W = 1 as an initial value, and this weighting factor is input to the abstract extracting step S14. In the abstract extraction step S14, the weighting factor W = 1 and the emphasis probability Ps (e) and the quietness probability Ps stored for each voice sub-paragraph from the voice emphasis probability table.
(E) is compared to extract a voice sub-paragraph having a relationship of W · Ps (e)> Ps (n), and a voice paragraph including at least one of the extracted voice sub-paragraphs is extracted and extracted. The total extension time MT (minutes) of the selected voice paragraph sequence is calculated.

【００４０】抽出した音声段落列の総延長時間ＭＴ
（分）と要約条件で決めた所定の要約時間ＹＴ（分）と
を比較する。ここでＭＴ≒ＹＴ（ＹＴに対するＭＴの誤
差が例えば±数％程度の範囲）であればそのまま採用し
た音声段落列を要約音声として再生する。要約条件で設
定した要約時間ＹＴに対するコンテンツの要約した総延
長時間ＭＴとの誤差値が規定より大きく、その関係がＭ
Ｔ＞ＹＴであれば抽出した音声段落列の総延長時間ＭＴ
（分）が、要約条件で定めた要約時間ＹＴ（分）より長
いと判定し、図１３に示した抽出条件変更ステップＳ１
３を再実行させる。抽出条件変更ステップＳ１３では重
み係数がＷ＝１で抽出した音声段落列の総延長時間ＭＴ
（分）が要約条件で定めた要約時間ＹＴ（分）より「長
い」とする判定結果を受けて強調確率Ｐｓ（ｅ）に現在
値より小さい重み付け係数Ｗ（請求項１記載の所定の係
数の場合は現在値よりも大きくする）を乗算Ｗ・Ｐｓ
（ｅ）して重み付けを施す。重み係数Ｗとしては例えば
Ｗ＝１−０．００１×Ｋ（Ｋはループ回数）で求める。Total extension time MT of the extracted voice paragraph sequence
(Minutes) is compared with a predetermined summary time YT (minutes) determined by the summary condition. If MT≈YT (the error of MT with respect to YT is within a range of ± several percent, for example), the adopted audio paragraph string is reproduced as a summary audio. The error value between the summarization time YT set in the summarization condition and the summed total extension time MT of the content is larger than the stipulation, and the relationship is M.
If T> YT, the total extension time MT of the extracted audio paragraph sequence
(Minutes) is determined to be longer than the summary time YT (minutes) defined by the summary conditions, and the extraction condition changing step S1 shown in FIG.
Re-execute 3. In the extraction condition changing step S13, the total extension time MT of the voice paragraph sequence extracted when the weighting factor is W = 1.
In response to the determination result that (minutes) is “longer” than the summarization time YT (minutes) defined in the summarization conditions, the weighting coefficient W (the predetermined coefficient according to claim 1) smaller than the current value is added to the emphasis probability Ps (e). If it is larger than the current value, multiply by W · Ps
(E) Then, weighting is performed. The weighting factor W is obtained by, for example, W = 1-0.001 × K (K is the number of loops).

【００４１】つまり、音声強調確率テーブルから読み出
した音声段落列の全ての音声小段落で求められている強
調確率Ｐｓ（ｅ）の配列に１回目のループではＷ＝１−
０．００１×１で決まる重み係数Ｗ＝０．９９９を乗算
し、重み付けを施す。この重み付けされた全ての各音声
小段落の強調確率Ｗ・Ｐｓ（ｅ）と各音声小段落の平静
確率Ｐｓ（ｎ）とを比較し、Ｗ・Ｐｓ（ｅ）＞Ｐｓ
（ｎ）の関係にある音声小段落を抽出する。この抽出結
果に従って要約抽出ステップＳ１４では抽出された音声
小段落を含む音声段落を抽出し、要約音声段落列を再び
求める。これと共に、この要約音声段落列の総延長時間
ＭＴ（分）を算出し、この総延長時間ＭＴ（分）と要約
条件で定められる要約時間ＹＴ（分）とを比較する。比
較の結果がＭＴ≒ＹＴであれば、その音声段落列を要約
音声と決定し、再生する。In other words, in the array of the enhancement probabilities Ps (e) found in all the voice sub-paragraphs of the voice paragraph sequence read from the voice enhancement probability table, W = 1−1 in the first loop.
Weighting is performed by multiplying the weighting coefficient W = 0.999 determined by 0.001 × 1. The emphasis probabilities W · Ps (e) of all the weighted voice sub-paragraphs are compared with the quietness probabilities Ps (n) of the voice sub-paragraphs, and W · Ps (e)> Ps.
An audio sub-paragraph having a relationship of (n) is extracted. In accordance with this extraction result, in the abstract extraction step S14, the voice paragraph including the extracted voice sub-paragraph is extracted, and the summary voice paragraph string is obtained again. At the same time, the total extension time MT (minutes) of this summary voice paragraph sequence is calculated, and this total extension time MT (minutes) is compared with the summary time YT (minutes) defined by the summary condition. If the comparison result is MT.apprxeq.YT, the audio paragraph string is determined as the summary audio and reproduced.

【００４２】１回目の重み付け処理の結果が依然として
ＭＴ＞ＹＴであれば抽出条件変更ステップを、２回目の
ループとして実行させる。このとき重み係数ＷはＷ＝１
−０．００１×２で求める。全ての強調確率Ｐｓ（ｅ）
にＷ＝０．９９８の重み付けを施す。このように、ルー
プの実行を繰り返す毎にこの例では重み係数Ｗの値を徐
々に小さくするように抽出条件を変更していくことによ
りＷＰｓ（ｅ）＞Ｐｓ（ｎ）の条件を満たす音声小段落
の数を漸次減らすことができる。これにより要約条件を
満たすＭＴ≒ＹＴの状態を検出することができる。If the result of the first weighting process is still MT> YT, the extraction condition changing step is executed as a second loop. At this time, the weight coefficient W is W = 1
It is calculated by −0.001 × 2. All emphasis probabilities Ps (e)
Is weighted with W = 0.998. As described above, in this example, the extraction condition is changed such that the value of the weighting coefficient W is gradually decreased every time the loop is repeatedly executed, so that the voice amount satisfying the condition of WPs (e)> Ps (n) is reduced. You can gradually reduce the number of paragraphs. As a result, it is possible to detect the state of MT≈YT that satisfies the summary condition.

【００４３】尚、上述では要約時間ＭＴの収束条件とし
てＭＴ≒ＹＴとしたが、厳密にＭＴ＝ＹＴに収束させる
こともできる。この場合には要約条件に例えば５秒不足
している場合、あと１つの音声段落を加えると１０秒超
過してしまうが、音声段落から５秒のみ再生することで
利用者の要約条件に一致させることができる。また、こ
の５秒は強調と判定された音声小段落の付近の５秒でも
よいし、音声段落の先頭から５秒でもよい。また、上述
した初期状態でＭＴ＜ＹＴと判定された場合は重み係数
Ｗを現在値よりも小さく例えばＷ＝１−０．００１×Ｋ
として求め、この重み係数Ｗを平静確率Ｐｓ（ｎ）の配
列に乗算し、平静確率Ｐｓ（ｎ）に重み付けを施せばよ
い。また、他の方法としては初期状態でＭＴ＞ＹＴと判
定された場合に重み係数を現在値より大きくＷ＝１＋
０．００１×Ｋとし、この重み係数Ｗを平静確率Ｐｓ
（ｎ）の配列に乗算してもよい。In the above description, MT≈YT is set as the convergence condition of the summarization time MT, but it is also possible to strictly set MT = YT. In this case, if the summary condition is insufficient for 5 seconds, for example, if another voice paragraph is added, it will exceed 10 seconds, but by reproducing only 5 seconds from the voice paragraph, the summary condition of the user is met. be able to. Further, the 5 seconds may be 5 seconds near the audio sub-paragraph determined to be emphasized, or 5 seconds from the beginning of the audio paragraph. When it is determined that MT <YT in the above-mentioned initial state, the weighting factor W is smaller than the current value, for example, W = 1-0.001 × K.
Then, the weighting coefficient W is multiplied by the array of the calm probability Ps (n) to weight the calm probability Ps (n). As another method, when it is determined that MT> YT in the initial state, the weighting coefficient is made larger than the current value W = 1 +
0.001 × K, and the weighting factor W is set to the calm probability Ps.
The array of (n) may be multiplied.

【００４４】また、要約再生ステップＳ１５では要約抽
出ステップＳ１４で抽出した音声段落列を再生するもの
として説明したが、音声付の画像情報の場合、要約音声
として抽出した音声段落に対応した画像情報を切り出し
てつなぎ合わせ、音声と共に再生することによりテレビ
放送の要約、あるいは映画の要約等を行うことができ
る。また、上述では音声強調確率テーブルに格納した各
音声小段落毎に求めた強調確率又は平静確率のいずれか
一方に直接重み係数Ｗを乗算して重み付けを施すことを
説明したが、強調状態を精度良く検出するためには重み
係数Ｗに各音声小段落を構成するフレームの数Ｆ乗して
ＷFとして重み付けを行うことが望ましい。In the summary reproducing step S15, the audio paragraph sequence extracted in the abstract extracting step S14 is described as being reproduced. However, in the case of image information with voice, the image information corresponding to the voice paragraph extracted as the summary voice is reproduced. It is possible to summarize a television broadcast, a movie, or the like by cutting out, connecting, and playing back together with the sound. Further, in the above description, it has been described that either the emphasis probability or the quietness probability obtained for each audio sub-paragraph stored in the audio emphasis probability table is directly multiplied by the weighting coefficient W to perform weighting. For good detection, it is desirable to perform weighting as WF by multiplying the weighting coefficient W by the number F of frames forming each audio sub-paragraph.

【００４５】つまり、式（８）及び式（１０）で算出す
る条件付の強調確率Ｐｓ（ｅ）は各フレーム毎に求めた
強調状態となる確率の積を求めている。また平静状態と
なる確率Ｐｓ（ｎ）も各フレーム毎に算出した平静状態
となる確率の積を求めている。従って、例えば強調確率
Ｐｓ（ｅ）に重み付けを施すには各フレーム毎に求めた
強調状態となる確率毎に重み付け係数Ｗを乗算すれば正
しい重み付けを施したことになる。この場合には音声小
段落を構成するフレーム数をＦとすれば重み係数ＷはＷ
Fとなる。この結果、フレームの数Ｆに応じて重み付け
の影響が増減され、フレーム数の多い音声小段落ほど、
つまり延長時間が長い音声小段落程大きい重みが付され
ることになる。In other words, the conditional emphasis probability Ps (e) calculated by the equations (8) and (10) is the product of the probabilities of the emphasized state obtained for each frame. Further, the probability Ps (n) of being in a calm state is also calculated by multiplying the probability of being in a calm state calculated for each frame. Therefore, for example, in order to weight the emphasis probability Ps (e), correct weighting is performed by multiplying the weighting coefficient W for each probability of the emphasis state obtained for each frame. In this case, if the number of frames forming the audio sub-paragraph is F, the weighting factor W is W
It becomes F. As a result, the influence of weighting is increased / decreased according to the number F of frames.
That is, a larger weight is given to a voice sub-paragraph having a longer extension time.

【００４６】但し、単に強調状態を判定するための抽出
条件を変更すればよいのであれば各フレーム毎に求めた
強調状態となる確率の積又は平静状態となる積に重み係
数Ｗを乗算するだけでも抽出条件の変更を行うことがで
きる。従って、必ずしも重み付け係数ＷをＷFとする必
要はない。また、上述では抽出条件の変更手段として音
声小段落毎に求めた強調確率Ｐｓ（ｅ）又は平静確率Ｐ
ｓ（ｎ）に重み付けを施してＰｓ（ｅ）＞Ｐｓ（ｎ）を
満たす音声小段落の数を変化させる方法を採ったが、他
の方法として全ての音声小段落の強調確率Ｐｓ（ｅ）と
平静確率Ｐｓ（ｎ）に関してその確率比Ｐｓ（ｅ）／Ｐ
ｓ（ｎ）を演算し、この確率比の降順に対応する音声信
号区間（音声小段落）を累積して要約区間の和を算出
し、要約区間の時間の総和が、略所定の要約時間に合致
する場合、そのときの音声信号区間を要約区間と決定し
て要約音声を編成する方法も考えられる。However, if it suffices to simply change the extraction condition for determining the emphasized state, the product of the probability of the emphasized state or the product of the calm state obtained for each frame is multiplied by the weighting coefficient W. However, the extraction conditions can be changed. Therefore, the weighting coefficient W does not necessarily have to be WF. Further, in the above description, the emphasis probability Ps (e) or the quietness probability P obtained for each audio sub-paragraph is used as the extraction condition changing means.
Although the method of weighting s (n) and changing the number of audio sub-paragraphs satisfying Ps (e)> Ps (n) is adopted, as another method, the emphasis probabilities Ps (e) of all audio sub-paragraphs are adopted. And the calm probability Ps (n), the probability ratio Ps (e) / P
s (n) is calculated, and the voice signal sections (voice sub-paragraphs) corresponding to the descending order of the probability ratios are accumulated to calculate the sum of the summary sections. If they match, a method of deciding the voice signal section at that time as a summary section and organizing the summary voice may be considered.

【００４７】この場合、編成した要約音声の総延長時間
が要約条件で設定した要約時間に対して過不足が生じた
場合には、強調状態にあると判定するための確率比Ｐｓ
（ｅ）／Ｐｓ（ｎ）の値を選択する閾値を変更すれば抽
出条件を変更することができる。この抽出条件変更方法
を採る場合には要約条件を満たす要約音声を編成するま
での処理を簡素化することができる利点が得られる。上
述では各音声小段落毎に求める強調確率Ｐｓ（ｅ）と平
静確率Ｐｓ（ｎ）を各フレーム毎に算出した強調状態と
なる確率の積及び平静状態となる確率の積で算出するも
のとして説明したが、他の方法として各フレーム毎に求
めた強調状態となる確率の平均値を求め、この平均値を
その音声小段落の強調確率Ｐｓ（ｅ）及び平静確率Ｐｓ
（ｎ）として用いることもできる。In this case, when the total extension time of the organized summary voices is excessive or insufficient with respect to the summary time set in the summary condition, the probability ratio Ps for determining the emphasized state is set.
The extraction condition can be changed by changing the threshold value for selecting the value of (e) / Ps (n). When this extraction condition changing method is adopted, it is possible to obtain an advantage that the process up to organizing the summary voice satisfying the summary condition can be simplified. In the above description, the emphasis probability Ps (e) and the quietness probability Ps (n) obtained for each audio sub-paragraph are calculated as the product of the probability of the emphasized state and the product of the probability of the calm state calculated for each frame. However, as another method, the average value of the probabilities of the emphasized state obtained for each frame is obtained, and the average value is used as the emphasis probability Ps (e) and the quietness probability Ps of the audio sub-paragraph.
It can also be used as (n).

【００４８】従って、この強調確率Ｐｓ（ｅ）及び平静
確率Ｐｓ（ｎ）の算出方法を採る場合には重み付けに用
いる重み付け係数Ｗはそのまま強調確率Ｐｓ（ｅ）又は
平静確率Ｐｓ（ｎ）に乗算すればよい。図１６を用いて
要約率を自由に設定することができる音声処理装置の実
施例を示す。この実施例では図１２に示した音声強調状
態要約装置の構成に要約条件入力部３１と、音声強調確
率テーブル３２と、強調小段落抽出部３３と、抽出条件
変更部３４と、要約区間仮判定部３５と、この要約区間
仮判定部３５の内部に要約音声の総延長時間を求める総
延長時間算出部３５Ａと、この総延長時間算出部３５Ａ
が算出した要約音声の総延長時間が要約条件入力部３１
で入力した要約時間の設定の範囲に入っているか否かを
判定する要約区間決定部３５Ｂと、要約条件に合致した
要約音声を保存し、再生する要約音声保存・再生部３５
Ｃを設けた構成とした点を特徴とするものである。Therefore, when the method of calculating the emphasis probability Ps (e) and the calm probability Ps (n) is adopted, the weighting coefficient W used for weighting is directly multiplied by the emphasis probability Ps (e) or the calm probability Ps (n). do it. An example of a voice processing apparatus capable of freely setting the summarization rate will be described with reference to FIG. In this embodiment, the summary condition input unit 31, the voice emphasis probability table 32, the emphasized sub-paragraph extraction unit 33, the extraction condition change unit 34, and the summary section provisional determination are added to the configuration of the voice emphasis state summarizing device shown in FIG. The unit 35, the total extension time calculation unit 35A for obtaining the total extension time of the summary voice inside the summary section provisional determination unit 35, and the total extension time calculation unit 35A.
The total extension time of the summary voice calculated by the summary condition input unit 31
The summary section determination unit 35B that determines whether or not the summary time that has been input in the setting range is included, and the summary voice storage / playback unit 35 that stores and reproduces the summary voice that matches the summary condition.
It is characterized in that it is provided with C.

【００４９】入力音声は図１１で説明したように、フレ
ーム毎に音声特徴量が求められ、この音声特徴量に従っ
て強調確率計算部１６と平静確率計算部１７でフレーム
毎に強調確率と、平静確率とを算出し、これら強調確率
と平静確率を各フレームに付与したフレーム番号と共に
記憶部１２に格納する。更に、このフレーム列番号に音
声小段落判定部で判定した音声小段落列に付与した音声
小段落列番号が付記され、各フレーム及び音声小段落に
アドレスが付与される。この実施例に示した音声処理装
置では強調確率算出部１６と平静確率算出部１７は記憶
部１２に格納している各フレームの強調確率と平静確率
を読み出し、この強調確率及び平静確率から各音声小段
落毎に強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）とを求
め、これら強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）を
音声強調テーブル３２に格納する。As described with reference to FIG. 11, the speech feature amount of the input speech is obtained for each frame, and the emphasis probability and the quietness probability for each frame are calculated by the emphasis probability calculation unit 16 and the calm probability calculation unit 17 according to the sound characteristic amount. And are stored in the storage unit 12 together with the frame number assigned to each frame. Further, to this frame string number, the audio small paragraph string number assigned to the audio small paragraph string determined by the audio small paragraph determination unit is added, and an address is assigned to each frame and audio small paragraph. In the speech processing apparatus shown in this embodiment, the emphasis probability calculation unit 16 and the calm probability calculation unit 17 read out the emphasizing probability and the calm probability of each frame stored in the storage unit 12, and each speech is calculated from the emphasizing probability and the calm probability. The emphasis probability Ps (e) and the calm probability Ps (n) are obtained for each small paragraph, and the emphasis probability Ps (e) and the calm probability Ps (n) are stored in the voice emphasis table 32.

【００５０】音声強調テーブル３２には各種のコンテン
ツの音声波形の音声小段落毎に求めた強調確率と平静確
率とが格納され、いつでも利用者の要求に応じて要約が
実行できる体制が整えられている。利用者は要約条件入
力部３１に要約条件を入力する。ここで言う要約条件と
は要約したいコンテンツの名称と、そのコンテンツの全
長時間に対する要約率を指す。要約率としてはコンテン
ツの全長を１／１０に要約するか、或は時間で１０分に
要約するなどの入力方法が考えられる。ここで例えば１
／１０と入力した場合は要約時間算出部３１Ａはコンテ
ンツの全長時間を１／１０した時間を算出し、その算出
した要約時間を要約区間仮判定部３５の要約区間決定部
３５Ｂに送り込む。The voice enhancement table 32 stores the enhancement probability and the quietness probability obtained for each voice sub-paragraph of the voice waveform of various contents, and is arranged so that a summary can be executed at any time according to the user's request. There is. The user inputs the summary condition into the summary condition input unit 31. The summarization condition mentioned here indicates the name of the content to be summarized and the summarization rate for the total length of the content. As the summarization rate, an input method such as summarizing the entire length of content to 1/10 or summarizing to 10 minutes in time can be considered. Here, for example, 1
When / 10 is input, the digest time calculation unit 31A calculates a time that is 1/10 of the total time of the content, and sends the calculated digest time to the digest segment determination unit 35B of the digest segment temporary determination unit 35.

【００５１】要約条件入力部３１に要約条件が入力され
たことを受けて制御部１９は要約音声の生成動作を開始
する。その開始の作業としては音声強調テーブル３２か
ら利用者が希望したコンテンツの強調確率と平静確率を
読み出す。読み出された強調確率と平静確率を強調小段
落抽出部３３に送り込み、強調状態にあると判定される
音声小段落番号を抽出する。強調状態にある音声区間を
抽出するための条件を変更する方法としては上述した強
調確率Ｐｓ（ｅ）又は平静確率Ｐｓ（ｎ）に確率比の逆
数となる重み付け係数Ｗを乗算しＷ・Ｐｓ（ｅ）＞Ｐｓ
（ｎ）の関係にある音声小段落を抽出し、音声小段落を
含む音声段落により要約音声を得る方法と、確率比Ｐｓ
（ｅ）／Ｐｓ（ｎ）を算出し、この確率比を降順に累算
して要約時間を得る方法とを用いることができる。In response to the input of the abstract condition to the abstract condition input unit 31, the control unit 19 starts the operation of generating the abstract voice. As the starting work, the emphasis probability and the quietness probability of the content desired by the user are read from the voice emphasis table 32. The read emphasis probabilities and quietness probabilities are sent to the emphasis subparagraph extraction unit 33, and the voice subparagraph numbers determined to be in the emphasis state are extracted. As a method of changing the condition for extracting the voice section in the emphasized state, the above-mentioned emphasis probability Ps (e) or the quietness probability Ps (n) is multiplied by a weighting coefficient W which is the reciprocal of the probability ratio, and W · Ps ( e)> Ps
A method of extracting a voice sub-paragraph having a relationship of (n) and obtaining a summary voice by a voice paragraph including a voice sub-paragraph, and a probability ratio Ps
(E) / Ps (n) is calculated, and the probability ratio is accumulated in descending order to obtain the digest time.

【００５２】抽出条件の初期値としては重み付けにより
抽出条件を変更する場合には重み付け係数ＷをＷ＝１と
して初期値とすることが考えられる。また、各音声小段
落毎に求めた強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）
の確率比Ｐｓ（ｅ）／Ｐｓ（ｎ）の値に応じて強調状態
と判定する場合は初期値としてその比の値が例えばＰｓ
（ｅ）／Ｐｓ（ｎ）≧１である場合を強調状態と判定す
ることが考えられる。この初期設定状態で強調状態と判
定された音声小段落番号と開始時刻、終了時刻を表わす
データを強調小段落抽出部３３から要約区間仮判定部３
５に送り込む。要約区間仮判定部３５では強調状態と判
定した小段落番号を含む音声段落を記憶部１２に格納し
ている音声段落列から検索し、抽出する。抽出した音声
段落列の総延長時間を総延長時間算出部３５Ａで算出
し、その総延長時間と要約条件で入力された要約時間と
を要約区間決定部３５Ｂで比較する。比較の結果が要約
条件を満たしていれば、その音声段落列を要約音声保存
・再生部３５Ｃで保存し、再生する。この再生動作は強
調小段落抽出部３３で強調状態と判定された音声小段落
の番号から音声段落を抽出し、その音声段落の開始時刻
と終了時刻の指定により各コンテンツの音声データ或は
映像データを読み出して要約音声及び要約映像データと
して送出する。As the initial value of the extraction condition, when changing the extraction condition by weighting, it is conceivable that the weighting coefficient W is set to W = 1 to be the initial value. Further, the emphasis probability Ps (e) and the quietness probability Ps (n) obtained for each audio sub-paragraph
When the emphasis state is determined according to the value of the probability ratio Ps (e) / Ps (n) of Ps (e) / Ps (n), the value of the ratio is, for example, Ps as an initial value.
It may be considered that the case where (e) / Ps (n) ≧ 1 is determined as the emphasized state. Data indicating the audio sub-paragraph number, start time, and end time determined to be emphasized in this initial setting state are output from the emphasized sub-paragraph extraction unit 33 to the summary section provisional determination unit 3
Send to 5. The summary section provisional determination unit 35 searches the voice paragraph string stored in the storage unit 12 for a voice paragraph including a sub-paragraph number determined to be in the emphasized state, and extracts the voice paragraph. The total extension time calculation unit 35A calculates the total extension time of the extracted speech paragraph sequence, and the summary section determination unit 35B compares the total extension time with the summary time input under the summary condition. If the comparison result satisfies the summarization condition, the sound paragraph sequence is stored and reproduced by the summarization sound storage / reproduction unit 35C. In this reproducing operation, an audio paragraph is extracted from the number of the audio sub-paragraph determined to be in the emphasized state by the emphasized sub-paragraph extracting unit 33, and the audio data or the video data of each content is specified by specifying the start time and the end time of the audio paragraph. Is read out and transmitted as summarized audio and summarized video data.

【００５３】要約区間決定部３５Ｂで要約条件を満たし
ていないと判定した場合は、要約区間決定部３５Ｂから
抽出条件変更部３４に抽出条件の変更指令を出力し、抽
出条件変更部３４に抽出条件の変更を行わせる。抽出条
件変更部３４は抽出条件の変更を行い、その抽出条件を
強調小段落抽出部３３に入力する。強調小段落抽出部３
３は抽出条件変更部３４から入力された抽出条件に従っ
て再び音声強調確率テーブル３２に格納されている各音
声小段落の強調確率と平静確率との比較判定を行う。強
調小段落抽出部３３の抽出結果は再び要約区間仮判定部
３５に送り込まれ、強調状態と判定された音声小段落を
含む音声段落の抽出を行わせる。この抽出された音声段
落の総延長時間を算出し、その算出結果が要約条件を満
たすか否かを要約区間決定部３５Ｂで行う。この動作が
要約条件を満たすまで繰り返され、要約条件が満たされ
た音声段落列が要約音声及び要約映像データとして記憶
部１２から読み出されユーザ端末に配信される。When the summary section determination unit 35B determines that the summarization conditions are not satisfied, the summary section determination unit 35B outputs an extraction condition change command to the extraction condition change unit 34, and the extraction condition change unit 34 outputs the extraction condition. To make changes. The extraction condition changing unit 34 changes the extraction condition and inputs the extraction condition to the emphasized small paragraph extracting unit 33. Emphasized subparagraph extraction unit 3
In accordance with the extraction condition input from the extraction condition changing unit 34, 3 again makes a comparison determination between the emphasis probability and the quietness probability of each audio subparagraph stored in the audio emphasis probability table 32. The extraction result of the emphasized small paragraph extracting unit 33 is sent again to the summary section provisional judging unit 35, and the audio paragraph including the audio small paragraph judged to be emphasized is extracted. The total extension time of the extracted voice paragraph is calculated, and the summary section determination unit 35B determines whether or not the calculation result satisfies the summary condition. This operation is repeated until the summarization condition is satisfied, and the audio paragraph string satisfying the summarization condition is read from the storage unit 12 as the summarized audio and summarized video data and distributed to the user terminal.

【００５４】尚、上述では要約区間の開始時刻及び終了
時刻を要約区間と判定した音声段落列の開始時刻及び終
了時刻として取り出すことを説明したが、映像付のコン
テンツの場合は要約区間と判定した音声段落列の開始時
刻と終了時刻に接近した映像信号のカット点を検出し、
このカット点（例えば特開平８−３２９２４号公報記載
のように検出した画面の切替わりに発生する信号を利用
する）の時刻で要約区間の開始時刻及び終了時刻を規定
する方法も考えられる。このように映像信号のカット点
を要約区間の開始時刻及び終了時刻に利用した場合は、
要約区間の切替わりが画像の切替わりに同期するため、
視覚上で視認性が高まり要約の理解度を向上できる利点
が得られる。In the above description, the start time and the end time of the summary section are extracted as the start time and the end time of the audio paragraph string that is determined to be the summary section, but it is determined to be the summary section in the case of the content with video. Detects the cut point of the video signal near the start time and end time of the audio paragraph sequence,
A method of defining the start time and the end time of the summary section by the time of this cut point (for example, as described in Japanese Patent Application Laid-Open No. 8-32924, a signal generated when switching the detected screen is used) can be considered. In this way, when the cut point of the video signal is used for the start time and end time of the summary section,
Since the switching of the summary section is synchronized with the switching of the image,
There is an advantage that the visibility is improved visually and the comprehension of the summary can be improved.

【００５５】上述したように、この発明で用いる音声要
約方法は音声波形の強調状態となる確率が高い音声区間
を要約区間として抽出するから、この音声区間と同一の
開始時間と終了時間で抽出される映像情報もコンテンツ
の内容で重要な映像部分である場合が多い。この結果、
この発明で用いる要約方法によればコンテンツの内容を
適格に視聴者に伝えることができる要約情報を得ること
ができる利点が得られる。図１にこの発明による映像再
生方法の処理手順の一例を示す。この発明では放送中の
実時間映像信号と音声信号を再生時刻と対応付けて録画
し、この録画状態を維持しながら、録画されている部分
を要約再生しようとするものである。以下この状況を追
いつき再生と称することにする。As described above, since the voice summarization method used in the present invention extracts a voice section having a high probability of being in the emphasized state of the voice waveform as a summary section, it is extracted at the same start time and end time as this voice section. In many cases, the video information to be displayed is also an important video part in the content. As a result,
According to the summarization method used in the present invention, it is possible to obtain the advantage that the summarization information that can appropriately convey the content of the content to the viewer can be obtained. FIG. 1 shows an example of the processing procedure of the video reproducing method according to the present invention. In the present invention, a real-time video signal and an audio signal being broadcast are recorded in association with a reproduction time, and the recorded portion is summarized and reproduced while maintaining this recording state. Hereinafter, this situation will be referred to as catch-up reproduction.

【００５６】ステップＳ１１１では追いつき再生対象開
始時刻又は追いつき再生対象開始映像の特定を行う。こ
の特定のためには例えば視聴者が番組視聴中に一時離席
する場合に、押ボタン操作により離席時刻の指定を行
う。又は室のドアにセンサが設けられ、ドアの開閉に伴
って視聴者が室外に退出したことを検知して離席時刻を
指定する。又は離席とは関係なく、既に録画されている
番組の一部を早送り再生し、その映像から視聴者が任意
に追いつき再生の開始位置を特定するなどの方法が考え
られる。ステップＳ１１２では要約条件（要約時間又は
要約率）の入力を行う。この入力は離席した視聴者が席
に戻った時点で行われる。つまり、離席中の時間が例え
ば３０分間であった場合に、視聴者はその３０分間に放
送された内容をどの程度に圧縮して視聴するかを視聴者
の考えに従って要約条件を入力する。その他の入力方法
としては視聴者からの入力が無かった場合に予め定めら
れたデフォルト値例えば３分間を使用するか、又は幾つ
かの候補を用意しておき、これらを表示し、視聴者が指
示選択して入力する方法が考えられる。In step S111, the catch-up reproduction target start time or the catch-up reproduction target start video is specified. For this identification, for example, when the viewer leaves the seat temporarily while watching the program, the leaving time is designated by a push button operation. Alternatively, a sensor is provided on the door of the room, and when the door is opened or closed, it is detected that the viewer has left the room, and the leaving time is designated. Alternatively, a method may be considered in which, regardless of whether or not the user is away from the seat, a part of a program that has already been recorded is fast-forwarded and played back, and the viewer arbitrarily specifies the start position of catch-up playback from the video. In step S112, the summary condition (summarization time or summary rate) is input. This input is made when the viewer who leaves the seat returns to the seat. That is, when the time during leaving the seat is, for example, 30 minutes, the viewer inputs the summary condition according to the viewer's idea as to how much the content broadcast in the 30 minutes should be compressed and viewed. As another input method, if there is no input from the viewer, a preset default value, for example, 3 minutes is used, or some candidates are prepared and displayed, and the viewer instructs A method of selecting and inputting is conceivable.

【００５７】また、例えば予約録画により自動的に録画
が開始されている状態で視聴者が帰宅した場合、予約録
画により録画開始時刻が既知であるため、要約再生開始
の指示により要約終了時刻が決定される。ここで例えば
要約条件が予めデフォルト値等により決定されていれ
ば、その要約条件に従って録画開始時刻から要約終了時
刻までが要約される。ステップＳ１１３で追いつき再生
開始を指定する。この追いつき再生開始の指定により、
要約対象区間の終了点（要約終了時刻）が指定される。
入力方法としては視聴者が追いつき再生開始を指示する
押ボタン操作によりその操作時刻を指定するか、又は室
のドアに設けた開閉センサにより視聴者の入室を検出
し、入室時刻を持って追いつき再生開始の指定を行って
もよい。Further, for example, when the viewer returns home while the recording is automatically started by the reserved recording, since the recording start time is known by the reserved recording, the digest end time is determined by the digest reproduction start instruction. To be done. Here, for example, if the summarization condition is determined in advance by a default value or the like, the recording start time to the summarization end time are summarized according to the summarization condition. In step S113, the catch-up reproduction start is designated. By specifying this catch-up playback start,
The end point (summary end time) of the section to be summarized is specified.
As an input method, the viewer specifies the operation time by a push button operation to instruct the start of catch-up playback, or the viewer's entry into the room is detected by an opening / closing sensor installed in the room door, and catch-up playback is performed with the entry time. You may specify the start.

【００５８】ステップＳ１１４で現在放送中の映像再生
を停止する。ステップＳ１１５で要約処理、要約映像・
音声の再生を行う。要約処理はステップＳ１１１で定め
た追いつき再生対象開始時刻又は追いつき再生対象開始
映像からステップＳ１１３で入力した要約終了時刻まで
の音声信号について指定された要約条件に従って、要約
区間を特定し、その要約区間の音声信号と、要約区間と
同期する映像信号を再生する。ステップＳ１１６で要約
区間の再生が終了。ステップＳ１１７で放送中の映像再
生を再会する。In step S114, the reproduction of the video currently being broadcast is stopped. In step S115, summary processing, summary video /
Play audio. The summarization process specifies a summarization section according to the summarization condition specified for the catch-up reproduction target start time or the catch-up reproduction target start video defined in step S111 to the summarization end time input in step S113, and identifies the summarization section. An audio signal and a video signal synchronized with the summary section are reproduced. In step S116, the reproduction of the summary section ends. In step S117, the video reproduction during broadcasting is reunited.

【００５９】図２に上述した追いつき再生を実行するこ
の発明の映像再生装置１００の一例を示す。この発明に
よる映像再生装置１００は、記録部１０１と、音声分離
部１０２と、音声要約部１０３と、要約区間読出部１０
４と、モード切替部１０５とによって構成することがで
きる。記録部１０１は、例えばハードディスク或は半導
体メモリ、ＤＶＤ−ＲＯＭ等のように高速で書込および
読出を実行することができる記録再生手段を用いる。高
速で書込及び読出を実行できることにより、現在放送中
の番組を録画しながら、既に記録されている部分を再生
することができる。入力信号Ｓ１はテレビチューナー等
から入力され、アナログ信号でもデジタル信号でもどち
らでもよい。但し、記録部１０１の記録はデジタル信号
で行われる。FIG. 2 shows an example of the video reproducing apparatus 100 of the present invention which executes the above-mentioned catch-up reproduction. The video reproducing apparatus 100 according to the present invention includes a recording unit 101, an audio separating unit 102, an audio summarizing unit 103, and a summarizing section reading unit 10.
4 and the mode switching unit 105. The recording unit 101 uses recording / reproducing means such as a hard disk, a semiconductor memory, or a DVD-ROM that can perform writing and reading at high speed. Since writing and reading can be performed at high speed, it is possible to reproduce a portion that has already been recorded while recording a program currently being broadcast. The input signal S1 is input from a television tuner or the like and may be either an analog signal or a digital signal. However, the recording of the recording unit 101 is performed by a digital signal.

【００６０】音声分離部１０２は指定された要約対象区
間の映像信号から音声を分離し、この音声信号を音声要
約部１０３に入力する。音声要約部１０３はこの音声信
号を用いて強調部分を抽出し、要約区間を特定する。
尚、音声要約部１０３は録画中は常に音声信号を分析
し、録画している番組毎に図１５に示した音声強調テー
ブルを作成し記憶部に格納する。従って、番組の放映中
に途中から録画部分の要約再生を行う場合は音声強調テ
ーブルを用いて要約が行われる。また後日、録画された
番組の要約を視聴する場合も音声強調確率テーブルを用
いて要約が行われる。The audio separating unit 102 separates audio from the video signal in the specified summarization target section, and inputs this audio signal to the audio summarizing unit 103. The voice summarization unit 103 extracts the emphasized portion using this voice signal and specifies the summarization section.
Note that the audio summarization unit 103 always analyzes the audio signal during recording, creates the audio enhancement table shown in FIG. 15 for each recorded program, and stores it in the storage unit. Therefore, when the summary reproduction of the recorded portion is performed midway during the broadcasting of the program, the summary is performed using the voice emphasis table. Further, when viewing the summary of the recorded program at a later date, the summary is performed using the voice emphasis probability table.

【００６１】要約区間読出部１０４は音声要約部１０３
で特定された要約区間に従って記録部１０１から音声付
映像信号を読み出し、モード切替部１０５に出力する。
モード切替部１０５は要約区間読出部１０４が読み出し
た音声付映像信号を要約映像信号として出力し、視聴者
の視聴に提供される。モード切替部１０５は要約映像を
出力するモードａの他に、記録部１０１から読み出した
映像信号を出力する再生モードｂと、入力信号Ｓ１を直
接視聴に提供するモードｃとに切替えられ、各種の形態
で利用が可能とされる。ところで、上述した追いつき再
生方法には追いつき再生の実行期間中に放送された映像
は要約対象区間に含まれないため、視聴者にはその映像
を視聴することができない不都合が生じる。The summary section reading unit 104 is a voice summarizing unit 103.
The video signal with audio is read from the recording unit 101 in accordance with the summary section specified in 1. and output to the mode switching unit 105.
The mode switching unit 105 outputs the video signal with audio read by the summary section reading unit 104 as a summary video signal and provides it for viewing by a viewer. The mode switching unit 105 is switched to a reproduction mode b for outputting the video signal read from the recording unit 101 and a mode c for directly providing the input signal S1 for viewing, in addition to the mode a for outputting the summary video, and various modes. It is available in the form. By the way, in the catch-up reproduction method described above, since the video broadcast during the execution period of the catch-up reproduction is not included in the summary target section, the viewer cannot view the video.

【００６２】このため、この発明ではこの欠点を解消す
るために、以下の実施例を提案する。つまり、要約区間
の再生が終了後に、以前の再生開始時刻を新たな要約対
象開始時刻とし、要約区間再生終了時刻を新たな要約終
了時刻として要約処理及び要約映像・音声の再生処理を
繰り返す。以前の再生開始時刻と要約区間再生終了時刻
の間の時間が所定時間（例えば５〜１０秒）以下となる
場合に繰り返しを終了する。この場合は指定された要約
率もしくは要約時間よりも長く要約区間が再生されると
いう問題が生じる。例えば要約対象区間の時間をＴ_Aと
して要約率ｒ（０＜ｒ＜１、ｒ＝要約時間（要約区間の
時間の総延長）／要約対象区間の時間）で要約すると、
１回目の要約における要約時間Ｔ₁はＴ_Aｒとなる。２回
目の要約における要約時間は１回目のものについて更に
要約率で要約するのでＴ_Aｒ²となる。この処理が順次繰
り返されるので繰り返し終了までに要する時間はＴ_Aｒ
／（１−ｒ）となる。Therefore, the present invention proposes the following embodiments in order to solve this drawback. That is, after the reproduction of the summary section is completed, the previous reproduction start time is set as a new start time of the summary target, and the summary section reproduction end time is set as the new summary end time, and the summary processing and the reproduction processing of the summary video / audio are repeated. When the time between the previous reproduction start time and the summary section reproduction end time is equal to or less than a predetermined time (for example, 5 to 10 seconds), the repetition is ended. In this case, there arises a problem that the summary section is reproduced longer than the specified summarization rate or summarization time. For example, when the time of the section to be summarized is T _A , the summary rate r (0 <r <1, r = summary time (total extension of time of the summary section) / time of the section to be summarized) is summarized as follows.
Summary time T ₁ in the first round of the summary will be T _A r. Summary time in the second abstract becomes T _A r ² so further summarized summarization rate for those of the first. Repetition time required until completion This process is repeated sequentially is T _A r
/ (1-r).

【００６３】ここで指定された要約率ｒをｒ／（１＋
ｒ）と調整し、この調整された要約率をもって要約を行
う。その場合、繰り返し終了までに要する時間はＴ_Aｒ
となり指定された要約率に適した要約時間となる。同様
に要約時間Ｔ₁が指定されたときでも要約対象区間の時
間Ｔ_Aが与えられていれば、指定された要約率ｒはＴ₁／
Ｔ_Aであるので、この要約率をＴ₁／（Ｔ_A＋Ｔ₁）として
も１回目の要約時間をＴ _AＴ₁／（Ｔ_A＋Ｔ₁）と調整して
もよい。図３は上記不都合を解消するための他の実施例
を示す。この実施例では入力信号Ｓ１をそのまま出力さ
せ、表示器の親画面２００（図４参照）に現在放送中の
映像を表示させると共に、モード切替部１０５に子画面
化処理部１０６を設け、子画面化処理した要約映像・音
声を入力信号Ｓ１に重畳させて出力し、この要約映像を
子画面２０１（図４参照）に表示させる混成モードｄを
設けた実施例を示す。The summarization rate r specified here is r / (1+
r) and perform the summarization with this adjusted summarization rate.
U In that case, the time required to complete the repetition is T_Ar
The summarization time is suitable for the specified summarization rate. As well
To the summary time T₁Even when is specified, if the section is a summary target
Interval T_AIs given, the specified summarization rate r is T₁/
T_ATherefore, the summarization rate is T₁/ (T_A+ T₁) As
Also the first summary time T _AT₁/ (T_A+ T₁)
Good. FIG. 3 shows another embodiment for solving the above-mentioned inconvenience.
Indicates. In this embodiment, the input signal S1 is output as it is.
The main screen 200 of the display unit (see FIG. 4) is currently being broadcast.
Along with displaying the image, the mode switching unit 105 displays a sub screen.
A sub-screen processing summary video / sound provided with the conversion processing unit 106
The voice is superimposed on the input signal S1 and output, and this summary video is displayed.
The mixed mode d to be displayed on the child screen 201 (see FIG. 4)
The provided example is shown.

【００６４】この実施例によれば、視聴者は親画面２０
０に表示される放映中の番組の内容を視聴しながら、過
去に放映された番組の内容を要約して子画面２０１で視
聴することができる。この結果、要約情報を視聴してい
る間に放映されている番組の内容は親画面２００から受
け取ることができるから、要約情報が全て再生された時
点では番組の内容は前半部分から現在放映中の時点まで
の内容をほぼ切れ目なく視聴者に理解させることができ
る。以上説明したこの発明による映像再生方法はコンピ
ュータに映像再生プログラムを実行させて実現される。
この場合、映像再生プログラムを通信回線を介してダウ
ンロードしたり、ＣＤ−ＲＯＭや磁気ディスク等の記録
媒体に格納させてコンピュータにインストールし、コン
ピュータ内のＣＰＵ等の処理装置で本発明の方法を実行
させる。According to this embodiment, the viewer can view the main screen 20.
While viewing the content of the program being broadcast that is displayed at 0, the content of the program that was broadcast in the past can be summarized and viewed on the child screen 201. As a result, the content of the program being broadcast while viewing the summary information can be received from the main screen 200, so that when the summary information is completely reproduced, the content of the program is currently being broadcast from the first half. It is possible to make the viewer understand the content up to the point of time almost seamlessly. The video reproducing method according to the present invention described above is realized by causing a computer to execute a video reproducing program.
In this case, the video reproduction program is downloaded via a communication line, or is stored in a recording medium such as a CD-ROM or a magnetic disk and installed in a computer, and the processing device such as a CPU in the computer executes the method of the present invention. Let

【００６５】[0065]

【発明の効果】以上説明したように、この発明によれば
録画されている番組を任意の圧縮率に圧縮して要約し、
この要約された要約情報を再生することができる。従っ
て録画されている多数の番組の内容を短時間に見分ける
ことができ、例えば眞に視聴したい番組を探す作業を短
時間に済ませることができる利点が得られる。更に、最
初から視聴することができなかった番組が放映されてい
る状況でも、その前半部分を要約して視聴することがで
きるため、途中から視聴を始めた番組でも、番組の全体
を把握して視聴を続けることができる。As described above, according to the present invention, a recorded program is compressed to an arbitrary compression rate and summarized.
This summarized summary information can be reproduced. Therefore, the contents of a large number of recorded programs can be distinguished in a short time, and, for example, the work of searching for a program to be viewed can be completed in a short time. In addition, even if a program that was not available from the beginning is being broadcast, the first half of the program can be summarized and viewed, so even if the program was started halfway, the entire program can be grasped. You can continue watching.

[Brief description of drawings]

【図１】この発明による映像再生方法を実行する場合の
手順の一例を説明するためのフローチャート。FIG. 1 is a flowchart for explaining an example of a procedure for executing a video reproducing method according to the present invention.

【図２】この発明による映像再生装置の一実施例を説明
するためのブロック図。FIG. 2 is a block diagram for explaining an embodiment of a video reproducing apparatus according to the present invention.

【図３】この発明による映像再生装置の変形実施例を説
明するためのブロック図。FIG. 3 is a block diagram for explaining a modified embodiment of the video reproducing apparatus according to the present invention.

【図４】この発明による映像再生装置の特徴を説明する
ための表示画面の一例を説明するための図。FIG. 4 is a diagram for explaining an example of a display screen for explaining the features of the video reproducing apparatus according to the present invention.

【図５】先に提案した音声要約方法を説明するためのフ
ローチャート。FIG. 5 is a flowchart for explaining the voice summarization method proposed above.

【図６】先に提案した音声段落の抽出方法を説明するた
めのフローチャート。FIG. 6 is a flowchart for explaining a previously proposed method of extracting a voice paragraph.

【図７】音声段落と音声小段落の関係を説明するための
図。FIG. 7 is a diagram for explaining the relationship between audio paragraphs and audio subparagraphs.

【図８】図５に示したステップＳ２における入力音声小
段落の発話状態を判定する方法の例を示すフローチャー
ト。8 is a flowchart showing an example of a method of determining the utterance state of an input voice sub-paragraph in step S2 shown in FIG.

【図９】先に提案した音声要約方法に用いられるコード
ブックを作成する手順の例を示すフローチャート。FIG. 9 is a flowchart showing an example of a procedure for creating a codebook used in the previously proposed speech summarization method.

【図１０】この発明において用いられるコードブックの
記憶例を示す図。FIG. 10 is a diagram showing a storage example of a codebook used in the present invention.

【図１１】発話状態尤度計算を説明するための波形図。FIG. 11 is a waveform diagram for explaining a speech state likelihood calculation.

【図１２】先に提案した音声強調状態判定装置及び音声
要約装置の一実施例を説明するためのブロック図。FIG. 12 is a block diagram for explaining an embodiment of the previously proposed voice enhancement state determination device and voice summarization device.

【図１３】要約率を自由に変更することができる要約方
法を説明するためのフローチャート。FIG. 13 is a flowchart for explaining a summarization method in which the summarization rate can be changed freely.

【図１４】音声の要約に用いる音声小段落の抽出動作と
各音声小段落の強調確率算出動作、音声小段落平静確率
抽出動作を説明するためのフローチャート。FIG. 14 is a flowchart for explaining an operation of extracting a voice sub-paragraph used for summarizing a voice, an operation of calculating an emphasis probability of each voice sub-paragraph, and an operation of extracting a voice sub-paragraph calm probability.

【図１５】音声要約装置に用いる音声強調確率テーブル
の構成を説明するための図。FIG. 15 is a diagram for explaining the configuration of a speech emphasis probability table used in the speech summarization device.

【図１６】要約率を自由に変更することができる音声要
約装置の一例を説明するためのブロック図。FIG. 16 is a block diagram for explaining an example of a voice summarizing device capable of freely changing the summarization rate.

[Explanation of symbols]

１００映像再生装置１０１記録部１０２音声
分離部１０３音声要約部１０４要約
区間読出部１０５モード切替部１０６子画
面化処理部100 Video Playback Device 101 Recording Unit 102 Audio Separation Unit 103 Audio Summarization Unit 104 Summary Section Reading Unit 105 Mode Switching Unit 106 Child Screening Processing Unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/06 Ｈ０４Ｎ 5/91 Ｃ 15/10 Ｇ１０Ｌ 3/00 ５３１ＮＨ０４Ｎ 5/91 ５２１Ｕ５１３Ａ (72)発明者水野理東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5C053 HA27 JA23 LA11 5D015 DD03 FF00 FF05 HH04 KK01─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 15/06 H04N 5/91 C 15/10 G10L 3/00 531N H04N 5/91 521U 513A (72) Invention Person Mizuno Osamu, Otemachi 2-3-1, Chiyoda-ku, Tokyo F-Term (Reference) within Nippon Telegraph and Telephone Corporation 5C053 HA27 JA23 LA11 5D015 DD03 FF00 FF05 HH04 KK01

Claims

[Claims]

1. A real-time video signal and an audio signal are stored in association with a reproduction time, a summary start time is input, and a summary time that is the total extension time of the summary section or all summary objects of the total extension time of the summary section. By inputting a summarization rate which is a ratio of sections, the section determined to be the emphasized state for the speech signal in the section to be summarized up to the present as the summarization start time to the summarization end time in the summarization time or summarization rate is determined to be the summarization section. A video reproducing method characterized by reproducing the audio signal and the video signal in the summary section.

2. The summary section is determined such that the reproduction end time of the audio signal and the video signal of the summary section is a new summary section end time and the reproduction end time of the summary section is a new summary reproduction start time. The method according to claim 1, wherein the reproduction of the audio signal and the video signal in the summary section is repeated.

3. The summarization rate r (r is a real number satisfying 0 <r <1) is adjusted to r / (1 + r), and the summarization section is determined based on the adjusted summarization rate. Two
Video playback method described.

4. At least a fundamental frequency or pitch period,
Power, a time-varying characteristic of a dynamic feature amount, or using a codebook that stores the feature amount including these inter-frame differences, the appearance probability in the emphasized state, and the appearance probability in the quiet state in association with each other, The appearance probability in the emphasized state corresponding to the feature amount analyzed for each frame is obtained, the appearance probability in the static state corresponding to the feature amount analyzed for each frame of the audio signal is obtained, and the appearance in the emphasized state is obtained. The probability of becoming an emphasized state is calculated based on the probability, the probability of becoming a calm state is calculated based on the appearance probability in the calm state, and the probability ratio of the probability of becoming the emphasized state to the probability of becoming the calm state is voiced. It is calculated for each signal section, and the summarization time is calculated by accumulating the time of the voice signal section corresponding to the probability ratio in descending order, and the summarization rate, which is the ratio of the summarization time to all summarization target sections, is input. 4. The video reproducing method according to claim 1, wherein an audio signal section having a reduction rate is determined as the summary section.

5. At least a fundamental frequency or pitch period,
Power, a time-varying characteristic of a dynamic feature amount, or using a codebook that stores the feature amount including these inter-frame differences, the appearance probability in the emphasized state, and the appearance probability in the quiet state in association with each other, The appearance probability in the emphasized state corresponding to the feature amount analyzed for each frame is obtained, the appearance probability in the static state corresponding to the feature amount analyzed for each frame of the audio signal is obtained, and the appearance in the emphasized state is obtained. The probability of being in an emphasized state is calculated based on the probability, the probability of being in a stationary state is calculated based on the appearance probability in the stationary state, and the probability ratio of the probability of becoming the emphasized state to the probability of becoming the stationary state is predetermined. A voice signal section larger than the coefficient of tentatively is determined to be a summary section, and the sum of the time of the summary section or a ratio of the time of all the voice signal sections to the sum of the time of the summary section is calculated as a summarization rate, 4. The summary section is determined by calculating the predetermined coefficient such that the sum of the summarization time periods is approximately a predetermined summarization time or the summarization rate is a substantially predetermined summarization rate. The video reproduction method according to any one of 1.

6. It is determined whether or not the voice signal is in a silent section for each frame, and whether or not it is a voiced section, and a portion surrounded by a voiceless section of a predetermined number of frames or more and a section including a voiced section is determined as a voice sub-paragraph, The average power of the voiced section included in the audio sub-paragraph is determined to be a voice sub-paragraph group ending with a voice sub-paragraph that is smaller than a predetermined constant multiple of the average power in the voice sub-paragraph, and the voice signal section is The video reproduction method according to claim 4, wherein the video reproduction method is defined for each audio paragraph, and the summary time is accumulated for each audio paragraph.

7. Storage means for storing a real-time video signal and audio signal in association with a reproduction time, a summary start time input means for inputting a summary start time, and a summary time or a summary which is a total extension time of the summary section. Summary condition input means for inputting a summary condition defined by a summarization rate that is a ratio of all summary target sections of the total extension time of the section, and according to the summary condition, the summary target from the summary start time to the present as the summary end time A summarization section determining unit that determines a section that is determined to be an emphasized state for an audio signal in the section as a summary section; and a reproducing unit that reproduces the audio signal and the video signal of the summary section determined by the summary section determining unit. Video playback device characterized by.

8. The video reproducing apparatus according to claim 7,
The summarization section determination means determines at least the fundamental frequency or pitch period, the power, the time change characteristic of the dynamic feature amount, or the feature amount including the difference between the frames, the appearance probability in the emphasized state, and the appearance probability in the quiet state. Correspondingly stored codebook, using the codebook to obtain the appearance probability in the emphasized state corresponding to the feature amount of the voice signal analyzed for each frame, and emphasized based on the appearance probability in the emphasized state An emphasis state probability calculation unit that calculates the probability of being in a state, and obtains an appearance probability in a quiet state corresponding to the feature amount obtained by analyzing the voice signal for each frame using the codebook,
A calm state probability calculating unit that calculates a probability of being in a calm state based on the appearance probability in the calm state, and a probability ratio of the probability of being the emphasized state to the probability of being the calm state is calculated for each voice signal section. A summarization time is calculated in a descending order by accumulating the time of the voice signal section corresponding to the probability ratio, and a summarization section provisional determination unit that provisionally determines a summarization section; and a ratio of the summarization section to all summarization target sections is the summarization rate. A video reproduction device comprising: a digest section determining unit that determines an audio signal section that satisfies the above as the digest section.

9. The video reproducing apparatus according to claim 7,
The summarization section determination means determines at least the fundamental frequency or pitch period, the power, the time change characteristic of the dynamic feature amount, or the feature amount including the difference between the frames, the appearance probability in the emphasized state, and the appearance probability in the quiet state. Correspondingly stored codebooks, using this codebook, the speech code is analyzed for each frame, and the appearance probability in the emphasized state and the appearance probability in the quiet state corresponding to the feature amount are obtained. An emphasized state probability calculation unit that calculates the probability of becoming an emphasized state based on the appearance probability of, a calm state probability calculation unit that calculates the probability of becoming a calm state based on the occurrence probability in the calm state, and the emphasized state And a probability ratio of the probability of being in the quiet state to a speech signal section having a larger coefficient than a predetermined coefficient. And a summarization section determining unit that determines the summarization section for each channel or each speaker by calculating the predetermined coefficient at which the summarization rate becomes a substantially predetermined summarization rate. Video playback device.

10. A video reproduction program which is described by a computer readable code and executes the video reproduction method according to claim 1. Description: