JP4416133B2

JP4416133B2 - Audio summary information extraction device and recording medium

Info

Publication number: JP4416133B2
Application number: JP2005357528A
Authority: JP
Inventors: 勝菅野; 康之中島; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2005-12-12
Filing date: 2005-12-12
Publication date: 2010-02-17
Anticipated expiration: 2020-12-27
Also published as: JP2006146253A

Description

本発明は、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報から、それらの内容を効率的に把握するための概要情報（サマリ）を抽出するオーディオ情報、オーディオビデオ概要情報の抽出装置および記録媒体に関する。また、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報から圧縮データ領域で高速に概要情報を抽出することにより、オーディオ情報またはオーディオビデオ情報の高速かつ効率的な閲覧を提供することが可能な、オーディオビデオ概要情報の再生装置に関する。 The present invention extracts audio information and audio video summary information for extracting summary information (summary) for efficiently grasping the contents from uncompressed or compressed audio information or compressed audio video information. The present invention relates to an apparatus and a recording medium. Also, providing high-speed and efficient browsing of audio information or audio-video information by extracting summary information at high speed in the compressed data area from uncompressed or compressed audio information or compressed audio-video information The present invention relates to a playback apparatus for audio / video outline information.

ビデオの自動要約作成については、例えば田中、脇本、神田による「シーン検出による動画情報の自動要約・閲覧技術の開発」、電子情報通信学会技術報告、IE99-20（1999）で研究発表（第１の従来技術）されており、該研究発表では、ビデオのシーン変化点を検出した後、階層構造化を行い、各シーンに優先度を付与することによってビデオの要約を自動的に作成している。特に、話題の転換点直後のシーンには高い優先度が付与される。階層構造においては、上位階層ほど優先度が高くなるように設定される。 For automatic video summarization, for example, Tanaka, Wakimoto, Kanda “Development of Automatic Video Information Summarization and Browsing Technology by Scene Detection”, IEICE Technical Report, IE99-20 (1999) In this research presentation, after video scene change points are detected, a hierarchical structure is created and a video summary is automatically created by assigning priorities to each scene. . In particular, a high priority is given to the scene immediately after the turning point of the topic. In the hierarchical structure, priority is set so that the higher the higher the higher the hierarchy is.

また、J.Saarela、B.Merialdoによる「Using content models to build audio-video summaries」、SPIE Conference on Strorage and Retrieval for Image and Video Databases VII（1999）においては（第２の従来技術）、汎用的なビデオの概要情報（サマリ）の作成を、制約つきの最適化問題として捉えている。制約としては、最小のショット長、オーディオとビデオの同期、ビデオの連続性、及びオーディオとビデオの冗長性などである。そして、手動によってビデオの内容モデル（記号的な記述）を構築し、サマリの作成を行っている。 In addition, “Using content models to build audio-video summaries” by J.Saarela and B.Merialdo, SPIE Conference on Strorage and Retrieval for Image and Video Databases VII (1999) (second prior art) The creation of video summary information (summary) is regarded as an optimization problem with constraints. Constraints include minimum shot length, audio and video synchronization, video continuity, and audio and video redundancy. Then, a video content model (symbol description) is manually constructed to create a summary.

また、R.Lienhartらによる「Video Abstracting」、Communications of ACM、Vol.40、No.12（1997）では（第３の従来技術）、映画の予告編に特化した概要情報の作成を目的としている。主な作成手順としては、ビデオのショットへの分割、特別なイベントを含むクリップの解析、クリップの選択、およびクリップの集約である。特別なイベントとしては、俳優の顔の認識・会話の識別、タイトルからの文字情報の抽出、及び銃撃や爆発などのイベントを抽出している。 R. Lienhart et al. “Video Abstracting”, Communications of ACM, Vol. 40, No. 12 (1997) (third prior art) aims to create summary information specialized for movie trailers. . The main creation procedures are splitting the video into shots, analysis of clips containing special events, clip selection, and clip aggregation. Special events include actor face recognition, conversation identification, character information extraction from titles, and events such as shootings and explosions.

これらの従来技術によるオーディオビデオの概要情報抽出は、次のような問題を有している。まず、前記第１の従来技術においては、ビデオの階層構造化により効率的な要約作成を提供するが、階層構造化は手動で行う必要があり、長尺のオーディオビデオ情報に関しては概要情報の抽出に必要な前処理としての階層構造化に、多くの時間を要する可能性がある。また、優先度の付与はシーンが属する階層に依存して行われるため、実質的には人手を要することが多くなる。また、ビデオ単体が対象となっているためオーディオビデオへの拡張は可能であるが、オーディオとしての特性を利用していないため、場合によっては重要な内容を含むオーディオ情報を利用していないため、適切な概要情報が得られないことも考えられる。 Extraction of summary information of audio video according to these conventional techniques has the following problems. First, in the first prior art, efficient summarization is provided by video hierarchical structuring. However, hierarchical structuring needs to be performed manually, and for long audio video information, summary information is extracted. It may take a lot of time to make a hierarchical structure as a pre-process necessary for the process. Further, since the priority is given depending on the hierarchy to which the scene belongs, the work is often required substantially in practice. In addition, it is possible to extend to audio video because it is a single video, but because it does not use the characteristics as audio, in some cases it does not use audio information containing important content, It is possible that adequate summary information is not available.

また、前記第２の従来技術においては、ビデオの内容モデルの構築を手動で行わなければならないほか、効果的にオーディオの特性を利用する方式を採っていない。同時に、ビデオのセグメントの分類においては、圧縮データ上では実現が困難な、高度な認識技術などが必要となるため、概要情報抽出に要する処理が大きくなることが予想される。前記第３の従来技術に関しても、これらの高度な（コンテンツの意味内容にまで立ち入った）処理が必要となっている。 In the second prior art, the video content model must be manually constructed, and a method of effectively using audio characteristics is not employed. At the same time, the classification of video segments requires advanced recognition technology that is difficult to realize on compressed data, so it is expected that the processing required for extracting summary information will increase. With regard to the third prior art as well, these high-level processes (entering the meaning of contents) are necessary.

このように、従来技術ではオーディオビデオ情報の入力から概要情報の出力までの過程において、手動による処理が介在することが多く、また、オーディオ情報を効果的に利用しないため、オーディオとしての特性による概要情報の作成が考慮されていない。また、圧縮または非圧縮のオーディオ単体からの自動概要情報抽出については、これまで有効な方式は検討されていない。 As described above, in the prior art, manual processing is often involved in the process from the input of the audio video information to the output of the summary information, and the audio information is not effectively used. Information creation is not considered. Also, no effective method has been studied so far for automatic summary information extraction from compressed or uncompressed audio alone.

また、概要情報の構造的な側面から見ても、オーディオビデオ情報全体から均一に概要情報が抽出される保証はなく、さらに前記第１と第２の従来技術では、外部から指定された概要情報長に近づけるような制御を行う方式も十分には採用されていない。 Also, from the structural aspect of the summary information, there is no guarantee that the summary information is uniformly extracted from the entire audio video information. Further, in the first and second prior arts, the summary information designated from outside is not guaranteed. A method of performing control to approach the length is not sufficiently employed.

本発明は、前記した従来技術に鑑みてなされたものであり、その目的は、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報から、それらの内容を効率よく把握するための概要情報（サマリ）を抽出する装置において、圧縮データ領域でオーディオ及びビデオの時空間的な特性を解析し、必要に応じてこれらを統合的に評価し、オーディオまたはオーディオビデオ情報の指定した長さを持つ概要情報を、コンテンツ全体から均一に、かつ高速に抽出することを可能とした、オーディオ概要情報、オーディオビデオ概要情報の抽出装置、再生装置および記録媒体を提供することにある。 The present invention has been made in view of the above-described prior art, and an object thereof is an outline for efficiently grasping the contents from uncompressed or compressed audio information or compressed audio video information. In a device that extracts information (summary), it analyzes the spatio-temporal characteristics of audio and video in the compressed data area, evaluates them comprehensively as necessary, and determines the specified length of audio or audio-video information. SUMMARY OF THE INVENTION An object of the present invention is to provide an audio summary information, audio video summary information extraction device, playback device, and recording medium that enable uniform summary information to be extracted from the entire content uniformly and at high speed.

前記した目的を達成するために、本発明は、入力された圧縮オーディオコンテンツからサブバンドデータを抽出する手段または入力されたオーディオコンテンツをサブバンドデータに変換する手段と、該サブバンドデータから、バンドにより重み付けしたサブバンドエネルギーの総和を計算する手段と、該サブバンドエネルギーの総和から、ある単位時間におけるサブバンドエネルギー総和を計算する手段と、該単位時間におけるサブバンドエネルギー総和が最大となる時間位置を概要情報開始時間として決定し、該単位時間におけるサブバンドエネルギー総和が最大値のα倍（但し、０＜α＜１）となる時間位置を概要情報終了時間として決定する手段と、前記概要情報開始時間及び概要情報終了時間の区間のオーディオ情報を概要情報として抽出する手段を具備した点に第１の特徴がある。 In order to achieve the above object, the present invention provides means for extracting subband data from input compressed audio content or means for converting input audio content into subband data; It means for calculating the sum of the sub-band energy weighted by, from the sum of the sub-band energy, means for calculating a sub-band energy summation at a certain unit time, time position where the sub-band energy sum is maximized at the unit time Means for determining the time position at which the sum of subband energy in the unit time is α times the maximum value (where 0 <α <1) as the summary information end time; Audio information for the start time and summary information end time interval is used as summary information. The first feature is that it has means for extracting data.

また、本発明は、オーディオ概要情報を抽出するためのコンピュータを、入力された圧縮オーディオコンテンツからサブバンドデータを抽出する手段及び入力されたオーディオコンテンツをサブバンドデータに変換する手段の少なくとも一方、該サブバンドデータから、バンドにより重み付けしたサブバンドエネルギーの総和を計算する手段、該サブバンドエネルギーの総和から、ある単位時間におけるサブバンドエネルギー総和を計算する手段、該単位時間におけるサブバンドエネルギー総和が最大となる時間位置を概要情報開始時間として決定し、該単位時間におけるサブバンドエネルギー総和が最大値のα倍（但し、０＜α＜１）となる時間位置を概要情報終了時間として決定する手段、及び前記概要情報開始時間及び概要情報終了時間の区間のオーディオ情報を概要情報として抽出する手段、として機能させるためのオーディオ概要情報を抽出するプログラムを記録したコンピュータ読み取り可能な記録媒体を提供するようにした点に第２の特徴がある。 Further, the present invention provides a computer for extracting audio summary information, wherein at least one of means for extracting subband data from input compressed audio content and means for converting input audio content to subband data, A means for calculating a sum of subband energy weighted by bands from subband data, a means for calculating a subband energy sum in a unit time from the sum of subband energy, and a subband energy sum in the unit time being maximum Means for determining as a summary information start time, and determining as a summary information end time a time position at which the total subband energy in the unit time is α times the maximum value (where 0 <α <1); And the summary information start time and summary information end time. A second feature is that a computer-readable recording medium in which a program for extracting audio summary information for functioning as audio information for the section is extracted as summary information is provided.

本発明によれば、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報に関して、それらの内容を高速かつ効率的に把握するための概要情報を抽出することが可能になる。また、この抽出によって、オーディオビデオ情報の高速な閲覧が可能となる。 According to the present invention, it is possible to extract summary information for quickly and efficiently grasping the contents of uncompressed or compressed audio information or compressed audio video information. In addition, this extraction enables high-speed browsing of audio-video information.

また、抽出される概要情報の長さは任意に指定することができると同時に、オーディオビデオ情報から均一に概要情報を抽出するため、コンテンツ全体の把握を効率的に行うことが可能となる。 In addition, the length of the extracted summary information can be arbitrarily specified, and at the same time, the summary information is uniformly extracted from the audio video information, so that it is possible to efficiently grasp the entire content.

また、概要情報に含まれる時間情報などを記述することにより、該当するオーディオビデオ情報の概要情報としての特徴記述を行うことが可能となり、コンテンツ記述の標準化であるMPEG-7などへも適用することが可能である。 Also, by describing the time information included in the summary information, it becomes possible to describe the feature as the summary information of the corresponding audio video information, and it can be applied to MPEG-7, which is the standardization of content description. Is possible.

また、抽出された概要情報の表示速度を変換するなどの、高機能な概要情報の再生を提供することが可能となる。 In addition, it is possible to provide reproduction of high-performance outline information such as converting the display speed of the extracted outline information.

以下に、図面を参照して、本発明を詳細に説明する。図１は、本発明の第１の実施形態であるオーディオビデオ概要情報の抽出装置の構成を示すブロック図である。 Hereinafter, the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an audio video summary information extraction apparatus according to the first embodiment of the present invention.

まず、圧縮されたオーディオビデオ情報ＡＶは、該オーディオビデオ情報ＡＶの時間的構造を解析するコンテンツ解析部１に入力される。該コンテンツ解析部１は、図２に示されているような構成を有しており、該圧縮されたオーディオビデオ情報ＡＶは、まずショット分割部１１に入力する。該ショット分割部１１は、ビデオ情報をショット単位Ｓに分割し、次いでコンテンツ全体に対するショット数をカウントし、全ショット数をＮＳとして保持する。次に、分割数決定部１２は、ショット分割部１１から入力される全ショット数ＮＳと、オーディオビデオ情報ＡＶから得られるコンテンツ長ＣＬと、外部から指定される概要情報長ＳＬを用いて、下記の（１）式により分割数ＮＰを決定する。
ＮＰ＝ＮＳ×ＳＬ／ＣＬ・・・（１）
ただし、ＳＬ＞ＣＬ／ＮＳであるとする。例えば、コンテンツ長ＣＬが１時間のものを５分の概要情報長ＳＬに纏めようとすると、前記全ショット数ＮＳが２４０個の場合、分割数ＮＰは２０となる。そして、コンテンツ分割１３においてコンテンツをＮＰ等分する。等分割された区間ＳＡＶと、各区間に属するショットＳとをマッピングする。 First, the compressed audio video information AV is input to the content analysis unit 1 that analyzes the temporal structure of the audio video information AV. The content analysis unit 1 has a configuration as shown in FIG. 2, and the compressed audio video information AV is first input to the shot division unit 11. The shot division unit 11 divides the video information into shot units S, then counts the number of shots for the entire content, and holds the total number of shots as NS. Next, the division number determination unit 12 uses the total shot number NS input from the shot division unit 11, the content length CL obtained from the audio video information AV, and the outline information length SL specified from the outside, as follows: The division number NP is determined by the equation (1).
NP = NS × SL / CL (1)
However, it is assumed that SL> CL / NS. For example, if the content length CL is 1 hour and the summary information length SL is 5 minutes, the total number of shots NS is 240, and the division number NP is 20. Then, the content is divided into NP equal parts in the content division 13. The equally divided sections SAV and the shots S belonging to each section are mapped.

以下では、分割区間入力１４で入力された区間ＳＡＶ毎に処理を行う。
まず、ショット入力１５で最初のショットＳnが入力される。ショット長評価部１６では、入力されたショットＳnのショット長ＳＨＬnが予め指定した長さＴ以下の場合に、概要情報ＳＵＭの候補から除外し、次のショットＳn+1の入力処理へ移行する。一方、ショットＳnのショット長ＳＨＬnがTよりも長い場合には、該当するショットＳnを概要情報ＳＵＭの候補とし、代表フレーム抽出部１７に送る。 In the following, processing is performed for each section SAV input by the divided section input 14.
First, the first shot Sn is input by the shot input 15. When the shot length SHLn of the input shot Sn is equal to or shorter than the length T specified in advance, the shot length evaluation unit 16 excludes it from the candidates for the summary information SUM and proceeds to input processing for the next shot Sn + 1. On the other hand, if the shot length SHLn of the shot Sn is longer than T, the corresponding shot Sn is set as a candidate for the outline information SUM and is sent to the representative frame extraction unit 17.

代表フレーム抽出部１７では、ショットSnの代表フレームRFSnとして、ショットの先頭フレームSFSnまたは特徴フレームKFSnを抽出する。特徴フレームKFSnの抽出には、例えば特願2000-065259に記載された方法などを用いることができる。さらに、代表フレーム特徴値抽出部１８において、代表フレーム抽出部１７で抽出されたフレームRFSnの特徴値CHFSnを抽出し、ショット特徴値抽出部１９において、ショットSnの特徴値CHSnを抽出する。 The representative frame extraction unit 17 extracts the first frame SFSn or feature frame KFSn of the shot as the representative frame RFSn of the shot Sn. For example, the method described in Japanese Patent Application No. 2000-065259 can be used to extract the feature frame KFSn. Further, the representative frame feature value extraction unit 18 extracts the feature value CHFSn of the frame RFSn extracted by the representative frame extraction unit 17, and the shot feature value extraction unit 19 extracts the feature value CHSn of the shot Sn.

ショットSnに対する代表フレームRFSnとしての先頭フレームSFSnまたは特徴フレームKFSnの特徴値CHFSnと、ショットSnの特徴値CHSnのいずれか一方又は両方は、代表ショット評価部１Ａに送られる。先頭フレームSFSnまたは特徴フレームKFSnの特徴値CHFSnとしては、例えばMPEG-7（Moving Picture Experts Group phase 7）で規定されている記述子などを用いることができる。 One or both of the feature value CHFSn of the start frame SFSn or the feature frame KFSn as the representative frame RFSn for the shot Sn and the feature value CHSn of the shot Sn are sent to the representative shot evaluation unit 1A. As the feature value CHFSn of the first frame SFSn or the feature frame KFSn, for example, a descriptor defined in MPEG-7 (Moving Picture Experts Group phase 7) can be used.

代表ショット評価部１Ａでは、対象となる分割区間SAVにおいて既に代表ショットRSとして登録されている全てのショットに関する代表フレームRFの特徴値CHFと、代表フレーム特徴値抽出部１８から送られたショットSnの代表フレームRFsの特徴値CHFSnとの間で類似度の判定を行う。ここで、特徴値CHFと入力されたショットSnの特徴値CHFSnとの類似度が大きいと判定された場合には反復ショットであると見なされ、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、特徴値CHFとショットSnの特徴値CHFSnとの類似度が小さいと判定された場合には、独立ショットであると見なされ、該当するショットSnを概要情報SUMの候補とし、代表ショット登録１Ｂでショットの登録を行ったあと、オーディオレベル評価部２に送る。なお、代表ショットとして登録されたショットSnに関する特徴値は別途保持され、それ以降の代表ショットの評価に用いられる。 In the representative shot evaluation unit 1A, the feature value CHF of the representative frame RF regarding all shots already registered as the representative shot RS in the target divided section SAV and the shot Sn sent from the representative frame feature value extraction unit 18 The similarity is determined between the feature value CHFSn of the representative frame RFs. Here, if it is determined that the similarity between the feature value CHF and the feature value CHFSn of the input shot Sn is large, it is regarded as a repetitive shot, and the corresponding shot Sn is excluded from the candidates for the summary information SUM. Then, the process proceeds to input processing for the next shot Sn + 1. On the other hand, if it is determined that the similarity between the feature value CHF and the feature value CHFSn of the shot Sn is small, it is regarded as an independent shot, the corresponding shot Sn is set as a candidate for the summary information SUM, and the representative shot registration 1B After registering the shot at, it is sent to the audio level evaluation unit 2. It should be noted that the feature value related to the shot Sn registered as the representative shot is stored separately and used for evaluation of the representative shot thereafter.

ショット類似度の判定においては、同様に既に代表ショットRSとして登録されているショットに関する代表ショットの特徴値CHSと、ショット特徴値抽出部１９から送られたショットSnの特徴値CHSnを代用するか、或いは併用してもよい。 In the determination of the shot similarity, similarly, the feature value CHS of the representative shot related to the shot that has already been registered as the representative shot RS and the feature value CHSn of the shot Sn sent from the shot feature value extraction unit 19 are substituted. Or you may use together.

オーディオレベル評価部２では、コンテンツ解析部１から入力されたショットSnに関して、図３のサブバンドデータ抽出部２１でショットSnのオーディオ部分のサブバンドデータSDSnを抽出する。そして、無音解析部２２で無音としての特徴値を計算したあと、無音判定部２３でショットSnの全ての区間、またはショットSnの区間のY%以上が無音であると判定された場合に、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、無音判定部２３での判定において、ショットSnのオーディオ部分がショットSnの区間の（１００−Ｙ）％以上が有音である場合に、ショットSnは低レベル音解析部２４へ送られる。無音解析部２２での無音解析方法及び無音判定部２３での無音判定方法としては、例えば特願平10-235543号に記載された方法などを用いることができる。 In the audio level evaluation unit 2, regarding the shot Sn input from the content analysis unit 1, the subband data SDSn of the audio part of the shot Sn is extracted by the subband data extraction unit 21 of FIG. Then, after the silence analysis unit 22 calculates the feature value as silence, the silence determination unit 23 determines that all sections of the shot Sn, or Y% or more of the sections of the shot Sn are determined to be silent. The shot Sn to be excluded is excluded from the candidates for the summary information SUM, and the process proceeds to input processing for the next shot Sn + 1. On the other hand, in the determination by the silence determination unit 23, when the audio portion of the shot Sn is (100-Y)% or more of the section of the shot Sn, the shot Sn is sent to the low level sound analysis unit 24. As the silence analysis method in the silence analysis unit 22 and the silence determination method in the silence determination unit 23, for example, the method described in Japanese Patent Application No. 10-235543 can be used.

低レベル音解析部２４では、同様にサブバンドデータ抽出部２１で抽出されたサブバンドデータSDSnから該当するオーディオのレベルLSnを推定し、指定された十分に低いレベルTHLL以下のオーディオがショットSnの全ての区間、またはショットSnの区間のＺ％以上を占める場合に、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、低レベル音解析部２４でショットSnのオーディオ部分が、ショットSnの区間の（１００−Ｚ）％以上がオーディオレベルTHLLを超えるオーディオである場合に、該当するショットSnを概要情報SUMの候補とし、高レベル音解析部２６に送る。 Similarly, the low-level sound analysis unit 24 estimates the level LSn of the corresponding audio from the subband data SDSn extracted by the subband data extraction unit 21, and the audio having a level sufficiently lower than the specified level THLL is recorded in the shot Sn. If all the sections or Z% or more of the section of the shot Sn occupies, the corresponding shot Sn is excluded from the candidates for the summary information SUM, and the process proceeds to the input process for the next shot Sn + 1. On the other hand, when the audio part of the shot Sn is an audio in which (100-Z)% or more of the section of the shot Sn exceeds the audio level THLL in the low level sound analysis unit 24, the corresponding shot Sn is selected as the candidate of the summary information SUM. And sent to the high level sound analysis unit 26.

高レベル音解析部２６では、サブバンドデータ抽出部２１から得られたショットSnに関するサブバンドデータSDSnに基づいて、図４の単位時間密度計算部２６１でショットSnにおけるあるレベルTHHL以上のオーディオの単位時間密度Dsnを計算する。このとき、オーディオ情報がMPEGオーディオで符号化されている場合、1秒当りの単位時間密度Dsnは、例えば以下のように求めることができる。
Dsn＝（ＮＡＦ_THHL／ＮＡＦ）×ＡＡＬ_THHL ・・・（２） In the high level sound analysis unit 26, based on the subband data SDSn relating to the shot Sn obtained from the subband data extraction unit 21, an audio unit of a level THHL or higher in the shot Sn by the unit time density calculation unit 261 in FIG. Calculate the time density Dsn. At this time, when the audio information is encoded with MPEG audio, the unit time density Dsn per second can be obtained as follows, for example.
Dsn = (NAF _THHL / NAF) × AAL _THHL (2)

ここで、NAF_THHLはレベルTHHL以上を持つ1秒間当りのフレーム数、NAFは1秒当りのオーディオフレーム数、AAL_THHLはNAF_THHLに対する平均レベルである。 Here, NAF _THHL is the number of frames per second having a level higher than THHL, NAF is the number of audio frames per second, and AAL _THHL is the average level with respect to NAF _THHL .

また、サブバンドエネルギー総和計算部２６２においてサブバンドにより重み付けされたサブバンドエネルギーの総和SBEを計算し、それに基づいて単位時間サブバンドエネルギー計算部２６３で単位時間でのサブバンドエネルギー総和SBsnを計算する。1秒当りの単位時間サブバンドエネルギーSBsnは、例えば以下の（３）式のように求めることができる。 In addition, the subband energy sum calculation unit 262 calculates the sum SBE of the subband energy weighted by the subband, and based on this, the unit time subband energy calculation unit 263 calculates the subband energy sum SBsn per unit time. . The unit time subband energy SBsn per second can be obtained, for example, by the following equation (3).

ここで、αkはサブバンドkに対する重み付け、sbkはあるフレームにおけるサブバンドkのエネルギーである。 Here, αk is a weight for subband k, and sbk is the energy of subband k in a frame.

次に、単位時間密度判定部２６４で該当するショットSnがある閾値THDを超える単位時間密度Dsnを持つ場合に、単位時間サブバンドエネルギー判定部２６５へ移行する。単位時間サブバンドエネルギー判定部２６５では、閾値THSBを超える単位時間サブバンドエネルギーSBsnが存在する場合、該当するオーディオを含むショットSnを概要情報の候補として判定し、高レベル音解析ルーチンを抜け出し、動きアクティビティ評価部３へ移行する。 Next, when the unit time density determination unit 264 has a unit time density Dsn that exceeds a certain threshold THD, the corresponding shot Sn shifts to the unit time subband energy determination unit 265. When the unit time subband energy SBsn exceeding the threshold THSB is present, the unit time subband energy determination unit 265 determines the shot Sn including the corresponding audio as a candidate for summary information, exits the high-level sound analysis routine, and moves The process proceeds to the activity evaluation unit 3.

これに対し、オーディオレベルTHHL以上のオーディオの単位時間密度DSnが閾値THD未満の場合、或いは単位時間サブバンドエネルギーSBSnが閾値THSB未満の場合、該当するオーディオを含むショットSnを概要情報の候補から除外し、次のショットSn+1の入力処理へ移行する。 On the other hand, if the unit time density DSn of the audio level THHL or higher is less than the threshold value THD, or if the unit time subband energy SBSn is less than the threshold value THSB, the shot Sn including the corresponding audio is excluded from the summary information candidates. Then, the process proceeds to input processing for the next shot Sn + 1.

なお、図３の構成において、無音解析部２２，低レベル音解析部２４、および高レベル音解析部２６は、必ずしも全部は必要でなく、少なくとも１つを備えておれば良い。 In the configuration of FIG. 3, all of the silence analysis unit 22, the low level sound analysis unit 24, and the high level sound analysis unit 26 are not necessarily required, and at least one may be provided.

動きアクティビティ評価部３では、オーディオレベル評価部２から入力されたショットSnに関して、図５の動きベクトル抽出部３１でショットSnに属する全てのフレームの動きベクトル情報MVを抽出する。そして、動きアクティビティ計算部３２において、動きベクトル情報MVを用いてショットSn全体としての動きアクティビティMASnを計算し、それを用いて単位時間動きアクティビティ計算部３３において、ある単位時間（例えば1秒、1フレームなど）における動きアクティビティMAを計算する。動きアクティビティ計算部３２及び単位時間動きアクティビティ３３における処理は、例えばMPEG符号化されたビデオに対して以下のように表せる。ここでは1秒当りの動きアクティビティと仮定する。 In the motion activity evaluation unit 3, regarding the shot Sn input from the audio level evaluation unit 2, the motion vector extraction unit 31 in FIG. 5 extracts the motion vector information MV of all frames belonging to the shot Sn. Then, the motion activity calculation unit 32 calculates the motion activity MASn of the entire shot Sn using the motion vector information MV, and uses the motion activity information calculation unit 32 to calculate a unit time (for example, 1 second, 1 Motion activity MA in a frame etc.). The processing in the motion activity calculation unit 32 and the unit time motion activity 33 can be expressed as follows, for example, for MPEG-encoded video. Here we assume motion activity per second.

ここで、ASMVは大きさがX以上の動きベクトルのフレーム内絶対値総和、NMBはフレーム内での大きさがX以上の動きベクトルを持つマクロブロック数、NPSnはショット内の予測符号化フレーム数、NVFは1秒当りのビデオフレーム数である。 Where ASMV is the sum of absolute values of motion vectors with a size greater than or equal to X, NMB is the number of macroblocks with motion vectors whose size is greater than or equal to X in the frame, and NPSn is the number of predicted encoded frames in a shot. , NVF is the number of video frames per second.

動きアクティビティ判定部３４において、求められた動きアクティビティMAと、ある指定された閾値THMAを比較し、MAがTHMAを超える場合に、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、動きアクティビティ判定部３４でショットSnのある単位時間における動きアクティビティMAが閾値THMAを超えない場合に、該当するショットSnを概要情報SUMの候補とする。 In the movement activity determination unit 34, the obtained movement activity MA is compared with a specified threshold THMA, and when MA exceeds THMA, the corresponding shot Sn is excluded from the candidates for the summary information SUM, and the next shot Move on to Sn + 1 input processing. On the other hand, if the motion activity MA in a unit time of the shot Sn does not exceed the threshold value THMA in the motion activity determination unit 34, the corresponding shot Sn is set as a candidate for the summary information SUM.

以上の処理により、オーディオビデオ情報AVを等間隔に分割した区間SAVにおいて、概要情報SUMの候補として選択されたショットSは概要情報登録部４に入力され、ショットSの情報が図６のショットメモリ４１に保存される。次いで、区間ＳＡＶでの全ショットの処理が終了したか否かの判断が判断部４２でなされ、この判断が否定の時には次のショットの処理が行われる。 Through the above processing, in the section SAV in which the audio video information AV is divided at equal intervals, the shot S selected as the candidate for the summary information SUM is input to the summary information registration unit 4, and the information of the shot S is stored in the shot memory of FIG. 41 is stored. Next, the determination unit 42 determines whether or not the processing of all shots in the section SAV has been completed. When this determination is negative, processing of the next shot is performed.

一方、区間ＳＡＶの全ショットの処理が終了した時には、区間SAVにおける全てのショットSに関するショット長の総和SSLは、概要情報長判定部４３において、各区間での平均概要情報長MSL（=SL/NP）に十分近いかどうかを判断される。区間SAVでのショット長総和SSLと平均概要情報長MSLが十分に近いと見なされた場合には、区間SAVでの概要情報抽出処理を終了し、概要情報登録４４で時間情報などが登録される。そして次の区間SAVn+1の入力処理へ移行する。このとき、代表ショット登録１Ｂで登録されていた代表ショットに関する各種特徴値はクリアされる。一方、SSLとMSLが近いと見なされない場合には、閾値変更部４７で、概要情報SUMの候補としての判定に用いる一部または全ての閾値の値を変更し、それまでに抽出されている概要情報SUMの候補を対象として、SSLとMSLが十分近くなるまでショット長評価部１６から動きアクティビティ評価部３での処理を再帰的に繰り返す。 On the other hand, when the processing of all shots in the section SAV is completed, the total SSL of the shot lengths related to all the shots S in the section SAV is obtained by the summary information length determination unit 43 in the average summary information length MSL (= SL / NP) is judged to be close enough. If the total shot length SSL in the section SAV and the average summary information length MSL are considered sufficiently close, the summary information extraction process in the section SAV is terminated, and time information and the like are registered in the summary information registration 44 . Then, the process proceeds to input processing for the next section SAVn + 1. At this time, various feature values related to the representative shot registered in the representative shot registration 1B are cleared. On the other hand, if SSL and MSL are not considered to be close, the threshold value changing unit 47 changes some or all threshold values used for determination as candidates for the summary information SUM, and has been extracted so far For the candidate of the summary information SUM, the processing from the shot length evaluation unit 16 to the motion activity evaluation unit 3 is recursively repeated until SSL and MSL are sufficiently close.

これらの処理を、全ての区間SAVについて行い、最終的にオーディオビデオ情報AVの概要情報SUMを得て、概要情報SUMが登録される。このとき、概要情報の記述が指定されていれば、概要情報記述部５へ移行し、指定されていなければ処理を終了する。 These processes are performed for all the sections SAV, finally, the summary information SUM of the audio video information AV is obtained, and the summary information SUM is registered. At this time, if the description of the summary information is designated, the process proceeds to the summary information description unit 5, and if not designated, the process is terminated.

概要情報記述部５では、図７に示されているように、概要情報SUMとして抽出された全てのショットSについて、概要情報記述部５１でそれらの時間情報を少なくとも記述し、概要情報出力部５２で概要情報記述ファイルとして出力する。記述するフォーマットとしては、例えばMPEG-7で規定されているフォーマットなどを用いることができる。全ての区間SAVについて記述が終了すると、全ての処理を終了する。また、概要情報SUMとして抽出された全てのショットSを結合して、別ファイルとして保存することができる。このとき、オーディオビデオファイルとして保存するか、オーディオとビデオを個別に保存することができる。 As shown in FIG. 7, the summary information description unit 5 describes at least the time information of all the shots S extracted as the summary information SUM by the summary information description unit 51, and the summary information output unit 52. To output as a summary information description file. As a format to be described, for example, a format defined by MPEG-7 can be used. When the description is finished for all the sections SAV, all the processes are finished. Further, all the shots S extracted as the summary information SUM can be combined and saved as a separate file. At this time, it can be saved as an audio video file, or audio and video can be saved separately.

次に、前記した実施形態のオーディオビデオ概要情報抽出装置の機能は、ソフトウェア（プログラム）で実現することができる。該ソフトウェアは、光ディスク、フロッピー（登録商標）ディスク、ハードディスク等の記録媒体に記録することができる。 Next, the function of the audio / video outline information extracting apparatus of the above-described embodiment can be realized by software (program). The software can be recorded on a recording medium such as an optical disk, a floppy (registered trademark) disk, or a hard disk.

図８は、該記録媒体１００に記録されるプログラムの一例を示すものであり、該記録媒体１００には、圧縮オーディオビデオ情報のコンテンツ解析機能１１１、オーディオレベル評価機能１１２、動きアクティビティ評価機能１１３、概要情報登録機能１１４、および概要情報記述機能１１５が記録される。なお、該動きアクティビティ評価機能１１３は、省略してもよい。 FIG. 8 shows an example of a program recorded on the recording medium 100. The recording medium 100 includes a compressed audio / video information content analysis function 111, an audio level evaluation function 112, a motion activity evaluation function 113, A summary information registration function 114 and a summary information description function 115 are recorded. The movement activity evaluation function 113 may be omitted.

また、前記コンテンツ解析機能１１１は、ビデオ情報をショットに分割する機能と、入力コンテンツをある基準に従って等間隔の区間に分割する機能と、該等間隔の区間に含まれるショットの長さを評価する機能および反復ショットを判定する機能の少なくとも一方とから構成することができる。また、前記オーディオレベル評価機能１１２は、無音を判定する機能、低レベル音を判定する機能、および高レベル音を判定する機能から構成することができる。 The content analysis function 111 evaluates the function of dividing video information into shots, the function of dividing input content into equal intervals according to a certain standard, and the length of shots included in the equal intervals. It can be composed of at least one of a function and a function for determining repetitive shots. The audio level evaluation function 112 can be constituted by a function for determining silence, a function for determining low level sound, and a function for determining high level sound.

また、前記動きアクティビティ評価機能１１３は、前記ショットに属するフレームの動きベクトルデータを抽出する機能と、該抽出された動きベクトルデータから動きアクティビティを計算する機能と、単位時間における動きアクティビティを計算する機能と、該単位時間における動きアクティビティを用いて概要情報の候補を判定する機能とから構成することができる。 The motion activity evaluation function 113 has a function of extracting motion vector data of a frame belonging to the shot, a function of calculating motion activity from the extracted motion vector data, and a function of calculating motion activity in unit time. And a function for determining summary information candidates using the motion activity in the unit time.

また、抽出された概要情報を、オーディオビデオとして結合するか、またはオーディオとビデオ個別に結合するかをし、該結合した概要情報を別ファイルとして、記録媒体１００に記録することができる。 In addition, the extracted summary information can be combined as audio video or audio and video individually, and the combined summary information can be recorded on the recording medium 100 as a separate file.

なお、前記記録媒１００には、ネットワークのように、データを一時的に記録保持するような伝送媒体も含まれる。 The recording medium 100 includes a transmission medium that temporarily records and holds data, such as a network.

図９は、本発明の第２の実施形態であるオーディオ概要情報の抽出装置の構成を示すブロック図である。 FIG. 9 is a block diagram showing a configuration of an audio summary information extraction apparatus according to the second embodiment of the present invention.

まず、圧縮されたオーディオ情報CAが入力されると、サブバンドデータ抽出部Ａ１でサブバンドデータSDを抽出する。抽出されたサブバンドデータSDは高レベル音評価部Ａ３に送られる。サブバンドデータ抽出部Ａ１の動作としては、第１の実施形態に示した無音・低レベル音評価部５におけるサブバンドデータ抽出部５１と同様である。一方、非圧縮のオーディオ情報UAが入力されると、サブバンド解析部Ａ２で入力オーディオがサブバンド解析され、解析された結果としてのサブバンドデータSDは同様に高レベル音評価部Ａ３に送られる。 First, when the compressed audio information CA is input, the subband data extraction unit A1 extracts the subband data SD. The extracted subband data SD is sent to the high level sound evaluation unit A3. The operation of the subband data extraction unit A1 is the same as that of the subband data extraction unit 51 in the silence / low level sound evaluation unit 5 shown in the first embodiment. On the other hand, when uncompressed audio information UA is input, the subband analysis unit A2 performs subband analysis on the input audio, and the analyzed subband data SD is similarly sent to the high-level sound evaluation unit A3. .

高レベル音評価部Ａ３では、第１の実施形態に示した高レベル音解析部２６に含まれるサブバンドエネルギー総和計算部２６２と、単位時間サブバンドエネルギー総和計算部２６３と同様の機能を持つ、図１０のサブバンドエネルギー総和計算部Ａ３１と、単位時間サブバンドエネルギー総和計算部Ａ３２により、入力されたサブバンドデータSDから、それぞれサブバンドエネルギー総和SBEと単位時間でのサブバンドエネルギー総和SBが計算される。 The high level sound evaluation unit A3 has the same functions as the subband energy sum calculation unit 262 and the unit time subband energy sum calculation unit 263 included in the high level sound analysis unit 26 shown in the first embodiment. The subband energy total calculation unit A31 and the unit time subband energy total calculation unit A32 in FIG. 10 respectively calculate the subband energy total SBE and the subband energy total SB in unit time from the input subband data SD. Is done.

次に、概要情報開始時間決定部Ａ３３において、単位時間サブバンドエネルギー総和SBが最大となる時間位置を、概要情報開始時間T_startとして決定する。また、概要情報終了時間決定部Ａ３４では、単位時間サブバンドエネルギー総和SBが最大値のα倍（０＜α＜１）となる時間位置を、概要情報終了時間T_endとして決定する。このとき、T_start＜T_endである。 Next, in the summary information start time determination unit A33, the time position at which the unit time subband energy sum SB is maximized is determined as the summary information start time T_start. Also, the summary information end time determination unit A34 determines the time position at which the unit time subband energy sum SB is α times the maximum value (0 <α <1) as the summary information end time T_end. At this time, T_start < T_end.

概要情報登録部Ａ４では、高レベル音評価部Ａ３で決定された概要情報開始時間T_startと、概要情報終了時間T_endに基づいて概要情報を登録する。そして、概要情報記述部Ａ５において上記時間情報を少なくとも記述し、概要情報記述ファイルとして出力する。記述するフォーマットとしては、例えばMPEG-7で規定されているフォーマットなどを用いることができる。 The summary information registration unit A4 registers the summary information based on the summary information start time T_start and the summary information end time T_end determined by the high level sound evaluation unit A3. Then, at least the time information is described in the summary information description part A5 and output as a summary information description file. As a format to be described, for example, a format defined by MPEG-7 can be used.

オーディオ情報が複数存在する場合には、上記の処理を全てのオーディオ情報に対して行う。 If there is a plurality of audio information, the above process is performed on all audio information.

前記した実施形態のオーディオ概要情報抽出装置の機能は、ソフトウェア（プログラム）で実現することができ、該ソフトウェアは、光ディスク、フロッピー（登録商標）ディスク、ハードディスク等の記録媒体１００に記録することができる。また、抽出されたオーディオ概要情報は、個別のファイルとして、該記録媒体１００に記録することができる。 The function of the audio summary information extraction device of the above-described embodiment can be realized by software (program), and the software can be recorded on a recording medium 100 such as an optical disk, a floppy (registered trademark) disk, a hard disk, or the like. . The extracted audio summary information can be recorded on the recording medium 100 as an individual file.

図１１は、該記録媒体１００に記録されるプログラムの一例を示すものであり、該記録媒体１００には、サブバンドデータ抽出機能１２１または／およびサブバンド解析機能１２２、高レベル音評価機能１２３、概要情報登録機能１２４、および概要情報記述機能１２５が記録される。 FIG. 11 shows an example of a program recorded on the recording medium 100. The recording medium 100 includes a subband data extraction function 121 or / and a subband analysis function 122, a high level sound evaluation function 123, A summary information registration function 124 and a summary information description function 125 are recorded.

図１２は、本発明のオーディオビデオの概要情報再生装置の一実施形態を、構成図として表したものである。 FIG. 12 is a block diagram showing an embodiment of an audio / video outline information reproducing apparatus according to the present invention.

前記手段により抽出されたオーディオビデオの概要情報SUMが入力されると、オーディオビデオ分離部Ｐ１において、概要情報のビデオ要素VSUMとオーディオ要素ASUMに分離される。次に、ビデオ速度変換部Ｐ２では、外部から与えられた変換速度パラメータSPに従ってビデオ要素VSUMを空間的に間引いてビデオの再生速度を変換する。同様にして、オーディオ要素ASUMはオーディオ速度変換部Ｐ３において変換速度パラメータSPに従ってビデオ要素VSUMと同じ割合で時間的に間引かれ、オーディオの再生速度を変換する。オーディオの再生速度変換としては、例えばオーディオのフレームの周期的なスキップや、繰り返し再生と周期的スキップの組み合わせなどによって実現することができる。ここで、オーディオを1.5倍の速度にする場合、前者では
＜再生するフレーム番号＞ 1、2、4、5、7、8、10、11、…
と連続する2フレームを再生し、次に続く1フレームスキップすることによって達成される。また後者では、
＜再生するフレーム番号＞ 1、1、4、4、7、7、10、10、…
と同一フレームを2回繰り返して再生し、次に続く2フレームをスキップすることによって達成される。 When the audio video summary information SUM extracted by the above means is input, the audio video separation unit P1 separates the summary information into the video element VSUM and the audio element ASUM. Next, the video speed conversion unit P2 converts the video playback speed by spatially thinning out the video elements VSUM in accordance with the conversion speed parameter SP given from the outside. Similarly, the audio element ASUM is thinned out in time at the same rate as the video element VSUM in accordance with the conversion speed parameter SP in the audio speed conversion unit P3 to convert the audio playback speed. Audio playback speed conversion can be realized by, for example, periodic skipping of audio frames or a combination of repeated playback and periodic skipping. Here, if the audio is to be 1.5 times faster, the former <frame number to play> 1, 2, 4, 5, 7, 8, 10, 11, ...
Is achieved by playing back two consecutive frames and skipping the next one frame. In the latter,
<Frame number to play> 1, 1, 4, 4, 7, 7, 10, 10, ...
Is achieved by repeatedly playing the same frame twice and skipping the next two frames.

速度を変換されたビデオ要素VSUM´及びオーディオ要素ASUM´は、オーディオビデオ多重化・同期部Ｐ４に入力され、多重化及び同期処理が行われ、速度変換されたオーディオビデオの概要情報SUM´が得られる。得られたオーディオビデオの概要情報SUM´は、表示再生される。 The speed-converted video element VSUM ′ and audio element ASUM ′ are input to the audio video multiplexing / synchronizing unit P4, multiplexed and synchronized, and obtained speed-converted audio video summary information SUM ′. It is done. The obtained audio video summary information SUM 'is displayed and reproduced.

本発明の一実施形態の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of one Embodiment of this invention. 図１のコンテンツ解析部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the content analysis part of FIG. 図１のオーディオレベル評価部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the audio level evaluation part of FIG. 図３の高レベル音解析部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the high level sound analysis part of FIG. 図１の動きアクティビティ評価部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the movement activity evaluation part of FIG. 図１の概要情報登録部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the summary information registration part of FIG. 図１の概要情報記述部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the summary information description part of FIG. 記録媒体に記録されるプログラムの概要を示す図である。It is a figure which shows the outline | summary of the program recorded on a recording medium. 本発明の他の実施形態のオーディオ概要情報抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio summary information extraction apparatus of other embodiment of this invention. 図８の高レベル音評価部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the high level sound evaluation part of FIG. 記録媒体に記録されるプログラムの概要を示す図である。It is a figure which shows the outline | summary of the program recorded on a recording medium. 本発明の他の実施形態のオーディオ概要情報再生装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio summary information reproducing | regenerating apparatus of other embodiment of this invention.

Explanation of symbols

１・・・コンテンツ解析部、２・・・オーディオレベル評価部、３・・・動きアクティビティ評価部、４・・・概要情報登録部、５・・・概要情報記述部，Ａ１・・・サブバンドデータ抽出部，Ａ２・・・サブバンド解析部，Ａ３・・・高レベル音評価部、Ａ４・・・概要情報登録部、Ａ５・・・概要情報記述部、Ｐ１・・・オーディオビデオ分離部、Ｐ２・・・ビデオ速度変換部、Ｐ３・・・オーディオ速度変換部、Ｐ４・・・オーディオビデオ多重化・同期部、１００・・・記録媒体。
DESCRIPTION OF SYMBOLS 1 ... Content analysis part, 2 ... Audio level evaluation part, 3 ... Motion activity evaluation part, 4 ... Outline information registration part, 5 ... Outline information description part, A1 ... Subband Data extraction unit, A2 ... subband analysis unit, A3 ... high level sound evaluation unit, A4 ... summary information registration unit, A5 ... summary information description unit, P1 ... audio video separation unit, P2 ... Video speed conversion unit, P3 ... Audio speed conversion unit, P4 ... Audio video multiplexing / synchronization unit, 100 ... Recording medium.

Claims

Means for extracting subband data from the input compressed audio content;
Means for calculating a sum of subband energies weighted by bands from the subband data;
From the sum of the sub-band energy, means for calculating a sub-band energy summation at a certain unit time,
The time position at which the subband energy sum in the unit time is maximum is determined as the summary information start time, and the time position at which the subband energy sum in the unit time is α times the maximum value (where 0 <α <1) , As a summary information end time,
An audio summary information extracting apparatus comprising means for extracting audio information in a section of the summary information start time and summary information end time as summary information.

Means for converting the input audio content into subband data;
Means for calculating a sum of subband energies weighted by bands from the subband data;
Means for calculating a subband energy sum in a unit time from the sum of the subband energies;
The time position at which the subband energy sum in the unit time is maximum is determined as the summary information start time, and the time position at which the subband energy sum in the unit time is α times the maximum value (where 0 <α <1) , As a summary information end time,
Extractor audio summary information, characterized by comprising means for extracting the audio information of the summary information start time and summary information end time interval as summary information.

The apparatus for extracting audio summary information according to claim 1 or 2 ,
A device for extracting audio summary information, comprising means for describing start time and end time of the summary information or start time and duration as time information of the extracted summary information.

In the audio outline information extracting device according to any one of claims 1 to 3 ,
A device for extracting audio summary information, comprising means for outputting and saving the extracted summary information as a separate file.

A computer to extract audio summary information,
At least one of means for extracting subband data from the input compressed audio content and means for converting the input audio content to subband data;
Means for calculating a sum of subband energy weighted by bands from the subband data;
Means for calculating a subband energy sum in a unit time from the sum of the subband energies;
The time position at which the subband energy sum in the unit time is maximum is determined as the summary information start time, and the time position at which the subband energy sum in the unit time is α times the maximum value (where 0 <α <1) , As a summary information end time, and
It means for extracting the audio information of the summary information start time and summary information end time interval as summary information,
The computer-readable recording medium which recorded the program which extracts the audio | voice outline information for making it function as.