JP2002199332A

JP2002199332A - Audio summary information, and extraction device therefor, reproduction device and recording medium

Info

Publication number: JP2002199332A
Application number: JP2000396820A
Authority: JP
Inventors: Masaru Sugano; 勝菅野; Yasuyuki Nakajima; 康之中島; Hiromasa Yanagihara; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2000-12-27
Filing date: 2000-12-27
Publication date: 2002-07-12
Anticipated expiration: 2020-12-27
Also published as: JP3838483B2

Abstract

PROBLEM TO BE SOLVED: To provide audio summary information and the extraction device therefor, the reproduction device and the recording medium, which extracts summary information having a length designated by audio or audio video information uniformly and speedily from the whole contents. SOLUTION: A contents analytic part 1 analyzes the hourly structure of video information of inputted audio video contents. An audio level evaluation part 2 evaluates the audio level of audio information attached to the video information the hourly structure of which is analyzed. A movement activity evaluation part 3 evaluates the movement activity of a shot which is evaluated to be the candidate of the summary information by the part 2. A summary information registering part 4 registers audio video summary information evaluated by the movement activity evaluation part.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、非圧縮または圧縮
されたオーディオ情報、または圧縮されたオーディオビ
デオ情報から、それらの内容を効率的に把握するための
概要情報（サマリ）を抽出するオーディオ情報、オーデ
ィオビデオ概要情報の抽出装置および記録媒体に関す
る。また、非圧縮または圧縮されたオーディオ情報、ま
たは圧縮されたオーディオビデオ情報から圧縮データ領
域で高速に概要情報を抽出することにより、オーディオ
情報またはオーディオビデオ情報の高速かつ効率的な閲
覧を提供することが可能な、オーディオビデオ概要情報
の再生装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to audio information for extracting, from uncompressed or compressed audio information, or compressed audio / video information, summary information (summary) for efficiently grasping the contents thereof. And a recording medium for extracting audio / video summary information. Also, to provide high-speed and efficient browsing of audio information or audio-video information by extracting summary information at high speed in a compressed data area from uncompressed or compressed audio information or compressed audio-video information. Which is capable of reproducing audio-video summary information.

【０００２】[0002]

【従来の技術】ビデオの自動要約作成については、例え
ば田中、脇本、神田による「シーン検出による動画情報
の自動要約・閲覧技術の開発」、電子情報通信学会技術
報告、IE99-20（1999）で研究発表（第１の従来技術）
されており、該研究発表では、ビデオのシーン変化点を
検出した後、階層構造化を行い、各シーンに優先度を付
与することによってビデオの要約を自動的に作成してい
る。特に、話題の転換点直後のシーンには高い優先度が
付与される。階層構造においては、上位階層ほど優先度
が高くなるように設定される。2. Description of the Related Art Automatic summarization of videos is described, for example, by Tanaka, Wakimoto and Kanda in "Development of Automatic Summarization / Browsing Technology of Video Information by Scene Detection", IEICE Technical Report, IE99-20 (1999). Research presentation (first conventional technology)
In this research presentation, after detecting a scene change point of a video, a hierarchical structure is formed, and a video summary is automatically created by giving a priority to each scene. In particular, scenes immediately after the turning point of the topic are given a high priority. In the hierarchical structure, the upper layer is set so that the priority is higher.

【０００３】また、J.Saarela、B.Merialdoによる「Usi
ng content models to build audio-video summarie
s」、SPIE Conference on Strorage and Retrieval for
Imageand Video Databases VII（1999）においては
（第２の従来技術）、汎用的なビデオの概要情報（サマ
リ）の作成を、制約つきの最適化問題として捉えてい
る。制約としては、最小のショット長、オーディオとビ
デオの同期、ビデオの連続性、及びオーディオとビデオ
の冗長性などである。そして、手動によってビデオの内
容モデル（記号的な記述）を構築し、サマリの作成を行
っている。[0003] Also, "Usi" by J. Saarela and B. Merialdo
ng content models to build audio-video summarie
s '', SPIE Conference on Strorage and Retrieval for
Imageand Video Databases VII (1999) (second prior art) considers the creation of general video summary information (summary) as a restricted optimization problem. Restrictions include minimum shot length, audio and video synchronization, video continuity, and audio and video redundancy. Then, the content model (symbol description) of the video is manually constructed, and a summary is created.

【０００４】また、R.Lienhartらによる「Video Abstra
cting」、Communications of ACM、Vol.40、No.12（199
7）では（第３の従来技術）、映画の予告編に特化した
概要情報の作成を目的としている。主な作成手順として
は、ビデオのショットへの分割、特別なイベントを含む
クリップの解析、クリップの選択、およびクリップの集
約である。特別なイベントとしては、俳優の顔の認識・
会話の識別、タイトルからの文字情報の抽出、及び銃撃
や爆発などのイベントを抽出している。Also, "Video Abstra" by R. Lienhart et al.
cting ”, Communications of ACM, Vol. 40, No. 12 (199
In (7) (third prior art), the purpose is to create summary information specialized for a movie trailer. The main creation procedure is to split the video into shots, analyze clips including special events, select clips, and aggregate clips. Special events include recognition of actor faces
It identifies conversations, extracts textual information from titles, and extracts events such as shootings and explosions.

【０００５】[0005]

【発明が解決しようとする課題】これらの従来技術によ
るオーディオビデオの概要情報抽出は、次のような問題
を有している。まず、前記第１の従来技術においては、
ビデオの階層構造化により効率的な要約作成を提供する
が、階層構造化は手動で行う必要があり、長尺のオーデ
ィオビデオ情報に関しては概要情報の抽出に必要な前処
理としての階層構造化に、多くの時間を要する可能性が
ある。また、優先度の付与はシーンが属する階層に依存
して行われるため、実質的には人手を要することが多く
なる。また、ビデオ単体が対象となっているためオーデ
ィオビデオへの拡張は可能であるが、オーディオとして
の特性を利用していないため、場合によっては重要な内
容を含むオーディオ情報を利用していないため、適切な
概要情報が得られないことも考えられる。The extraction of the outline information of the audio / video according to the prior art has the following problems. First, in the first prior art,
Although video summarization provides efficient summarization, it must be done manually, and for long audio / video information, it can be used as a pre-process required to extract summary information. , Can take a lot of time. In addition, since the assignment of the priority is performed depending on the hierarchy to which the scene belongs, it is often necessary to substantially require human labor. Also, since it is intended for video alone, it can be extended to audio video, but since it does not use the characteristics as audio, in some cases it does not use audio information containing important contents, It is possible that appropriate summary information cannot be obtained.

【０００６】また、前記第２の従来技術においては、ビ
デオの内容モデルの構築を手動で行わなければならない
ほか、効果的にオーディオの特性を利用する方式を採っ
ていない。同時に、ビデオのセグメントの分類において
は、圧縮データ上では実現が困難な、高度な認識技術な
どが必要となるため、概要情報抽出に要する処理が大き
くなることが予想される。前記第３の従来技術に関して
も、これらの高度な（コンテンツの意味内容にまで立ち
入った）処理が必要となっている。In the second prior art, the construction of the content model of the video must be performed manually, and no method for effectively utilizing the characteristics of the audio is adopted. At the same time, the classification of video segments requires advanced recognition technology, which is difficult to realize on compressed data, and therefore, the processing required to extract summary information is expected to increase. The above-mentioned third prior art also requires these advanced processes (entering into the meaning of the contents).

【０００７】このように、従来技術ではオーディオビデ
オ情報の入力から概要情報の出力までの過程において、
手動による処理が介在することが多く、また、オーディ
オ情報を効果的に利用しないため、オーディオとしての
特性による概要情報の作成が考慮されていない。また、
圧縮または非圧縮のオーディオ単体からの自動概要情報
抽出については、これまで有効な方式は検討されていな
い。As described above, in the prior art, in the process from input of audio-video information to output of summary information,
In many cases, manual processing is involved, and audio information is not effectively used. Therefore, generation of summary information based on audio characteristics is not considered. Also,
As for automatic extraction of summary information from compressed or uncompressed audio alone, no effective method has been studied so far.

【０００８】また、概要情報の構造的な側面から見て
も、オーディオビデオ情報全体から均一に概要情報が抽
出される保証はなく、さらに前記第１と第２の従来技術
では、外部から指定された概要情報長に近づけるような
制御を行う方式も十分には採用されていない。Also, from the structural aspect of the outline information, there is no guarantee that the outline information is uniformly extracted from the entire audio / video information. Further, in the first and second prior arts, the outline information is specified from outside. A method of performing control to approach the summary information length described above has not been sufficiently adopted.

【０００９】本発明は、前記した従来技術に鑑みてなさ
れたものであり、その目的は、非圧縮または圧縮された
オーディオ情報、または圧縮されたオーディオビデオ情
報から、それらの内容を効率よく把握するための概要情
報（サマリ）を抽出する装置において、圧縮データ領域
でオーディオ及びビデオの時空間的な特性を解析し、必
要に応じてこれらを統合的に評価し、オーディオまたは
オーディオビデオ情報の指定した長さを持つ概要情報
を、コンテンツ全体から均一に、かつ高速に抽出するこ
とを可能とした、オーディオ概要情報、オーディオビデ
オ概要情報の抽出装置、再生装置および記録媒体を提供
することにある。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned prior art, and has as its object to efficiently grasp the contents of uncompressed or compressed audio information or compressed audio / video information. In the apparatus for extracting summary information (summary), the spatio-temporal characteristics of audio and video are analyzed in the compressed data area, and if necessary, these are integratedly evaluated, and audio or audio-video information is specified. An object of the present invention is to provide an audio outline information and audio video outline information extraction device, a reproduction device, and a recording medium that enable uniform and high-speed extraction of outline information having a length.

【００１０】[0010]

【課題を解決するための手段】前記した目的を達成する
ために、本発明は、オーディオビデオ概要情報の抽出装
置において、入力されたオーディオビデオコンテンツの
ビデオ情報の時間的構造を解析するコンテンツ解析手段
と、時間的構造を解析された該ビデオ情報に付随するオ
ーディオ情報のオーディオレベルを評価するオーディオ
レベル評価手段と、該オーディオレベル評価手段により
評価されたオーディオビデオ概要情報を登録する概要情
報登録手段とを具備した点に第１の特徴があるIn order to achieve the above-mentioned object, the present invention provides an audio-video summary information extracting apparatus for analyzing the temporal structure of video information of input audio-video content. Audio level evaluation means for evaluating the audio level of audio information associated with the video information whose temporal structure has been analyzed, and summary information registration means for registering audio / video summary information evaluated by the audio level evaluation means The first feature is that it has

【００１１】また、本発明は、オーディオ概要情報の抽
出装置において、入力された圧縮オーディオコンテンツ
からサブバンドデータを抽出する手段と、該サブバンド
データから、バンドにより重み付けしたサブバンドエネ
ルギーの総和を計算する手段と、単位時間におけるサブ
バンドエネルギー総和を評価する手段と、オーディオ概
要情報を抽出する手段とを具備した点に第２の特徴があ
る。According to the present invention, there is provided an audio outline information extracting apparatus, comprising: means for extracting sub-band data from input compressed audio content; and calculating a sum of sub-band energies weighted by a band from the sub-band data. The second characteristic lies in that the present invention comprises means for performing summing, a means for evaluating the sum of sub-band energies per unit time, and a means for extracting audio summary information.

【００１２】また、本発明は、入力されたオーディオビ
デオコンテンツのビデオ情報の時間的構造を解析する機
能と、時間的構造を解析された該ビデオ情報に付随する
オーディオ情報のオーディオレベルを評価する機能と、
画像内の動きアクティビティを評価する機能と、前記オ
ーディオレベルと動きアクティビティを評価されたオー
ディオビデオ概要情報を登録する概要情報登録機能とを
含む、コンピュータに実行させるためのプログラムを記
録したコンピュータ読み取り可能な記録媒体を提供する
ようにした点に第３の特徴がある。Also, the present invention provides a function of analyzing a temporal structure of video information of input audio / video content and a function of evaluating an audio level of audio information accompanying the video information whose temporal structure has been analyzed. When,
A computer-readable recording program for causing a computer to execute, including a function of evaluating motion activity in an image, and a summary information registration function of registering audio-video summary information on which the audio level and the motion activity have been evaluated. A third feature is that a recording medium is provided.

【００１３】さらに、本発明は、オーディオビデオ概要
情報の再生装置において、オーディオビデオ概要情報の
抽出装置により抽出されたビデオ要素とオーディオ要素
を同期して再生する手段を具備した点に第４の特徴があ
る。A fourth feature of the present invention is that the audio-video summary information reproducing apparatus includes means for synchronously reproducing the video element and the audio element extracted by the audio-video summary information extracting apparatus. There is.

【００１４】[0014]

【発明の実施の形態】以下に、図面を参照して、本発明
を詳細に説明する。図１は、本発明の第１の実施形態で
あるオーディオビデオ概要情報の抽出装置の構成を示す
ブロック図である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration of an audio-video summary information extraction device according to a first embodiment of the present invention.

【００１５】まず、圧縮されたオーディオビデオ情報Ａ
Ｖは、該オーディオビデオ情報ＡＶの時間的構造を解析
するコンテンツ解析部１に入力される。該コンテンツ解
析部１は、図２に示されているような構成を有してお
り、該圧縮されたオーディオビデオ情報ＡＶは、まずシ
ョット分割部１１に入力する。該ショット分割部１１
は、ビデオ情報をショット単位Ｓに分割し、次いでコン
テンツ全体に対するショット数をカウントし、全ショッ
ト数をＮＳとして保持する。次に、分割数決定部１２
は、ショット分割部１１から入力される全ショット数Ｎ
Ｓと、オーディオビデオ情報ＡＶから得られるコンテン
ツ長ＣＬと、外部から指定される概要情報長ＳＬを用い
て、下記の（１）式により分割数ＮＰを決定する。First, compressed audio / video information A
V is input to a content analysis unit 1 that analyzes the temporal structure of the audio / video information AV. The content analysis unit 1 has a configuration as shown in FIG. 2, and the compressed audio / video information AV is first input to the shot division unit 11. The shot division unit 11
Divides video information into shot units S, then counts the number of shots for the entire content, and holds the total number of shots as NS. Next, the division number determination unit 12
Is the total number of shots N input from the shot division unit 11
Using S, the content length CL obtained from the audio-video information AV, and the outline information length SL specified from the outside, the division number NP is determined by the following equation (1).

【００１６】ＮＰ＝ＮＳ×ＳＬ／ＣＬ・・・（１）ただし、ＳＬ＞ＣＬ／ＮＳであるとする。例えば、コン
テンツ長ＣＬが１時間のものを５分の概要情報長ＳＬに
纏めようとすると、前記全ショット数ＮＳが２４０個の
場合、分割数ＮＰは２０となる。そして、コンテンツ分
割１３においてコンテンツをＮＰ等分する。等分割され
た区間ＳＡＶと、各区間に属するショットＳとをマッピ
ングする。NP = NS × SL / CL (1) Here, it is assumed that SL> CL / NS. For example, when the content length CL is 1 hour and the summary information length SL is 5 minutes, when the total number of shots NS is 240, the division number NP is 20. Then, the content is divided into NPs in the content division 13. The equally divided sections SAV and the shots S belonging to each section are mapped.

【００１７】以下では、分割区間入力１４で入力された
区間ＳＡＶ毎に処理を行う。まず、ショット入力１５で
最初のショットＳnが入力される。ショット長評価部１
６では、入力されたショットＳnのショット長ＳＨＬnが
予め指定した長さＴ以下の場合に、概要情報ＳＵＭの候
補から除外し、次のショットＳn+1の入力処理へ移行す
る。一方、ショットＳnのショット長ＳＨＬnがTよりも
長い場合には、該当するショットＳnを概要情報ＳＵＭ
の候補とし、代表フレーム抽出部１７に送る。In the following, processing is performed for each section SAV input by the divided section input 14. First, the first shot Sn is input in the shot input 15. Shot length evaluation unit 1
In step 6, when the shot length SHLn of the input shot Sn is less than or equal to the length T specified in advance, the shot information is excluded from the candidates for the summary information SUM, and the process proceeds to the input processing of the next shot Sn + 1. On the other hand, when the shot length SHLn of the shot Sn is longer than T, the corresponding shot Sn is added to the summary information SUM.
And sends it to the representative frame extracting unit 17.

【００１８】代表フレーム抽出部１７では、ショットSn
の代表フレームRFSnとして、ショットの先頭フレームSF
Snまたは特徴フレームKFSnを抽出する。特徴フレームKF
Snの抽出には、例えば特願2000-065259に記載された方
法などを用いることができる。さらに、代表フレーム特
徴値抽出部１８において、代表フレーム抽出部１７で抽
出されたフレームRFSnの特徴値CHFSnを抽出し、ショッ
ト特徴値抽出部１９において、ショットSnの特徴値CHSn
を抽出する。In the representative frame extracting section 17, the shot Sn
As the representative frame RFSn of the shot, the first frame SF of the shot
Extract Sn or feature frame KFSn. Feature frame KF
For the extraction of Sn, for example, the method described in Japanese Patent Application No. 2000-065259 can be used. Further, the representative frame characteristic value extracting unit 18 extracts the characteristic value CHFSn of the frame RFSn extracted by the representative frame extracting unit 17, and the shot characteristic value extracting unit 19 extracts the characteristic value CHSn of the shot Sn.
Is extracted.

【００１９】ショットSnに対する代表フレームRFSnとし
ての先頭フレームSFSnまたは特徴フレームKFSnの特徴値
CHFSnと、ショットSnの特徴値CHSnのいずれか一方又は
両方は、代表ショット評価部１Ａに送られる。先頭フレ
ームSFSnまたは特徴フレームKFSnの特徴値CHFSnとして
は、例えばMPEG-7（Moving Picture Experts Group pha
se 7）で規定されている記述子などを用いることができ
る。The characteristic value of the leading frame SFSn or the characteristic frame KFSn as the representative frame RFSn for the shot Sn
One or both of CHFSn and the feature value CHSn of the shot Sn are sent to the representative shot evaluation unit 1A. As the feature value CHFSn of the first frame SFSn or the feature frame KFSn, for example, MPEG-7 (Moving Picture Experts Group pha
Descriptors specified in se 7) can be used.

【００２０】代表ショット評価部１Ａでは、対象となる
分割区間SAVにおいて既に代表ショットRSとして登録さ
れている全てのショットに関する代表フレームRFの特徴
値CHFと、代表フレーム特徴値抽出部１８から送られた
ショットSnの代表フレームRFsの特徴値CHFSnとの間で類
似度の判定を行う。ここで、特徴値CHFと入力されたシ
ョットSnの特徴値CHFSnとの類似度が大きいと判定され
た場合には反復ショットであると見なされ、該当するシ
ョットSnを概要情報SUMの候補から除外し、次のショッ
トSn+1の入力処理へ移行する。一方、特徴値CHFとショ
ットSnの特徴値CHFSnとの類似度が小さいと判定された
場合には、独立ショットであると見なされ、該当するシ
ョットSnを概要情報SUMの候補とし、代表ショット登録
１Ｂでショットの登録を行ったあと、オーディオレベル
評価部２に送る。なお、代表ショットとして登録された
ショットSnに関する特徴値は別途保持され、それ以降の
代表ショットの評価に用いられる。In the representative shot evaluation section 1A, the characteristic value CHF of the representative frame RF regarding all shots already registered as the representative shots RS in the target divided section SAV and the representative frame characteristic value extraction section 18 send the CHF. The similarity is determined with respect to the feature value CHFSn of the representative frame RFs of the shot Sn. Here, when it is determined that the similarity between the feature value CHF and the feature value CHFSn of the input shot Sn is large, it is regarded as a repetitive shot, and the corresponding shot Sn is excluded from the candidates for the summary information SUM. Then, the processing shifts to the input processing of the next shot Sn + 1. On the other hand, when it is determined that the similarity between the feature value CHF and the feature value CHFSn of the shot Sn is small, the shot is regarded as an independent shot, the corresponding shot Sn is set as a candidate for the summary information SUM, and the representative shot registration 1B is set. After registering a shot, the shot is sent to the audio level evaluation unit 2. Note that feature values relating to the shot Sn registered as the representative shot are separately held and used for evaluation of the subsequent representative shots.

【００２１】ショット類似度の判定においては、同様に
既に代表ショットRSとして登録されているショットに関
する代表ショットの特徴値CHSと、ショット特徴値抽出
部１９から送られたショットSnの特徴値CHSnを代用する
か、或いは併用してもよい。In the determination of the shot similarity, similarly, the characteristic value CHS of the representative shot related to the shot already registered as the representative shot RS and the characteristic value CHSn of the shot Sn sent from the shot characteristic value extracting unit 19 are substituted. Or may be used together.

【００２２】オーディオレベル評価部２では、コンテン
ツ解析部１から入力されたショットSnに関して、図３の
サブバンドデータ抽出部２１でショットSnのオーディオ
部分のサブバンドデータSDSnを抽出する。そして、無音
解析部２２で無音としての特徴値を計算したあと、無音
判定部２３でショットSnの全ての区間、またはショット
Snの区間のY%以上が無音であると判定された場合に、該
当するショットSnを概要情報SUMの候補から除外し、次
のショットSn+1の入力処理へ移行する。一方、無音判
定部２３での判定において、ショットSnのオーディオ部
分がショットSnの区間の（１００−Ｙ）％以上が有音で
ある場合に、ショットSnは低レベル音解析部２４へ送ら
れる。無音解析部２２での無音解析方法及び無音判定部
２３での無音判定方法としては、例えば特願平10-23554
3号に記載された方法などを用いることができる。In the audio level evaluation section 2, with respect to the shot Sn input from the content analysis section 1, the subband data extraction section 21 in FIG. 3 extracts the subband data SDSn of the audio portion of the shot Sn. Then, the silence analyzing unit 22 calculates a feature value as silence, and the silence determining unit 23 calculates all the sections of the shot Sn or the shots.
When it is determined that Y% or more of the Sn section is silent, the corresponding shot Sn is excluded from the candidates for the summary information SUM, and the process proceeds to the input processing of the next shot Sn + 1. On the other hand, in the determination by the silence determination section 23, when the audio portion of the shot Sn is (100-Y)% or more of the section of the shot Sn having sound, the shot Sn is sent to the low-level sound analysis section 24. Examples of the silent analysis method in the silent analysis unit 22 and the silent determination method in the silent determination unit 23 include, for example, Japanese Patent Application No. Hei 10-23554.
The method described in No. 3 can be used.

【００２３】低レベル音解析部２４では、同様にサブバ
ンドデータ抽出部２１で抽出されたサブバンドデータSD
Snから該当するオーディオのレベルLSnを推定し、指定
された十分に低いレベルTHLL以下のオーディオがショッ
トSnの全ての区間、またはショットSnの区間のＺ％以上
を占める場合に、該当するショットSnを概要情報SUMの
候補から除外し、次のショットSn+1の入力処理へ移行す
る。一方、低レベル音解析部２４でショットSnのオーデ
ィオ部分が、ショットSnの区間の（１００−Ｚ）％以上
がオーディオレベルTHLLを超えるオーディオである場合
に、該当するショットSnを概要情報SUMの候補とし、高
レベル音解析部２６に送る。In the low-level sound analysis unit 24, the sub-band data SD similarly extracted by the sub-band data extraction unit 21
The level LSn of the corresponding audio is estimated from Sn, and if the audio below the designated sufficiently low level THLL occupies all the sections of the shot Sn or Z% or more of the section of the shot Sn, the corresponding shot Sn is determined. It is excluded from the candidates for the summary information SUM, and the process proceeds to the input processing of the next shot Sn + 1. On the other hand, when the audio part of the shot Sn is audio that exceeds the audio level THLL by (100-Z)% or more of the section of the shot Sn in the low-level sound analysis unit 24, the shot Sn is regarded as a candidate for the summary information SUM. And sends it to the high-level sound analysis unit 26.

【００２４】高レベル音解析部２６では、サブバンドデ
ータ抽出部２１から得られたショットSnに関するサブバ
ンドデータSDSnに基づいて、図４の単位時間密度計算部
２６１でショットSnにおけるあるレベルTHHL以上のオー
ディオの単位時間密度Dsnを計算する。このとき、オー
ディオ情報がMPEGオーディオで符号化されている場合、
1秒当りの単位時間密度Dsnは、例えば以下のように求め
ることができる。In the high-level sound analysis section 26, based on the sub-band data SDSn relating to the shot Sn obtained from the sub-band data extraction section 21, the unit time density calculation section 261 shown in FIG. Calculate the unit time density Dsn of audio. At this time, if the audio information is encoded with MPEG audio,
The unit time density Dsn per second can be obtained, for example, as follows.

【００２５】 Dsn＝（ＮＡＦ_THHL／ＮＡＦ）×ＡＡＬ_THHL ・・・（２）ここで、NAF_THHLはレベルTHHL以上を持つ1秒間当りのフ
レーム数、NAFは1秒当りのオーディオフレーム数、AAL
_THHLはNAF_THHLに対する平均レベルである。Dsn = (NAF _THHL / NAF) × AAL _THHL (2) Here, NAF _THHL is the number of frames per second having a level of THHL or more, NAF is the number of audio frames per second, AAL
_THHL is the average level for NAF _THHL .

【００２６】また、サブバンドエネルギー総和計算部２
６２においてサブバンドにより重み付けされたサブバン
ドエネルギーの総和SBEを計算し、それに基づいて単位
時間サブバンドエネルギー計算部２６３で単位時間での
サブバンドエネルギー総和SBsnを計算する。1秒当りの
単位時間サブバンドエネルギーSBsnは、例えば以下の
（３）式のように求めることができる。ここで、αkはサブバンドkに対する重み付け、sbkはあ
るフレームにおけるサブバンドkのエネルギーである。The sub-band energy sum calculation unit 2
At 62, the sum SBE of the sub-band energies weighted by the sub-bands is calculated, and the unit time sub-band energy calculator 263 calculates the sum of the sub-band energies SBsn per unit time based on the calculated sum SBE. The unit time sub-band energy SBsn per second can be obtained, for example, as in the following equation (3). Here, αk is a weight for the subband k, and sbk is the energy of the subband k in a certain frame.

【００２７】次に、単位時間密度判定部２６４で該当す
るショットSnがある閾値THDを超える単位時間密度Dsnを
持つ場合に、単位時間サブバンドエネルギー判定部２６
５へ移行する。単位時間サブバンドエネルギー判定部２
６５では、閾値THSBを超える単位時間サブバンドエネル
ギーSBsnが存在する場合、該当するオーディオを含むシ
ョットSnを概要情報の候補として判定し、高レベル音解
析ルーチンを抜け出し、動きアクティビティ評価部３へ
移行する。Next, if the corresponding shot Sn has a unit time density Dsn exceeding a certain threshold THD, the unit time sub-band energy determining unit 26
Go to 5. Unit time subband energy determination unit 2
At 65, when there is a unit time sub-band energy SBsn exceeding the threshold THSB, the shot Sn including the corresponding audio is determined as a candidate for the outline information, the high-level sound analysis routine is exited, and the process proceeds to the motion activity evaluation unit 3. .

【００２８】これに対し、オーディオレベルTHHL以上の
オーディオの単位時間密度DSnが閾値THD未満の場合、或
いは単位時間サブバンドエネルギーSBSnが閾値THSB未満
の場合、該当するオーディオを含むショットSnを概要情
報の候補から除外し、次のショットSn+1の入力処理へ移
行する。On the other hand, when the unit time density DSn of the audio having the audio level THHL or higher is less than the threshold value THD, or when the unit time subband energy SBSn is less than the threshold value THSB, the shot Sn including the corresponding audio is included in the summary information. It is excluded from the candidates, and the process proceeds to the input processing of the next shot Sn + 1.

【００２９】なお、図３の構成において、無音解析部２
２，低レベル音解析部２４、および高レベル音解析部２
６は、必ずしも全部は必要でなく、少なくとも１つを備
えておれば良い。Note that, in the configuration of FIG.
2, low-level sound analysis unit 24 and high-level sound analysis unit 2
6 is not always necessary, and it is sufficient if at least one is provided.

【００３０】動きアクティビティ評価部３では、オーデ
ィオレベル評価部２から入力されたショットSnに関し
て、図５の動きベクトル抽出部３１でショットSnに属す
る全てのフレームの動きベクトル情報MVを抽出する。そ
して、動きアクティビティ計算部３２において、動きベ
クトル情報MVを用いてショットSn全体としての動きアク
ティビティMASnを計算し、それを用いて単位時間動きア
クティビティ計算部３３において、ある単位時間（例え
ば1秒、1フレームなど）における動きアクティビティMA
を計算する。動きアクティビティ計算部３２及び単位時
間動きアクティビティ３３における処理は、例えばMPEG
符号化されたビデオに対して以下のように表せる。ここ
では1秒当りの動きアクティビティと仮定する。ここで、ASMVは大きさがX以上の動きベクトルのフレー
ム内絶対値総和、NMBはフレーム内での大きさがX以上の
動きベクトルを持つマクロブロック数、NPSnはショット
内の予測符号化フレーム数、NVFは1秒当りのビデオフレ
ーム数である。The motion activity evaluator 3 extracts the motion vector information MV of all the frames belonging to the shot Sn by the motion vector extractor 31 of FIG. 5 for the shot Sn input from the audio level evaluator 2. Then, the motion activity calculator 32 calculates the motion activity MASn as the entire shot Sn using the motion vector information MV, and uses the calculated motion activity MASn as a unit time in the unit time motion activity calculator 33 (for example, 1 second, 1 second). Movement activity MA in frames
Is calculated. The processing in the movement activity calculation unit 32 and the unit time movement activity 33 is performed, for example, by MPEG.
It can be expressed as follows for the encoded video. Here, we assume movement activity per second. Here, ASMV is the sum of absolute values of motion vectors having a size of X or more in a frame, NMB is the number of macroblocks having a motion vector of a size of X or more in a frame, and NPSn is the number of predicted coded frames in a shot. , NVF is the number of video frames per second.

【００３１】動きアクティビティ判定部３４において、
求められた動きアクティビティMAと、ある指定された閾
値THMAを比較し、MAがTHMAを超える場合に、該当するシ
ョットSnを概要情報SUMの候補から除外し、次のショッ
トSn+1の入力処理へ移行する。一方、動きアクティビテ
ィ判定部３４でショットSnのある単位時間における動き
アクティビティMAが閾値THMAを超えない場合に、該当す
るショットSnを概要情報SUMの候補とする。In the movement activity determination section 34,
The calculated motion activity MA is compared with a specified threshold value THMA, and when MA exceeds THMA, the corresponding shot Sn is excluded from the candidates for the summary information SUM, and the input processing of the next shot Sn + 1 is performed. Transition. On the other hand, when the motion activity determination unit 34 determines that the motion activity MA of the shot Sn in a certain unit time does not exceed the threshold THMA, the corresponding shot Sn is set as a candidate for the summary information SUM.

【００３２】以上の処理により、オーディオビデオ情報
AVを等間隔に分割した区間SAVにおいて、概要情報SUMの
候補として選択されたショットSは概要情報登録部４に
入力され、ショットSの情報が図６のショットメモリ４
１に保存される。次いで、区間ＳＡＶでの全ショット
の処理が終了したか否かの判断が判断部４２でなされ、
この判断が否定の時には次のショットの処理が行われ
る。With the above processing, audio-video information
In the section SAV obtained by dividing the AV at equal intervals, the shot S selected as a candidate for the summary information SUM is input to the summary information registration unit 4, and the information of the shot S is stored in the shot memory 4 in FIG.
1 is stored. Next, the determination unit 42 determines whether or not processing of all shots in the section SAV has been completed.
If this determination is negative, the processing for the next shot is performed.

【００３３】一方、区間ＳＡＶの全ショットの処理が終
了した時には、区間SAVにおける全てのショットSに関す
るショット長の総和SSLは、概要情報長判定部４３にお
いて、各区間での平均概要情報長MSL（=SL/NP）に十分
近いかどうかを判断される。区間SAVでのショット長総
和SSLと平均概要情報長MSLが十分に近いと見なされた場
合には、区間SAVでの概要情報抽出処理を終了し、概要
情報登録４４で時間情報などが登録される。そして次の
区間SAVn+1の入力処理へ移行する。このとき、代表ショ
ット登録１Ｂで登録されていた代表ショットに関する各
種特徴値はクリアされる。一方、SSLとMSLが近いと見な
されない場合には、閾値変更部４７で、概要情報SUMの
候補としての判定に用いる一部または全ての閾値の値を
変更し、それまでに抽出されている概要情報SUMの候補
を対象として、SSLとMSLが十分近くなるまでショット長
評価部１６から動きアクティビティ評価部３での処理を
再帰的に繰り返す。On the other hand, when the processing of all the shots in the section SAV is completed, the sum total SSL of the shot lengths of all the shots S in the section SAV is determined by the summary information length determination section 43 by the average summary information length MSL ( = SL / NP). If the shot length total SSL and the average summary information length MSL in the section SAV are considered to be sufficiently close, the summary information extraction processing in the section SAV is terminated, and the time information and the like are registered in the summary information registration 44. . Then, the processing shifts to the input processing of the next section SAVn + 1. At this time, various characteristic values related to the representative shot registered in the representative shot registration 1B are cleared. On the other hand, if SSL and MSL are not considered to be close to each other, the threshold value changing unit 47 changes some or all of the threshold values used to determine the summary information SUM as a candidate, and has been extracted up to that time. The processing from the shot length evaluator 16 to the motion activity evaluator 3 is recursively repeated for the candidates for the summary information SUM until SSL and MSL become sufficiently close.

【００３４】これらの処理を、全ての区間SAVについて
行い、最終的にオーディオビデオ情報AVの概要情報SUM
を得て、概要情報SUMが登録される。このとき、概要情
報の記述が指定されていれば、概要情報記述部５へ移行
し、指定されていなければ処理を終了する。These processes are performed for all sections SAV, and finally, summary information SUM of audio video information AV
Then, the summary information SUM is registered. At this time, if the description of the summary information has been designated, the process proceeds to the summary information description unit 5, and if not, the process is terminated.

【００３５】概要情報記述部５では、図７に示されてい
るように、概要情報SUMとして抽出された全てのショッ
トSについて、概要情報記述部５１でそれらの時間情報
を少なくとも記述し、概要情報出力部５２で概要情報記
述ファイルとして出力する。記述するフォーマットとし
ては、例えばMPEG-7で規定されているフォーマットなど
を用いることができる。全ての区間SAVについて記述が
終了すると、全ての処理を終了する。また、概要情報SU
Mとして抽出された全てのショットSを結合して、別ファ
イルとして保存することができる。このとき、オーディ
オビデオファイルとして保存するか、オーディオとビデ
オを個別に保存することができる。As shown in FIG. 7, in the summary information description section 5, at least the time information of all shots S extracted as the summary information SUM is described in the summary information description section 51. The output unit 52 outputs it as a summary information description file. As a format to be described, for example, a format defined by MPEG-7 can be used. When the description has been completed for all the sections SAV, all the processing ends. Also, summary information SU
All shots S extracted as M can be combined and saved as a separate file. At this time, it can be saved as an audio video file, or audio and video can be saved separately.

【００３６】次に、前記した実施形態のオーディオビデ
オ概要情報抽出装置の機能は、ソフトウェア（プログラ
ム）で実現することができる。該ソフトウェアは、光デ
ィスク、フロッピー（登録商標）ディスク、ハードディ
スク等の記録媒体に記録することができる。Next, the function of the audio-video summary information extracting device of the above-described embodiment can be realized by software (program). The software can be recorded on a recording medium such as an optical disk, a floppy (registered trademark) disk, and a hard disk.

【００３７】図８は、該記録媒体１００に記録されるプ
ログラムの一例を示すものであり、該記録媒体１００に
は、圧縮オーディオビデオ情報のコンテンツ解析機能１
１１、オーディオレベル評価機能１１２、動きアクティ
ビティ評価機能１１３、概要情報登録機能１１４、およ
び概要情報記述機能１１５が記録される。なお、該動き
アクティビティ評価機能１１３は、省略してもよい。FIG. 8 shows an example of a program recorded on the recording medium 100. The recording medium 100 has a content analysis function 1 for compressed audio / video information.
11, an audio level evaluation function 112, a motion activity evaluation function 113, a summary information registration function 114, and a summary information description function 115 are recorded. Note that the movement activity evaluation function 113 may be omitted.

【００３８】また、前記コンテンツ解析機能１１１は、
ビデオ情報をショットに分割する機能と、入力コンテン
ツをある基準に従って等間隔の区間に分割する機能と、
該等間隔の区間に含まれるショットの長さを評価する機
能および反復ショットを判定する機能の少なくとも一方
とから構成することができる。また、前記オーディオレ
ベル評価機能１１２は、無音を判定する機能、低レベル
音を判定する機能、および高レベル音を判定する機能か
ら構成することができる。Further, the content analysis function 111 comprises:
A function of dividing video information into shots, a function of dividing input content into equally spaced sections according to a certain criterion,
It can be composed of at least one of a function for evaluating the length of a shot included in the equally-spaced section and a function for determining a repetitive shot. Further, the audio level evaluation function 112 can be composed of a function of determining silence, a function of determining low-level sound, and a function of determining high-level sound.

【００３９】また、前記動きアクティビティ評価機能１
１３は、前記ショットに属するフレームの動きベクトル
データを抽出する機能と、該抽出された動きベクトルデ
ータから動きアクティビティを計算する機能と、単位時
間における動きアクティビティを計算する機能と、該単
位時間における動きアクティビティを用いて概要情報の
候補を判定する機能とから構成することができる。The motion activity evaluation function 1
Reference numeral 13 denotes a function of extracting motion vector data of a frame belonging to the shot, a function of calculating a motion activity from the extracted motion vector data, a function of calculating a motion activity in a unit time, and a function of calculating a motion in the unit time. And a function of determining a candidate for summary information using an activity.

【００４０】また、抽出された概要情報を、オーディオ
ビデオとして結合するか、またはオーディオとビデオ個
別に結合するかをし、該結合した概要情報を別ファイル
として、記録媒体１００に記録することができる。It is possible to combine the extracted summary information as audio video or audio and video separately, and record the combined summary information as a separate file on the recording medium 100. .

【００４１】なお、前記記録媒１００には、ネットワー
クのように、データを一時的に記録保持するような伝送
媒体も含まれる。The recording medium 100 includes a transmission medium such as a network for temporarily recording and holding data.

【００４２】図９は、本発明の第２の実施形態であるオ
ーディオ概要情報の抽出装置の構成を示すブロック図で
ある。FIG. 9 is a block diagram showing a configuration of an audio outline information extracting apparatus according to a second embodiment of the present invention.

【００４３】まず、圧縮されたオーディオ情報CAが入力
されると、サブバンドデータ抽出部Ａ１でサブバンドデ
ータSDを抽出する。抽出されたサブバンドデータSDは高
レベル音評価部Ａ３に送られる。サブバンドデータ抽出
部Ａ１の動作としては、第１の実施形態に示した無音・
低レベル音評価部５におけるサブバンドデータ抽出部５
１と同様である。一方、非圧縮のオーディオ情報UAが入
力されると、サブバンド解析部Ａ２で入力オーディオが
サブバンド解析され、解析された結果としてのサブバン
ドデータSDは同様に高レベル音評価部Ａ３に送られる。First, when the compressed audio information CA is input, the sub-band data extracting unit A1 extracts the sub-band data SD. The extracted sub-band data SD is sent to the high-level sound evaluation section A3. The operation of the sub-band data extraction unit A1 is the same as the operation of the silent band shown in the first embodiment.
Sub-band data extraction unit 5 in low-level sound evaluation unit 5
Same as 1. On the other hand, when the uncompressed audio information UA is input, the input audio is sub-band analyzed by the sub-band analysis unit A2, and the sub-band data SD as a result of the analysis is similarly sent to the high-level sound evaluation unit A3. .

【００４４】高レベル音評価部Ａ３では、第１の実施形
態に示した高レベル音解析部２６に含まれるサブバンド
エネルギー総和計算部２６２と、単位時間サブバンドエ
ネルギー総和計算部２６３と同様の機能を持つ、図１０
のサブバンドエネルギー総和計算部Ａ３１と、単位時間
サブバンドエネルギー総和計算部Ａ３２により、入力さ
れたサブバンドデータSDから、それぞれサブバンドエネ
ルギー総和SBEと単位時間でのサブバンドエネルギー総
和SBが計算される。The high-level sound evaluation section A3 has the same functions as the sub-band energy sum calculation section 262 and the unit time sub-band energy sum calculation section 263 included in the high-level sound analysis section 26 shown in the first embodiment. Figure 10
From the input sub-band data SD, the sub-band energy sum SBE and the sub-band energy sum SB per unit time are calculated by the sub-band energy sum calculation unit A31 and the unit time sub-band energy sum calculation unit A32. .

【００４５】次に、概要情報開始時間決定部Ａ３３にお
いて、単位時間サブバンドエネルギー総和SBが最大とな
る時間位置を、概要情報開始時間T_startとして決定す
る。また、概要情報終了時間決定部Ａ３４では、単位時
間サブバンドエネルギー総和SBが最大値のα倍（0<α<
1）となる時間位置を、概要情報終了時間T_endとして決
定する。このとき、T_start＞T_endである。Next, in the outline information start time determination unit A33, a time position where the sum of the unit time subband energies SB is maximum is determined as the outline information start time T_start. In the summary information end time determination unit A34, the unit time subband energy sum SB is α times the maximum value (0 <α <
The time position of 1) is determined as the outline information end time T_end. At this time, T_start> T_end.

【００４６】概要情報登録部Ａ４では、高レベル音評価
部Ａ３で決定された概要情報開始時間T_startと、概要
情報終了時間T_endに基づいて概要情報を登録する。そ
して、概要情報記述部Ａ５において上記時間情報を少な
くとも記述し、概要情報記述ファイルとして出力する。
記述するフォーマットとしては、例えばMPEG-7で規定さ
れているフォーマットなどを用いることができる。The outline information registration unit A4 registers outline information based on the outline information start time T_start and the outline information end time T_end determined by the high-level sound evaluation unit A3. Then, at least the time information is described in the summary information description section A5 and output as a summary information description file.
As a format to be described, for example, a format defined by MPEG-7 can be used.

【００４７】オーディオ情報が複数存在する場合には、
上記の処理を全てのオーディオ情報に対して行う。When there are a plurality of pieces of audio information,
The above processing is performed on all audio information.

【００４８】前記した実施形態のオーディオ概要情報抽
出装置の機能は、ソフトウェア（プログラム）で実現
することができ、該ソフトウェアは、光ディスク、フロ
ッピーディスク、ハードディスク等の記録媒体１００に
記録することができる。また、抽出されたオーディオ概
要情報は、個別のファイルとして、該記録媒体１００に
記録することができる。The function of the audio summary information extracting device of the above-described embodiment can be realized by software (program), and the software can be recorded on a recording medium 100 such as an optical disk, a floppy disk, and a hard disk. Further, the extracted audio summary information can be recorded on the recording medium 100 as an individual file.

【００４９】図１１は、該記録媒体１００に記録される
プログラムの一例を示すものであり、該記録媒体１００
には、サブバンドデータ抽出機能１２１または／および
サブバンド解析機能１２２、高レベル音評価機能１２
３、概要情報登録機能１２４、および概要情報記述機能
１２５が記録される。FIG. 11 shows an example of a program recorded on the recording medium 100.
The sub-band data extraction function 121 and / or the sub-band analysis function 122, the high-level sound evaluation function 12
3. The summary information registration function 124 and the summary information description function 125 are recorded.

【００５０】図１２は、本発明のオーディオビデオの概
要情報再生装置の一実施形態を、構成図として表したも
のである。FIG. 12 is a block diagram showing an embodiment of an audio / video summary information reproducing apparatus according to the present invention.

【００５１】前記手段により抽出されたオーディオビデ
オの概要情報SUMが入力されると、オーディオビデオ分
離部Ｐ１において、概要情報のビデオ要素VSUMとオーデ
ィオ要素ASUMに分離される。次に、ビデオ速度変換部Ｐ
２では、外部から与えられた変換速度パラメータSPに従
ってビデオ要素VSUMを空間的に間引いてビデオの再生速
度を変換する。同様にして、オーディオ要素ASUMはオー
ディオ速度変換部Ｐ３において変換速度パラメータSPに
従ってビデオ要素VSUMと同じ割合で時間的に間引かれ、
オーディオの再生速度を変換する。オーディオの再生速
度変換としては、例えばオーディオのフレームの周期的
なスキップや、繰り返し再生と周期的スキップの組み合
わせなどによって実現することができる。ここで、オー
ディオを1.5倍の速度にする場合、前者では＜再生するフレーム番号＞ 1、2、4、5、7、8、10、1
1、…と連続する2フレームを再生し、次に続く1フレー
ムスキップすることによって達成される。また後者で
は、＜再生するフレーム番号＞ 1、1、4、4、7、7、10、1
0、…と同一フレームを2回繰り返して再生し、次に続く
2フレームをスキップすることによって達成される。When the audio / video summary information SUM extracted by the above means is input, the audio / video separation unit P1 separates the audio / video summary information into a video element VSUM and an audio element ASUM of the summary information. Next, the video speed converter P
In 2, the video element VSUM is spatially thinned out according to the conversion speed parameter SP given from the outside to convert the video playback speed. Similarly, the audio element ASUM is temporally thinned out at the same rate as the video element VSUM according to the conversion speed parameter SP in the audio speed conversion unit P3,
Convert audio playback speed. The conversion of the audio playback speed can be realized by, for example, periodic skipping of an audio frame or a combination of repeated playback and periodic skipping. Here, when making the audio 1.5 times faster, in the former, <frame number to be played> 1, 2, 4, 5, 7, 8, 10, 1
This is achieved by playing back two consecutive frames with 1, ... and skipping the next one frame. In the latter, <frame number to be played back> 1, 1, 4, 4, 7, 7, 10, 1
Repeat the same frame as 0, ... twice, and then continue
Achieved by skipping two frames.

【００５２】速度を変換されたビデオ要素VSUM´及びオ
ーディオ要素ASUM´は、オーディオビデオ多重化・同期
部Ｐ４に入力され、多重化及び同期処理が行われ、速度
変換されたオーディオビデオの概要情報SUM´が得られ
る。得られたオーディオビデオの概要情報SUM´は、表
示再生される。The video element VSUM 'and the audio element ASUM' whose speed has been converted are input to the audio / video multiplexing / synchronizing unit P4, where multiplexing and synchronization processing is performed, and the speed-converted audio / video summary information SUM. 'Is obtained. The obtained audio / video summary information SUM ′ is displayed and reproduced.

【００５３】[0053]

【発明の効果】以上の説明から明らかなように、本発明
によれば、非圧縮または圧縮されたオーディオ情報、ま
たは圧縮されたオーディオビデオ情報に関して、それら
の内容を高速かつ効率的に把握するための概要情報を抽
出することが可能になる。また、この抽出によって、オ
ーディオビデオ情報の高速な閲覧が可能となる。As is apparent from the above description, according to the present invention, the contents of uncompressed or compressed audio information or compressed audio / video information can be grasped quickly and efficiently. Can be extracted. In addition, this extraction enables high-speed browsing of audio-video information.

【００５４】また、抽出される概要情報の長さは任意に
指定することができると同時に、オーディオビデオ情報
から均一に概要情報を抽出するため、コンテンツ全体の
把握を効率的に行うことが可能となる。The length of the outline information to be extracted can be arbitrarily specified, and at the same time, since the outline information is uniformly extracted from the audio / video information, the entire contents can be efficiently grasped. Become.

【００５５】また、概要情報に含まれる時間情報などを
記述することにより、該当するオーディオビデオ情報の
概要情報としての特徴記述を行うことが可能となり、コ
ンテンツ記述の標準化であるMPEG-7などへも適用するこ
とが可能である。By describing the time information and the like included in the summary information, it is possible to describe the feature as the summary information of the corresponding audio / video information, and it is also possible to standardize the content description to MPEG-7 or the like. It is possible to apply.

【００５６】また、抽出された概要情報の表示速度を変
換するなどの、高機能な概要情報の再生を提供すること
が可能となる。Further, it is possible to provide high-performance reproduction of the outline information, such as converting the display speed of the extracted outline information.

[Brief description of the drawings]

【図１】本発明の一実施形態の全体構成を示すブロッ
ク図である。FIG. 1 is a block diagram showing an overall configuration of an embodiment of the present invention.

【図２】図１のコンテンツ解析部の詳細構成を示すブ
ロック図である。FIG. 2 is a block diagram illustrating a detailed configuration of a content analysis unit in FIG. 1;

【図３】図１のオーディオレベル評価部の詳細構成を
示すブロック図である。FIG. 3 is a block diagram illustrating a detailed configuration of an audio level evaluation unit in FIG. 1;

【図４】図３の高レベル音解析部の詳細構成を示すブ
ロック図である。FIG. 4 is a block diagram illustrating a detailed configuration of a high-level sound analysis unit in FIG. 3;

【図５】図１の動きアクティビティ評価部の詳細構成
を示すブロック図である。FIG. 5 is a block diagram illustrating a detailed configuration of a motion activity evaluation unit in FIG. 1;

【図６】図１の概要情報登録部の詳細構成を示すブロ
ック図である。FIG. 6 is a block diagram illustrating a detailed configuration of a summary information registration unit in FIG. 1;

【図７】図１の概要情報記述部の詳細構成を示すブロ
ック図である。FIG. 7 is a block diagram illustrating a detailed configuration of a summary information description unit in FIG. 1;

【図８】記録媒体に記録されるプログラムの概要を示
す図である。FIG. 8 is a diagram showing an outline of a program recorded on a recording medium.

【図９】本発明の他の実施形態のオーディオ概要情報
抽出装置の構成を示すブロック図である。FIG. 9 is a block diagram illustrating a configuration of an audio summary information extraction device according to another embodiment of the present invention.

【図１０】図８の高レベル音評価部の詳細構成を示す
ブロック図である。FIG. 10 is a block diagram illustrating a detailed configuration of a high-level sound evaluation unit in FIG. 8;

【図１１】記録媒体に記録されるプログラムの概要を
示す図である。FIG. 11 is a diagram showing an outline of a program recorded on a recording medium.

【図１２】本発明の他の実施形態のオーディオ概要情
報再生装置の構成を示すブロック図である。FIG. 12 is a block diagram showing a configuration of an audio summary information reproducing device according to another embodiment of the present invention.

[Explanation of symbols]

１・・・コンテンツ解析部、２・・・オーディオレベル評価
部、３・・・動きアクティビティ評価部、４・・・概要情報登
録部、５・・・概要情報記述部，Ａ１・・・サブバンドデータ
抽出部，Ａ２・・・サブバンド解析部，Ａ３・・・高レベル音
評価部、Ａ４・・・概要情報登録部、Ａ５・・・概要情報記述
部、Ｐ１・・・オーディオビデオ分離部、Ｐ２・・・ビデオ速
度変換部、Ｐ３・・・オーディオ速度変換部、Ｐ４・・・オー
ディオビデオ多重化・同期部、１００・・・記録媒体。DESCRIPTION OF SYMBOLS 1 ... Content analysis part, 2 ... Audio level evaluation part, 3 ... Motion activity evaluation part, 4 ... Summary information registration part, 5 ... Summary information description part, A1 ... Subband Data extraction unit, A2: sub-band analysis unit, A3: high-level sound evaluation unit, A4: summary information registration unit, A5: summary information description unit, P1: audio / video separation unit, P2: video speed converter, P3: audio speed converter, P4: audio / video multiplexing / synchronizing unit, 100: recording medium.

───────────────────────────────────────────────────── フロントページの続き (72)発明者柳原広昌埼玉県上福岡市大原２−１−15 株式会社ケイディディ研究所内Ｆターム(参考） 5C053 FA14 GB19 HA30 JA01 5D044 AB02 BC01 BC03 CC04 DE91 GK11 HL02 5L096 BA20 GA08 GA59 HA04 HA08 JA11 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hiromasa Yanagihara 2-1-15 Ohara, Kamifukuoka-shi, Saitama F-term in C-D Laboratory Inc. (reference) 5C053 FA14 GB19 HA30 JA01 5D044 AB02 BC01 BC03 CC04 DE91 GK11 HL02 5L096 BA20 GA08 GA59 HA04 HA08 JA11

Claims

[Claims]

1. Content analysis means for analyzing the temporal structure of video information of input audio-video content, and audio level evaluation for evaluating the audio level of audio information associated with the video information whose temporal structure has been analyzed. Means for registering audio-video summary information evaluated by the audio-level evaluation means.

2. The audio-video summary information extracting device according to claim 1, wherein the content analyzing unit divides the video information into shots and divides the input content into equally spaced sections according to a certain criterion. And at least one of a means for evaluating the length of a shot included in the equally-spaced section and a means for determining a repetitive shot, and when the length of the shot is smaller than a predetermined value or when the repetitive shot Wherein the shot is excluded from the summary information when the determination is made.

3. The audio-video summary information extraction apparatus according to claim 2, wherein the means for determining the repetitive shot includes means for determining a representative frame of the divided shot, and a feature value of the representative frame. Means for extracting, and a means for performing a similar image search by comparing the characteristic value with the characteristic value of another representative frame. When the similar image is determined to be a similar image by the similar image search, the shot is set as a repetitive shot. An audio-video summary information extraction device characterized by being excluded.

4. The audio / video summary information extracting apparatus according to claim 2, wherein the means for determining the repetitive shot includes means for extracting a feature value of the divided shot, and a feature value and another shot. Means for performing a similar shot search by comparing the feature value with the feature value.
An apparatus for extracting audio / video summary information, wherein the shot is determined to be a repetitive shot.

5. The audio-video summary information extraction device according to claim 2, wherein the audio level evaluation unit includes at least one of a unit that determines silence and a unit that determines low-level sound. An audio / video summary information extraction device, wherein the shot is excluded from the summary information when the sound is evaluated as a low-level sound.

6. The audio-video summary information extracting apparatus according to claim 5, further comprising: means for determining a high-level sound, wherein the means for determining the high-level sound comprises a sub-band from the compressed audio information. Means for extracting data, means for estimating an audio level from the sub-band data, and means for calculating a time density of the audio level from the audio level exceeding a certain threshold value, and using the audio level time density. An audio / video summary information extracting apparatus for determining a candidate for summary information.

7. The audio-video summary information extracting apparatus according to claim 5, further comprising: means for judging a high-level sound, wherein the means for judging the high-level sound comprises a sub-band from compressed audio information. Means for extracting data, means for calculating the sum of sub-band energies weighted by bands from the sub-band data, and means for calculating the sum of sub-band energies per unit time, An audio-video summary information extraction apparatus, wherein summary information candidates are determined using at least one of the sum and the sub-band energy sum in the unit time.

8. The audio-video summary information extraction device according to claim 1, further comprising: a motion activity evaluation unit that evaluates a motion in an image; An audio-video summary information extraction device, which registers audio-video summary information evaluated by a level evaluation unit and the motion activity evaluation unit.

9. The audio / video summary information extraction device according to claim 8, wherein the motion activity evaluation means extracts motion vector data of a frame belonging to the shot,
Means for calculating a motion activity from the extracted motion vector data, and means for calculating a motion activity per unit time, wherein a candidate for summary information is determined using the motion activity per unit time. Audio-video summary information extraction device.

10. The audio / video summary information extraction device according to claim 1, wherein the outline information specified from outside is controlled by recursively controlling a threshold value as a criterion for determining the summary information. An audio / video summary information extraction device, wherein a determination process is repeated until the length becomes longer.

11. The audio-video summary information extracting apparatus according to claim 1, wherein the time information of the extracted summary information is a start time and an end time or a start time and a continuation time of the summary information. An audio / video summary information extraction device, comprising: means for at least describing time.

12. The audio-video summary information extraction device according to claim 1, wherein the extracted summary information is combined as audio video or audio and video separately. Means for outputting and storing the combined summary information as a separate file.

13. The audio / video summary information extraction apparatus according to claim 1, wherein: a means for extracting a head frame or a feature frame of a shot as a representative frame as a video element of the extracted summary information; An audio / video summary information extraction apparatus, comprising: means for extracting audio information accompanying a shot as an audio element of the summary information.

14. The audio / video summary information extracting apparatus according to claim 13, wherein a time position in the content is described as time information of the extracted video element, and a time of the extracted audio element. An audio-video summary information extraction apparatus, comprising: means for describing at least a start time and an end time or a start time and a duration in a content as information.

15. A means for extracting sub-band data from input compressed audio content, a means for calculating a sum of sub-band energies weighted by a band from the sub-band data, and a sum of sub-band energies per unit time. An audio outline information extraction device, comprising: an evaluation unit; and an audio outline information extraction unit.

16. A means for extracting sub-band data from the input compressed audio content, a means for calculating a sum of sub-band energies weighted by a band from the sub-band data, and a unit of the sum of the sub-band energies Means for calculating the sum in time, means for determining the start time of the summary information, means for determining the end time of the summary information, and means for extracting audio information of the section between the start time and the end time as the summary information. A device for extracting audio summary information, comprising:

17. A means for converting input audio content into sub-band data; a means for calculating a sum of sub-band energies weighted by a band from the sub-band data; An audio outline information extracting device, comprising: an evaluation unit; and an audio outline information extracting unit.

18. A unit for converting input audio content into sub-band data, a unit for calculating a sum of sub-band energies weighted by a band from the sub-band data, and a unit time of the sum of the sub-band energies , A means for determining the start time of the summary information, a means for determining the end time of the summary information, and a means for extracting audio information of the section between the start time and the end time as the summary information. A device for extracting audio summary information, characterized in that:

19. The audio outline information extracting device according to claim 15, wherein the time information of the extracted outline information is a start time and an end time or a start time and a duration time of the outline information. Characterized by at least a means for describing audio information.

20. The audio summary information extracting apparatus according to claim 15, further comprising means for outputting and saving the extracted summary information as an individual file. Extraction equipment.

21. A playback apparatus for audio-video summary information, comprising: means for synchronizing and playing back video and audio elements extracted by the audio-video summary information extraction device according to claim 13. Description: Audio-video summary information playback device.

22. A playback device for audio-video summary information, comprising: means for temporally thinning out audio elements extracted by the audio-video summary information extraction device according to claim 13 to convert audio playback speed; An audio / video summary information reproducing apparatus, comprising: means for synchronously reproducing the converted audio element and video element.

23. A playback device for audio-video summary information, comprising: means for temporally thinning out video information of a shot as the extracted summary information to convert a video playback speed; and a shot as the extracted summary information. Means for converting audio reproduction speed by temporally thinning out the audio information at the same rate as video information, and means for synchronizing and reproducing the video information and audio information whose reproduction speed has been converted. Characteristic audio-video summary information playback device.

24. The audio / video summary information reproducing apparatus according to claim 22, wherein the audio reproduction speed conversion is achieved by periodically skipping audio frames. Video summary information playback device.

25. The audio-video summary information reproducing apparatus according to claim 22, wherein the audio reproduction speed conversion is achieved by combining repetitive reproduction of audio frames and periodic skipping. Characteristic audio-video summary information playback device.

26. A function of analyzing a temporal structure of video information of input audio-video content, a function of evaluating an audio level of audio information associated with the video information whose temporal structure has been analyzed, and A computer-readable recording medium recording a program to be executed by a computer, the program including a summary information registering function for registering audio-video summary information evaluated for level and motion activity.

27. The recording medium according to claim 26, further comprising a function of evaluating a motion activity in an image, wherein the summary information registering function includes the audio-video summary information evaluated for the audio level and the motion activity. A recording medium characterized by registering the following.

28. At least one of a sub-band data extraction function for extracting sub-band data from input compressed audio content and a sub-band analysis function for converting input audio content to sub-band data, and evaluating a high-level sound And a function of registering the outline information of the audio content.