JP2002140712A

JP2002140712A - Av signal processor, av signal processing method, program and recording medium

Info

Publication number: JP2002140712A
Application number: JP2001170611A
Authority: JP
Inventors: Hiromasa Shibata; 浩正柴田; Walker Toby; ウォーカートビー
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-07-14
Filing date: 2001-06-06
Publication date: 2002-05-17
Anticipated expiration: 2021-06-06
Also published as: US7027508B2; US20060114992A1; US20020061136A1; JP4683253B2

Abstract

PROBLEM TO BE SOLVED: To detect the boundary of scenes. SOLUTION: In a step S1, inputted video data is divided into a video segment or an audio segment or into both segments if possible. In a step S2, feature variable showing the feature of the segment is calculated. In a step S3, the similarity of the segment is measured by using feature variable. In a step S4, it is judged whether the segment is in the gap of the scenes or not. Namely, a video/audio processor considers the respective segments to be in the present by using a non-similarity measurement reference calculated in the previous step S3 and the feature variable extracted in the previous step S2. It is detected in which part of the past or the future the existing ratio of adjacent similar segments is higher compared to reference segments. The pattern of the change of the existing ratio is checked and it is judged whether it is the boundary of the scenes or not.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ＡＶ信号処理装置
および方法、プログラム、並びに記録媒体に関し、特
に、一連の映像信号の中から所望する部分を選択して再
生させる場合に用いて好適なＡＶ信号処理装置および方
法、プログラム、並びに記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an AV signal processing apparatus and method, a program, and a recording medium, and more particularly to an AV signal suitable for use when selecting and reproducing a desired portion from a series of video signals. The present invention relates to a signal processing device and method, a program, and a recording medium.

【０００２】[0002]

【従来の技術】例えば、ビデオデータに録画されたテレ
ビ番組のような大量の異なる映像データにより構成され
る映像アプリケーションの中から、興味のある部分等の
所望の部分を探して再生したい場合がある。2. Description of the Related Art For example, a user may want to search for and reproduce a desired part such as a part of interest from a video application composed of a large amount of different video data such as a television program recorded on video data. .

【０００３】このように、所望の映像内容を抽出するた
めの一般的な技術としては、アプリケーションの主要場
面を描いた一連の映像を並べて作成されたパネルである
ストーリボードがある。このストーリボードは、ビデオ
データをいわゆるショットに分解し、各ショットにおい
て代表される映像を表示したものである。このような映
像抽出技術は、そのほとんどが、例えば“G. Ahanger a
nd T.D.C. Little, Asurvey of technologies for pars
ing and indexing digital video, J. of Visual Commu
nication and Image Representation 7:28-4, 1996”に
記載されているように、ビデオデータからショットを自
動的に検出して抽出するものである。As described above, as a general technique for extracting desired video contents, there is a storyboard which is a panel formed by arranging a series of videos depicting main scenes of an application. This storyboard is a video board that breaks down video data into so-called shots and displays a video image represented by each shot. Most of such video extraction technologies are, for example, "G. Ahanger a
nd TDC Little, Asurvey of technologies for pars
ing and indexing digital video, J. of Visual Commu
As described in “Nication and Image Representation 7: 28-4, 1996”, a shot is automatically detected and extracted from video data.

【０００４】[0004]

【発明が解決しようとする課題】ところで、例えば代表
的な３０分のテレビ番組中には、数百ものショットが含
まれている。そのため、上述した従来の映像抽出技術に
おいて、ユーザは、抽出された膨大な数のショットを並
べたストーリボードを調べる必要があり、このようなス
トーリボードを理解するにはユーザに大きな負担を強い
る必要があった。By the way, for example, a typical 30-minute television program contains hundreds of shots. Therefore, in the above-described conventional video extraction technology, the user needs to examine a storyboard in which a huge number of extracted shots are arranged, and to understand such a storyboard, it is necessary to impose a heavy burden on the user. was there.

【０００５】また、従来の映像抽出技術においては、例
えば、話し手の変化に応じて交互に２者を撮影した会話
場面におけるショットは、冗長のものが多いという問題
があった。このように、ショットは、ビデオ構造を抽出
する対象としては階層が低すぎて無駄な情報量が多く、
このようなショットを抽出する従来の映像抽出技術は、
ユーザにとって利便性のよいものではなかった。Further, the conventional video extraction technology has a problem that, for example, shots in a conversation scene in which two persons are alternately photographed according to a change in a speaker are often redundant. In this way, shots are too low in the hierarchy for extracting video structure and have a lot of wasted information,
Conventional video extraction technology that extracts such shots,
It was not convenient for the user.

【０００６】また、他の映像抽出技術としては、例えば
“A. Merlino, D. Morey and M. Maybury, Broadcast n
ews navigation using story segmentation, Proc. of
ACMMultimedia 97, 1997”や特開平１０−１３６２９７
号公報に記載されているように、ニュースやフットボー
ルゲームといった特定の内容ジャンルに関する非常に専
門的な知識を用いるものがある。しかしながら、この従
来の映像抽出技術は、目的のジャンルに関しては良好な
結果を得ることができるが、他のジャンルには全く役に
立たず、更にジャンルに限定される結果、容易に一般化
することができないという問題があった。[0006] Other video extraction techniques include, for example, "A. Merlino, D. Morey and M. Maybury, Broadcast n.
ews navigation using story segmentation, Proc. of
ACM Multimedia 97, 1997 ”and JP-A-10-136297.
Some use very specialized knowledge about a particular content genre, such as news or football games, as described in the publication. However, this conventional video extraction technique can obtain a good result with respect to a target genre, but is useless for other genres, and cannot be easily generalized as a result of being limited to a genre. There was a problem.

【０００７】さらに、他の映像抽出技術としては、例え
ば米国特許５７０８７６７号公報に記載されているよう
に、いわゆるストーリユニットを抽出するものがある。
しかしながら、この従来の映像抽出技術は、完全に自動
化されたものではなく、どのショットが同じ内容を示す
ものであるかを決定するために、ユーザの操作が必要で
あった。また、この従来の映像抽出技術は、処理に要す
る計算が複雑であるとともに、適用対象として映像情報
のみに限定されるといった問題もあった。Further, as another video extraction technique, there is a technique for extracting a so-called story unit as described in, for example, US Pat. No. 5,708,767.
However, this conventional video extraction technique is not completely automated, and requires a user operation to determine which shot has the same content. In addition, this conventional video extraction technique has a problem that the calculation required for the processing is complicated and the application target is limited to only video information.

【０００８】さらにまた、他の映像抽出技術としては、
例えば特開平９−２１４８７９号公報に記載されている
ように、ショット検出と無音部分検出とを組み合わせる
ことによりシーンを識別するものがある。しかしなが
ら、この従来の映像抽出技術は、無音部分がショット境
界に対応した場合のみに限定されたものであった。[0008] Further, as another video extraction technology,
For example, as described in Japanese Patent Application Laid-Open No. 9-21879, a scene is identified by combining shot detection and silent portion detection. However, this conventional video extraction technique is limited to a case where a silent portion corresponds to a shot boundary.

【０００９】また、他の映像抽出技術としては、例えば
“H. Aoki, S. Shimotsuji and O.Hori, A shot classi
fication method to select effective key-frames for
video browsing, IPSJ Human Interface SIG Notes,
7:43-50, 1996”や特開平９−９３５８８号公報に記載
されているように、ストーリボードにおける表示の冗長
度を低減する為に、反復された類似ショットを検出する
ものがある。しかしながら、この従来の映像抽出技術
は、映像情報のみに適用できるものであり、音声情報に
適用できるものではなかった。As another video extraction technique, for example, “H. Aoki, S. Shimotsuji and O. Hori, A shot classi”
fication method to select effective key-frames for
video browsing, IPSJ Human Interface SIG Notes,
7: 43-50, 1996 "and Japanese Patent Application Laid-Open No. 9-93588, some of which detect repeated similar shots in order to reduce the redundancy of the display on the storyboard. However, this conventional video extraction technique can be applied only to video information, and cannot be applied to audio information.

【００１０】さらに、これら従来技術ではセットトップ
ボックスやディジタルビデオレコーダなどの家庭機器に
実装するにあたり、複数の問題が生じている。それは、
主に従来技術では後処理を行うことが前提とされていた
ためである。具体的には、次の３つの問題が挙げられ
る。[0010] Further, these conventional techniques have a plurality of problems when they are mounted on home appliances such as set-top boxes and digital video recorders. that is,
This is mainly because post-processing is premised in the prior art. Specifically, there are the following three problems.

【００１１】１つ目の問題は、セグメント数は、コンテ
ンツの長さに依存し、一定であってもその中に含まれる
ショットの数が一定でない。そのためシーン検出に必要
なメモリ量の固定ができないので必要とするメモリ量を
過剰に設定しなければならなかった。これはメモリ量の
少ない家庭機器では大きな問題であった。The first problem is that the number of segments depends on the length of the content. Even if the number of segments is constant, the number of shots contained therein is not constant. Therefore, the amount of memory required for scene detection cannot be fixed, so that the amount of memory required has to be set excessively. This was a major problem for home appliances with a small amount of memory.

【００１２】２つ目の問題は、家庭機器では、決められ
た時間内に決められた処理を必ず終わらせなければなら
ない実時間処理が必要とされる。しかし、セグメント数
が固定できなく、また、後処理処理を行わなければなら
ないため、常に決められた時間内に処理を終わらせるの
は困難であった。このことは家庭用機器に実装されてい
る高性能でないCPUを使用しなければならない場合、さ
らに実時間処理を行うことが困難であることを意味す
る。The second problem is that home appliances require real-time processing in which a predetermined process must be completed within a predetermined time. However, since the number of segments cannot be fixed and post-processing must be performed, it has always been difficult to end the processing within a predetermined time. This means that it is difficult to perform further real-time processing when a low-performance CPU mounted on a household device must be used.

【００１３】３つ目の問題は、今まで述べてきたように
後処理処理が必要であるため、セグメントが生成される
毎にシーン検出の処理結果が終わらせることができな
い。これは録画途中で何らかの理由で録画状態が止まっ
た場合、それまでの途中結果を得られないことを意味す
る。これは録画しながら逐次処理ができないことを意味
し、家庭用機器では大きな問題になる。The third problem is that post-processing is required as described above, and the processing result of scene detection cannot be ended every time a segment is generated. This means that if the recording state stops for some reason during the recording, it is not possible to obtain an intermediate result. This means that it is not possible to perform sequential processing while recording, which is a major problem in home appliances.

【００１４】また、従来技術では、シーンを決定する場
合、セグメントの繰り返しのパターンやそれ以外のセグ
メントのグループ化などによる方法を用いていたためシ
ーンの検出結果は一意的になっていた。故に検出された
境界が実際のシーンの境界である可能性が高いか低いか
を判断することは不可能であり、段階的にシーンの検出
数を制御することができなかった。Further, in the prior art, when a scene is determined, a method based on a repetition pattern of segments or grouping of other segments is used, so that the detection result of the scene is unique. Therefore, it is impossible to determine whether the detected boundary is likely to be an actual scene boundary or not, and the number of scenes detected cannot be controlled step by step.

【００１５】さらに、ビデオを一覧するに当たって、見
易くするため得られたシーンの数をできる限り少なくす
ることが必要となる。そのゆえに、検出したシーンの数
が限定された場合に、どのシーンを見せるとよいかとい
う問題が生じる。そのため、得られたシーンの各々の重
要性が解れば、その重要性の順番に従い、シーンを見せ
ると一覧するためによい。ただし、従来技術では得られ
たシーンがどの程度重要であるかを計る尺度を提供して
いない。Further, in listing videos, it is necessary to minimize the number of obtained scenes for easy viewing. Therefore, when the number of detected scenes is limited, a problem arises as to which scene should be displayed. Therefore, if the importance of each of the obtained scenes is known, the scenes can be viewed and listed according to the order of importance. However, the prior art does not provide a measure of how important the resulting scene is.

【００１６】本発明はこのような状況に鑑みてなされた
ものであり、録画したビデオデータを任意のシーンから
再生できるように、シーンの境界を検出することを目的
とする。The present invention has been made in view of such circumstances, and has as its object to detect scene boundaries so that recorded video data can be reproduced from any scene.

【００１７】[0017]

【課題を解決するための手段】本発明のＡＶ信号処理装
置は、ＡＶ信号を構成する一連のフレームによって形成
されるセグメントの特徴量を抽出する特徴量抽出手段
と、基準となるセグメントと他のセグメントとの特徴量
の類似性を測定するための測定基準を算出する算出手段
と、測定基準を用いて、基準となるセグメントと他のセ
グメントとの類似性を測定する類似性測定手段と、類似
性測定手段が測定した類似性を用いて、基準となるセグ
メントがシーンの境界である可能性を示す測定値を計算
する測定値計算手段と、測定値計算手段が計算した測定
値の時間的パターンの変化を解析し、解析結果に基づい
て基準となるセグメントがシーンの境界であるか否かを
判定する境界判定手段とを含むことを特徴とする。According to the present invention, there is provided an AV signal processing apparatus comprising: a feature extracting means for extracting a feature of a segment formed by a series of frames constituting an AV signal; Calculating means for calculating a metric for measuring the similarity of a feature with a segment; similarity measuring means for measuring the similarity between a reference segment and another segment using the metric; Measurement value calculation means for calculating a measurement value indicating that a reference segment may be a scene boundary using the similarity measured by the gender measurement means; and a temporal pattern of the measurement value calculated by the measurement value calculation means. And a boundary determining means for determining whether or not the reference segment is a boundary of the scene based on the analysis result.

【００１８】ＡＶ信号には、映像信号および音声信号の
うちの少なくとも一方を含むようにすることができる。The AV signal can include at least one of a video signal and an audio signal.

【００１９】本発明のＡＶ信号処理装置は、基準となる
セグメントに対応する測定値の変化の程度を示す強度値
を計算する強度値計算手段をさらに含むことができる。The AV signal processing apparatus according to the present invention may further include an intensity value calculating means for calculating an intensity value indicating a degree of change of the measured value corresponding to the reference segment.

【００２０】前記測定値計算手段には、基準となるセグ
メントに対して、所定の時間領域内における類似セグメ
ントを求め、類似セグメントの時間分布を解析し、過去
と未来に存在する比率を数値化して測定値を計算させる
ようにすることができる。The measured value calculating means finds a similar segment in a predetermined time domain with respect to a reference segment, analyzes the time distribution of the similar segment, and quantifies the ratio existing in the past and the future. Measurements can be calculated.

【００２１】前記境界判定手段には、測定値の絶対値の
総和にも基づき、基準となるセグメントがシーンの境界
であるか否かを判定させるようにすることができる。The boundary determining means may determine whether or not the reference segment is a boundary of the scene based on the sum of the absolute values of the measured values.

【００２２】本発明のＡＶ信号処理装置は、ＡＶ信号に
映像信号が含まれる場合、映像セグメントの基本単位と
なるショットを検出して、音声セグメントを生成する音
声セグメント生成手段をさらに含むことができる。The AV signal processing apparatus of the present invention can further include an audio segment generating means for detecting a shot which is a basic unit of the video segment and generating an audio segment when the AV signal includes a video signal. .

【００２３】本発明のＡＶ信号処理装置は、ＡＶ信号に
音声信号が含まれる場合、音声信号の特徴量および無音
区間のうちの少なくとも一方を用いて、音声セグメント
を生成する音声セグメント生成手段をさらに含むことが
できる。In the AV signal processing apparatus according to the present invention, when the audio signal is included in the AV signal, the audio signal generating means for generating an audio segment by using at least one of the characteristic amount of the audio signal and the silent section. Can be included.

【００２４】映像信号の特徴量には、少なくともカラー
ヒストグラムが含まれるようにすることができる。The feature amount of the video signal can include at least a color histogram.

【００２５】音声信号の特徴量には、音量およびスペク
トラムのうちの少なくとも一方が含まれるようにするこ
とができる。The feature amount of the audio signal may include at least one of a volume and a spectrum.

【００２６】前記境界判定手段には、予め設定され閾値
と測定値を比較することにより、基準となるセグメント
がシーンの境界であるか否かを判定させるようにするこ
とができる。The boundary determining means may determine whether or not the reference segment is a boundary of the scene by comparing a preset threshold value with the measured value.

【００２７】本発明のＡＶ信号処理方法は、ＡＶ信号を
構成する一連のフレームによって形成されるセグメント
の特徴量を抽出する特徴量抽出ステップと、基準となる
セグメントと他のセグメントとの特徴量の類似性を測定
するための測定基準を算出する算出ステップと、測定基
準を用いて、基準となるセグメントと他のセグメントと
の類似性を測定する類似性測定ステップと、類似性測定
ステップの処理で測定された類似性を用いて、基準とな
るセグメントがシーンの境界である可能性を示す測定値
を計算する測定値計算ステップと、測定値計算ステップ
の処理で計算された測定値の時間的パターンの変化を解
析し、解析結果に基づいて基準となるセグメントがシー
ンの境界であるか否かを判定する境界判定ステップとを
含むことを特徴とする。According to the AV signal processing method of the present invention, a feature amount extracting step of extracting a feature amount of a segment formed by a series of frames constituting an AV signal, and a feature amount of a reference segment and another segment are extracted. A calculating step of calculating a metric for measuring similarity, a similarity measuring step of measuring the similarity between the reference segment and another segment using the metric, and a processing of the similarity measuring step. A measured value calculating step of calculating a measured value indicating that the reference segment may be a scene boundary using the measured similarity; and a temporal pattern of the measured values calculated in the processing of the measured value calculating step. Analyzing the change of the image, and determining whether the reference segment is a boundary of the scene based on the analysis result. That.

【００２８】本発明のプログラムは、ＡＶ信号を構成す
る一連のフレームによって形成されるセグメントの特徴
量を抽出する特徴量抽出ステップと、基準となるセグメ
ントと他のセグメントとの特徴量の類似性を測定するた
めの測定基準を算出する算出ステップと、測定基準を用
いて、基準となるセグメントと他のセグメントとの類似
性を測定する類似性測定ステップと、類似性測定ステッ
プの処理で測定された類似性を用いて、基準となるセグ
メントがシーンの境界である可能性を示す測定値を計算
する測定値計算ステップと、測定値計算ステップの処理
で計算された測定値の時間的パターンの変化を解析し、
解析結果に基づいて基準となるセグメントがシーンの境
界であるか否かを判定する境界判定ステップとをコンピ
ュータに実行させることを特徴とする。A program according to the present invention includes a feature value extracting step of extracting a feature value of a segment formed by a series of frames constituting an AV signal, and a similarity feature value between a reference segment and another segment. A calculation step of calculating a measurement standard for measurement, a similarity measurement step of measuring the similarity between a reference segment and another segment using the measurement standard, and a similarity measurement step. Using a similarity, a measurement value calculation step of calculating a measurement value indicating that the reference segment may be a scene boundary, and a change in a temporal pattern of the measurement value calculated in the processing of the measurement value calculation step. Analyze,
A boundary determining step of determining whether or not the reference segment is a boundary of the scene based on the analysis result.

【００２９】本発明の記録媒体のプログラムは、ＡＶ信
号を構成する一連のフレームによって形成されるセグメ
ントの特徴量を抽出する特徴量抽出ステップと、基準と
なるセグメントと他のセグメントとの特徴量の類似性を
測定するための測定基準を算出する算出ステップと、測
定基準を用いて、基準となるセグメントと他のセグメン
トとの類似性を測定する類似性測定ステップと、類似性
測定ステップの処理で測定された類似性を用いて、基準
となるセグメントがシーンの境界である可能性を示す測
定値を計算する測定値計算ステップと、測定値計算ステ
ップの処理で計算された測定値の時間的パターンの変化
を解析し、解析結果に基づいて基準となるセグメントが
シーンの境界であるか否かを判定する境界判定ステップ
とを含むことを特徴とする。A program of a recording medium according to the present invention includes a feature amount extracting step of extracting a feature amount of a segment formed by a series of frames constituting an AV signal, and a feature amount extracting step of a feature amount between a reference segment and another segment. A calculating step of calculating a metric for measuring similarity, a similarity measuring step of measuring the similarity between the reference segment and another segment using the metric, and a processing of the similarity measuring step. A measured value calculating step of calculating a measured value indicating that the reference segment may be a scene boundary using the measured similarity; and a temporal pattern of the measured values calculated in the processing of the measured value calculating step. And a boundary determination step of determining whether or not a reference segment is a scene boundary based on the analysis result. To.

【００３０】本発明のＡＶ信号処理装置および方法、並
びにプログラムにおいては、ＡＶ信号を構成する一連の
フレームによって形成されるセグメントの特徴量が抽出
され、基準となるセグメントと他のセグメントとの特徴
量の類似性を測定するための測定基準が算出され、測定
基準を用いて、基準となるセグメントと他のセグメント
との類似性が測定され、測定された類似性を用いて、基
準となるセグメントがシーンの境界である可能性を示す
測定値が計算される。また、計算された測定値の時間的
パターンの変化が解析され、解析結果に基づいて基準と
なるセグメントがシーンの境界であるか否かが判定され
る。In the AV signal processing apparatus and method and the program according to the present invention, the characteristic amount of a segment formed by a series of frames constituting an AV signal is extracted, and the characteristic amount of a reference segment and another segment is extracted. A metric for measuring the similarity of the segment is calculated, the similarity between the reference segment and other segments is measured using the metric, and the reference segment is determined using the measured similarity. A measurement is calculated that indicates a possible boundary of the scene. Further, a change in the temporal pattern of the calculated measurement value is analyzed, and it is determined whether or not the reference segment is a scene boundary based on the analysis result.

【００３１】[0031]

【発明の実施の形態】本発明は、ビデオデータをシーン
に切り分けることが目的である。この切り分けるという
意味はシーンとシーンの境界を検出するということであ
る。シーンは、１以上のセグメントによって構成され
る。各シーンは、それぞれに固有な特徴を持っているた
め、隣接するシーンの各境界のセグメントを比較した場
合、それらの特徴には顕著な違いが現れる。換言すれ
ば、そのような顕著な違いが現れるところがシーンの境
界であり、それを検出することによりセグメント単位
で、シーンを切り分けることが可能になる。DETAILED DESCRIPTION OF THE INVENTION An object of the present invention is to segment video data into scenes. The meaning of the separation means that a boundary between scenes is detected. A scene is composed of one or more segments. Each scene has its own unique features, so that when comparing segments at each boundary of adjacent scenes, those features show significant differences. In other words, a place where such a remarkable difference appears is the boundary of the scene, and by detecting the boundary, the scene can be separated in segment units.

【００３２】この処理を行うに当たり、上述した従来技
術と同ように、最初に対象となるビデオデータをセグメ
ント単位に分割する。分割して得たセグメントは時系列
を成し、各セグメントについて、次のセグメントとの間
にシーン境界があるか否かを判断することが必要とな
る。各セグメントを基準とし、その近隣のセグメントの
中に似ているセグメントが時間的に何処にあるのかを調
べる。In performing this processing, first, the target video data is divided into segments, as in the above-described prior art. The segments obtained by division form a time series, and it is necessary to determine whether or not each segment has a scene boundary between the next segment. Based on each segment, it is examined where similar segments are located in time in the neighboring segments.

【００３３】シーン境界があると判断された場合、過去
に集中して存在していたパターンから、未来に集中して
存在するパターンへと短い時間で特異な変化が現れる変
化点が検出される。その変化点から次の変化点までが一
つのシーンである。このようなパターンの変化が起こる
ところを見つけるため、シーンの境界の前後で局所的な
変化を見るだけで十分な情報が得られる。When it is determined that there is a scene boundary, a change point where a peculiar change appears in a short time from a pattern that has been concentrated in the past to a pattern that is concentrated in the future is detected. A scene from the change point to the next change point is one scene. Sufficient information can be obtained simply by looking at local changes before and after scene boundaries to find where such pattern changes occur.

【００３４】さらにこの局所的変化の大きさの大小を測
定することによりシーンの切り分けを段階的に制御する
ことも可能である。これは視覚的な変化点がシーンの意
味的な変化点と良く一致することが経験的に判明したこ
とからである。本発明は以上のことを基本にしてシーン
の境界を検出し、ビデオデータなどのシーンを切り分け
るためのものである。またこのシーン境界情報をもとに
ビデオデータを見やすく表示することを可能とする。Further, by measuring the magnitude of the local change, scene segmentation can be controlled stepwise. This is because it has been empirically found that the visual change point coincides well with the semantic change point of the scene. The present invention is for detecting a scene boundary and separating a scene such as video data on the basis of the above. Also, video data can be displayed in an easily viewable manner based on the scene boundary information.

【００３５】次に、本発明の概要を具体的に説明する。
まず、シーンとシーンの境界が存在する場合と存在しな
い場合に分けて、それぞれの特徴について説明する。あ
るビデオデータの具体例を図２に示す。同図では、ビデ
オデータの単位はセグメント単位で示されており、３つ
のシーン１乃至シーン３によって構成されているもので
ある。同図において時間軸は右方向に向いているものと
する。境界が存在しない領域を非境界領域とし、境界が
存在している領域を境界領域とし、図４に詳細に示して
ある。Next, the outline of the present invention will be specifically described.
First, each feature will be described separately for a case where there is a boundary between scenes and a case where it does not exist. FIG. 2 shows a specific example of certain video data. In the figure, the unit of video data is shown in segment units, and is constituted by three scenes 1 to 3. In the figure, it is assumed that the time axis is directed to the right. A region where no boundary exists is referred to as a non-boundary region, and a region where a boundary exists is referred to as a boundary region, which is shown in detail in FIG.

【００３６】図４（Ａ）の非境界領域に示してあるのは
シーン２の時間内の部分であり、他のシーンとの境界が
存在していないセグメント３乃至セグメント１１の時間
領域である。また、これと対照的に図４（Ｂ）の境界領
域はシーン２とシーン３の境界領域を含むところでシー
ンとシーンの隣接しているセグメント８乃至セグメント
１５の時間領域を示している。The non-boundary region shown in FIG. 4A is the portion of scene 2 in time, and is the time region of segments 3 to 11 where no boundary exists with other scenes. Also, in contrast to this, the boundary region in FIG. 4B shows the time region of segments 8 to 15 adjacent to the scene including the boundary region of scene 2 and scene 3.

【００３７】まず、境界が存在しない場合を表している
非境界領域の特徴について説明する。非境界領域は、類
似したセグメントだけで構成されているので、非境界領
域の中の基準セグメントに対して過去、未来の時間帯と
分けた場合ほぼ均等に類似セグメントは存在する。その
ため類似セグメントの分布パターンには特異な変化のパ
ターンは現れない。First, the characteristics of the non-boundary area, which represents the case where no boundary exists, will be described. Since the non-boundary region is composed of only similar segments, similar segments exist almost equally when divided into past and future time zones with respect to the reference segment in the non-boundary region. Therefore, no peculiar change pattern appears in the distribution pattern of the similar segments.

【００３８】境界領域は、非境界領域と異なり、２つの
シーンが隣接している境界点を含む時間帯の部分を表し
ている。ここでシーンというのは互いに高い類似性を持
ったセグメントからなっているものを意味する。そのた
め、シーン２を構成しているセグメント８乃至セグメン
ト１１と、異なるシーン３を構成しているセグメント１
２乃至セグメント１５とが隣り合っており、それらの境
界を挟んでシーンのセグメントの特徴がそれぞれ異な
る。The boundary area is different from the non-boundary area and represents a part of a time zone including a boundary point where two scenes are adjacent to each other. Here, the scene means a scene composed of segments having high similarity to each other. Therefore, the segments 8 to 11 constituting the scene 2 and the segments 1 constituting the different scene 3
The segments 2 to 15 are adjacent to each other, and the features of the segments of the scene are different from each other with the boundary therebetween.

【００３９】シーンの境界を検出するには、まず各セグ
メントを時間的基準(現在)と仮定する。それぞれに対
し、最も類似したセグメントの時間的分布パターン(基
準から見て過去であるのか未来であるのか)の変化を調
べることにより実現できる。In order to detect scene boundaries, each segment is first assumed to be a temporal reference (current). For each of them, it can be realized by examining the change of the temporal distribution pattern of the most similar segments (past or future when viewed from the reference).

【００４０】これは図４（Ｂ）に示す境界領域からわか
るように、セグメント８乃至セグメント１１が順に時間
的基準となって境界に近づくにつれ、最も類似なセグメ
ントが未来に対して過去に存在する比率が高くなって行
き、境界直近(シーンの終り)では１００％になる。そし
て境界を越えた直後(次のシーンの先頭)では過去に対し
て未来に存在する比率が１００％になり、セグメント１
２乃至セグメント１５が順に時間的基準となるにつれ、
その比率が低くなって行く。As can be seen from the boundary region shown in FIG. 4B, as the segments 8 to 11 sequentially approach the boundary as a temporal reference, the most similar segment exists in the past with respect to the future. The ratio increases and reaches 100% near the boundary (end of the scene). Immediately after the boundary (the beginning of the next scene), the ratio of the future to the past becomes 100%, and the segment 1
As 2 to segment 15 become the temporal reference in order,
The ratio goes down.

【００４１】したがって、このような最も類似なセグメ
ントの時間分布比率のパターンの変化によって、シーン
の境界である可能性が高い場所を特定できる。また、こ
の典型的なパターンはシーンの境界付近の局所的な部分
に現れる確率が非常に高いので、境界近辺だけを調べれ
ばそのパターンの変化から境界を特定できる。これは言
い換えれば、類似セグメントの分布パターンを調べる時
間領域を必要以上に大きく取らなくても良いということ
になる。Therefore, by changing the pattern of the time distribution ratio of the most similar segments, it is possible to specify a place which is likely to be a boundary of a scene. In addition, since this typical pattern has a very high probability of appearing in a local portion near the boundary of the scene, the boundary can be identified from a change in the pattern by examining only the vicinity of the boundary. In other words, it is not necessary to take an unnecessarily large time region for examining the distribution pattern of similar segments.

【００４２】また、これらのパターンの変化を数値化す
ると、その値の変化の度合いがシーンの視覚的変化の度
合いに連動している。そしてシーンの視覚的変化の度合
いはシーンの意味的な変化の度合いに連動していること
が経験上および実験的結果によってわかっている。した
がってこの数値化した値を境界性測定値とすると、この
値の大小によりシーンの意味的度合いの大小に対応した
シーンを検出することが可能となる。When these changes in the pattern are digitized, the degree of the change is linked to the degree of the visual change of the scene. It has been empirically and experimentally known that the degree of the visual change of the scene is linked to the degree of the semantic change of the scene. Therefore, if this digitized value is used as a boundary measurement value, it is possible to detect a scene corresponding to the magnitude of the semantic degree of the scene by the magnitude of this value.

【００４３】次に、本発明の一実施の形態である映像音
声処理装置について説明するが、その前に、映像音声処
理装置が処理の対象とするビデオデータについて説明す
る。Next, a video and audio processing apparatus according to an embodiment of the present invention will be described. Before that, video data to be processed by the video and audio processing apparatus will be described.

【００４４】本発明においては、処理対象とするビデオ
データを、図１に示すようにモデル化し、フレーム、セ
グメント、シーンの３つのレベルに階層化されたデータ
構造を有するものとする。すなわち、ビデオデータは、
最下位層において、一連のフレームにより構成される。
また、ビデオデータは、フレームの１つ上の階層とし
て、連続するフレームのひと続きから形成されるセグメ
ントにより構成される。さらに、ビデオデータは、最上
位層において、このセグメントを意味のある関連に基づ
きまとめて形成されるシーンにより構成される。In the present invention, it is assumed that video data to be processed is modeled as shown in FIG. 1 and has a data structure hierarchized into three levels of frames, segments, and scenes. That is, the video data is
The lowest layer is composed of a series of frames.
Further, the video data is constituted by segments formed from a series of consecutive frames, as a layer immediately above the frame. Furthermore, the video data is composed of a scene formed in the highest layer by grouping the segments based on a meaningful association.

【００４５】このビデオデータは、一般に、映像および
音声の両方の情報を含む。すなわち、このビデオデータ
においてフレームは、単一の静止画像である映像フレー
ムと、数ＫＨｚ乃至数十ＫＨｚ」のサンプリングレート
でサンプルされた音声情報を表す音声フレームが含まれ
る。This video data generally contains both video and audio information. That is, in the video data, the frame includes a video frame which is a single still image and an audio frame representing audio information sampled at a sampling rate of several KHz to several tens KHz.

【００４６】また、映像セグメントは、単一のカメラに
より連続的に撮影された一連の映像フレームから構成さ
れ、一般にはショットと呼ばれる。A video segment is composed of a series of video frames continuously photographed by a single camera, and is generally called a shot.

【００４７】一方、音声セグメントについては、多くの
定義が可能であり、例として次に示すようなものが考え
られる。音声セグメントは、一般によく知られている方
法により検出されたビデオデータ中の無音期間により境
界を定められて形成されるものがある。また、音声セグ
メントは、“D. Kimber and L. Wilcox, Acoustic Segm
entation for Audio Browsers, Xerox Parc Technical
Report”に記載されているように、例えば、音声、音
楽、ノイズ、無音等のように少数のカテゴリに分類され
た音声フレームのひと続きから形成されるものがある。
さらに、音声セグメントは、“S. Pfeiffer, S. Fische
r and E. Wolfgang, Automatic Audio Content Analysi
s, Proceeding of ACM Multimedia 96, Nov. 1996, pp2
1-30”に記載されているように、２枚の連続する音声フ
レーム間のある特徴における大きな変化を音声の変わり
目として検出し、これに基づいて決定される場合もあ
る。On the other hand, a voice segment can be defined in many ways, and the following can be considered as an example. Some audio segments are defined by silence periods in video data detected by generally well-known methods. The audio segment is “D. Kimber and L. Wilcox, Acoustic Segm
entation for Audio Browsers, Xerox Parc Technical
As described in “Report”, for example, some are formed from a series of speech frames classified into a small number of categories such as speech, music, noise, and silence.
In addition, the audio segment is "S. Pfeiffer, S. Fische
r and E. Wolfgang, Automatic Audio Content Analysi
s, Proceeding of ACM Multimedia 96, Nov. 1996, pp2
As described in 1-30 ", a large change in a certain feature between two consecutive audio frames may be detected as a transition between audio and determined based on this.

【００４８】シーンは、ビデオデータの内容を意味に基
づくより高いレベルのものである。シーンは、主観的な
ものであり、ビデオデータの内容あるいはジャンルに依
存する。シーンは、その特徴が互いに類似性を示す映像
セグメントまたは音声セグメントで構成されている。A scene is a higher level based on the meaning of the content of the video data. Scenes are subjective and depend on the content or genre of the video data. A scene is composed of a video segment or an audio segment whose characteristics are similar to each other.

【００４９】ここでは、ビデオデータ内の各セグメント
について、その近隣に存在する類似的特徴を持っている
セグメントが、過去に集中して存在していたパターンか
ら、未来に集中して存在するパターンへと特異な変化を
示す変化点を検出し、その変化点から次の変化点を一つ
のシーンとするものである。このようなパターンがシー
ンの切れ目と対応するのは、各シーンに含まれているセ
グメントの特徴が異なるためにシーンの境界でセグメン
トの類似的特徴が大きく変化するからである。これはビ
デオデータにおける高いレベルでの意味のある構造と非
常に関係があり、シーンは、このようなビデオデータに
おける高いレベルでの意味を持ったまとまりを示すもの
である。Here, for each segment in the video data, a segment having similar characteristics existing in the vicinity of the segment is changed from a pattern concentrated in the past to a pattern concentrated in the future. And a change point showing a peculiar change is detected, and the next change point is regarded as one scene from the change point. Such patterns correspond to scene breaks because similar characteristics of the segments greatly change at the boundary of the scene because the characteristics of the segments included in each scene are different. This is very relevant to the high level meaningful structure in the video data, and the scenes represent such a high level meaningful unity in the video data.

【００５０】次に、本発明の一実施の形態である映像音
声処理装置の構成例について、図３を参照して説明す
る。映像音声処理装置は、上述したビデオデータにおけ
るセグメントの特徴量を用いてセグメント間の類似性を
測定し、これらのセグメントをシーンにまとめてビデオ
構造を自動的に抽出するものであり、映像セグメントお
よび音声セグメントの両方に適用できるものである。Next, an example of the configuration of a video / audio processing apparatus according to an embodiment of the present invention will be described with reference to FIG. The video and audio processing apparatus measures the similarity between segments using the feature amounts of the segments in the video data described above, collects these segments into scenes, and automatically extracts a video structure. It can be applied to both audio segments.

【００５１】映像音声処理装置は、図３に示すように、
入力されるビデオデータのストリームを映像または音
声、あるいは両方のセグメントに分割するビデオ分割部
１１、ビデオデータの分割情報を記憶するビデオセグメ
ントメモリ１２、各映像セグメントにおける特徴量を抽
出する映像特徴量抽出部１３、各音声セグメントにおけ
る特徴量を抽出する音声特徴量抽出部１４、映像セグメ
ントおよび音声セグメントの特徴量を記憶するセグメン
ト特徴量メモリ１５、映像セグメントおよび音声セグメ
ントをシーンにまとめるシーン検出部１６、および２つ
のセグメント間の類似性を測定する特徴量類似性測定部
１７より構成される。As shown in FIG. 3, the video / audio processing apparatus
A video dividing unit 11 for dividing an input video data stream into video or audio or both segments, a video segment memory 12 for storing video data division information, and a video feature extraction for extracting a feature in each video segment. Unit 13, an audio feature amount extraction unit 14 for extracting the feature amount of each audio segment, a segment feature amount memory 15 for storing the feature amounts of the video segment and the audio segment, a scene detection unit 16 for combining the video segment and the audio segment into a scene, And a feature amount similarity measuring unit 17 for measuring the similarity between the two segments.

【００５２】ビデオ分割部１１は、入力される、例え
ば、MPEG(Moving Picture Experts Group)１、MPEG２、
またはいわゆるＤＶ(Digital Video)などの圧縮ビデオ
データフォーマットを含む種々のディジタル化されたフ
ォーマットにおける映像データと音声データとからなる
ビデオデータのストリームを映像、音声またはこれらの
両方のセグメントに分割するものである。The video dividing unit 11 receives, for example, MPEG (Moving Picture Experts Group) 1, MPEG2,
Alternatively, a video data stream composed of video data and audio data in various digitized formats including a compressed video data format such as so-called DV (Digital Video) is divided into video, audio, or both segments. is there.

【００５３】ビデオ分割部１１は、入力されるビデオデ
ータが圧縮フォーマットであった場合、この圧縮ビデオ
データを完全伸張することなく直接処理することができ
る。ビデオ分割部１１は、入力されたビデオデータを処
理し、映像セグメントと音声セグメントとに分割する。
また、ビデオ分割部１１は、入力したビデオデータを分
割した結果である分割情報を後段のビデオセグメントメ
モリ１２に出力する。さらに、ビデオ分割部１１は、映
像セグメントと音声セグメントとに応じて、分割情報を
後段の映像特徴量抽出部１３および音声特徴量抽出部１
４に出力する。When the input video data is in a compressed format, the video dividing unit 11 can directly process the compressed video data without completely expanding the compressed video data. The video division unit 11 processes the input video data and divides the video data into a video segment and an audio segment.
Further, the video division unit 11 outputs division information, which is a result of dividing the input video data, to the video segment memory 12 at the subsequent stage. Further, the video division unit 11 divides the division information into the video characteristic amount extraction unit 13 and the audio characteristic amount extraction unit 1 in the subsequent stage according to the video segment and the audio segment.
4 is output.

【００５４】ビデオセグメントメモリ１２は、ビデオ分
割部１１から供給されたビデオデータの分割情報を記憶
する。また、ビデオセグメントメモリ１２は、後述する
シーン検出部１６からの問い合わせに応じて、分割情報
をシーン検出部１６に出力する。The video segment memory 12 stores the division information of the video data supplied from the video division section 11. In addition, the video segment memory 12 outputs division information to the scene detection unit 16 in response to an inquiry from the scene detection unit 16 described later.

【００５５】映像特徴量抽出部１３は、ビデオ分割部１
１によりビデオデータを分割して得た各映像セグメント
の特徴量を抽出する。映像特徴量抽出部１３は、圧縮映
像データを完全伸張することなく直接処理することがで
きる。映像特徴量抽出部１３は、抽出した各映像セグメ
ントの特徴量を後段のセグメント特徴量メモリ１５に出
力する。The video feature quantity extraction unit 13 includes the video division unit 1
The feature amount of each video segment obtained by dividing the video data by 1 is extracted. The video feature amount extraction unit 13 can directly process the compressed video data without completely expanding it. The video feature extraction unit 13 outputs the extracted feature of each video segment to the segment feature memory 15 at the subsequent stage.

【００５６】音声特徴量抽出部１４は、ビデオ分割部１
１によりビデオデータを分割して得た各音声セグメント
の特徴量を抽出する。音声特徴量抽出部１４は、圧縮音
声データを完全伸張することなく直接処理することがで
きる。音声特徴量抽出部１４は、抽出した各音声セグメ
ントの特徴量を後段のセグメント特徴量メモリ１５に出
力する。The audio feature amount extraction unit 14 includes the video division unit 1
The feature amount of each audio segment obtained by dividing the video data by 1 is extracted. The audio feature amount extraction unit 14 can directly process the compressed audio data without completely expanding it. The audio feature amount extraction unit 14 outputs the extracted feature amount of each audio segment to the subsequent segment feature amount memory 15.

【００５７】セグメント特徴量メモリ１５は、映像特徴
量抽出部１３および音声特徴量抽出部１４からそれぞれ
供給された映像セグメントおよび音声セグメントの特徴
量を記憶する。セグメント特徴量メモリ１５は、後述す
る特徴量類似性測定部１７からの問い合わせに応じて、
記憶している特徴量やセグメントを特徴量類似性測定部
１７に出力する。The segment feature memory 15 stores the features of the video segment and the audio segment supplied from the video feature extractor 13 and the audio feature extractor 14, respectively. The segment feature memory 15 responds to an inquiry from a feature similarity measuring unit 17 described later,
The stored feature values and segments are output to the feature value similarity measurement unit 17.

【００５８】シーン検出部１６は、ビデオセグメントメ
モリ１２に保持された分割情報と、セグメント間の類似
性とを用いて、映像セグメントおよび音声セグメントが
シーンの境界であるかを判断する。シーン検出部１６
は、各セグメントの近隣の最も類似な特徴量を持つセグ
メントの分布パターンが、過去に集中した状態から未来
に集中した状態へ切り替わる変化点を特定することによ
り、シーンの境界を検出し先頭部と最後部を確定する。
シーン検出部１６は、セグメントが発生する毎に1セグ
メント分、時系列的に移動させ、近隣の最も類似してい
るセグメントの分布パターンを測定する。シーン検出部
１６は、特徴量類似性測定部１７を用いて、近隣のセグ
メントで最も類似しているものの数を特定する。すなわ
ち、特徴空間における特徴量の最近傍の数を求める。そ
してセグメントの最近傍の類似セグメントがそのセグメ
ントを境にして過去に存在するものと未来に存在するも
のとの個数の違いのパターンの変化からシーンの境界を
特定する。The scene detector 16 determines whether the video segment and the audio segment are scene boundaries using the division information held in the video segment memory 12 and the similarity between the segments. Scene detector 16
Detects the boundary of the scene by identifying the transition point where the distribution pattern of the segment with the most similar feature in the vicinity of each segment switches from the state concentrated in the past to the state concentrated in the future, and detects the boundary of the scene and Confirm the last part.
The scene detector 16 moves one segment in time series every time a segment is generated, and measures the distribution pattern of the nearest similar segments in the vicinity. The scene detecting unit 16 uses the feature similarity measuring unit 17 to specify the number of similar segments in the neighboring segments. That is, the number of the nearest neighbors of the feature amount in the feature space is obtained. Then, the boundary of the scene is specified from a change in the pattern of the difference in the number of similar segments closest to the segment existing in the past and those existing in the future with the segment as a boundary.

【００５９】特徴量類似性測定部１７は、各セグメント
とその近隣のセグメントとの類似性を測定する。特徴量
類似性測定部１７は、あるセグメントに関する特徴量を
検索するようにセグメント特徴量メモリ１５に問いかけ
る。The feature quantity similarity measuring unit 17 measures the similarity between each segment and its neighboring segments. The feature similarity measurement unit 17 queries the segment feature memory 15 to search for a feature related to a certain segment.

【００６０】ビデオデータ記録部１８は、ビデオストリ
ームおよびビデオデータに関する各種のデータである、
いわゆる付加情報データを記録する。ここにシーン検出
部１６から出力されたシーン境界情報およびシーンに対
して計算された強度値が保存される。The video data recording section 18 stores various data related to a video stream and video data.
The so-called additional information data is recorded. Here, the scene boundary information output from the scene detection unit 16 and the intensity value calculated for the scene are stored.

【００６１】ビデオ表示部１９は、ビデオデータ記録部
１８からのビデオデータを、各種付加情報データに基
き、サムネイルのような表示方法やランダムアクセス方
法などを実現する。これはユーザの視聴方法に自由度を
増やし、利便性良くビデオデータを表示する。The video display unit 19 realizes a display method such as a thumbnail or a random access method based on the video data from the video data recording unit 18 based on various types of additional information data. This increases the degree of freedom in the user's viewing method, and displays video data with high convenience.

【００６２】制御部２０は、ドライブ２１を制御して、
磁気ディスク２２、光ディスク２３、光磁気ディスク２
４、または半導体メモリ２５に記憶されている制御用プ
ログラムを読み出し、読み出した制御用プログラムに基
づいて、映像音声処理装置の各部を制御する。The control unit 20 controls the drive 21 to
Magnetic disk 22, optical disk 23, magneto-optical disk 2
Or the control program stored in the semiconductor memory 25 is read, and each unit of the video / audio processing device is controlled based on the read control program.

【００６３】映像音声処理装置は、図５に概略を示すよ
うな一連の処理を行うことによって、シーンを検出す
る。The video / audio processing apparatus detects a scene by performing a series of processes as schematically shown in FIG.

【００６４】まず、映像音声処理装置は、同図に示すよ
うに、ステップＳ１において、ビデオ分割を行う。すな
わち映像音声処理装置は、ビデオ分割部１１に入力され
たビデオデータを映像セグメントまたは音声セグメント
のいずれか、あるいは可能であればその両方に分割す
る。First, the video and audio processing apparatus performs video division in step S1, as shown in FIG. That is, the video and audio processing device divides the video data input to the video dividing unit 11 into either a video segment or an audio segment or, if possible, into both.

【００６５】映像音声処理装置が適用するビデオ分割方
法には、特に前提要件を設けない。例えば、映像音声処
理装置は、“G. Ahanger and T.D.C. Little, A survey
oftechnologies for parsing and indexing digital v
ideo, J. of Visual Communication and Image Represe
ntation 7:28-4, 1996”に記載されているような方法に
よりビデオ分割を行う。このようなビデオ分割の方法
は、当該技術分野ではよく知られたものであり、映像音
声処理装置は、いかなるビデオ分割方法も適用できるも
のとする。The video division method applied by the video / audio processing apparatus has no particular prerequisite. For example, the video and audio processing device is “G. Ahanger and TDC Little, A survey
oftechnologies for parsing and indexing digital v
ideo, J. of Visual Communication and Image Represe
ntation 7: 28-4, 1996 ". Video segmentation is performed in a manner well known in the art. Any video segmentation method shall be applicable.

【００６６】次に、映像音声処理装置は、ステップＳ２
において、特徴量の抽出を行う。すなわち映像音声処理
装置は、映像特徴量抽出部１３や音声特徴量抽出部１４
により、そのセグメントの特徴を表す特徴量を計算す
る。映像音声処理装置では、例えば、各セグメントの時
間長や、カラーヒストグラムやテクスチャフィーチャと
いった映像特徴量や、周波数解析結果、レベル、ピッチ
といった音声特徴量やアクティビティ測定結果等が、適
用可能な特徴量として計算される。勿論、映像音声処理
装置は、適用可能な特徴量としてこれらに限定されるも
のではない。Next, the video / audio processing apparatus performs step S2.
In, the feature amount is extracted. That is, the video / audio processing device includes the video feature amount extraction unit 13 and the audio feature amount extraction unit 14
, A feature quantity representing the feature of the segment is calculated. In the video / audio processing apparatus, for example, the time length of each segment, video feature amounts such as color histograms and texture features, audio feature amounts such as frequency analysis results, levels and pitches, activity measurement results, and the like are applicable feature amounts. Is calculated. Of course, the video and audio processing device is not limited to these as applicable feature amounts.

【００６７】続いて、映像音声処理装置は、ステップＳ
３において、特徴量を用いたセグメントの類似性測定を
行う。すなわち映像音声処理装置は、特徴量類似性測定
部１７により非類似性測定を行い、その測定基準によ
り、セグメントとその近隣のセグメントがどの程度類似
しているかを測定する。映像音声処理装置は、先のステ
ップＳ２において抽出した特徴量を用いて、非類似性測
定基準を計算する。Subsequently, the video and audio processing apparatus performs step S
In 3, the similarity of the segments is measured using the feature values. That is, the video and audio processing apparatus performs the dissimilarity measurement by the feature amount similarity measurement unit 17 and measures the degree of similarity between the segment and the neighboring segments based on the measurement criterion. The video / audio processing device calculates the dissimilarity metric using the feature amount extracted in step S2.

【００６８】そして、映像音声処理装置は、ステップＳ
４において、セグメントがシーンの切れ目にあたるか否
かを判断する。すなわち、映像音声処理装置は、先のス
テップＳ３において計算した非類似性測定基準と、先の
ステップＳ２において抽出した特徴量とを用いて、各セ
グメントを現在と見なし、近接の類似したセグメント
が、その基準とするセグメントに対し過去か未来かどち
らに存在比率が高いかを求め、その存在比の率変化のパ
ターンを調べ、シーンの境界であるか否かの判断をす
る。映像音声処理装置は、このようにして最終的に各セ
グメントがシーンの切れ目であるか否かを出力する。Then, the video and audio processing apparatus proceeds to step S
At 4, it is determined whether the segment corresponds to a scene break. That is, using the dissimilarity metric calculated in the previous step S3 and the feature amount extracted in the previous step S2, the video and audio processing apparatus regards each segment as the current segment, It is determined whether the presence ratio is higher in the past or in the future with respect to the reference segment, the pattern of the change ratio of the presence ratio is examined, and it is determined whether or not the segment is a scene boundary. The video / audio processing device finally outputs whether or not each segment is a scene break.

【００６９】このような一連の処理を経ることによっ
て、映像音声処理装置は、ビデオデータからシーンを検
出することができる。Through such a series of processing, the video / audio processing apparatus can detect a scene from video data.

【００７０】したがって、ユーザは、この結果を用いる
ことによって、ビデオデータの内容を要約したり、ビデ
オデータ中の興味のあるポイントに迅速にアクセスした
りすることが可能となる。Therefore, by using the result, the user can summarize the contents of the video data and quickly access a point of interest in the video data.

【００７１】以下、上述した処理の各ステップをより詳
細に説明する。Hereinafter, each step of the above processing will be described in more detail.

【００７２】ステップＳ１におけるビデオ分割について
説明する。映像音声処理装置は、ビデオ分割部１１に入
力されたビデオデータを映像セグメントまたは音声セグ
メントのいずれか、あるいは可能であればその両方に分
割するが、このビデオデータにおけるセグメントの境界
を自動的に検出するための技術は多くのものがあり、映
像音声処理装置において、このビデオ分割方法に特別な
前提要件を設けないことは上述した通りである。The video division in step S1 will be described. The video / audio processing apparatus divides the video data input to the video division unit 11 into either a video segment or an audio segment or, if possible, both, and automatically detects a segment boundary in the video data. As described above, there are many techniques for performing the above, and no special prerequisites are provided for this video division method in the video / audio processing apparatus.

【００７３】一方、映像音声処理装置において、後の処
理によるシーン検出の精度は、本質的に、基礎となるビ
デオ分割の精度に依存する。なお、映像音声処理装置に
おけるシーン検出は、ある程度ビデオ分割時のエラーを
許容することができる。特に、映像音声処理装置におい
て、ビデオ分割は、セグメント検出が不十分である場合
よりも、セグメント検出を過度に行う場合の方が好まし
い。映像音声処理装置は、類似したセグメントの検出が
過度である結果である限り、一般に、シーン検出の際に
検出過度であるセグメントを同一シーンとしてまとめる
ことができる。On the other hand, in the video / audio processing apparatus, the accuracy of scene detection by the subsequent processing essentially depends on the accuracy of the underlying video division. It should be noted that the scene detection in the video / audio processing apparatus can tolerate an error at the time of video division to some extent. In particular, in the video / audio processing apparatus, video division is preferably performed when segment detection is excessively performed, rather than when segment detection is insufficient. In general, the video and audio processing apparatus can combine segments that are overdetected during scene detection as the same scene as long as the result of detection of similar segments is excessive.

【００７４】ステップＳ２における特徴量抽出について
説明する。特徴量とは、セグメントの特徴を表すととも
に、異なるセグメント間の類似性を測定するためのデー
タを供給するセグメントの属性である。映像音声処理装
置は、映像特徴量抽出部１３や音声特徴量抽出部１４に
おいて各セグメントの特徴量を計算し、セグメントの特
徴を表す。The feature value extraction in step S2 will be described. The feature amount is an attribute of the segment that represents the feature of the segment and supplies data for measuring the similarity between different segments. In the video / audio processing apparatus, the video feature extraction unit 13 and the audio feature extraction unit 14 calculate the feature of each segment, and represent the feature of the segment.

【００７５】映像音声処理装置は、いかなる特徴量の具
体的詳細にも依存するものではないが、映像音声処理装
置において用いて効果的であると考えられる特徴量とし
ては、例えば以下に示す映像特徴量、音声特徴量、映像
音声共通特徴量のようなものがある。映像音声処理装置
において適用可能となるこれら特徴量の必要条件は、非
類似性の測定が可能であることである。また映像音声処
理装置は、効率化のために、特徴量抽出と上述したビデ
オ分割とを同時に行うことがある。以下に説明する特徴
量は、このような処理を可能にするものである。Although the video / audio processing device does not depend on any specific details of the feature values, the feature values considered to be effective when used in the video / audio processing device include, for example, the following video feature values. There are such as quantity, audio feature quantity, and video / audio common feature quantity. A necessary condition of these feature amounts that can be applied to the video and audio processing device is that dissimilarity can be measured. In addition, the video and audio processing apparatus may simultaneously perform feature extraction and the above-described video division for efficiency. The feature values described below enable such processing.

【００７６】上記特徴量としては、まず映像に関するも
のが挙げられる。以下では、これを映像特徴量と称する
ことにする。映像セグメントは、連続する映像フレーム
により構成されるため、映像セグメントから適切な映像
フレームを抽出することによって、その映像セグメント
の描写内容を、抽出した映像フレームで特徴付けること
が可能である。すなわち映像セグメントの類似性は、適
切に抽出された映像フレームの類似性で代替可能であ
る。つまり映像特徴量は、映像音声処理装置で用いるこ
とができる重要な特徴量の１つである。この場合の映像
特徴量は、単独では静的な情報しか表せないが、映像音
声処理装置は、後述するような方法を適用することによ
って、この映像特徴量に基づく映像セグメントの動的な
特徴を抽出する。As the above-mentioned feature amount, there is firstly one relating to a video. Hereinafter, this is referred to as an image feature amount. Since the video segment is composed of consecutive video frames, by extracting an appropriate video frame from the video segment, it is possible to characterize the depiction content of the video segment with the extracted video frame. That is, the similarity of video segments can be replaced with the similarity of appropriately extracted video frames. That is, the video feature is one of important features that can be used in the video and audio processing device. In this case, the video feature amount alone can represent only static information. However, the video and audio processing apparatus applies a method described later to obtain the dynamic feature of the video segment based on the video feature amount. Extract.

【００７７】映像特徴量として既知のものは多数存在す
るが、シーン検出のためには以下に示す色特徴量（ヒス
トグラム）および映像相関が、計算コストと精度との良
好な兼ね合いを与えることを見出したことから、映像音
声処理装置は、映像特徴として、色特徴量および映像相
関を用いることにする。Although there are many known image feature amounts, it has been found that, for scene detection, the following color feature amounts (histogram) and image correlation give a good balance between calculation cost and accuracy. Therefore, the video and audio processing device uses the color feature amount and the video correlation as the video features.

【００７８】映像音声処理装置において、映像における
色は、２つの映像が類似しているかを判断する際の重要
な材料となる。カラーヒストグラムを用いて映像の類似
性を判断することは、例えば“G. Ahanger and T.D.C.
Little, A survey of technologies for parsing and i
ndexing digital video, J. of Visual Communication
and Image Representation 7:28-4, 1996”に記載され
ているように、よく知られている。In a video / audio processing apparatus, the color in a video is an important material when determining whether two videos are similar. Judging the similarity of images using a color histogram is described, for example, in “G. Ahanger and TDC.
Little, A survey of technologies for parsing and i
ndexing digital video, J. of Visual Communication
and Image Representation 7: 28-4, 1996 ”.

【００７９】ここでカラーヒストグラムとは、例えばLU
VやRGB等の３次元色空間をｎ個の領域に分割し、映像に
おける画素の、各領域での出現頻度の相対的割合を計算
したものである。そして、得られた情報からは、ｎ次元
ベクトルが与えられる。圧縮されたビデオデータについ
ては、例えば米国特許５７０８７６７号公報に記載され
ているように、カラーヒストグラムを、圧縮データから
直接抽出することができる。Here, the color histogram is, for example, LU
It is obtained by dividing a three-dimensional color space such as V or RGB into n regions and calculating the relative proportion of the appearance frequency of pixels in the video in each region. Then, an n-dimensional vector is provided from the obtained information. For compressed video data, a color histogram can be directly extracted from the compressed data, as described, for example, in US Pat. No. 5,708,767.

【００８０】映像音声処理装置では、セグメントを構成
する映像（MPEG1／2，DVなど一般的に使われている方
式）における元々のYUV色空間のヒストグラムベクトル
を得る。The video / audio processing apparatus obtains the original histogram vector in the YUV color space in the video (a generally used method such as MPEG1 / 2 or DV) constituting the segment.

【００８１】映像音声処理装置では、セグメントを構成
する映像（MPEG1／2，DVなど一般的に使われている方
式）における元来のYUV色空間を、色チャンネル当たり
２ビットでサンプリングして構成した、２^2・3＝６４次
元のヒストグラムベクトルを得る。In the video / audio processing apparatus, the original YUV color space of the video (MPEG1 / 2, DV, etc.) that constitutes a segment is sampled with 2 bits per color channel. 2 ^{2 3} = 64-dimensional histogram vector is obtained.

【００８２】このようなヒストグラムは、映像の全体的
な色調を表すが、これには時間情報が含まれていない。
そこで、映像音声処理装置では、もう１つの映像特徴量
として、映像相関を計算する。映像音声処理装置でのシ
ーン検出において、複数の類似セグメントが互いに交差
した構造は、それがまとまった１つのシーン構造である
ことの有力な指標となる。[0082] Such a histogram represents the overall color tone of the video, but does not include time information.
Therefore, the video / audio processing device calculates a video correlation as another video feature amount. In scene detection in the video and audio processing device, a structure in which a plurality of similar segments intersect with each other is a powerful indicator that the structure is a single scene structure.

【００８３】例えば会話場面において、カメラの位置
は、２人の話し手の間を交互に移動するが、カメラは通
常、同一の話し手を再度撮影するときには、ほぼ同じ位
置に戻る。このような場合における構造を検出するため
には、グレイスケールの縮小映像に基づく相関がセグメ
ントの類似性の良好な指標となることを見出したことか
ら、映像音声処理装置では、元の映像をＭ×Ｎの大きさ
のグレイスケール映像に間引き縮小し、これを用いて映
像相関を計算する。ここで、ＭとＮは、両方とも小さな
値で十分であり、例えば８×８である。つまり、これら
の縮小グレイスケール映像は、ＭＮ次元の特徴量ベクト
ルとして解釈される。For example, in a conversation scene, the position of the camera alternately moves between two speakers, but the camera usually returns to almost the same position when the same speaker is photographed again. In order to detect the structure in such a case, it has been found that the correlation based on the grayscale reduced video is a good indicator of the similarity of the segments. The image is thinned out and reduced to a grayscale image having a size of × N, and the image correlation is calculated using this. Here, a small value is sufficient for both M and N, for example, 8 × 8. That is, these reduced grayscale images are interpreted as MN-dimensional feature amount vectors.

【００８４】さらに上述した映像特徴量とは異なる特徴
量としては、音声に関するものが挙げられる。以下で
は、この特徴量を音声特徴量と称することにする。音声
特徴量とは、音声セグメントの内容を表すことができる
特徴量であり、映像音声処理装置は、この音声特徴量と
して、周波数解析、ピッチ、レベル等を用いることがで
きる。これらの音声特徴量は、種々の文献により知られ
ているものである。Further, as a feature amount different from the above-mentioned video feature amount, there is a feature amount related to audio. Hereinafter, this feature will be referred to as a voice feature. The audio feature is a feature that can represent the content of an audio segment, and the video / audio processing device can use frequency analysis, pitch, level, and the like as the audio feature. These speech features are known from various documents.

【００８５】まず、映像音声処理装置は、フーリエ変換
等の周波数解析を行うことによって、単一の音声フレー
ムにおける周波数情報の分布を決定することができる。
映像音声処理装置は、例えば、１つの音声セグメントに
わたる周波数情報の分布を表すために、FFT（Fast Four
ier Transform；高速フーリエ変換）成分、周波数ヒス
トグラム、パワースペクトル、ケプストラム(Cepstru
m)、その他の特徴量を用いることができる。First, the video and audio processing apparatus can determine the distribution of frequency information in a single audio frame by performing frequency analysis such as Fourier transform.
The video / audio processing apparatus, for example, expresses an FFT (Fast Four Fourier Display) to represent the distribution of frequency information over one audio segment.
ier Transform: Fast Fourier Transform component, frequency histogram, power spectrum, cepstrum (Cepstru)
m) and other feature values can be used.

【００８６】また、映像音声処理装置は、平均ピッチや
最大ピッチなどのピッチや、平均ラウドネスや最大ラウ
ドネスなどの音声レベルもまた、音声セグメントを表す
有効な音声特徴量として用いることができる。The video / audio processing apparatus can also use pitches such as an average pitch and a maximum pitch, and audio levels such as an average loudness and a maximum loudness as effective audio feature amounts representing audio segments.

【００８７】さらに他の特徴量としては、映像音声共通
特徴量が挙げられる。これは、特に映像特徴量でもなく
音声特徴量でもないが、映像音声処理装置において、シ
ーン内のセグメントの特徴を表すのに有用な情報を与え
るものである。映像音声処理装置は、この映像音声共通
特徴量として、セグメント長とアクティビティとを用い
る。As another feature, there is a video / audio common feature. This is not a video feature amount nor an audio feature amount, but provides information useful for representing characteristics of a segment in a scene in a video / audio processing apparatus. The video / audio processing device uses the segment length and the activity as the video / audio common feature amount.

【００８８】映像音声処理装置は、映像音声共通特徴量
として、セグメント長を用いることができる。このセグ
メント長は、セグメントにおける時間長である。一般
に、シーンは、そのシーンに固有のリズム特徴を有す
る。このリズム特徴は、シーン内のセグメント長の変化
として表れる。例えば、迅速に連なった短いセグメント
は、コマーシャルを表す。一方、会話シーンにおけるセ
グメントは、コマーシャルの場合よりも長く、また会話
シーンには、相互に組み合わされたセグメントが互いに
類似しているという特徴がある。映像音声処理装置は、
このような特徴を有するセグメント長を映像音声共通特
徴量として用いることができる。The video / audio processing apparatus can use the segment length as the video / audio common feature. This segment length is the length of time in the segment. Generally, a scene has rhythmic features that are unique to the scene. This rhythm feature appears as a change in segment length in the scene. For example, a short series of rapid segments represents a commercial. On the other hand, the segments in the conversation scene are longer than in the case of commercials, and the conversation scene is characterized in that mutually combined segments are similar to each other. The video and audio processing device
A segment length having such a feature can be used as a video / audio common feature amount.

【００８９】また、映像音声処理装置は、映像音声共通
特徴量として、アクティビティを用いることができる。
アクティビティとは、セグメントの内容がどの程度動的
あるいは静的であるように感じられるかを表す指標であ
る。例えば、視覚的に動的である場合、アクティビティ
は、カメラが対象物に沿って迅速に移動する度合い、ま
たは撮影されているオブジェクトが迅速に変化する度合
いを表す。The video / audio processing apparatus can use an activity as the video / audio common feature.
The activity is an index indicating how much the content of the segment feels dynamic or static. For example, if visually dynamic, the activity represents the degree to which the camera moves quickly along the object or the object being photographed changes rapidly.

【００９０】このアクティビティは、カラーヒストグラ
ムのような特徴量のフレーム間非類似性の平均値を測定
することにより、間接的に計算される。ここで、フレー
ムｉとフレームｊとの間で測定された特徴量Ｆに対する
非類似性測定基準をｄＦ（ｉ，ｊ）と定義すると、映像
アクティビティＶＦは、次式（１）のように定義され
る。This activity is calculated indirectly by measuring the average value of inter-frame dissimilarity of a feature quantity such as a color histogram. Here, if the dissimilarity metric for the feature value F measured between the frame i and the frame j is defined as dF (i, j), the video activity VF is defined as the following equation (1). You.

【数１】 (Equation 1)

【００９１】式（１）において、ｂとｆはそれぞれ、１
セグメントにおける最初と最後のフレームのフレーム番
号である。映像音声処理装置は、具体的には、例えば上
述したヒストグラムを用いて、映像アクティビティＶＦ
を計算する。In the equation (1), b and f each represent 1
The frame numbers of the first and last frames in the segment. Specifically, the video / audio processing device uses, for example, the above-described histogram to generate the video activity VF.
Is calculated.

【００９２】ところで、上述した映像特徴量を始めとす
る特徴量は、基本的にはセグメントの静的情報を表すも
のであることは上述した通りであるが、セグメントの特
徴を正確に表すためには、その動的情報も考慮する必要
がある。そこで、映像音声処理装置は、以下に示すよう
な特徴量のサンプリング方法により動的情報を表す。As described above, the feature values including the above-mentioned video feature values basically represent the static information of the segment. However, in order to accurately represent the feature of the segment, Needs to consider its dynamic information. Therefore, the video and audio processing apparatus represents the dynamic information by a feature amount sampling method as described below.

【００９３】映像音声処理装置は、例えば図５に示すよ
うに、１セグメント内の異なる時点から１以上の静的な
特徴量を抽出する。このとき、映像音声処理装置は、特
徴量の抽出数を、そのセグメント表現における忠実度の
最大化とデータ冗長度の最小化とのバランスをとること
により決定する。例えば、セグメント内のある１画像が
当該セグメントのキーフレームとして指定可能な場合に
は、そのキーフレームから計算されたヒストグラムが、
抽出すべきサンプリング特徴量となる。The video and audio processing apparatus extracts one or more static feature values from different points in one segment, for example, as shown in FIG. At this time, the video and audio processing apparatus determines the number of extracted feature amounts by balancing the maximization of the fidelity in the segment representation and the minimization of the data redundancy. For example, if one image in a segment can be specified as a key frame of the segment, the histogram calculated from the key frame is
This is the sampling feature to be extracted.

【００９４】映像音声処理装置は、後述するサンプリン
グ方法を用いて、対象とするセグメントにおいて、特徴
として抽出可能なサンプルのうち、どのサンプルを選択
するかを決定する。The video / audio processing apparatus determines which of the samples that can be extracted as features in the target segment is to be selected by using a sampling method described later.

【００９５】ところで、あるサンプルが常に所定の時
点、例えばセグメント内の最後の時点において選択され
る場合を考える。この場合、黒フレームへ変化してゆく
（フェードしてゆく）任意の２つのセグメントについて
は、サンプルが同一の黒フレームとなるため、同一の特
徴量が得られる結果になる恐れがある。すなわち、これ
らのセグメントの映像内容がいかなるものであれ、選択
した２つのフレームは、極めて類似していると判断され
てしまう。このような問題は、サンプルが良好な代表値
でないために発生するものである。Now, consider a case where a certain sample is always selected at a predetermined time, for example, the last time in a segment. In this case, for any two segments that change (fade) to a black frame, the samples have the same black frame, so that the same feature value may be obtained. That is, whatever the video content of these segments is, the selected two frames are determined to be extremely similar. Such a problem occurs because the sample is not a good representative value.

【００９６】そこで、映像音声処理装置は、このように
固定点で特徴量を抽出するのではなく、セグメント全体
における統計的な代表値を抽出することとする。ここで
は、一般的な特徴量のサンプリング方法を２つの場合、
すなわち、特徴量を実数のｎ次元ベクトルとして表すこ
とができる第１の場合と、非類似性測定基準しか利用で
きない第２の場合とについて説明する。なお、第１の場
合は、ヒストグラムやパワースペクトル等、最もよく知
られている映像特徴量および音声特徴量が含まれる。Therefore, the video / audio processing apparatus does not extract the characteristic amount at the fixed point as described above, but extracts a statistical representative value in the entire segment. Here, in the case of two general sampling methods of the feature amount,
That is, the first case in which the feature amount can be represented as a real n-dimensional vector and the second case in which only the dissimilarity metric can be used will be described. In the first case, the most well-known video feature amounts and audio feature amounts, such as a histogram and a power spectrum, are included.

【００９７】第１の場合においては、サンプル数ｋは予
め決められており、映像音声処理装置は、“L. Kaufman
and P.J. Rousseeuw, Finding Groups in Data:An Int
roduction to Cluster Analysis, John-Wiley and son
s, 1990”に記載されてよく知られているｋ平均値クラ
スタリング法(k-means-clustering method)を用いて、
セグメント全体についての特徴量をｋ個の異なるグルー
プに自動的に分割する。そして、映像音声処理装置は、
サンプル値として、ｋ個の各グループから、グループの
重心値（centroid）またはこの重心値に近いサンプルを
選択する。映像音声処理装置におけるこの処理の複雑度
は、サンプル数に関して単に直線的に増加するに留ま
る。In the first case, the number k of samples is determined in advance, and the video and audio processing apparatus uses “L. Kaufman
and PJ Rousseeuw, Finding Groups in Data: An Int
roduction to Cluster Analysis, John-Wiley and son
s, 1990 ”, using a well-known k-means-clustering method,
The feature amount of the entire segment is automatically divided into k different groups. And the video and audio processing device,
As a sample value, a centroid value of the group or a sample close to this centroid value is selected from each of the k groups. The complexity of this processing in a video and audio processing device only increases linearly with the number of samples.

【００９８】一方、第２の場合においては、映像音声処
理装置は、“L. Kaufman and P.J.Rousseeuw, Finding
Groups in Data:An Introduction to Cluster Analysi
s, John-Wiley and sons, 1990”に記載されているｋ−
メドイドアルゴリズム法(k-medoids algorithm method)
を用いて、ｋ個のグループを形成する。そして、映像音
声処理装置は、サンプル値として、ｋ個の各グループ毎
に、上述したグループのメドイド(medoid)を用いる。[0098] On the other hand, in the second case, the video and audio processing apparatus is described in "L. Kaufman and PJ Rousseeuw, Finding.
Groups in Data: An Introduction to Cluster Analysi
s, John-Wiley and sons, 1990 ".
K-medoids algorithm method
Are used to form k groups. Then, the video and audio processing apparatus uses the medoid of the above-described group for each of the k groups as the sample value.

【００９９】なお、映像音声処理装置においては、抽出
された動的特徴を表す特徴量についての非類似性測定基
準を構成する方法は、その基礎となる静的な特徴量の非
類似性測定基準に基づくが、これについては後述する。In the video / audio processing apparatus, the method of constructing the dissimilarity metric for the feature quantity representing the extracted dynamic feature is based on the dissimilarity metric of the static feature quantity on which the feature is based. , Which will be described later.

【０１００】このようにして、映像音声処理装置は、静
的な特徴量を複数抽出し、これら複数の静的な特徴量を
用いることで、動的特徴を表すことができる。As described above, the video / audio processing apparatus can extract a plurality of static features and use the plurality of static features to represent a dynamic feature.

【０１０１】以上のように、映像音声処理装置は、種々
の特徴量を抽出することができる。これらの各特徴量
は、一般に、単一ではセグメントの特徴を表すのに不十
分であることが多い。そこで、映像音声処理装置は、こ
れらの各種特徴量を組み合わせることで、互いに補完し
合う特徴量の組を選択することができる。例えば、映像
音声処理装置は、上述したカラーヒストグラムと映像相
関とを組み合わせることによって、各特徴量が有する情
報よりも多くの情報を得ることができる。As described above, the video and audio processing apparatus can extract various feature amounts. Each of these features is generally insufficient by itself to represent the features of the segment. Thus, the video and audio processing device can select a set of feature amounts that complement each other by combining these various feature amounts. For example, the video and audio processing apparatus can obtain more information than the information of each feature amount by combining the above-described color histogram and video correlation.

【０１０２】次に、図５のステップＳ３における特徴量
を用いたセグメントの類似性測定について説明する。映
像音声処理装置は、２つの特徴量について、それがどの
程度非類似であるかを測定する実数値を計算する関数で
ある非類似性測定基準を用いて、特徴量類似性測定部１
７によりセグメントの類似性測定を行う。この非類似性
測定基準は、その値が小さい場合は２つの特徴量が類似
していることを示し、値が大きい場合は非類似であるこ
とを示す。ここでは、特徴量Ｆに関する２つのセグメン
トＳ₁，Ｓ₂の非類似性を計算する関数を非類似性測定基
準ｄＦ（Ｓ₁，Ｓ₂）と定義する。なお、この関数は、以
下の式（２）で与えられる関係を満足させる必要があ
る。Next, a description will be given of the similarity measurement of the segments using the feature amounts in step S3 of FIG. The video / audio processing apparatus uses a feature amount similarity measurement unit 1 using a dissimilarity metric that is a function for calculating a real value that measures how dissimilar the two feature amounts are.
7, the similarity of the segments is measured. In the dissimilarity metric, a small value indicates that the two feature amounts are similar, and a large value indicates dissimilarity. Here, a function for calculating the dissimilarity between the _two segments S ₁ and S ₂ related to the feature value F is defined as a dissimilarity metric dF (S ₁ , S ₂ ). This function needs to satisfy the relationship given by the following equation (2).

【数２】 (Equation 2)

【０１０３】ところで、非類似性測定基準の中には、あ
る特定の特徴量にのみ適用可能なものがあるが、“G. A
hanger and T.D.C. Little, A survey of technologies
forparsing and indexing digital video, J. of Visu
al Communication and Image Representation 7:28-4,
1996”や“L. Kaufman and P.J. Rousseeuw, Finding G
roups in Data:An Introduction to Cluster Analysis,
John-Wiley and sons, 1990”に記載されているよう
に、一般には、多くの非類似性測定基準は、ｎ次元空間
における点として表される特徴量についての類似性を測
定することに適用可能である。By the way, some of the dissimilarity metrics can be applied only to a specific feature, but “G.A.
hanger and TDC Little, A survey of technologies
forparsing and indexing digital video, J. of Visu
al Communication and Image Representation 7: 28-4,
1996 ”and“ L. Kaufman and PJ Rousseeuw, Finding G
roups in Data: An Introduction to Cluster Analysis,
In general, many dissimilarity metrics can be applied to measuring similarity for features represented as points in n-dimensional space, as described in John-Wiley and sons, 1990. It is.

【０１０４】その具体例は、ユークリッド距離、内積、
Ｌ１距離等である。ここで、特にＬ１距離が、ヒストグ
ラムや映像相関などの特徴量を含む種々の特徴量に対し
て有効に作用することから、映像音声処理装置は、Ｌ１
距離を導入する。ここで、２つのｎ次元ベクトルをＡ，
Ｂとした場合、Ａ，Ｂ間のＬ１距離ｄＬ１（Ａ，Ｂ）は
次式（３）で与えられる。The specific examples are Euclidean distance, inner product,
L1 distance or the like. Here, in particular, since the L1 distance effectively acts on various feature amounts including feature amounts such as histograms and video correlations, the video / audio processing apparatus uses the L1 distance.
Introduce the distance. Here, two n-dimensional vectors are A,
If B, the L1 distance dL1 (A, B) between A and B is given by the following equation (3).

【数３】ここでＡ，Ｂの添え字ｉは、ｎ次元ベクトルＡ，Ｂそれ
ぞれのｉ次元の要素を示すものである。(Equation 3) Here, the suffix i of A and B indicates the i-dimensional element of each of the n-dimensional vectors A and B.

【０１０５】また、映像音声処理装置は、上述したよう
に、動的特徴を表す特徴量として、セグメントにおける
様々な時点での静的な特徴量を抽出する。そして、映像
音声処理装置は、抽出された二つの動的特徴量間の類似
性を決定するために、その非類似性測定基準として、そ
の基礎となる静的特徴量の間の非類似性測定基準を用い
る。これら動的特徴量の非類似性測定基準は、多くの場
合、各動的特徴量から選択された最も類似した静的特徴
量の対の非類似性値を用いて決定されるのが最良であ
る。この場合、２つの抽出された動的特徴量ＳＦ₁，Ｓ
Ｆ₂の間の非類似性測定基準は、次式（４）のように定
義される。As described above, the video and audio processing apparatus extracts static feature values at various points in the segment as feature values representing dynamic features. Then, the video and audio processing apparatus determines the similarity between the two extracted dynamic features by using the dissimilarity measurement between the underlying static features as its dissimilarity criterion. Use criteria. These dynamic feature dissimilarity metrics are often best determined using the dissimilarity values of the most similar static feature pair selected from each dynamic feature. is there. In this case, the two extracted dynamic features SF ₁ , S
Dissimilarity metric between F ₂ is defined by the following equation (4).

【数４】 (Equation 4)

【０１０６】ここで、上式（４）における関数ｄＦ（Ｆ
₁，Ｆ₂）は、その基礎となる静的特徴量Ｆについての非
類似性測定基準を示す。なお、場合によっては、特徴量
の非類似性の最小値をとる代わりに、最大値または平均
値をとってもよい。Here, the function dF (F
₁ , F ₂ ) indicates the dissimilarity metric for the static feature F on which it is based. In some cases, instead of taking the minimum value of the dissimilarity of the feature amount, the maximum value or the average value may be taken.

【０１０７】ところで、映像音声処理装置は、セグメン
トの類似性を決定する上で、単一の特徴量だけでは不十
分であり、同一セグメントに関する多数の特徴量からの
情報を組み合わせることを必要とする場合も多い。この
１つの方法として、映像音声処理装置は、種々の特徴量
に基づく非類似性を、それぞれの特徴量の重み付き組み
合わせとして計算する。すなわち、映像音声処理装置
は、ｋ個の特徴量Ｆ₁，Ｆ₂，・・・，Ｆ_kが存在する場
合、次式（５）に示すような組み合わせた特徴量に関す
る非類似性測定基準ｄＦ（Ｓ₁，Ｓ₂）を用いる。By the way, in the video / audio processing apparatus, a single feature amount is not enough to determine the similarity of segments, and it is necessary to combine information from a large number of feature amounts regarding the same segment. Often. As one of the methods, the video and audio processing apparatus calculates dissimilarity based on various feature amounts as a weighted combination of the respective feature amounts. That is, when there are k feature quantities F ₁ , F ₂ ,..., F _k , the video / audio processing apparatus sets the dissimilarity measurement criterion dF for the combined feature quantity as shown in the following equation (5). (S ₁ , S ₂ ) is used.

【数５】 (Equation 5)

【０１０８】ここで、｛ｗ_i｝は、Σｉｗ_i＝１となる重
み係数である。Here, {w _i } is a weighting coefficient that satisfies {iw _i = 1}.

【０１０９】以上のように、映像音声処理装置は、図５
のステップＳ２において抽出された特徴量を用いて非類
似性測定基準を計算し、当該セグメント間の類似性を測
定することができる。As described above, the video and audio processing apparatus is configured as shown in FIG.
It is possible to calculate a dissimilarity metric using the feature amount extracted in step S2 of step (a) and measure the similarity between the segments.

【０１１０】次に図５のステップＳ４におけるシーンの
切り分けについて説明する。映像音声処理装置は、非類
似性測定基準と抽出した特徴量とを用いて、各セグメン
トに対する近隣の最も類似したセグメントの分布パター
ンの変化を検出し、シーンの切れ目か否かを判断して出
力する。Next, the division of the scene in step S4 of FIG. 5 will be described. The video / audio processing device detects a change in the distribution pattern of the closest similar segment to each segment using the dissimilarity metric and the extracted feature amount, and determines whether or not the segment is a scene break and outputs the result. I do.

【０１１１】映像音声処理装置は、シーンを検出する際
に、次のような４つの処理を行う。The video / audio processing apparatus performs the following four processes when detecting a scene.

【０１１２】の処理では、各セグメントを基準とした
とき、一定の時間枠の中で最も類似したセグメントを一
定数検出する。In the process (1), based on each segment, a certain number of the most similar segments in a certain time frame is detected.

【０１１３】の処理では、の処理の後、基準セグメ
ントに対し過去と未来の時間帯に存在する類似セグメン
トの数の比率を計算し(実際には未来に存在している類
似セグメントの個数から過去に存在している類似セグメ
ントの個数を減算するなど)、その計算結果を境界性測
定値とする。In the processing of (1), after the processing of (2), the ratio of the number of similar segments existing in the past and future time zones with respect to the reference segment is calculated (actually, the number of similar segments existing in the future is calculated from the number of similar segments existing in the future). , Etc.), and the calculation result is used as a boundary measurement value.

【０１１４】の処理では、の処理で得られた境界性
測定値を、各セグメントを基準としたときの時間変化を
調べ、過去比率が高いものがいくつか連続し、未来比率
の高いものがいくつか連続するパターンを示すセグメン
ト位置を検出する。In the processing of (1), the borderline measurement value obtained in the processing of (2) is examined with respect to time change with respect to each segment. Or a segment position indicating a continuous pattern is detected.

【０１１５】の処理では、の処理のとき、境界性測
定値の絶対値を合計し、この合計値をシーン強度値と呼
ぶことにする。このシーン強度値があらかじめ決められ
た閾値を超えた場合、シーンの境界とする。In the process (1), the absolute values of the measured boundary values are summed in the process (1), and this sum is referred to as a scene intensity value. If this scene intensity value exceeds a predetermined threshold, it is determined as a scene boundary.

【０１１６】これらの処理について、図６を参照して具
体的に説明する。の処理では、例えば図６（Ａ）のよ
うに、各セグメントに対して過去に任意のｋ個のセグメ
ント、未来にもｋ個のセグメントの時間枠を設定し(例
えばここでは５個)、類似セグメントをこの時間枠の中
でＮ個検出する(ここでは４個)。時間は各セグメントを
表す数字が大きくなるに連れて未来へと進んで行く。同
図の真中の濃い網掛けのセグメント７が、ある時間の基
準のセグメントであり、これに対して類似なセグメント
はそれよりも薄い網掛けになっているセグメント４，
６，９，１０である。ここでは４個の類似セグメントを
抽出しており、過去に２個、未来に２個存在する。These processes will be specifically described with reference to FIG. 6A, for example, as shown in FIG. 6A, a time frame of arbitrary k segments is set for each segment in the past, and a time frame of k segments is set for the future (for example, 5 frames in this case). N segments are detected in this time frame (here, four segments). Time progresses into the future as the number representing each segment increases. The darkly shaded segment 7 in the middle of the figure is a reference segment at a certain time, while the similar segment is a lighter shaded segment 4,
6, 9, and 10. Here, four similar segments are extracted, two in the past and two in the future.

【０１１７】の処理では、このとき境界性測定値は、
(過去の個数)を(未来の個数)で除算するか、または(未
来の個数)から(過去の個数)を減算するかのいずれかの
方法で計算する。ここでは、後者の方法で境界性測定値
を計算する。ここで、各境界性測定値をＦ_iと表す。ｉ
は各セグメントの位置(番号)である。いま、後者の方法
で計算すると同図(Ａ)の境界性測定値Ｆ₆は０となる。In this process, the measured boundary value is
The calculation is performed by dividing (the number in the past) by (the number in the future) or subtracting (the number in the past) from (the number in the future). Here, the boundary value is calculated by the latter method. Here, representing each borderline measured and F _i. i
Is the position (number) of each segment. Now, when the calculation is performed by the latter method, the measured boundary value F _{6 in} FIG.

【０１１８】の処理では、の処理での計算を時間軸
に沿って行って行く。同図（Ｂ）は同図（Ａ）から３セ
グメント進んだときのセグメント１０に対して過去にセ
グメント５，８，９の３個、未来にセグメント１１の１
個類似セグメントが存在している。このときの境界性測
定値Ｆ₁₀＝１−３＝−２となる。In the process (1), the calculation in the process (2) is performed along the time axis. FIG. 8B shows three segments 5, 8, and 9 in the past and one segment 11 in the future with respect to the segment 10 advanced by three segments from FIG.
Similar segments exist. The borderline measurements F ₁₀ = 1-3 = -2 in this case.

【０１１９】また、同図（Ｃ）はさらに１セグメント進
んでシーンの境界直前に到達した状態であり、セグメン
ト１１の類似セグメント６，７，９，１０はすべて過去
に集中している。このとき境界性測定値はＦ₁₁＝０−４
＝−４となる。FIG. 11C shows a state in which the image has advanced one segment to reach just before the boundary of the scene, and similar segments 6, 7, 9, and 10 of segment 11 are all concentrated in the past. At this time, the measured boundary value is F ₁₁ = 0-4.
= -4.

【０１２０】次に、同図（Ｄ）は同図（Ｃ）から１セグ
メント進んだ状態であり、境界を越えて新しいシーンに
入った直後であって、シーンの先頭がセグメント１２で
ある場合である。類似セグメントは１３，１４，１５、
１６である。このとき類似セグメントは未来にすべて存
在するパターンに変化している。Ｆ１２＝４−０＝４と
なる。Next, FIG. 11D shows a state advanced by one segment from FIG. 10C, immediately after entering a new scene beyond the boundary, and where the beginning of the scene is segment 12. is there. Similar segments are 13, 14, 15,
Sixteen. At this time, all the similar segments have changed to patterns that exist in the future. F12 = 4-0 = 4.

【０１２１】最後に、同図（Ｅ）は、さらに１セグメン
ト進んだ状態のセグメント１３の場合である。同様に、
Ｆ₁₃＝３−１＝２となる。この方法ではこのように過去
の方に類似セグメントの比率が大きいときは負符号（マ
イナス符号）であり、正符号（プラス符号）は未来に比
率が大きいことを示している。このときの境界性測定値
Ｆ_iの変化は、０ … （−２）→（−４）→（＋４）→（＋２）・・・（６）のようなパターンを示す。Finally, FIG. 11E shows the case of the segment 13 which is further advanced by one segment. Similarly,
A F ₁₃ = 3-1 = 2. In this method, when the ratio of similar segments is large in the past, a negative sign (minus sign) is used, and a positive sign (plus sign) indicates that the ratio is large in the future. The change of the measured boundary value F _i at this time shows a pattern such as 0... (−2) → (−4) → (+4) → (+2) (6).

【０１２２】（−４）→（＋４）と変化しているところ
がシーンの境界に対応している。これは図６（Ａ）のよ
うにシーンの中間にある場合は時間枠内にある類似的セ
グメントは各セグメントを挟んで過去、未来にほぼ均等
に存在する。しかし、シーンの境界に近づくにつれて同
図（Ｂ）のように過去に存在する比率が高くなって行
き、同図（Ｃ）で過去の存在比率が１００％になり、同
図（Ｄ）のように境界を超えた直後は未来に存在比率が
１００％に変わるパターンを持つことを表している。こ
のようなパターンを検出することによりほぼ過去１００
％の存在比率から未来への存在比率ほぼ１００％へ大き
く変動する変化点がシーンの切れ目と対応付けられる。The portion changing from (−4) to (+4) corresponds to the boundary of the scene. In the case where the similar segment is located in the middle of the scene as shown in FIG. 6A, the similar segments within the time frame exist almost equally in the past and the future with each segment interposed therebetween. However, as the position approaches the boundary of the scene, the ratio existing in the past becomes higher as shown in FIG. 2B, and the existing ratio in the past becomes 100% in FIG. 2C, as shown in FIG. Immediately after the boundary is exceeded, it indicates that there is a pattern in which the existence ratio changes to 100% in the future. By detecting such a pattern, almost 100
A change point that largely fluctuates from the% existence ratio to the future existence ratio of approximately 100% is associated with a scene break.

【０１２３】また、シーンの非境界領域の中であっても
過去比率が高いパターンから未来比率の高い比率へ一時
的に変化(１セグメント間のみ)する場合がある。しか
し、それはシーンの境界ではないことが多い。なぜなら
ば、このような一時的な変化の多くは偶発的に発生する
からである。非境界領域のような類似セグメントが過去
に存在比率の大きい境界性測定値が複数続いたあとに、
未来に存在比率の大きい境界性測定値が複数続くパター
ンが検出されたときにシーンの境界の可能性が高いと判
断する。そうでないときはシーンの境界ではない可能性
が高いため、シーンの境界と見なさない。Even in a non-boundary region of a scene, a pattern having a high past ratio may temporarily change to a high future ratio (only for one segment). However, it is often not a scene boundary. This is because many of these temporary changes occur accidentally. After a number of similarity measures, such as non-boundary areas, that have a high prevalence in the past,
When a pattern in which a plurality of boundary measurement values having a large existence ratio continue in the future is detected, it is determined that the possibility of a scene boundary is high. If not, there is a high possibility that it is not a scene boundary, so it is not regarded as a scene boundary.

【０１２４】の処理では、の処理の後、境界性測定
値を合計し、シーン境界点の「強さ」を計算する。その
強さを測定するために、境界性測定値の絶対値を足すこ
ととする。その値の変化の度合いがシーンの視覚的変化
の度合いに対応しており、また、シーンの視覚的変化の
度合いはシーンの意味的な変化の度合いに対応してい
る。したがってこの値の大小によりシーンの意味的度合
いの大小に対応したシーンを検出することが可能とな
る。In the processing (1), after the processing (2), the measured values of the borderlines are summed up to calculate the “strength” of the scene boundary point. In order to measure the strength, the absolute value of the boundary measurement value is added. The degree of the change in the value corresponds to the degree of the visual change of the scene, and the degree of the visual change of the scene corresponds to the degree of the semantic change of the scene. Therefore, it is possible to detect a scene corresponding to the magnitude of the semantic degree of the scene by the magnitude of this value.

【０１２５】ここではこの絶対値の合計をシーン強度値
Ｖｉと定義する。その定義ではｉはセグメントの番号を
表す。例えば４つの境界性測定値（各セグメントにおい
て過去の２つのセグメントと未来の１つのセグメント
と、そのセグメントの境界性測定値の計４つのセグメン
トＦ_i-2，Ｆ_i-1，Ｆ_i，Ｆ_i+1）の絶対値の合計を使って
いる。Here, the sum of the absolute values is defined as a scene intensity value Vi. In that definition, i represents the segment number. For example, four boundary measurement values (two segments in the past, one segment in the future and one segment in the future, and a total of four segments F _i-2 , F _i-1 , F _i , F of boundary measurement values of the segment) _{i + 1} ) using the sum of the absolute values.

【０１２６】シーンの境界での境界性測定値の変化のパ
ターンは理論的には、先に示したようにＦ_i-1→Ｆ_iの値
−４→＋４のように１００％過去に類似セグメントが存
在した場合から１００％未来に存在する変化が起こると
考えられる。The pattern of the change in the measured boundary value at the boundary of the scene is theoretically 100% similar segment in the past, such as F _i−1 → F _i −4 → + 4 as shown above. It is considered that a change that exists 100% in the future will occur from the case where.

【０１２７】このようにシーンの境界では、１セグメン
ト間で大きな変化が起こる。そして式（６）のパターン
のように、４セグメント以上に渡って境界性測定値の絶
対値が大きいままパターンの変化が起こる可能性は、シ
ーンの境界付近でないと高くない。このパターンの変化
の特性から、シーン強度値Ｖ_iがある大きさ以上のもの
だけを実際のシーンの境界と判断することにより、希望
とするシーン検出を行うことができる。As described above, at the boundary of a scene, a large change occurs between one segment. Then, like the pattern of Expression (6), the possibility that the pattern will change while the absolute value of the measured boundary value is large over four or more segments is not high except near the boundary of the scene. The characteristics of the change in this pattern, by determining only those of a certain size or greater scene intensity values V _i and the actual scene boundary, it is possible to perform scene detection to desired.

【０１２８】図７は、実際の音楽番組を録音した３０分
程度のビデオデータを使用した結果をグラフ化したもの
である。縦軸にシーン強度値、横軸に各セグメントを表
している。色の濃い棒のところのセグメントが実際のシ
ーンの境界(ここではシーンの先頭セグメント)である。
この結果の場合、シーン強度値が１２以上をシーンの境
界とすると６／７の確率で実際のシーンと一致する。FIG. 7 is a graph showing a result of using video data of about 30 minutes obtained by recording an actual music program. The vertical axis represents the scene intensity value, and the horizontal axis represents each segment. The segment at the dark bar is the actual scene boundary (here, the top segment of the scene).
In the case of this result, if the scene intensity value is 12 or more as the boundary of the scene, the scene coincides with the actual scene with a probability of 6/7.

【０１２９】図７のグラフが生成される過程について図
８を参照して説明する。ここで説明することは映像音声
処理装置で示したシーン検出部１６で行われることであ
り、この処理はセグメントが生成される毎に以下の処理
を行う。The process of generating the graph of FIG. 7 will be described with reference to FIG. What is described here is performed by the scene detection unit 16 shown in the video and audio processing apparatus. This processing performs the following processing every time a segment is generated.

【０１３０】ステップＳ１１では各セグメントに対し、
そのセグメントを中心に±ｋ個のセグメント範囲の中
で、特徴量類似性測定部１７を用いて最近傍の類似セグ
メントをＮ個検出し、それらが過去に存在する個数と未
来に存在する個数を求める。In step S11, for each segment,
Within the ± k segment ranges centered on the segment, N nearest similar segments are detected by using the feature similarity measuring unit 17, and the number of those existing in the past and the number present in the future are determined. Ask.

【０１３１】ステップＳ１２では、各セグメントの境界
性測定値Ｆ_iとして、ステップＳ１１の処理で求められ
たＮ個の類似セグメントのうち、未来に存在する類似セ
グメントの個数から、過去に存在する類似セグメント個
数を減じた数を算出して保存する。In step S12, the number of similar segments existing in the future from the number of similar segments existing in the future among the N similar segments obtained in the processing in step S11 is set as the boundary characteristic measurement value F _i of each segment. Calculate and save the reduced number.

【０１３２】ステップＳ１３では、２ｎ個のセグメント
の境界性測定値Ｆ_i-n，・・・，Ｆ_i，Ｆ_i+nのパターン
の変化からシーンの境界の可能性の高い場所を特定す
る。ｎは、ｉセグメントから過去の比率と未来の比率の
パターン変化を見るために必要な境界測定値の数であ
る。[0132] At step S13, 2n segments of borderline measurements F _in, ···, F i, to identify the high places from a change in the pattern of F _i + _n potential scene boundaries. n is the number of boundary measurements needed to see the pattern change of past ratio and future ratio from i segment.

【０１３３】ここで、シーンの境界を示唆する変化パタ
ーンについての３つの条件を次のように定義する。境界性測定値Ｆ_i-n乃至Ｆ_i+nが一様に０ではないこと境界性測定値Ｆ_i-n乃至Ｆ_i-1が０以下であること境界性測定値Ｆ_i乃至Ｆ_i+nが０以上であることHere, three conditions for a change pattern indicating a scene boundary are defined as follows. Borderline measure F _i to F _{i + n} be borderline measurements F _in to F _{i + n} is uniformly Borderline measurement is not a 0 value F _in to F _i-1 is less than or equal to 0 is 0 or more Being

【０１３４】そして、上述した乃至の３条件を全て
満足するか否かを判定する。３条件を全て満足すると判
定された場合、シーンの境界の可能性が高いと判断し、
次のステップＳ１４に移行する。そうでない場合は処理
がステップ１６に進む。Then, it is determined whether or not all of the above three conditions are satisfied. If it is determined that all three conditions are satisfied, it is determined that the possibility of a scene boundary is high,
Move to the next step S14. Otherwise, the process proceeds to step S16.

【０１３５】ステップＳ１４では、さらにステップＳ１
３での境界性測定値を次式に適用して境界性測定値Ｆ
_i-n，・・・，Ｆ_i，Ｆ_i+nからシーン強度Ｖ_iを計算す
る。Ｖ_i＝|Ｆ_i-n| + … + |Ｆ_i-1| + |Ｆ_i| + … + |Ｆ
_i+n|In step S14, step S1
3 is applied to the following equation to calculate the boundary value F
_in, ···, F _i, to calculate the scene intensity V _i from F _{i + n.} V _i = | F _in | +… + | F _i-1 | + | F _i | +… + | F
_{i + n} |

【０１３６】そして、強度値に対する閾値を越える条件
が設けられた場合、その条件を満たすシーン強度値が現
れたときには、求めるシーンの視覚的変化の強度である
と判断し、処理しているビデオデータのシーンの境界の
１つであるとして、その位置を出力する。強度値に関す
る条件が必要とされない場合、各セグメントについての
強度値を付加情報データとしてビデオデータ記録部１８
に出力して記録する。When a condition that exceeds the threshold value for the intensity value is set, and a scene intensity value that satisfies the condition appears, it is determined that the intensity of the visual change of the scene to be obtained is, and the video data being processed is determined. And outputs the position as one of the boundaries of the scene. When the condition regarding the intensity value is not required, the intensity value for each segment is used as the additional information data in the video data recording unit 18.
Output to and record.

【０１３７】以上の処理を繰り返して行くことによりシ
ーンの境界を検出する。シーンはこの境界から境界に含
まれるセグメント群がシーンを形成されることとなる。The boundaries of the scene are detected by repeating the above processing. In the scene, a segment group included in the boundary is formed from the boundary.

【０１３８】以上説明したように、本発明を適用した映
像音声処理装置は、シーン構造を抽出するためのもので
ある。上述した映像音声処理装置の一連の処理が、テレ
ビドラマや映画など、様々な内容のビデオデータに対し
て、そのシーン構造を抽出可能であることは、既に実験
にて検証済みである。As described above, the video / audio processing apparatus to which the present invention is applied is for extracting a scene structure. It has already been experimentally verified that a series of processes of the video and audio processing apparatus described above can extract a scene structure from video data of various contents such as a TV drama and a movie.

【０１３９】なお、検出されるシーンの境界の数は、シ
ーン強度値を任意に変更することによって調整すること
が可能である。そのため、このシーン強度値を調整する
ことにより、いろいろなコンテンツにより良く適応した
シーンの境界検出を行うことが可能である。Note that the number of detected scene boundaries can be adjusted by arbitrarily changing the scene intensity value. Therefore, by adjusting the scene intensity value, it is possible to perform scene boundary detection that is better adapted to various contents.

【０１４０】さらに、一定時間のビデオデータの各シー
ンを一覧表示する場合、検出するシーンの数を制限する
ことによって一覧表示を見易くことが考えられる。その
場合、どのシーンを一覧表示に含めればビデオデータを
把握し易いかという問題が生じる。そのためには、得ら
れた各シーンの重要性の順番に従って一覧表示に用いる
シーンを決定すればよい。本発明では、得られたシーン
の重要性の尺度であるシーン強度値を提供することがで
き、さらにその尺度を変更する(シーン強度閾値を変更
する)ことにより、シーンの個数を変更することが可能
であり、ユーザの興味に応じて利便性の良い視聴表現を
行うことができる。Further, in the case of displaying a list of scenes of video data for a certain period of time, it is conceivable that the list display is made easier by limiting the number of scenes to be detected. In that case, there is a problem that which scenes should be included in the list display to make it easier to grasp the video data. For this purpose, the scenes to be used for the list display may be determined in accordance with the order of the obtained importance of each scene. In the present invention, it is possible to provide a scene intensity value, which is a measure of the importance of the obtained scene, and to change the scale (change the scene intensity threshold) to change the number of scenes. It is possible, and a convenient viewing expression can be performed according to the user's interest.

【０１４１】しかも、シーンの個数を変更するにあた
り、再度シーン検出処理を行うことを必要とせず、シー
ン強度閾値を変更することのみで保存された強度値時系
列を簡単に処理することが可能である。Further, when the number of scenes is changed, it is not necessary to perform the scene detection process again, and the stored intensity value time series can be easily processed only by changing the scene intensity threshold. is there.

【０１４２】以上のように、本発明は、従来技術におけ
る上述した全ての問題点を解決したものである。As described above, the present invention has solved all the above-mentioned problems in the prior art.

【０１４３】まず、映像音声処理装置は、ユーザが事前
にビデオデータの意味的な構造を知る必要はない。First, in the video / audio processing apparatus, the user does not need to know the semantic structure of the video data in advance.

【０１４４】さらに、映像音声処理装置は、各セグメン
トに対し行われている処理は次の項目を含む。特徴量抽出すること一定個数の時間領域内にセグメント対の間の非類似性
を測定すること非類似性測定結果を用い、一定個数の最も類似したセ
グメントを抽出すること類似したセグメントの存在比率より境界性測定値を計
算すること境界性測定値を用い、シーン境界点の強度値を求める
ことIn the video / audio processing apparatus, the processing performed on each segment includes the following items. Extracting feature values Measuring dissimilarity between a pair of segments within a fixed number of time domains Extracting a fixed number of most similar segments using dissimilarity measurement results Calculate the borderline measurement Use the borderline measurement to determine the intensity value at the scene boundary point

【０１４５】いずれの処理も計算上の負荷が少ない。そ
のため、セットトップボックスやディジタルビデオレコ
ーダ、ホームサーバ等の家庭用電子機器にも適用するこ
とができる。Each of the processes has a small calculation load. Therefore, the present invention can be applied to home electronic devices such as set-top boxes, digital video recorders, and home servers.

【０１４６】また、映像音声処理装置は、シーンを検出
した結果、ビデオブラウジングのための新たな高レベル
アクセスの基礎を与えることができる。そのため、映像
音声処理装置は、セグメントではなくシーンといった高
レベルのビデオ構造を用いてビデオデータの内容を視覚
化することにより、内容に基づいたビデオデータへの容
易なアクセスを可能とする。例えば、映像音声処理装置
は、シーンを表示することにより、ユーザは、番組の要
旨をすばやく知ることができ、興味のある部分を迅速に
見つけることができる。As a result of detecting a scene, the video / audio processing apparatus can provide a new high-level access base for video browsing. Therefore, the video / audio processing device enables easy access to the video data based on the content by visualizing the content of the video data using a high-level video structure such as a scene instead of a segment. For example, by displaying a scene, the video and audio processing device allows the user to quickly know the gist of the program and quickly find a part of interest.

【０１４７】さらに、映像音声処理装置は、シーン検出
の結果、ビデオデータの概要または要約を自動的に作成
するための基盤が得られる。一般に、一貫した要約を作
成するには、ビデオデータからのランダムな断片を組み
合わせるのではなく、ビデオデータを、再構成可能な意
味を持つ成分に分解することを必要とする。映像音声処
理装置により検出されたシーンは、そのような要約を作
成するための基礎となる。Further, the video / audio processing apparatus provides a basis for automatically creating an outline or summary of video data as a result of scene detection. In general, creating a coherent summary requires breaking down the video data into reconfigurable meaningful components, rather than combining random fragments from the video data. Scenes detected by the audiovisual processing device are the basis for creating such summaries.

【０１４８】なお、本発明は、上述した実施の形態に限
定されるものではなく、例えば、セグメント間の類似性
測定のために用いる特徴量等は、上述したもの以外でも
よいことは勿論であり、その他、本発明の趣旨を逸脱し
ない範囲で適宜変更が可能であることはいうまでもな
い。The present invention is not limited to the above-described embodiments. For example, it is needless to say that the feature amounts used for measuring the similarity between segments may be other than those described above. Needless to say, other modifications can be made without departing from the spirit of the present invention.

【０１４９】またさらに、本発明はシーン強度値を任意
に変更することにより、コンテンツ構造上、重要な変化
点であるシーンが得られる。なぜなら、強度値がコンテ
ンツ内容の変化の度合いに対応できるからである。すな
わち、ビデオを閲覧する際に、シーン強度値閾値を調整
することにより、検出シーンの個数を制御できる。しか
も、目的に応じて、コンテンツを表示する個数を増やし
たり減らしたりすることが可能となる。Furthermore, according to the present invention, by arbitrarily changing the scene intensity value, a scene which is an important change point in the content structure can be obtained. This is because the intensity value can correspond to the degree of change in the content content. That is, when browsing a video, the number of detected scenes can be controlled by adjusting the scene intensity value threshold. In addition, the number of contents to be displayed can be increased or decreased according to the purpose.

【０１５０】つまり、コンテンツのいわゆる閲覧粒度(g
ranularity)が目的に応じて自由に制御することができ
る。例えば、ある一時間ビデオを見るときに、最初に強度
値を高く設定し、コンテンツに対して重要であるシーン
からなる短い要約を示す。次に、若し興味が増し、詳しく
見てみたいと思ったなら、強度値を下げることにより、よ
り細かいシーンによって構成されている要約を表示する
ことができる。しかも本発明の方法を利用すれば、従来
技術と異なって、強度値を調整する度に検出を再び行う
必要がなく、保存された強度値時系列を簡単に処理を行
うことだけ十分である。That is, the so-called browsing granularity (g
ranularity) can be freely controlled according to the purpose. For example, when watching an hour of video, first set the intensity value high to show a short summary of scenes that are important to the content. Then, if interest increases and one wants to look in detail, lowering the intensity value can display a summary made up of more detailed scenes. Moreover, using the method of the present invention, unlike the prior art, there is no need to perform detection again each time the intensity value is adjusted, and it is sufficient to simply process the stored intensity value time series.

【０１５１】セットトップボックスやディジタルビデオ
レコーダなどの家庭機器に実装するにあたり、以下のよ
うな効果がある。The following effects can be obtained when the present invention is implemented in household equipment such as a set-top box and a digital video recorder.

【０１５２】１つ目の効果は、調べるセグメントを一定
数に固定できることである。本発明のシーン検出は各セ
グメントに対する類似セグメントの局所的な変化を調べ
ることで実現できるので、そのため処理に必要なメモリ
量を固定化することが可能になり、メモリ量の少ないセ
ットトップボックスやディジタルレコーダなどの家庭機
器でも実装可能となる。The first effect is that the number of segments to be examined can be fixed to a fixed number. The scene detection of the present invention can be realized by examining local changes of similar segments with respect to each segment, so that the amount of memory required for processing can be fixed, and a set-top box or digital Home equipment such as recorders can also be implemented.

【０１５３】２つ目の効果は、各セグメントの処理にか
かる時間が一定とすることができることである。これ
は、これは決められた時間内に決められた処理を必ず終
わらせなければならないセットトップボックスやディジ
タルビデオレコーダなどの家庭機器などに適している。The second effect is that the time required for processing each segment can be made constant. This is suitable for home devices such as set-top boxes and digital video recorders, which must complete a predetermined process within a predetermined time.

【０１５４】３つ目の効果は、１つの処理が終わる毎に
新たなセグメントの処理を行う逐次処理が可能であるこ
とである。このことは、セットトップボックスやディジ
タルビデオレコーダなどの家庭機器において、ビデオ信
号などの記録を終了する場合、その終了時刻とほぼ同時
に処理を終了することが可能である。また何らかの理由
で記録状態が停止した場合でも、それまでの記録を残し
ておくことが可能である。A third effect is that sequential processing for processing a new segment every time one processing is completed is possible. This means that when recording of a video signal or the like is completed in a household device such as a set-top box or a digital video recorder, it is possible to terminate the processing almost simultaneously with the end time. Also, even if the recording state is stopped for some reason, it is possible to keep the previous recording.

【０１５５】ところで、上述した一連の処理は、ハード
ウェアにより実行させることもできるが、ソフトウェア
により実行させることもできる。一連の処理をソフトウ
ェアにより実行させる場合には、そのソフトウェアを構
成するプログラムが、専用のハードウェアに組み込まれ
ているコンピュータ、または、各種のプログラムをイン
ストールすることで、各種の機能を実行することが可能
な、例えば汎用のパーソナルコンピュータなどに、記録
媒体からインストールされる。By the way, the above-described series of processing can be executed by hardware, but can also be executed by software. When a series of processing is executed by software, a program constituting the software can execute various functions by installing a computer built into dedicated hardware or installing various programs. It is installed from a recording medium into a possible general-purpose personal computer or the like.

【０１５６】この記録媒体は、図３に示すように、コン
ピュータとは別に、ユーザにプログラムを提供するため
に配布される、プログラムが記録されている磁気ディス
ク２２（フロッピディスクを含む）、光ディスク２３
（CD-ROM(Compact Disc-Read Only Memory)、DVD(Digit
al Versatile Disc)を含む）、光磁気ディスク２４（Ｍ
Ｄ(Mini Disc)を含む）、もしくは半導体メモリ２５な
どよりなるパッケージメディアにより構成されるだけで
なく、コンピュータに予め組み込まれた状態でユーザに
提供される、プログラムが記録されているROMやハード
ディスクなどで構成される。As shown in FIG. 3, the recording medium is a magnetic disk 22 (including a floppy disk) on which the program is recorded and an optical disk 23 which are distributed separately from the computer to provide the program to the user.
(CD-ROM (Compact Disc-Read Only Memory), DVD (Digit
al Versatile Disc), magneto-optical disc 24 (M
D (including Mini Disc)), or a packaged medium including a semiconductor memory 25 and the like, and a ROM or a hard disk in which a program is recorded and provided to a user in a state where the program is incorporated in a computer in advance. It consists of.

【０１５７】なお、本明細書において、記録媒体に記録
されるプログラムを記述するステップは、記載された順
序に従って時系列的に行われる処理はもちろん、必ずし
も時系列的に処理されなくとも、並列的あるいは個別に
実行される処理をも含むものである。[0157] In this specification, the step of describing a program recorded on a recording medium is not limited to processing performed in chronological order according to the described order. Alternatively, it also includes individually executed processing.

【０１５８】また、本明細書において、システムとは、
複数の装置により構成される装置全体を表すものであ
る。In this specification, the system is
It represents the entire device composed of a plurality of devices.

【０１５９】[0159]

【発明の効果】以上のように、本発明のＡＶ信号処理装
置および方法、並びにプログラムによれば、基準となる
セグメントと他のセグメントとの特徴量の類似性を測定
するための測定基準を算出し、測定基準を用いて、基準
となるセグメントと他のセグメントとの類似性を測定
し、測定し類似性を用いて、基準となるセグメントがシ
ーンの境界である可能性を示す測定値を計算するように
したので、シーンの境界を検出することが可能となる。As described above, according to the AV signal processing apparatus, method, and program of the present invention, a measurement criterion for measuring the similarity of the characteristic amount between a reference segment and another segment is calculated. Use the metric to measure the similarity between the reference segment and other segments, measure and use the similarity to calculate a measurement that indicates that the reference segment may be a scene boundary As a result, the boundary between scenes can be detected.

[Brief description of the drawings]

【図１】ビデオデータの階層モデルを示す図である。FIG. 1 is a diagram showing a hierarchical model of video data.

【図２】シーンの境界領域と非境界領域を説明するため
の図である。FIG. 2 is a diagram for explaining a boundary region and a non-boundary region of a scene.

【図３】本発明の一実施の形態である映像音声処理装置
の構成例を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration example of a video / audio processing device according to an embodiment of the present invention;

【図４】シーンの境界領域を説明するための図である。FIG. 4 is a diagram for explaining a boundary region of a scene.

【図５】映像音声処理装置の動作を説明するフローチャ
ートである。FIG. 5 is a flowchart illustrating an operation of the video and audio processing device.

【図６】類似セグメントの分布パターンの例を示す図で
ある。FIG. 6 is a diagram illustrating an example of a distribution pattern of similar segments.

【図７】シーン検出結果を示す図である。FIG. 7 is a diagram showing a scene detection result.

【図８】シーン検出部１６の処理を説明するフローチャ
ートである。FIG. 8 is a flowchart illustrating a process of a scene detection unit 16;

[Explanation of symbols]

１１ビデオ分割部，１２ビデオセグメントメモ
リ，１３映像特徴量抽出部，１４音声特徴量抽
出部，１５セグメント特徴量メモリ，１６シーン
検出部，１７特徴量類似性測定部，１８ビデオ
データ記録部，１９ビデオ表示部，２０制御部，
２１ドライバ，２２磁気ディスク，２３光
ディスク，２４光磁気ディスク，２５半導体メ
モリReference Signs List 11 video division unit, 12 video segment memory, 13 video feature extraction unit, 14 audio feature extraction unit, 15 segment feature memory, 16 scene detection unit, 17 feature similarity measurement unit, 18 video data recording unit, 19 Video display unit, 20 control unit,
21 Driver, 22 Magnetic Disk, 23 Optical Disk, 24 Magneto-Optical Disk, 25 Semiconductor Memory

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5C053 FA14 GB09 HA29 LA06 LA11 5D015 FF06 5L096 AA02 CA04 FA23 FA37 HA01 JA03 JA11 ──────────────────────────────────────────────────続き Continued on the front page F term (reference) 5C053 FA14 GB09 HA29 LA06 LA11 5D015 FF06 5L096 AA02 CA04 FA23 FA37 HA01 JA03 JA11

Claims

[Claims]

1. An AV signal processing apparatus for detecting and analyzing a pattern that reflects the semantic structure of the content of a supplied AV signal and detecting a scene that is a meaningful delimiter. Feature value extraction means for extracting a feature value of a segment formed by a frame; calculation means for calculating a metric for measuring the similarity of the feature value between a reference segment and another segment; and the measurement Using a criterion, a similarity measurement unit that measures the similarity between the reference segment and the other segment, and using the similarity measured by the similarity measurement unit, the reference segment is Measurement value calculation means for calculating a measurement value indicating a possibility of being a boundary of a scene; and analyzing a change in a temporal pattern of the measurement value calculated by the measurement value calculation means. , AV signal processing apparatus characterized by segments of said reference on the basis of the analysis result and a determining boundary determination means for determining whether or not a boundary of the scene.

2. The AV signal processing device according to claim 1, wherein the AV signal includes at least one of a video signal and an audio signal.

3. The AV signal processing apparatus according to claim 1, further comprising an intensity value calculating means for calculating an intensity value indicating a degree of change of said measured value corresponding to said reference segment.

4. The measurement value calculation means obtains a similar segment in a predetermined time domain with respect to the reference segment, analyzes a time distribution of the similar segment, and determines a ratio between the past and the future. 2. The AV signal processing device according to claim 1, wherein the measurement value is calculated by numerical conversion.

5. The apparatus according to claim 1, wherein the boundary determination unit determines whether the reference segment is a boundary of the scene based on a sum of absolute values of the measurement values. An AV signal processing apparatus as described in the above.

6. The apparatus according to claim 2, further comprising an audio segment generating means for detecting a shot as a basic unit of the video segment and generating the audio segment when the AV signal includes a video signal. A described in
V signal processing device.

7. When the audio signal is included in the AV signal, the audio signal further includes an audio segment generating unit that generates an audio segment using at least one of the feature amount and the silent section of the audio signal. The AV signal processing device according to claim 2, wherein

8. The AV signal processing apparatus according to claim 2, wherein the feature amount of the video signal includes at least a color histogram.

9. The AV signal processing device according to claim 2, wherein the feature amount of the audio signal includes at least one of a volume and a spectrum.

10. The apparatus according to claim 1, wherein the boundary determining unit determines whether the reference segment is a boundary of the scene by comparing the measured value with a preset threshold value. 2. The AV signal processing device according to 1.

11. An AV signal processing method of an AV signal processing apparatus for detecting and analyzing a pattern that reflects a semantic structure of the content of a supplied AV signal and detecting a scene that is a meaningful break, A feature value extraction step of extracting a feature value of a segment formed by a series of frames constituting the above, and a calculation for calculating a measurement reference for measuring the similarity of the feature value between a reference segment and another segment And a similarity measurement step of measuring the similarity between the reference segment and the other segment using the measurement criterion, using the similarity measured in the process of the similarity measurement step. A measurement value calculating step of calculating a measurement value indicating that the reference segment may be a boundary of the scene; Analyzing a change in the temporal pattern of the measurement value calculated in the processing of the map, and determining whether or not the reference segment is a boundary of the scene based on the analysis result. An AV signal processing method comprising:

12. A computer that detects and analyzes a pattern that reflects the semantic structure of the content of the supplied AV signal and detects a scene that is a meaningful delimiter, comprising a series of frames constituting the AV signal. A feature value extraction step of extracting a feature value of a segment to be performed; a calculation step of calculating a metric for measuring the similarity of the feature value between a reference segment and another segment; and using the metric. A similarity measurement step of measuring the similarity between the reference segment and the other segment; and using the similarity measured in the similarity measurement step, the reference segment is determined by the A measurement value calculation step of calculating a measurement value indicating the possibility of being a scene boundary; and the measurement value calculated in the processing of the measurement value calculation step. Analyzes the changes in the temporal pattern of values, analysis program segment serving as the reference, based on the result of executing the boundary determination step of determining whether a boundary of the scene.

13. An AV signal processing program for detecting and analyzing a pattern that reflects the semantic structure of the content of a supplied AV signal and detecting a scene that is a meaningful delimiter. A feature value extraction step of extracting a feature value of a segment formed by a series of frames constituting the frame, and a calculation step of calculating a metric for measuring the similarity of the feature value between a reference segment and another segment And, using the measurement criterion, a similarity measurement step of measuring the similarity between the reference segment and the other segment, using the similarity measured in the process of the similarity measurement step, A measurement value calculating step of calculating a measurement value indicating a possibility that the reference segment is a boundary of the scene; Analyzing a change in the temporal pattern of the measurement values calculated in the above, and a boundary determination step of determining whether the reference segment is a boundary of the scene based on the analysis result. Recording medium on which a computer-readable program to be executed is recorded.