JPH10257436A

JPH10257436A - Automatic hierarchical structuring method for moving image and browsing method using the same

Info

Publication number: JPH10257436A
Application number: JP5534097A
Authority: JP
Inventors: Atsushi Matsushita; 温松下; Kenichi Okada; 謙一岡田
Original assignee: Individual
Current assignee: Individual
Priority date: 1997-03-10
Filing date: 1997-03-10
Publication date: 1998-09-25

Abstract

PROBLEM TO BE SOLVED: To speedily and exactly extract a scene by performing the hybrid encoding of applied moving image, dividing this image into hierarchical structure and extracting the scene by integrating shots corresponding to the similarity of divided shots. SOLUTION: An MPEG1 moving image is not completely decoded but only irreducibly minimum information is decoded and high-speed processing is enabled. Then, processing reduction is attained by utilizing the characteristics of MPEG1 encoding algorithm such as inter-frame prediction or the possession of simplified image due to the decoding of DC component. Then, only an I picture is decoded and further, the reduced image of the source frame is provided by simplified decoding. Besides, it is necessary to decode information not to be directly used for decoding B and P pictures but since the I picture is encoded while being closed in the frame, it is not necessary to decode this picture. Besides, since the DC component of intramicroblock in the I picture is decoded without using the IDCT of a large calculation amount, a DC image can be provided at high speed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、符号化した動画
像を自動階層構造化し、この動画像とその解析データを
基にしてビデオブラウザを得ることを目的とした動画像
の自動階層構造化方法及びこれを用いたブラウジング方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an automatic hierarchical structure structuring method for an encoded moving image for obtaining a video browser based on the moving image and its analysis data. And a browsing method using the same.

【０００２】[0002]

【従来の技術】現在動画像情報は、ビデオの単純再生の
域を脱していない。即ち動画像はフレーム単位でとらえ
られており、撮影時に符号を付した場合には、その符号
によって抽出することができると共に、再生時間をファ
クターとして所望のフレームを検出又は再生するなど特
別な関係が明らかな場合に限り、当該フレームを抽出す
ることができている。2. Description of the Related Art At present, moving picture information is not out of the range of simple reproduction of video. That is, a moving image is captured in frame units, and if a code is added at the time of shooting, it can be extracted by the code, and a special relationship such as detecting or reproducing a desired frame using the reproduction time as a factor. Only in the obvious case can the frame be extracted.

【０００３】[0003]

【発明により解決すべき課題】然し乍ら何等の符号を付
することなく、与えられる動画像情報から所望のフレー
ムを抽出することは極めて困難であり、時間的制約があ
れば、抽出不可能となる。例えば一般に１フレームは３
０分の１秒であるから、１分間に１８００フレーム、１
時間に１０８０００フレームとなる。However, it is extremely difficult to extract a desired frame from given moving image information without adding any code, and if there is a time constraint, it becomes impossible to extract it. For example, one frame is generally 3
Since it is 1/0 second, 1800 frames per minute, 1
108,000 frames per hour.

【０００４】そこで前記従来の一般動画像情報から任意
のフレームを短時間で抽出することができない問題点が
あった。Therefore, there is a problem that an arbitrary frame cannot be extracted in a short time from the conventional general moving image information.

【０００５】[0005]

【課題を解決する為の手段】然るにこの発明は、符号化
された動画像を各ショットに分割し、ショット毎の類似
度を用いてショットを統合し、シーンを抽出することに
より自動階層構造化し、このデータを用いて動画像のブ
ラウジングツールを作成することによって前記従来の問
題点を解決したのである。According to the present invention, an encoded moving image is divided into shots, shots are integrated using the similarity of each shot, and scenes are extracted to form an automatic hierarchical structure. By using this data to create a moving image browsing tool, the above-mentioned conventional problems were solved.

【０００６】即ちこの発明は、符号化された動画像を各
ショットに分割し、ついで分割されたショット毎の類似
度を用い、ショットを統合してシーンを抽出処理するこ
とを特徴とした動画像の自動階層構造化方法であり、動
画像の符号化は、ＭＰＥＧによるものとすることを特徴
としたものである。また符号化された動画像からショッ
トを検出する際に、ＭＰＥＧの特徴を利用して高速に処
理することを特徴としたものであり、ショット同士の類
似度の算出に際し、代表フレームを抽出することを特徴
としたものである。次にショット間の類似度をファジィ
推論により求めることを特徴としたものであり、シーン
の抽出処理は、定義されたショット間の結合度により求
めることを特徴としたものである。更に他の発明は、符
号化された動画像を各ショットに分割し、ついで分割さ
れたショット毎の類似度を用い、ショットを統合してシ
ーンを抽出処理して動画像を自動階層構造化し、この階
層構造化されたデータを用いて動画像全体の内容把握、
所望のシーンまたはショットの検出を容易にすることを
特徴とした動画像のブラウジング方法である。That is, the present invention is characterized in that a coded moving image is divided into shots, and the shots are integrated to extract a scene by using the similarity of each divided shot. The automatic hierarchical structuring method described above is characterized in that the moving picture is encoded by MPEG. Also, when a shot is detected from an encoded moving image, high-speed processing is performed using the characteristics of MPEG, and a representative frame is extracted when calculating the degree of similarity between shots. It is characterized by. Next, it is characterized in that the similarity between shots is obtained by fuzzy inference, and the scene extraction processing is characterized by being obtained by the defined degree of connection between shots. Still another invention divides an encoded moving image into shots, and then uses the similarity of each of the divided shots, integrates the shots, extracts a scene, and automatically arranges the moving image into a hierarchical structure. Using this hierarchically structured data, you can grasp the contents of the entire video,
A moving image browsing method characterized by facilitating detection of a desired scene or shot.

【０００７】前記における符号化はＭＰＥＧ１の圧縮ア
ルゴリズムによる。ここにＭＰＥＧ１の正式名称は「Ｃ
ｏｄｉｎｇｏｆｍｏｖｉｎｇｐｉｃｔｕｒｅｓ
ａｎｄａｓｓｏｃｉａｔｅｄａｕｄｉｏｆｏｒ
ｄｉｇｉｔａｌｓｔｏｒａｇｅｍｅｄｉａａｔ
ｕｐｔｏａｂｏｕｔ１．５Ｍｂｉｔ／ｓ」であ
る。[0007] The above encoding is based on the compression algorithm of MPEG1. Here, the official name of MPEG1 is "C
odding of moving pictures
and associated audio for
digital storage media at
up to about 1.5 Mbit / s ".

【０００８】前記ハイブリッド符号化は、ＤＣＴと量子
化、動き補償フレーム間予測及びエントロピー符号化に
より行うが、前記個々の方法は公知の方法であるから詳
細な説明は省略する。The hybrid coding is performed by DCT, quantization, motion compensation inter-frame prediction, and entropy coding. However, since the individual methods are known methods, detailed description thereof will be omitted.

【０００９】次にＭＰＥＧ１符号化、復号システムを図
１について説明する。ビデオ入力は、前処理を経てビデ
オ符号化器に入り、ついでシステム多重化を経て、蓄積
メディアに入りついでシステム多重分離した後、ビデオ
復号器に入り前処理してビデオ出力となる。Next, an MPEG1 encoding / decoding system will be described with reference to FIG. The video input enters the video encoder through pre-processing, then enters the storage medium via the system multiplexing, then demultiplexes the system, enters the video decoder, and pre-processes to the video output.

【００１０】前記ＭＰＥＧ１ビデオ符号化器は図２の入
力画像がデータに処理される。またＭＰＥＧ１ビデオ復
号器は、図３のように入力バッファが表示バッファに処
理される。In the MPEG1 video encoder, the input image shown in FIG. 2 is processed into data. In the MPEG1 video decoder, an input buffer is processed into a display buffer as shown in FIG.

【００１１】前述したように、ＭＰＥＧ１はＣＤ−ＲＯ
Ｍなどの蓄積メディアに用いることが目的である。蓄積
メディアでは、早送り、巻戻し、途中からの再生、逆転
再生などのトリックモードが必要とされる。このような
トリックモードを実現するため、ＭＰＥＧ１ではグルー
プオブピクチャー（ＧｒｏｕｐｏｆＰｉｃｔｕｒｅ
ｓ、以下ＧＯＰという）構造が取られている。As mentioned above, MPEG1 is a CD-RO
It is intended to be used for storage media such as M. For storage media, trick modes such as fast forward, rewind, midway playback, and reverse playback are required. In order to realize such a trick mode, MPEG1 employs a group of pictures.
s, hereinafter referred to as GOP).

【００１２】ＭＰＥＧ１では、符号化された画像データ
は、前後の画面データをもとにして作られているため
に、１画面だけで完結した情報にはならない。このため
に、何枚かの画面データをひとまとまりにしたＧＯＰを
単位として、ランダムアクセスを可能にしている。つま
り、ＧＯＰの中に少なくとも１枚は、前後画面の情報を
利用せず１枚だけで閉じた画面データ（Ｉピクチャ）を
必ず含むようにすることで、このデータを元にＧＯＰ内
の他の画面データの再生が可能となる。なお、１つのＧ
ＯＰは、通常１５枚程度のピクチャをグループ化するこ
とが多い（図４）。[0012] In MPEG1, encoded image data is created on the basis of previous and subsequent screen data, and thus does not become complete information for only one screen. For this reason, random access is possible in units of a GOP in which several pieces of screen data are grouped. In other words, at least one picture in a GOP always includes screen data (I picture) closed by only one picture without using information of the previous and next screens, and other data in the GOP is based on this data. Screen data can be reproduced. In addition, one G
OPs usually group about 15 pictures (FIG. 4).

【００１３】ＭＰＥＧ１では、過去再生画像からの順方
向予測と未来再生画像からの逆方向予測の両方を行って
いる。これを双方向予測という。In MPEG1, both forward prediction from a past reproduced image and backward prediction from a future reproduced image are performed. This is called bidirectional prediction.

【００１４】双方向予測を実現するため、ＭＰＥＧ１で
は、Ｉピクチャ、Ｐピクチャ、Ｂピクチャの３つのタイ
プの画像を規定している。In order to realize bidirectional prediction, MPEG1 defines three types of images: I picture, P picture, and B picture.

【００１５】これらの他に、Ｄピクチャ（ＤＣ符号化画
像）が規定されている。これは、フレーム内の情報のみ
で符号化され、ＤＣＴ係数の内のＤＣ成分のみで構成さ
れており、他の３種類のピクチャタイプと同じシーケン
スに共存することはない。In addition to these, a D picture (DC coded image) is defined. This is coded only with information in a frame and is composed only of DC components of DCT coefficients, and does not coexist in the same sequence with the other three picture types.

【００１６】ＭＰＥＧ１では、双方向予測を行うＢピク
チャが導入されることによって、予測効率が大きく向上
し、高圧縮時の画質向上に役立っている。In MPEG1, the introduction of B-pictures for bidirectional prediction greatly improves the prediction efficiency and contributes to the improvement of image quality at high compression.

【００１７】画像データは、図６に示すように、シーケ
ンス、ＧＯＰ、ピクチャ、スライス、マクロブロック
（ＭＢ）、ブロックの６層の階層構造から成っている。As shown in FIG. 6, the image data has a hierarchical structure of a sequence, GOP, picture, slice, macroblock (MB), and block.

【００１８】前記シーケンス層とは、一続きの映像を表
現するビットストリームは、シーケンスヘッダで始ま
り、その後に１個または数個のＧＯＰが続き、最後に１
個のシーケンスエンドコードで終了する。どのＧＯＰの
直前にもシーケンスヘッダを置くことができるが、一続
きの映像中のシーケンスヘッダでは量子化マトリクス以
外のデータ要素は全て最初のシーケンスヘッダと同じで
ある必要がある。[0018] The sequence layer means that a bit stream representing a sequence of video starts with a sequence header, is followed by one or several GOPs, and ends with a 1
End with sequence end codes. A sequence header can be placed immediately before any GOP, but in a sequence header in a sequence of video, all data elements other than the quantization matrix need to be the same as the first sequence header.

【００１９】これによってシーケンス途中へのランダム
アクセスが可能になる。As a result, random access in the middle of the sequence becomes possible.

【００２０】またＧＯＰ層とはＧＯＰをひとつ含む。The GOP layer includes one GOP.

【００２１】次にピクチャ層とはＩピクチャ、Ｐピクチ
ャ、Ｂピクチャ、Ｄピクチャのいずれかを１枚含む。Next, the picture layer includes one of I picture, P picture, B picture and D picture.

【００２２】またスライス層とは、スライスは、画像の
左上から始まってラスタスキャン順に右下に続く一連の
任意個のマクロブロックの集まりである。スライス間に
は重なりやすき間を持たせることはできないが、スライ
スの位置は画面ごとで異なってもよい。スライスのデー
タの先頭には同期信号が割り当てられるため、復号時に
データの読みだし誤差があっても次のスライスで同期を
回復できる利点がある。またスライスのデータの復号は
そのスライスだけ独立して行えるため、復号の高速化の
ためにスライス単位に並列処理が可能である。The slice layer is a group of a series of arbitrary macroblocks starting from the upper left of the image and continuing to the lower right in the raster scan order. Although there is no gap between slices, the position of the slice may be different for each screen. Since a synchronization signal is assigned to the head of slice data, there is an advantage that synchronization can be restored in the next slice even if there is a data reading error during decoding. In addition, since decoding of slice data can be performed independently for each slice, parallel processing can be performed in slice units to speed up decoding.

【００２３】次にマクロブロック層とは、マクロブロッ
クは１６画素×１６ラインの輝度成分と、画像中で空間
位置が対応する８画素×８ラインの２つの色差成分で構
成されている。ひとつのマクロブロックは４個の輝度ブ
ロックと２個の色差ブロックからなる。マクロブロック
中でのブロック順序と配置は図６のとおりである。この
マクロブロックを単位に動き補償およびフレーム間予測
は行われる。Next, the macroblock layer is such that the macroblock is composed of a luminance component of 16 pixels × 16 lines and two color difference components of 8 pixels × 8 lines corresponding to spatial positions in the image. One macro block includes four luminance blocks and two chrominance blocks. The block order and arrangement in the macro block are as shown in FIG. Motion compensation and inter-frame prediction are performed for each macro block.

【００２４】更にブロック層とは、８画素×８ラインか
らなる輝度成分または色差成分で構成されるＤＣＴ処理
単位である。Further, the block layer is a DCT processing unit composed of a luminance component or a color difference component composed of 8 pixels × 8 lines.

【００２５】[0025]

【発明の実施の形態】この発明は、符号化された動画像
を各ショットに分割し、ショット毎の類似度を用いてシ
ョットを統合し、シーンを抽出するようにした動画像の
自動階層構造化の方法である。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention provides an automatic moving image hierarchical structure in which an encoded moving image is divided into shots, shots are integrated using similarity of each shot, and scenes are extracted. It is a method of conversion.

【００２６】また前記階層構造化されたデータを用いて
動画像の全体の内容を把握し、また所望のシーン、ショ
ット又はフレームの検索を容易にした動画像のブラウジ
ング方法である。Further, there is provided a moving image browsing method for grasping the entire contents of a moving image by using the hierarchically structured data and facilitating retrieval of a desired scene, shot or frame.

【００２７】前記この発明により、動画像の内容を把握
したり、所望の場面を検索することが極めて容易となっ
た。According to the present invention, it is extremely easy to grasp the contents of a moving image and to search for a desired scene.

【００２８】[0028]

【実施例】この発明の実施例を図面に基づいて説明す
る。An embodiment of the present invention will be described with reference to the drawings.

【００２９】まず、物理的な特徴量によって抽出が可能
であり、検出が比較的容易な、ショットへと動画像を分
割する。そして、分割されたショット間の類似度によっ
て、ショットを統合することでシーンの抽出を行う。こ
の際、ショットという動画像のままでは扱いにくいた
め、ショット中から代表フレームをいくつか選び出す
（図７）。First, a moving image is divided into shots that can be extracted based on physical feature amounts and are relatively easy to detect. Then, a scene is extracted by integrating shots according to the similarity between the divided shots. At this time, since it is difficult to handle the moving image as a shot, some representative frames are selected from the shot (FIG. 7).

【００３０】一般に、動画像はデータ量が多いため、そ
の処理量は膨大なものとなる。さらに、符号化された動
画像の場合、復号化が必要となるため、さらに処理量は
増大する。Generally, since a moving image has a large data amount, the processing amount is enormous. Furthermore, in the case of an encoded moving image, decoding is required, so that the processing amount further increases.

【００３１】この発明では、ＭＰＥＧ１動画像を完全に
復号化することなく、必要最小限の情報のみを復号化す
ることで、高速な処理を可能としているので、フレーム
間予測やＤＣ成分の復号化による簡略画像の取得といっ
たＭＰＥＧ１の符号化アルゴリズムの特性を利用して処
理量を軽減している。そこでＭＰＥＧ動画像中のＩピク
チャの簡略復号化とフレームの比較の方法について述
べ、その後流れに沿って各処理の詳細を説明する。According to the present invention, high-speed processing is enabled by decoding only the minimum necessary information without completely decoding the MPEG1 moving image. The processing amount is reduced by utilizing the characteristics of the MPEG1 encoding algorithm such as acquisition of a simplified image. Therefore, a method for simplified decoding of an I picture in an MPEG moving image and a method for comparing frames will be described, and then details of each process will be described along the flow.

【００３２】動画像は多数の静止画像（フレーム）によ
って構成されている。したがって、動画像の解析には各
フレームの画像情報は必要不可欠である。しかし、ＭＰ
ＥＧ１動画像の復号化は、比較的処理量が多く、高速な
処理を実現することは難しい。A moving image is composed of a number of still images (frames). Therefore, image information of each frame is indispensable for analysis of a moving image. But MP
The decoding of the EG1 moving image requires a relatively large amount of processing, and it is difficult to realize high-speed processing.

【００３３】そこで、すべてのフレームではなくＩピク
チャだけを復号化し、さらに、完全な復号化ではなく、
簡易復号化によって原フレームの縮小画像を得る。この
簡易復号化は、ＤＣＴ係数のＤＣ成分を復号化すると元
のブロックの平均色が得られることを利用する。つま
り、各ブロックのＤＣＴ係数のうちＤＣ成分だけを復号
化し、得られた平均色で各ブロックを代表させた画像を
作るのである（図８）。Therefore, only I pictures, not all frames, are decoded.
A reduced image of the original frame is obtained by simple decoding. This simple decoding utilizes the fact that decoding the DC component of the DCT coefficient yields the average color of the original block. That is, only the DC component of the DCT coefficients of each block is decoded, and an image representative of each block is created with the obtained average color (FIG. 8).

【００３４】このようにして得られた画像をＤＣ画像と
呼ぶこととする。各ブロックの大きさは８×８であるか
ら、ＤＣ画像は縦横それぞれ原画像の１／８の大きさと
なる。The image obtained in this manner is called a DC image. Since the size of each block is 8 × 8, the DC image is 縦 the size of the original image both vertically and horizontally.

【００３５】ＢピクチャおよびＰピクチャの復号には、
自分自身だけでなく、動きベクトル情報および参照先の
ピクチャなど直接使わない情報の復号が必要となるが、
Ｉピクチャはフレーム内で閉じた符号化がなされている
ので、そうした情報の復号化の必要はない。また、Ｉピ
クチャ中のイントラ（Ｉｎｔｒａ）マクロブロックのＤ
Ｃ成分は、計算量の多いＩＤＣＴを行うことなく下式
（１）によって復号化することができる。したがって、
ＤＣ画像は非常に高速に得ることができる。For decoding a B picture and a P picture,
It is necessary to decode not only itself but also information that is not directly used, such as motion vector information and the picture of the reference destination,
Since the I-picture is encoded in a closed frame, there is no need to decode such information. In addition, D of an Intra macroblock in an I picture
The C component can be decoded by the following equation (1) without performing a computationally intensive IDCT. Therefore,
DC images can be obtained very quickly.

【００３６】[0036]

【数１】 (Equation 1)

【００３７】ここで、Ｙ_ｋ、Ｃｂ_ｋ′、Ｃｒ_ｋ′は各ブ
ロック（_ｋ、_ｋ′はブロック番号）の平均色の輝度およ
び色差成分、ＤＹ_ｋ、ＤＣｂ_ｋ′、ＤＣｒは各ブロック
のＤＣ成分である。Here, Y _k , Cb _{k ′} and Cr _{k ′} are the luminance and color difference components of the average color of each block ( _k and _{k ′} are block numbers), DY _k , DCb _{k ′} and DCr are the DC of each block. Component.

【００３８】実際に、約３０分のＭＰＥＧ１動画像（Ｇ
ＯＰは図５のタイプのもの）について、全てのフレーム
を復号した場合と、ＩピクチャのＤＣ画像だけを復号し
た場合の処理時間を表１に示す。全てのフレームを復号
する場合に比べ、約１／２０の処理時間で復号できるこ
とがわかる。Actually, an MPEG1 moving image (G
Table 1 shows the processing time when all the frames are decoded and when only the DC picture of the I picture is decoded for the OP of the type shown in FIG. It can be seen that decoding can be performed in about 1/20 the processing time as compared with the case where all frames are decoded.

【００３９】[0039]

【表１】 [Table 1]

【００４０】また、図９に、ＤＣ画像の例と、その原画
像を示す。FIG. 9 shows an example of a DC image and its original image.

【００４１】フレーム間の比較に用いる類似度は、画素
値の比較と色ヒストグラムの比較に大別される。色ヒス
トグラムによる比較はカメラや被写体の動きに影響を受
けにくいために、類似度として用いるには都合が良い
が、半面、空間的な情報を全く含まないために全く違う
画像が同じ色ヒストグラムを持つ場合が問題となる。色
ヒストグラムによる比較に空間的な情報を持たせようと
する試みはいくつかなされているが、いずれもある程度
複雑な処理を必要とする。The similarity used for comparison between frames is roughly classified into comparison of pixel values and comparison of color histograms. Color histogram comparisons are convenient to use as similarities because they are less affected by camera and subject movement, but on the other hand, completely different images have the same color histogram because they do not contain any spatial information. The case becomes a problem. Some attempts have been made to provide spatial information for comparison by color histograms, but all require some complicated processing.

【００４２】この発明では、色ヒストグラムによる距離
として式（２）のＤ_histareaを、画素値による距離とし
て式（３）のＤ_pixsumを用い、このふたつの値を組み合
わせて類似度を算出することで処理の単純さを損なわず
に空間的な情報を加味した類似度を求める。[0042] In the present invention, the D _Histarea of formula (2) as a distance based on the color histogram, using a D _Pixsum of formula (3) as the distance by the pixel value, by calculating the similarity by combining the two values A similarity that takes into account spatial information is obtained without impairing the simplicity of the processing.

【００４３】[0043]

【数２】 (Equation 2)

【００４４】[0044]

【数３】 (Equation 3)

【００４５】ここで、ふたつの値から類似度を算出する
手法として、簡略化ファジィ推論を用いる。ファジィ推
論を用いる事で、色ヒストグラムによる距離および画素
値による距離と、画像の類似度の関係を厳密に定式化す
る事なく記述できる。また、簡略化ファジィ推論による
推論は単純で、高速に実行できる。Here, a simplified fuzzy inference is used as a technique for calculating the similarity from two values. By using fuzzy inference, it is possible to describe the relationship between the distance based on the color histogram and the distance based on the pixel value, and the similarity between the images without strictly formulating the relationship. Further, inference by simplified fuzzy inference is simple and can be executed at high speed.

【００４６】このとき用いるファジィルールは以下式
（４）の通りである。The fuzzy rule used at this time is as shown in the following equation (4).

【００４７】[0047]

【数４】 (Equation 4)

【００４８】ここで、ｉはルール番号、Ｉはルール数、
ｃ_iは後件部を表す実数値であり、［０、１］の値をと
る。また、Ａ_a、Ｂ_bはそれぞれその特徴値のメンバシ
ップ関数であり、各特徴値ごとに図１０のような“ｓｍ
ａｌｌ”、“ｍｅｄｉｕｍ”、“ｌａｒｇｅ”の３つの
メンバシップ関数を設定する。Where i is the rule number, I is the number of rules,
c _i is a real numerical value representing the consequent part, and takes a value of [0, 1]. A _a and B _b are membership functions of the feature values, respectively, and “sm” as shown in FIG.
Three membership functions of "all", "medium", and "large" are set.

【００４９】このルールに対する適合度を式（５）によ
り求め、次に式（６）で最終的な推論結果、すなわち画
像間の類似度ｓを求める。なお、ｓは［０、１］の値を
取る。The degree of conformity to this rule is obtained by equation (5), and then the final inference result, that is, the similarity s between images, is obtained by equation (6). Note that s takes a value of [0, 1].

【００５０】[0050]

【数５】 (Equation 5)

【００５１】[0051]

【数６】 (Equation 6)

【００５２】ショットの検出はショットの間のカット点
の検出を行う。For the detection of a shot, a cut point between shots is detected.

【００５３】カット点の検出とは、フレーム間の相関の
低い点を検出する作業に他ならないが、この発明では、
フレームの特徴量を直接比較して相関を調べるのではな
く、ＭＰＥＧ１の符号化の様子から相関を調べ、カット
点を検出する。つまり、ＭＰＥＧ１において、フレーム
間の相関から予測によって圧縮が行われていることを利
用し、逆に、予測の行われ方を調べることでフレーム間
の相関を調べるのである。The detection of a cut point is nothing but the operation of detecting a point having a low correlation between frames.
Instead of directly comparing the feature amounts of the frames to check the correlation, the correlation is checked based on the encoding state of MPEG1, and the cut point is detected. That is, in MPEG1, the fact that compression is performed by prediction from the correlation between frames is used, and conversely, the correlation between frames is checked by checking how prediction is performed.

【００５４】新たにフレームの特徴量を調べることな
く、ＭＰＥＧ１の符号化情報を利用することで、計算量
が少なくて済み、また、すべての情報を復号化する必要
がないため、高速な処理が可能である。By using the MPEG1 encoding information without newly examining the feature amount of the frame, the amount of calculation can be reduced, and since it is not necessary to decode all information, high-speed processing can be performed. It is possible.

【００５５】処理手順としては、まず、Ｂピクチャにお
ける参照の様子からカット点を検出し、さらにＰピクチ
ャでの参照、Ｉピクチャの変化を調べて確認を行う。図
１１のようなＮ＝１５、Ｍ＝３のＧＯＰを例に説明す
る。As a processing procedure, first, a cut point is detected from a state of reference in a B picture, and further, a reference in a P picture and a change in an I picture are checked for confirmation. The following describes an example of a GOP where N = 15 and M = 3 as shown in FIG.

【００５６】Ｂピクチャでは、前後両方のＩまたはＰピ
クチャから参照を行なっている。即ちＢピクチャ中に
は、一般的に（ＩＭＢ）、（ＦＭＢ）、（ＢＭＢ）、
（ＢｉＭＢ）の４種類のマクロブロックＭＢが存在し、
それぞれ、参照を全く行わないＩＭＢか、順方向ＦＭ
Ｂ、逆方向ＢＭＢ、双方向ＩｎＭＢの参照を行ってい
る。このときの参照の様子は図１２のようになる。In the B picture, reference is made from both the front and rear I or P pictures. That is, in a B picture, generally, (IMB), (FMB), (BMB),
(BiMB) there are four types of macroblocks MB,
Either IMB without reference or forward FM
B, backward BMB, and bidirectional InMB. The state of the reference at this time is as shown in FIG.

【００５７】ショットの中、すなわちフレーム間の相関
が高い場合には、過去および未来への参照の数はほぼ等
しいが、参照するフレームとの間にカット点が存在する
と、過去または未来へ依存が大きく偏り、マクロブロッ
クの構成に偏りが生じる。このときの様子を図１２
（ｂ）（ｃ）（ｄ）に示す。ただし、図１２は極端な場
合であり、実際にはカット点を越えた参照が完全になく
なる訳ではない。In a shot, that is, when the correlation between frames is high, the number of references to the past and the future is almost the same. The bias is greatly biased, and the configuration of the macroblock is biased. The situation at this time is shown in FIG.
(B), (c) and (d). However, FIG. 12 is an extreme case, and actually, the reference beyond the cut point is not completely eliminated.

【００５８】このことからわかるように、Ｂピクチャの
マクロブロックタイプの構成から、Ｂピクチャの前後フ
レームへの参照の様子を判断することができる。これを
Ｂピクチャの依存度ｒｅｌａｔとして式（７）のように
定義する。As can be seen from this, it is possible to determine the state of reference to the preceding and succeeding frames of the B picture from the configuration of the macroblock type of the B picture. This is defined as the dependency relat of the B picture as in Expression (7).

【００５９】[0059]

【数７】 (Equation 7)

【００６０】ただし、Ｎ_F、Ｎ_B、Ｎ_BiはそれぞれＢピ
クチャに含まれるＦＭＢ、ＢＭＢ、ＢｉＭＢの数であ
る。Here, N _F , N _B and N _Bi are the numbers of FMB, BMB and BiMB included in the B picture, respectively.

【００６１】ｒｅｌａｔは、Ｎ_FとＮ_Bの差が大きく、
またＮ_Biが少ない程、その絶対値が大きくなり、参照の
偏りが大きいことを示す。[0061] relat is, the difference between the N _F and N _B is large,
Also, the smaller the N _Bi , the larger its absolute value, indicating that the bias of the reference is large.

【００６２】さて、図１１のようなＧＯＰにおいて、ふ
たつのＰピクチャ（またはＩピクチャ）とそれに挟まれ
たふたつのＢピクチャに注目し（例えばｆ₇、ｆ₈、ｆ
₉、ｆ₁₀）、これをＰ₁Ｂ₂Ｂ₃Ｐ₄と表すことにする
と、全てのカット点は必ずＰ₁｜Ｂ₂Ｂ₃Ｐ₄、Ｐ₁Ｂ
₂｜Ｂ₃Ｐ₄、Ｐ₁Ｂ₂Ｂ₃｜Ｐ₄のいずれかの形で現
れる（｜はカット点を表す）。これらはそれぞれ図１２
の（ｂ）（ｃ）（ｄ）に対応する。このとき、どの場合
でも、図１２からわかるとおり、Ｂ₁、Ｂ₂の両方に参
照の偏りが生じ、依存度の絶対値が大きくなる。Now, in a GOP as shown in FIG. 11, attention is focused on two P pictures (or I pictures) and two B pictures sandwiched between them (for example, f ₇ , f ₈ , f ₈₎ .
₉ , f ₁₀ ), which is expressed as P ₁ B ₂ B ₃ P ₄ , all cut points are always P ₁ | B ₂ B ₃ P ₄ , P ₁ B
₂ | B ₃ P ₄ and P ₁ B ₂ B ₃ | P ₄ (| denotes a cut point). These are shown in FIG.
(B), (c) and (d). In this case, in any case, as can be seen from FIG. 12, a reference bias occurs in both B ₁ and B ₂ , and the absolute value of the dependency increases.

【００６３】そこで、次式（８）を満たせば、Ｐ₁｜Ｂ
₂Ｂ₃Ｐ₄、Ｐ₁Ｂ₂｜Ｂ₃Ｐ₄、Ｐ₁Ｂ₂Ｂ₃｜Ｐ₄
のいずれかの形でカット点が存在すると判断する。Therefore, if the following equation (8) is satisfied, P ₁ | B
_{_{_{_{2 B 3 P 4, P 1}}}} B 2 | B 3 P 4, P 1 B 2 B 3 | P 4
It is determined that a cut point exists in one of the following forms.

【００６４】[0064]

【数８】 (Equation 8)

【００６５】次に、Ｐ₁｜Ｂ₂Ｂ₃Ｐ₄、Ｐ₁Ｂ₂｜Ｂ
₃Ｐ₄、Ｐ₁Ｂ₂Ｂ₃｜Ｐ₄のどのパターンかを判断
し、正確なカット点を決定する。式７からもわかるとお
り、ｒｅｌａｔは、過去からの参照が多いと正、未来か
らの参照が多いと負の値を取る。これを利用して、式
（９）のようにカット点を決めることができる。Next, P ₁ | B ₂ B ₃ P ₄ , P ₁ B ₂ | B
₃ P ₄ , P ₁ B ₂ B ₃ │P ₄ is determined to determine an accurate cut point. As can be seen from Expression 7, relat takes a positive value when there are many references from the past, and a negative value when there are many references from the future. By utilizing this, the cut point can be determined as in equation (9).

【００６６】[0066]

【数９】 (Equation 9)

【００６７】以上のようにして、Ｂピクチャの参照情報
からカット点を検出することができる。As described above, the cut point can be detected from the reference information of the B picture.

【００６８】Ｂピクチャの参照による検出だけでは、ノ
イズや、カメラの前を物体が横切るなど瞬間的な画面の
変動がある際に、誤検出が発生することがある。これ
は、Ｂピクチャと参照先のピクチャとの距離が短いため
と考えられる。そこで、Ｂピクチャだけではなく、より
遠いピクチャを参照するＰピクチャの参照情報を利用し
て結果の確認を行う。In the case of the detection only by referring to the B picture, erroneous detection may occur when there is a noise or an instantaneous screen change such as an object crossing in front of the camera. This is considered because the distance between the B picture and the picture of the reference destination is short. Therefore, the result is confirmed using not only the B picture but also the reference information of the P picture which refers to a farther picture.

【００６９】Ｐピクチャは、参照しているＩまたはＰピ
クチャとの間にカット点が存在すると、参照がほとんど
できなくなるため、ＩＭＢの数が増加するはずである。
そこで、次式（１０）を満たす場合は、間にあるＢピク
チャから求めたカット点は誤検出であるとみなし、これ
を除去する。If a cut point exists between a P picture and a reference I or P picture, reference is hardly made, so that the number of IMBs should increase.
Therefore, when the following equation (10) is satisfied, the cut point obtained from the B picture in between is regarded as being erroneously detected, and is removed.

【００７０】[0070]

【数１０】 (Equation 10)

【００７１】ただし、Ｎ_IはＰピクチャに含まれるＩＭ
Ｂの数、ＮはＰピクチャ中の全ＭＢの数である。[0071] However, N _I can IM that is included in the P-picture
The number of B and N is the number of all MBs in the P picture.

【００７２】Ｐピクチャよりさらに離れたＩピクチャど
うしの比較を使った確認を行う。Confirmation is made using a comparison between I-pictures further away from the P-picture.

【００７３】また、図１１のタイプのＧＯＰの場合、ｆ
₃のＩピクチャの前にある２つのＢピクチャ（ｆ₁、ｆ
₂）によるカット点に対しては、Ｐピクチャを利用する
結果の確認は行うことができない。ｆ₃のＩピクチャは
参照を行わないからである。この部分で起こる誤検出の
検出のためにもこのＩピクチャによる結果の確認が必要
となる。For a GOP of the type shown in FIG.
_The two B pictures (f ₁ , f ₁₎ before the _three I pictures
Regarding the cut point according to ₂ ), the result using the P picture cannot be confirmed. I picture f ₃ is do no references. In order to detect erroneous detection that occurs in this portion, it is necessary to confirm the result using the I picture.

【００７４】Ｉピクチャ（ｆ₃）とひとつ前のＧＯＰに
おけるＩピクチャ、それぞれのＤＣ画像に対する色ヒス
トグラム距離Ｄ_histarea（式（２））を調べ、次式（１
１）を満す場合は間にあるＢピクチャから求めたカット
点は誤検出であるとみなし、これを除去する。The I-picture (f ₃ ), the I-picture in the _immediately preceding GOP, and the color histogram distance D _histarea (Equation (2)) for each DC image are examined, and the following equation (1) is obtained.
If 1) is satisfied, the cut point obtained from the B picture in between is regarded as an erroneous detection and is removed.

【００７５】[0075]

【数１１】 [Equation 11]

【００７６】次にショットの代表フレームの選出につい
て説明する。ここに代表フレームとはショットは動画全
体に比べれば短い単位ではあるが、例えば５秒間のショ
ットでは１５０枚（３０ｆ_psの場合）のフレームの集合
であり、このままでは、比較、表示、特徴値の検出など
の処理がしにくい。そこで、一般に、ショットを扱う際
には、ショットの中からそのショットを代表するフレー
ムを選び出し、この代表フレームによって比較、表示な
どの処理を行う。Next, selection of a representative frame of a shot will be described. Here, the representative frame is a short unit compared to the entire moving image, but is a set of 150 frames (in the case of 30 _fps ) of a shot of 5 seconds, for example. Processing such as detection is difficult. Therefore, in general, when handling a shot, a frame representing the shot is selected from the shots, and processing such as comparison and display is performed using the representative frame.

【００７７】この発明においては、ショットを統合しシ
ーンを抽出する際に、ショット間の類似度を求めるため
に用いる。また、解析された動画構造をユーザに提示す
る際にショットの内容を簡単に示すためにも用いられ
る。したがって、ショットの内容を最もよく表している
フレームを選ぶことが必要となる。In the present invention, when a shot is integrated and a scene is extracted, the similarity between shots is used. It is also used to simply show the contents of a shot when presenting the analyzed moving image structure to the user. Therefore, it is necessary to select a frame that best represents the content of the shot.

【００７８】ショットを扱っている従来の研究では、こ
の代表フレームとして、機械的にショットの先頭のフレ
ームあるいは中央のフレームを用いているものが多い。
しかし、そのようにして選ばれたフレームはショットの
内容をよく表すとは言い難い。そこでこの発明では、シ
ョットに含まれるフレームの平均に最も近いフレームを
代表フレームとして選ぶこととする。In the conventional research dealing with shots, there are many cases where the leading frame or the center frame of the shot is mechanically used as the representative frame.
However, it is hard to say that the frame selected in such a manner well represents the content of the shot. Therefore, in the present invention, the frame closest to the average of the frames included in the shot is selected as the representative frame.

【００７９】また、ショットはほぼ動きがない場合だけ
ではなく、（１）ひとつのショットの途中でカメラの動き（パン・
ズームなど）があるもの。（２）カメラあるいは画像中のオブジェクトが動き続け
ているもの。（３）動きが非常に激しいもの。などの場合がある。このようなショットではひとつのフ
レームでショット全体を代表させるのは難しく、有用な
情報を落とす危険がある。そこで、このようなショット
は、複数の代表フレームによって表すこととする。Further, not only when the shot has almost no motion, but also (1) the motion of the camera (pan
Zoom)). (2) A camera or an object in an image that keeps moving. (3) The movement is very intense. And so on. In such a shot, it is difficult to represent the entire shot in one frame, and there is a risk of dropping useful information. Therefore, such a shot is represented by a plurality of representative frames.

【００８０】さらに、複数の代表フレームを選出するこ
とによって、カット点の検出ミスによる影響を少なくす
ることができる。つまり、カット点の検出ミスにより、
本来複数であるショットがひとつにまとまってしまった
場合、代表フレームをひとつだけ選出すると、本来、た
だひとつのショットの情報だけしか使われないことにな
る。これに対し、内容に基づいて複数選出すれば、それ
ぞれのショットの情報を捨てることなく、シーン抽出の
際に生かせることになるわけである。Furthermore, by selecting a plurality of representative frames, it is possible to reduce the influence of a cut point detection error. In other words, due to the detection error of the cut point,
When a plurality of shots are originally combined into one, if only one representative frame is selected, only information of one shot is originally used. On the other hand, if a plurality of shots are selected based on the content, the information of each shot can be used for scene extraction without discarding.

【００８１】必要最小限の代表フレームを選び出すた
め、まず、ショット中のフレームのクラスタリングを行
う。この結果できた各クラスタからそれぞれもっとも平
均に近いフレームを選び出し、これをショットの代表フ
レームとする。In order to select a minimum necessary representative frame, first, clustering of frames in a shot is performed. A frame closest to the average is selected from each cluster obtained as a result, and this is set as a representative frame of the shot.

【００８２】ただし、ショット内のすべてのフレームを
代表フレームの候補とすると、ショットが長くなったと
きに処理量が増大する恐れがある。そこで、候補として
Ｉピクチャだけを用いる。これにより、選出のための処
理量だけでなく、原動画像からの復号化のための処理量
も削減することができる。また、ＭＰＥＧ１において一
般に、符号化効率を上げるためにＩ、Ｐ、Ｂピクチャの
量子化特性を変えることが多いため、Ｉピクチャが最も
品質がよい場合が多いことも都合がよい。However, if all the frames in the shot are set as the representative frame candidates, the processing amount may increase when the shot becomes long. Therefore, only I pictures are used as candidates. As a result, not only the processing amount for selection but also the processing amount for decoding from the original moving image can be reduced. In general, in MPEG1, the quantization characteristics of I, P, and B pictures are often changed in order to increase the coding efficiency. Therefore, it is also convenient that the I picture often has the best quality.

【００８３】さらに、Ｉピクチャを完全に復号化するの
ではなく、前記で述べたＤＣ画像を用い、復号時の処理
量削減を計る。Further, instead of completely decoding the I picture, the DC image described above is used to reduce the processing amount at the time of decoding.

【００８４】具体的な処理手順は以下のようになる（図
１３）。なお、Ｉピクチャがひとつも含まれないショッ
トの場合、つまり非常に短いショットの場合はショット
中で一番初めに現れるＰピクチャを、それも存在しない
場合はＢピクチャを代表フレームとする。非常に短いシ
ョットの場合、ショット中での変化はほとんどないとい
えるから、このような機械的な処理で十分である。The specific processing procedure is as follows (FIG. 13). In the case of a shot that does not include any I picture, that is, in the case of a very short shot, the P picture that appears first in the shot is used as the representative frame, and when there is no I picture, the B picture is used as the representative frame. In the case of a very short shot, it can be said that there is almost no change during the shot, so such mechanical processing is sufficient.

【００８５】（１）ショットに含まれるＩピクチャを簡
易復号化し、ＤＣ画像を取り出す。(1) An I picture included in a shot is simply decoded to extract a DC image.

【００８６】（２）ショット中、動きが少ない部分のＤ
Ｃ画像を初期クラスタとする。動きが少ないかどうか
は、ＢおよびＰピクチャに含まれるＩＭＢの数を調べる
ことによって行う（式（１２））。(2) During shots, D of a portion with little movement
Let the C image be the initial cluster. Whether the motion is small is determined by checking the number of IMBs included in the B and P pictures (Equation (12)).

【００８７】[0087]

【数１２】 (Equation 12)

【００８８】（３）前記初期クラスタをもとにクラスタ
リングを行い、ショット中のＩピクチャをいくつかのク
ラスタへと分類する。クラスタリングは群平均法を使っ
て行い、要素間の距離としてはＤ_histarea（式（２））
を用いる。クラスタリングは、クラスタ間の距離のうち
最小のものが閾値を越えるまで行う。(3) Clustering is performed based on the initial cluster, and the I picture in the shot is classified into several clusters. Clustering is performed using the group average method, and the distance between elements is D _histarea (Equation (2))
Is used. Clustering is performed until the smallest one of the distances between clusters exceeds a threshold.

【００８９】（４）クラスタリング終了後、各クラスタ
からそれぞれひとつずつ代表フレームを選び出す。ま
ず、クラスタ内のＩピクチャのＤＣ画像を平均して、平
均ＤＣ画像をつくる（式（１３））。(4) After the clustering, one representative frame is selected from each cluster. First, an average DC image is created by averaging DC images of I pictures in a cluster (Equation (13)).

【００９０】[0090]

【数１３】 (Equation 13)

【００９１】（５）平均ＤＣ画像との距離Ｄ_pixsum（式
（３））が一番小さいＤＣ画像ＤＣ_kをもつＩピクチャ
Ｉ_kを代表フレームとする。(5) The I picture I _k having the DC image DC _k with the smallest distance D _pixsum (Equation (3)) from the average DC image is set as the representative frame.

【００９２】以上のようにしてショットの代表フレーム
が選出される。As described above, a representative frame of a shot is selected.

【００９３】実際の処理では、ＩピクチャのＤＣ画像
や、ＰおよびＢピクチャのマクロブロック情報を効率的
に得るために、代表フレームの選出はカット点の検出と
並行して行われる。In the actual processing, the selection of the representative frame is performed in parallel with the detection of the cut point in order to efficiently obtain the DC image of the I picture and the macroblock information of the P and B pictures.

【００９４】たとえば会話のシーンなどでは、話者を交
互に撮る場合が多いため、同じようなショットの繰り返
しになる。このように、ひとつのシーンのなかには、似
ているショットがいくつか含まれることが多い。この性
質に着目してショットを統合し、シーンを抽出する。For example, in a conversation scene or the like, since a speaker is often photographed alternately, similar shots are repeated. As described above, one scene often includes several similar shots. Focusing on this property, shots are integrated to extract a scene.

【００９５】ショット間の類似度は、それぞれのショッ
トの代表フレーム間の類似度ｓ（式（６））を用いる。
ただし、ひとつのショットが複数の代表フレームを持つ
場合もあるため、すべての代表フレームの組み合わせに
ついての類似度を調べ、そのうちの最大値をショットの
類似度とする（図１４）。The similarity between shots uses the similarity s (formula (6)) between the representative frames of each shot.
However, since one shot may have a plurality of representative frames, the similarities of all the combinations of the representative frames are checked, and the maximum value is set as the similarity of the shots (FIG. 14).

【００９６】ショット間の類似度からシーンを抽出する
最も簡単な方法は、図１５のように、似ている（類似度
が非情に高い）ショットが存在すれば、その間をすべて
同じシーンとみなす方法である。The simplest method of extracting a scene from the similarity between shots is, as shown in FIG. 15, when there are similar shots (similarity is extremely high), all of the shots are regarded as the same scene. It is.

【００９７】しかし、このようにすれば、（１）似ているか、似ていないかの閾値の設定が結果に
大きく影響する（２）類似度が非常に高いショットの組はないが、中程
度の類似度のショットの組が多数ある、といった場合で
も同じシーンとみなすことができず、柔軟性に欠けるといった問題点がある。However, in this case, (1) the setting of the threshold value for similarity or dissimilarity greatly affects the result. (2) There is no set of shots having a very high similarity, Even when there are many sets of shots with similarity of, there is a problem that the same scene cannot be regarded and the flexibility is lacking.

【００９８】そこで、ショットｓｈｏｔ_ｎとショットｓ
ｈｏｔ_ｎ＋１が連続している（すなわち、同じシーンに
属する）度合を表す結合度ｃｏｎｎｅｃｔ_{ｎ，ｎ＋１}を
式（１４）のように定義し、この結合度を用いてシーン
の抽出を行う。Then, shot shot _n and shot s
A degree of connection connect _{n, n + 1} representing the degree to which hot _{n + 1} is continuous (that is, belonging to the same scene) is defined as in Expression (14), and a scene is extracted using this degree of connection.

【００９９】[0099]

【数１４】 [Equation 14]

【０１００】ここで、Ｎは比較するショットの範囲を表
す。ｓ_ｉｊはｓｈｏｔ_ｉとｓｈｏｔ_ｊの類似度である。Here, N represents the range of shots to be compared. s _ij is the similarity between shot _i and shot _j .

【０１０１】このように、結合度ｃｏｎｎｅｃｔ
_{ｎ，ｎ＋１}はショットｓｈｏｔ_ｎとショットｓｈｏｔ
_ｎ＋１だけでなく、その付近のすべてのショット間の類
似度ｓ_ｉｊから求められる。例えば、図１６において、
ｃｏｎｎｅｃｔ_３，４はショットｓｈｏｔ_３とショット
ｓｈｏｔ_４の類似度ｓ_３４だけでなく、ショットｓｈｏ
ｔ_２とショットｓｈｏｔ_５の類似度ｓ_２５も使って求め
られる。なぜなら、たとえショットｓｈｏｔ_３とショッ
トｓｈｏｔ_４がまったく類似していなくても、ショット
ｓｈｏｔ_２とショットｓｈｏｔ_５が類似していれば、シ
ョットｓｈｏｔ_３とショットｓｈｏｔ_４は同じシーンに
属すると考えられるからである。As described above, the coupling degree connect
_{n and n + 1} are shots shot _n and shot shot
It is obtained not only from _{n + 1} but also from the similarity s _ij between all shots in the vicinity. For example, in FIG.
the connect _{3, 4} not only the similarity _{s 34} shots shot ₃ and shot shot _4, shot sho
similarity _{s 25} of t ₂ and shot shot ₅ is also obtained using. This is because even if shots shot ₃ and shot ₄ are not completely similar, if shots shot ₂ and shot ₅ are similar, shots shot ₃ and shot ₄ are considered to belong to the same scene. is there.

【０１０２】ただし、時間的に遠く離れているショット
同士は、同じシーンに属する可能性が小さく、むしろ違
うシーンに属するにもかかわらずたまたま類似度が高い
ショットが存在する可能性があり、このような原因によ
る未検出をできるだけ防ぐため、比較するショットの範
囲はＮに制限する。However, shots that are far apart in time are unlikely to belong to the same scene. Rather, shots that happen to have a high degree of similarity even though they belong to different scenes may exist. The range of shots to be compared is limited to N in order to prevent undetection due to various causes as much as possible.

【０１０３】式（１３）によって得られる結合度ｃｏｎ
ｎｅｃｔ_{ｎ，ｎ＋１}の変化は、例えば図１７のようにな
る。The coupling degree con obtained by the equation (13)
The change of nect _{n, n + 1} is, for example, as shown in FIG.

【０１０４】このような結合度の変化から、シーンチェ
ンジを決定しシーンを抽出する。ここでは、変化のピー
クと谷の差が閾値ｔｈｒｅｓｈｏｌｄ_SCENEより大きい
とき、その谷となる結合度をもつカット点をシーンチェ
ンジ点とする。From such a change in the degree of coupling, a scene change is determined and a scene is extracted. Here, when the difference between the peak and the valley of the change is larger than the threshold value threshold _SCENE , a cut point having a coupling degree corresponding to the valley is set as a scene change point.

【０１０５】[0105]

【発明の効果】この発明によれば、与えられた動画像を
ハイブリッド符号化し、これを階層的構造へ分割し、分
割されたショット間の類似度によりショットを統合して
シーンを抽出するので、シーンの抽出が、迅速、正確に
行われる効果がある。然してシーンからショット又はフ
レームを抽出するのは比較的容易であるから、結局動画
像からシーン、ショット又はカットを短時間に、かつ正
確に抽出し得る効果がある。According to the present invention, a given moving image is hybrid-encoded, divided into a hierarchical structure, and shots are extracted by integrating shots based on the similarity between the divided shots. There is an effect that scene extraction is performed quickly and accurately. However, since it is relatively easy to extract a shot or a frame from a scene, there is an effect that a scene, a shot or a cut can be accurately extracted from a moving image in a short time.

【０１０６】前記処理は総て現在使用されているハード
に、適切なソフトを組み込むことにより自動化できるの
で、適切な入力指示により、所望のシーン、ショット又
はカットを自動的に提供できる効果がある。Since the above processing can be automated by incorporating appropriate software into the hardware currently in use, a desired scene, shot or cut can be automatically provided by an appropriate input instruction.

【０１０７】実験の結果によれば、表２の動画像を用い
たカットの検出結果は表３の通りである。According to the experimental results, the cut detection results using the moving images in Table 2 are shown in Table 3.

【０１０８】[0108]

【表２】 [Table 2]

【０１０９】[0109]

【表３】 [Table 3]

【０１１０】またカット点検出の処理時間は表４の通り
である。Table 4 shows the processing time of cut point detection.

【０１１１】[0111]

【表４】 [Table 4]

【０１１２】更に単純なアルゴリズムによるカット点検
出の処理時間は表５の通りである。Table 5 shows the processing time of cut point detection by a simpler algorithm.

【０１１３】[0113]

【表５】 [Table 5]

【０１１４】次にカット点検出の結果得られたショット
を用いて、シーン抽出を行った。結合度の変化を図１
８、１９、２０に示す。Next, scene extraction was performed using shots obtained as a result of cut point detection. Figure 1 shows the change in coupling
8, 19 and 20 are shown.

【０１１５】図１８、１９、２０の結合度からシーン抽
出を行った結果が表６である。なお、シーンチェンジ点
検出の閾値は、ｔｈｒｅｓｈｏｌｄ_SCENE＝０．３とし
た。Table 6 shows the result of scene extraction from the degrees of connection shown in FIGS. The threshold value for detecting a scene change point was threshold _SCENE = 0.3.

【０１１６】表６から、いずれの動画像についてもシー
ンチェンジ点のうち７５％以上を検出できており、高速
性や本手法が意味解析や知識を使っていないことを考慮
すると十分実用的であるといえる。なお、検出数の内１
／４から１／３程度は、実際のシーンチェンジ点から１
ショット分前または後ろにずれて検出されている。これ
は、シーンチェンジ点に隣接するショットに関して、そ
の本来属するべきシーンのなかにそのショットへの類似
度が高いショットが存在しない場合は結合度が低くなっ
てしまうというアルゴリズム上の欠点による。From Table 6, 75% or more of the scene change points can be detected for all the moving images, which is sufficiently practical in consideration of the high speed and the fact that this method does not use semantic analysis or knowledge. It can be said that. In addition, 1 of the number of detections
About から to ３ is one point from the actual scene change point.
A shift is detected before or after the shot. This is due to an algorithmic defect that, for a shot adjacent to a scene change point, if a shot having a high degree of similarity to the shot does not exist in a scene to which the shot originally belongs, the degree of connection is reduced.

【０１１７】[0117]

【表６】 [Table 6]

【０１１８】構造解析処理全体の処理時間は表７のよう
になっており、十分な高速性を保っていることがわか
る。The processing time of the entire structural analysis processing is as shown in Table 7, and it can be seen that a sufficient high speed is maintained.

【０１１９】[0119]

【表７】 [Table 7]

[Brief description of the drawings]

【図１】この発明のＭＰＥＧ１符号化・復号システム。FIG. 1 is an MPEG1 encoding / decoding system according to the present invention.

【図２】同じくＭＰＥＧ１ビデオ符号化器のブロック
図。FIG. 2 is a block diagram of the MPEG1 video encoder.

【図３】同じくＭＰＥＧ１ビデオ復号器の図。FIG. 3 is a diagram of the MPEG1 video decoder.

【図４】同じくＧＯＰの例示図。FIG. 4 is a view showing an example of a GOP.

【図５】同じく原画像およびストリーム上の画面の並び
を示す図。FIG. 5 is a diagram showing an arrangement of screens on an original image and a stream.

【図６】同じくＭＰＥＧ１の階層構成図。FIG. 6 is a diagram showing the hierarchical structure of MPEG1.

【図７】同じく構造解析処理の流れ図。FIG. 7 is a flowchart of a structure analysis process.

【図８】同じくＤＣ画像の生成図。FIG. 8 is a diagram showing the generation of a DC image.

【図９】同じくＤＣ画像の例示図。FIG. 9 is a view showing an example of a DC image.

【図１０】同じく類似度を求めるファジィ推論に用いる
メンバシップ関数の形状図。FIG. 10 is a shape diagram of a membership function used for fuzzy inference for obtaining similarity.

【図１１】同じくＧＯＰの例示図。FIG. 11 is a view showing an example of a GOP.

【図１２】（ａ）同じく通常の参照図。（ｂ）同じく過去のＰピクチャとの間にカット点がある
場合を示す図。（ｃ）Ｂピクチャの間にカット点がある場合を示す図。（ｄ）同じく未来のＰピクチャとの間にカット点がある
場合の図。FIG. 12 (a) is also a normal reference diagram. (B) The figure which shows the case where there is also a cut point between the past P picture. (C) The figure which shows the case where there is a cut point between B pictures. (D) A diagram in a case where there is a cut point between the same and a future P picture.

【図１３】同じく代表フレームの選出アルゴリズムを示
す流れ図であって、（ａ）ＤＣ画像を取り出す図。（ｂ）初期クラスタを決定する図。（ｃ）クラスタリングの図。（ｄ）クラスタ中のＤＣ画像から平均画像を作る図。（ｅ）平均画像に最も近いものを代表フレームとする
図。FIG. 13 is a flowchart showing a representative frame selection algorithm, in which (a) a DC image is extracted. (B) The figure which determines an initial cluster. (C) Diagram of clustering. (D) A diagram for creating an average image from DC images in a cluster. (E) A diagram showing the image closest to the average image as a representative frame.

【図１４】同じくショット間の類似度を示す図。FIG. 14 is a view showing the similarity between shots.

【図１５】同じく単純なシーン抽出の図。FIG. 15 is a diagram of the same simple scene extraction.

【図１６】同じくＮ＝３のときの結合度を示す図。FIG. 16 is a diagram showing the coupling degree when N = 3.

【図１７】同じく結合度ｃｏｎｎｅｃｔ_{ｎ，ｎ＋１}の変
化の例示図。FIG. 17 is a view showing an example of a change in the degree of connection connect _{n, n + 1} .

【図１８】同じく動画像Ａの結合度の変化を示す例示
図。FIG. 18 is an exemplary view showing a change in the coupling degree of the moving image A;

【図１９】同じく動画像Ｂの結合度の変化を示す例示
図。FIG. 19 is an exemplary view showing a change in the coupling degree of the moving image B;

【図２０】同じく動画像Ｃの結合度の変化を示す図。FIG. 20 is a diagram showing a change in the coupling degree of the moving image C;

Claims

[Claims]

1. An automatic hierarchy of moving images, wherein an encoded moving image is divided into shots, and shots are integrated and shots are extracted using similarities of the divided shots. Structured method.

2. The method according to claim 1, wherein encoding of the moving image is performed by MPEG.

3. The moving image automatic hierarchical structuring method according to claim 1, wherein when a shot is detected from an encoded moving image, high-speed processing is performed using characteristics of MPEG.

4. The method according to claim 1, wherein a representative frame is extracted when calculating the degree of similarity between shots.

5. The method according to claim 1, wherein the similarity between shots is obtained by fuzzy inference.

6. The moving image automatic hierarchical structuring method according to claim 1, wherein the scene extraction processing is performed based on a degree of connection between the defined shots.

7. The encoded moving image is divided into shots, and the shots are integrated by using the similarity of each of the divided shots to extract a scene to automatically form a hierarchical structure of the moving image. A moving image browsing method characterized by using a hierarchically structured data to easily grasp the contents of the entire moving image and to detect a desired scene or shot.